Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

Merged
merged 7 commits into from Feb 12, 2022

Conversation

erip
Copy link
Contributor

@erip erip commented Feb 10, 2022

Partially address lingering TODO in #1493.

TODO: add IWSLT which fails due to XML parsing errors -- need to investigate.

@erip
Copy link
Contributor Author

erip commented Feb 10, 2022

These test failures are actually good. 🔥

@erip
Copy link
Contributor Author

erip commented Feb 10, 2022

I have a Windows desktop, so instead of using CI as a debugging mechanism I can just wrap this PR up once I'm home. I'm glad we caught these. 😄

@Nayef211
Copy link
Contributor

Thanks for taking this on @erip! Just a couple of questions from my side.

  • Do we know for certain that all of our datasets are UTF8 encoded?
  • Do you have any initial guesses as to why these test failures are occurring? From a first glance it seems like it might be caused by the LineReaderIterDataPipe and the CSVParserIterDataPipe not handling UTF8 encoded strings correctly.
  • Does this PR update all our dataset tests to use the UTF-8 encoded strings? If so it might be worthwhile mentioning that in the PR description as well.

@erip
Copy link
Contributor Author

erip commented Feb 10, 2022

  1. I don't know for certain, but I strongly suspect it. I can confirm later by checking the legacy code to load these datasets -- if the encoding is set, that's evidence.
  2. I suspect that the text readers either aren't passing the appropriate encoding in 'r' mode or aren't decoding to utf-8 in 'b' mode
  3. All but IWSLTs. Can update for clarity.

Edit: a cursory glance at dataset_utils suggests they are all utf8.

@erip erip changed the title generate unicode strings to test utf-8 handling. generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. Feb 11, 2022
test/datasets/test_amazonreviewfull.py Outdated Show resolved Hide resolved
test/datasets/test_yelpreviewfull.py Outdated Show resolved Hide resolved
@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

Tracked down the underlying issue with encoding here on Windows. See pytorch/pytorch#72713 for context.

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

  1. I suspect that the text readers either aren't passing the appropriate encoding in 'r' mode or aren't decoding to utf-8 in 'b' mode

Seem like the culprit is not setting decode as True in our readers which is False by default https://github.com/pytorch/data/blob/9c6e5ddfcdf1061e3968ed5cd9d55754cc713965/torchdata/datapipes/iter/util/plain_text_reader.py#L90

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

Seem like the culprit is not setting decode as True in our readers

Indeed, one option is to read the files in binary and decode appropriately. Ideally the FileOpener could handle opening the file with appropriate encoding/mode for us since extra downstream decoding is a bit of a pain. See the issue in upstream pytorch for the "better" option and alternatives.

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

Seem like the culprit is not setting decode as True in our readers

Indeed, one option is to read the files in binary and decode appropriately. Ideally the FileOpener could handle opening the file with appropriate encoding/mode for us since extra downstream decoding is a bit of a pain. See the issue in upstream pytorch for the "better" option and alternatives.

Agreed. This could (should) be handled at File Opening so that downstream readers do not have to worry about it. Thanks for picking this up @erip. Looking forward to the resolution at torchdata level. Meanwhile (since I am not sure if it will take longer to resolve APIs etc for FileOpener), can we try closing this PR by reading in binary mode and setting decode to True for downstream readers? We can then cherry-pick the changes once the FileOpener handles decoding scheme.

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

Yes, that seems reasonable.

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

Yes, that seems reasonable.

Great! Then let's try to land this before we make the branch-cut. cc: @Nayef211

…DO: replace with FileOpener with appropriate encoding when this lands in upstream pytorch.
@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

OK, I think this should be gtg now @parmeet

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @erip! This looks good to me. I will merge it once the CI is green for unit-testing.

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

Oh one more change @erip: I know IWSLT 16/17 test suit update is pending, but can we at-least update the reading/decoding part for actual datasets

tgt_data_dp = FileOpener(cache_inner_tgt_decompressed_dp, mode="r")
src_data_dp = FileOpener(cache_inner_src_decompressed_dp, mode="r")
src_lines = src_data_dp.readlines(return_path=False, strip_newline=False)
tgt_lines = tgt_data_dp.readlines(return_path=False, strip_newline=False)

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

Good catch, I'll fix that.

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

It looks like there's a lingering issue that I'm trying to debug with IMDB. The cache is written as text, but when I try to change it to being written as bytes and taking appropriate encoding compensation before the cache is written...

     cache_decompressed_dp = (
         cache_decompressed_dp.lines_to_paragraphs()
     )  # group by label in cache file
+    cache_decompressed_dp = cache_decompressed_dp.map(lambda x: (x[0], x[1].encode()))
     cache_decompressed_dp = cache_decompressed_dp.end_caching(
-        mode="wt",
-        filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x),
+        mode="wb",
+        filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x)
     )

I'm met with the following errors:

test/datasets/test_imdb.py:84: in test_imdb_split_argument
    for d1, d2 in zip_equal(dataset1, dataset2):
test/common/case_utils.py:53: in zip_equal
    for combo in zip_longest(*iterables, fillvalue=sentinel):
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:95: in __iter__
    for data in self.datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../data/torchdata/datapipes/iter/util/plain_text_reader.py:106: in __iter__
    for path, file in self.source_datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/fileopener.py:45: in __iter__
    yield from get_file_binaries_from_pathnames(self.datapipe, self.mode)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/utils/common.py:85: in get_file_binaries_from_pathnames
    for pathname in pathnames:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/combining.py:38: in __iter__
    for data in dp:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../data/torchdata/datapipes/iter/util/saver.py:36: in __iter__
    for filepath, data in self.source_datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:95: in __iter__
    for data in self.datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:96: in __iter__
    yield self._apply_fn(data)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:69: in _apply_fn
    res = self.fn(data[self.input_col])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fd = b'\xc3\xafp\xc4\xa9\xc8\xba\xc2\xbd\xc5\xb7\n\xc5\x9b\xe1\x9b\xac\xe2\xb1\xb4\xc7\xbf\xe2\xb1\xa3\xe2\xb1\xa4\xc2\xb6'

    def _read_bytes(fd):
>       return b"".join(fd)
E       TypeError: sequence item 0: expected a bytes-like object, int found

../data/torchdata/datapipes/iter/util/cacheholder.py:215: TypeError

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

It seems that _read_bytes is expecting a file-like object. But here since we are already mapping string to bytes (by doing encoding), this function doesn't work with bytes (although interesting _read_str works both with file-like object and str in which case it simply return the same string). I think it is save to use skip_read=True since we don't really need to read from stream. cc: @ejguan to make sure my understanding is right?

@parmeet
Copy link
Contributor

parmeet commented Feb 11, 2022

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

It seems that _read_bytes is expecting a file-like object. But here since we are already mapping string to bytes (by doing encoding), this function doesn't work with bytes (although interesting _read_str works both with file-like object and str in which case it simply return the same string). I think it is save to use skip_read=True since we don't really need to read from stream. cc: @ejguan to make sure my understanding is right?

@erip could you please push this change, so that we can close this PR in prep for branch-cut? Thanks

@erip
Copy link
Contributor Author

erip commented Feb 11, 2022

Done!

@erip
Copy link
Contributor Author

erip commented Feb 12, 2022

I think this is good to merge @Nayef211 @parmeet

@parmeet parmeet merged commit 2e93d94 into pytorch:main Feb 12, 2022
@erip erip deleted the feature/unicode-tests branch February 12, 2022 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants