Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated RFPDupeFilter line separator for correct universal newlines mode usage. #4283

Merged
merged 4 commits into from Feb 6, 2020

Conversation

LanetheGreat
Copy link
Contributor

@LanetheGreat LanetheGreat commented Jan 23, 2020

Hey guys, found another small non-breaking glitch that I'm not sure if anybody else noticed that existed since RFPDupfilter was implement back on commit 5497252 that's specific to Windows and the RFPDupeFilter used in Scrapy. It looks like when RFPDupeFilter writes its request hashes to the requests.seen file for a job, it seems to be using Python's universal newlines feature incorrectly and ends up writing '\r\r\n' as each line ending instead of '\r\n', though it works fine in Linux and writes '\n' as expected.

Here's an example file demonstrating what's happening, opened in Notepad++ and a Hex editor with one of the line endings highlighted:
Newlines

I've tested this PR (using a subclass in my actual project overriding the base behavior) and it works fine on both Linux and Windows writing '\n' and '\r\n' accordingly. This bug's not that serious but would be a nice data storage optimization for those running Scrapy on Windows using job storage, instead of writing thousands of extra carriage returns (\r) to the file.

@codecov
Copy link

@codecov codecov bot commented Jan 23, 2020

Codecov Report

Merging #4283 into master will decrease coverage by 0.24%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #4283      +/-   ##
=========================================
- Coverage   84.15%   83.9%   -0.25%     
=========================================
  Files         166     166              
  Lines        9761    9869     +108     
  Branches     1462    1469       +7     
=========================================
+ Hits         8214    8281      +67     
- Misses       1295    1334      +39     
- Partials      252     254       +2
Impacted Files Coverage Δ
scrapy/dupefilters.py 98.11% <100%> (ø) ⬆️
scrapy/utils/test.py 49.35% <0%> (-8.99%) ⬇️
scrapy/utils/ftp.py 23.8% <0%> (-6.2%) ⬇️
scrapy/pipelines/files.py 61.66% <0%> (-3.99%) ⬇️
scrapy/utils/trackref.py 82.85% <0%> (-2.86%) ⬇️
scrapy/core/downloader/handlers/datauri.py 93.33% <0%> (-0.79%) ⬇️
scrapy/crawler.py 89.26% <0%> (-0.36%) ⬇️
scrapy/core/downloader/handlers/http10.py 100% <0%> (ø) ⬆️
scrapy/http/request/__init__.py 100% <0%> (ø) ⬆️
scrapy/core/downloader/handlers/file.py 100% <0%> (ø) ⬆️
... and 9 more

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Jan 23, 2020

Would it be possible to add some tests for this?

@LanetheGreat
Copy link
Contributor Author

@LanetheGreat LanetheGreat commented Jan 28, 2020

Just updated my pull request to include a new test for the incorrect line endings. Also while writing this I noticed a slight bug with the test test_dupefilter_path when I accidentally triggered it testing my test. It appears to leave requests.seen open and locked when it asserts (because df/df2's .close() method doesn't get called) causing shutil.rmtree to fail with PermissionError. I added a commit that fixes that in my PR but I'm not sure if that should be intended behavior or not because by not deleting it that allowed my to still see the temporary requests.seen file and realize my error sooner...


with open(os.path.join(path, 'requests.seen'), 'rb') as seen_file:
line = next(seen_file).decode()
assert not line.endswith('\r\r\n')
Copy link
Member

@elacuesta elacuesta Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also check if the line ends with \n here?

Copy link
Contributor Author

@LanetheGreat LanetheGreat Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated the PR, how's that? It should check that universal line endings mode is selected and working as expected now on each platform (Linux/Mac and Windows).

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Jan 28, 2020

I believe the test changes are ok.
Regarding the underlying newline change, I'm not 100% sure because I'm not a Windows user, but it does make sense to me, specially considering the following quote from https://docs.python.org/3/library/os.html#os.linesep (emphasis mine):

os.linesep
The string used to separate (or, rather, terminate) lines on the current platform. This may be a single character, such as '\n' for POSIX, or multiple characters, for example, '\r\n' for Windows. Do not use os.linesep as a line terminator when writing files opened in text mode (the default); use a single '\n' instead, on all platforms.

@kmike
Copy link
Member

@kmike kmike commented Jan 30, 2020

@LanetheGreat could you please check if updated RFPDupeFilter can load a file created by a previous version of RFPDupeFilter? It could be fine if it doesn't load, but in this case we would need to mark this change as backwards incompatible.

@LanetheGreat
Copy link
Contributor Author

@LanetheGreat LanetheGreat commented Jan 30, 2020

@LanetheGreat could you please check if updated RFPDupeFilter can load a file created by a previous version of RFPDupeFilter? It could be fine if it doesn't load, but in this case we would need to mark this change as backwards incompatible.

Just created a virtual environment (pipenv with Python 3.6.8/Windows 10) and installed my PR version of scrapy into it and tested loading the same old requests.seen file I used in my first comment above and it looks like it loaded all the hashes fine with all line endings properly stripped. So looks like it doesn't conflict trying to read old files. (Cut it down to the first 20 to not overfill the console screen here)

image

kmike
kmike approved these changes Jan 31, 2020
Copy link
Member

@kmike kmike left a comment

Thanks @LanetheGreat!

@Gallaecio Gallaecio merged commit 3263441 into scrapy:master Feb 6, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants