Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if file is already present on running scrapy genspider and terminate if so (#4561) #4623

Merged
merged 21 commits into from Aug 17, 2020

Conversation

ajaymittur28
Copy link
Contributor

@ajaymittur28 ajaymittur28 commented Jun 10, 2020

Continuing from PR #4616 to fix #4561, I've added a direct file check to see if a spider with the name provided in scrapy genspider is already present in the directory and not overwrite the existing spider file if present.

3Dook and others added 3 commits Jun 8, 2020
Issue scrapy#4561 enhancement. The bug occurs when genspider is called outside of a startproject. I added a small and simple check to compare the new spider names to the current files in the directory. If there is already a spider with the file name it will return and stop the function.
@ajaymittur28 ajaymittur28 changed the title Check if spider is already present on scrapy genspider and terminate if present (#4561) Check if spider is already present on scrapy genspider and terminate if so (#4561) Jun 10, 2020
@codecov
Copy link

codecov bot commented Jun 10, 2020

Codecov Report

Merging #4623 into master will increase coverage by 0.46%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4623      +/-   ##
==========================================
+ Coverage   86.29%   86.75%   +0.46%     
==========================================
  Files         160      160              
  Lines        9646     9717      +71     
  Branches     1416     1427      +11     
==========================================
+ Hits         8324     8430     +106     
+ Misses       1062     1025      -37     
- Partials      260      262       +2     
Impacted Files Coverage Δ
scrapy/commands/genspider.py 86.86% <100.00%> (+3.33%) ⬆️
scrapy/pipelines/media.py 97.16% <0.00%> (-0.14%) ⬇️
scrapy/exporters.py 100.00% <0.00%> (ø)
scrapy/utils/curl.py 100.00% <0.00%> (ø)
scrapy/pipelines/images.py 91.81% <0.00%> (ø)
scrapy/settings/default_settings.py 98.73% <0.00%> (+<0.01%) ⬆️
scrapy/settings/__init__.py 92.98% <0.00%> (+0.04%) ⬆️
scrapy/utils/conf.py 93.13% <0.00%> (+0.06%) ⬆️
scrapy/utils/python.py 85.71% <0.00%> (+0.25%) ⬆️
scrapy/extensions/feedexport.py 87.35% <0.00%> (+1.22%) ⬆️
... and 2 more

scrapy/commands/genspider.py Outdated Show resolved Hide resolved
tests/test_commands.py Outdated Show resolved Hide resolved
@ajaymittur28 ajaymittur28 changed the title Check if spider is already present on scrapy genspider and terminate if so (#4561) Check if file is already present on running scrapy genspider and terminate if so (#4561) Jun 12, 2020
@ajaymittur28 ajaymittur28 requested a review from elacuesta Jun 12, 2020
@elacuesta
Copy link
Member

elacuesta commented Jun 13, 2020

Many thanks for the work @ajaymittur28.

It's looking good so far, but I tested this locally and realized that since the exists check is done only with the passed name, it doesn't work in the context of a project:

$ scrapy startproject foobar
New Scrapy project 'foobar', using template directory '/.../scrapy/templates/project', created in:
    /.../foobar

You can start your first spider with:
    cd foobar
    scrapy genspider example example.com

$ cd foobar/

$ scrapy genspider example example.com
Created spider 'example' using template 'basic' in module:
  foobar.spiders.example

# rename the spider so it's not handled by https://github.com/scrapy/scrapy/blob/2.1.0/scrapy/commands/genspider.py#L69-L78
$ sed -i -e "s/name = 'example'/name = 'renamed'/g" foobar/spiders/example.py

$ scrapy genspider example example.com
Created spider 'example' using template 'basic' in module:
  foobar.spiders.example

The added test confirms the patch works for standalone spiders, but it'd be good to also make it work for spiders within projects. Do you think you can make such a change?

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jun 13, 2020

Yep, I can make it work for the case you just mentioned when it's run within projects as well. working on it! Thanks for showing how the issue looks like!

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jun 14, 2020

Pushed the latest changes that handle the case mentioned above.

I made a few changes from the previous commits:

  1. Added a check to see if the command was being run as a standalone or from a project (or rather if the spiders module exists), since I felt if it was being run from a project, the original issue (overwriting existing file in same directory) wouldn't be a problem since the spiders would always be generated inside the spiders\ directory and both the cases (the one you mentioned and the one already handled in lines 69-78) would be handled along with the lines I added here

  2. Moved the code to after the attempt to load the spider was done here and a KeyError exception (spider not found) was thrown because:-

    Output if check is done before trying to load spider (before checking if spider with same spidername exists)

    $ scrapy genspider example example.com
    Created spider 'example' using template 'basic' in module:
      foobar.spiders.example
    
    $ scrapy genspider example example.com
    Spider with filename 'example.py' already exists in directory:
       <path_of_spiders_directory>
    

    Output if check is done after trying to load the spider (after checking if spider with same spidername exists)

    $ scrapy genspider example example.com
    Created spider 'example' using template 'basic' in module:
      foobar.spiders.example
    
    $ scrapy genspider example example.com
    Spider 'example' already exists in module:
      foobar.spiders.example
    

    Since the tests already present expect it to check for the spidername first (display second output) I moved the code to inside the except block which runs if the spider with the given name wasn't found; and then proceed to check if a spider with filename same as provided exists.

scrapy/commands/genspider.py Outdated Show resolved Hide resolved
tests/test_commands.py Outdated Show resolved Hide resolved
scrapy/commands/genspider.py Outdated Show resolved Hide resolved
scrapy/commands/genspider.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

I’m leaving an inline comment with refactoring ideas in case you are interested, but I’m OK with the changes as is being merged. Thanks!

scrapy/commands/genspider.py Outdated Show resolved Hide resolved
@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jun 16, 2020

I'll refactor the code, thanks for the suggestions

scrapy/commands/genspider.py Outdated Show resolved Hide resolved
scrapy/commands/genspider.py Show resolved Hide resolved
@Gallaecio Gallaecio self-requested a review Jun 18, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

What about checking the file contents in addition to the modification time?

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jun 23, 2020

Yep, will do that, initially thought doing so will stress the memory reading the file everytime but since it's only once during the test and file doesn't seem to be that big, shouldn't be an issue.

wRAR
wRAR approved these changes Jul 22, 2020
@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jul 22, 2020

Previous commit resolves a merge conflict in the imports in test_command.py

Copy link
Member

@elacuesta elacuesta left a comment

Thanks for your work! I left a small comment, not a blocker IMHO.

scrapy/commands/genspider.py Outdated Show resolved Hide resolved
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
@Gallaecio Gallaecio merged commit a8e08d5 into scrapy:master Aug 17, 2020
2 checks passed
@Gallaecio
Copy link
Member

Gallaecio commented Aug 17, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scrapy genspider should not overwrite existing file
5 participants