Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running scrapy shell against a local file #1550

Closed
alecxe opened this issue Oct 19, 2015 · 17 comments
Closed

Running scrapy shell against a local file #1550

alecxe opened this issue Oct 19, 2015 · 17 comments
Milestone

Comments

@alecxe
Copy link

@alecxe alecxe commented Oct 19, 2015

Before Scrapy 1.0, I could execute:

 scrapy shell index.html

In >=1.0, it started to throw ValueError: Missing scheme in request url: index.html:

$ scrapy shell index.html
2015-10-12 15:32:59 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-12 15:32:59 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-10-12 15:32:59 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
Traceback (most recent call last):
  File "/Users/user/.virtualenvs/so/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/commands/shell.py", line 50, in run
    spidercls = spidercls_for_request(spider_loader, Request(url),
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: index.html 

As a workaround, I've used the "file" protocol providing the full path to a file:

$ scrapy shell file:////absolute/path/to/index.html

From a comment to the relevant SO topic http://stackoverflow.com/questions/33088877/scrapy-shell-against-a-local-file, we can see that the relevant change was introduced here.

Would it be possible and would it make sense to bring the previous behavior back so that we can execute the shell against a local file as easy as scrapy shell filename?

Thanks!

@kmike
Copy link
Member

@kmike kmike commented Oct 19, 2015

#1498 is not a part of 1.0.3 release, so it must be something else. I was not aware of this feature, and there are no tests for it.

But that's right that Scrapy is on track to interpreting scrapy shell index.html as scrapy shell http://index.html. Do you think it should check if there is a local file with the requested name first?

What about using ./index.html and ../index.html and /foo/bar as shortcuts to a local file name?

@alecxe
Copy link
Author

@alecxe alecxe commented Oct 19, 2015

@kmike thanks for looking into this.

Yeah, I think the desired behavior is for the scrapy shell, if it does not find the "protocol" there or may be there is an error during url parsing, try to interpret the argument as a path to a file - it should also handle both absolute and relative to the current directory paths..I think this is actually how it worked before 1.0..

Here are the samples:

scrapy shell index.html  # file in the current directory
scrapy shell subdirectory/index.html  # subdirectory
scrapy shell /absolute/path/to/a/file.html  # absolute path
scrapy shell C:\absolute\path\to\a\file.html  # windows paths
scrapy shell ../index.html  # handle directory traverals

I'm sure you can think about other test/use cases. Thanks again!

@kmike
Copy link
Member

@kmike kmike commented Oct 19, 2015

I'm not 100% sure it is good to support index.html and subdirectory/index.html because it is ambiguos; other examples look fine - there can't be a website http://..index.html, but http://subdirectory/index.html can be available.

On the other hand, if it is limited to scrapy shell then checking for a file existence may be fine.

Maybe we should have guess_scheme function, not just add_http_...

@alecxe
Copy link
Author

@alecxe alecxe commented Oct 23, 2015

Yes, I think we can check if the file exists first. There are though several ways the logic can be implemented here - for instance, we may check if there is a protocol in the beginning of the argument and interpret the argument as a URL..or handle url parsing errors and fall back to interpreting it as a local file..sort of the EAFP approach..just thoughts. Thanks.

@kmike kmike added this to the Scrapy 1.1 milestone Oct 30, 2015
@Preetwinder
Copy link
Contributor

@Preetwinder Preetwinder commented Oct 31, 2015

The inability to interpret the scrapy shell argument as a file path was introduced in the following commit - 9cbbfd8.
The following line tries to infer the spider to be used for the crawl based on the domain in the url, if a spider is not provided as an argument.
elif url:
spidercls = spidercls_for_request(spiders, Request(url), spidercls, log_multiple=True)
The error arises from the call to Request.
Since inferring the spider based on file paths is not accurate, this check can be circumvented if the argument is detected to be a file path.

Maybe we should have guess_scheme function, not just add_http_...

The guess_scheme function should check if the argument is a file path, if yes it should circumvent the check mentioned above, if no, add_http_if_no_scheme should be called. Is this correct?
Should the argument be checked for being a file path using a file search, or interpreting arguments beginning with / or ./ or ../ or containing \ as file paths?

Thank you.

@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2015

The guess_scheme function shouldn't check if file exists, it should only work with a path as a string; I think checking for a file is only OK if it is limited to scrapy shell command.

@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2015

@kmike , I would also allow it for scrapy parse. what do you think? I have used that to test callback code with local files

@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2015

@redapple yeah, this makes sense.

side2k pushed a commit to side2k/scrapy that referenced this issue Nov 2, 2015
side2k pushed a commit to side2k/scrapy that referenced this issue Nov 2, 2015
…scheme(): when given URI where scheme is present, but not 'http' the function gave bad result
side2k pushed a commit to side2k/scrapy that referenced this issue Nov 2, 2015
…uests "http://example.com"; "example" requests "file://example"; "./example.com" requests "file://example.com"
side2k pushed a commit to side2k/scrapy that referenced this issue Nov 3, 2015
side2k pushed a commit to side2k/scrapy that referenced this issue Nov 3, 2015
@redapple
Copy link
Contributor

@redapple redapple commented Jan 25, 2016

@alecxe , can I ask you to check if #1710 fixes this issue?
Thanks.

@alecxe
Copy link
Author

@alecxe alecxe commented Jan 25, 2016

@redapple almost, scrapy shell ./index.html works, but scrapy shell index.html produces twisted.internet.error.DNSLookupError: DNS lookup failed. Let me know if you need more details. Thanks for looking into this.

@kmike
Copy link
Member

@kmike kmike commented Jan 25, 2016

@alecxe I begged for this behaviour because this way you can tell just by looking at command if it opens a local or a remote file

@alecxe
Copy link
Author

@alecxe alecxe commented Jan 25, 2016

@kmike I am okay with it and it makes sense. Please though mention it in docs additionally. Thanks.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 25, 2016

@kmike @alecxe I would suggest adding both examples and based on that explain why DNSLookupError is raised.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 26, 2016

@alecxe , @nramirezuy , docs updated. Plz have a look at #1710

@kmike
Copy link
Member

@kmike kmike commented Jan 29, 2016

Fixed by #1710.

@kmike kmike closed this Jan 29, 2016
@alecxe
Copy link
Author

@alecxe alecxe commented Jan 29, 2016

Thanks everyone for the time! Glad to see it being a part of 1.1.

@tarunteckedge
Copy link

@tarunteckedge tarunteckedge commented Sep 10, 2018

Good to see it works in scrapy shell "./path/to/file/hello.html"

But same url doesn't work in spider. Anyone can help on that or can confirm this is not supposed to work there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants