Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response.body is duplicate #1606

Closed
fanpei91 opened this issue Nov 16, 2015 · 2 comments
Closed

response.body is duplicate #1606

fanpei91 opened this issue Nov 16, 2015 · 2 comments
Labels

Comments

@fanpei91
Copy link

@fanpei91 fanpei91 commented Nov 16, 2015

Access the text page(not mine) by browsers or wget and you will find the response content is not duplicate, but scrapy's response.body is duplicate. I had tried set the scrapy's headers same as a real brower's, but it is still duplicate.

Just use the follow sample code, and you will find the issue.

scrapy shell "http://files.qidian.com/Author4/3615059/88542882.txt"

Sorry for my bad english.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Nov 17, 2015

Hm I can replicate this on my machine. The body is not fully duplicated, but almost:

jakob@MosEisley ~ % scrapy shell http://files.qidian.com/Author4/3615059/88542882.txt               
2015-11-17 01:40:37 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-11-17 01:40:37 [scrapy] INFO: Optional features available: ssl, http11
2015-11-17 01:40:37 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-11-17 01:40:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-11-17 01:40:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-17 01:40:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-17 01:40:38 [scrapy] INFO: Enabled item pipelines: 
2015-11-17 01:40:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6025
2015-11-17 01:40:38 [scrapy] INFO: Spider opened
2015-11-17 01:40:38 [scrapy] DEBUG: Crawled (200) <GET http://files.qidian.com/Author4/3615059/88542882.txt> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f78dff83ad0>
[s]   item       {}
[s]   request    <GET http://files.qidian.com/Author4/3615059/88542882.txt>
[s]   response   <200 http://files.qidian.com/Author4/3615059/88542882.txt>
[s]   settings   <scrapy.settings.Settings object at 0x7f78d749bbd0>
[s]   spider     <DefaultSpider 'default' at 0x7f78dc893190>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-11-17 01:40:38 [root] DEBUG: Using default logger
2015-11-17 01:40:38 [root] DEBUG: Using default logger

In [1]: len(response.body)
Out[1]: 16661

In [2]: import requests

In [3]: resp = requests.get("http://files.qidian.com/Author4/3615059/88542882.txt")
2015-11-17 01:41:01 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): files.qidian.com
2015-11-17 01:41:01 [requests.packages.urllib3.connectionpool] DEBUG: "GET /Author4/3615059/88542882.txt HTTP/1.1" 200 None

In [4]: len(resp.text)
Out[4]: 8465

In [5]: response.body.count('document')
Out[5]: 2

In [6]: resp.text.count('document')
Out[6]: 1

In [7]: response.body.find('document', 1)
Out[7]: 8196

In [8]: len(response.body[8196:])
Out[8]: 8465

In [9]: response.body_as_unicode()[8196:] == resp.text
Out[9]: True

Downloading the file from a different server (http://pastebin.com/raw.php?i=37ewyTRK) doesn't change any of the requests stuff, but interestingly changes Scrapy's response:

In [12]: resp2 = requests.get("http://pastebin.com/raw.php?i=37ewyTRK")
2015-11-17 01:49:17 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): pastebin.com
2015-11-17 01:49:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /raw.php?i=37ewyTRK HTTP/1.1" 200 None

In [14]: resp2.text == resp.text
Out[14]: True

In [15]: fetch("http://pastebin.com/raw.php?i=37ewyTRK")
2015-11-17 01:49:40 [scrapy] DEBUG: Crawled (200) <GET http://pastebin.com/raw.php?i=37ewyTRK> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f78dff83ad0>
[s]   item       {}
[s]   request    <GET http://pastebin.com/raw.php?i=37ewyTRK>
[s]   response   <200 http://pastebin.com/raw.php?i=37ewyTRK>
[s]   settings   <scrapy.settings.Settings object at 0x7f78d749bbd0>
[s]   spider     <DefaultSpider 'default' at 0x7f78dc893190>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [16]: len(response.body)
Out[16]: 16595

In [17]: response.body.count('document')
Out[17]: 1

In [18]: response.body.find('document')
Out[18]: 0

@fanpei91
Copy link
Author

@fanpei91 fanpei91 commented Nov 17, 2015

Thank you very mutch! @jdemaeyer ! Your detail is very useful!.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants