Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'=' added at end of urls extracted from LinkExtractor #1982

Closed
gmargari opened this issue May 11, 2016 · 3 comments
Closed

'=' added at end of urls extracted from LinkExtractor #1982

gmargari opened this issue May 11, 2016 · 3 comments

Comments

@gmargari
Copy link

@gmargari gmargari commented May 11, 2016

$ scrapy shell "https://uat.payleap.com/transactservices.svc"
> from scrapy.linkextractors import LinkExtractor
> [ l.url for l in LinkExtractor(allow=()).extract_links(response) ]
['https://uat.payleap.com/TransactServices.svc?wsdl=']

Scrapy shell info:

2016-05-11 16:37:21 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-05-11 16:37:21 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 16:37:21 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-05-11 16:37:22 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-05-11 16:37:22 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 16:37:22 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 16:37:22 [scrapy] INFO: Enabled item pipelines: 
2016-05-11 16:37:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-05-11 16:37:22 [scrapy] INFO: Spider opened
2016-05-11 16:37:23 [scrapy] DEBUG: Crawled (200) <GET https://uat.payleap.com/transactservices.svc> (referer: None)
@gmargari gmargari changed the title Extra "=" added at end of urls from LinkExtractor Extra "=" added at end of urls extracted from LinkExtractor May 11, 2016
@gmargari gmargari changed the title Extra "=" added at end of urls extracted from LinkExtractor '=' added at end of urls extracted from LinkExtractor May 11, 2016
@redapple
Copy link
Contributor

@redapple redapple commented May 11, 2016

@gmargari , this is the (unfortunate) behavior of canonicalize_url (enabled by default in LinkExtractor).

You can disable it though:

In [1]: from scrapy.linkextractors import LinkExtractor

In [2]: [ l.url for l in LinkExtractor(allow=()).extract_links(response) ]
Out[2]: ['https://uat.payleap.com/TransactServices.svc?wsdl=']

In [3]: [ l.url for l in LinkExtractor(allow=(), canonicalize=False).extract_links(response) ]
Out[3]: ['https://uat.payleap.com/TransactServices.svc?wsdl']


@gmargari
Copy link
Author

@gmargari gmargari commented May 11, 2016

Thanks!

@gmargari gmargari closed this May 11, 2016
@Digenis
Copy link
Member

@Digenis Digenis commented May 11, 2016

gmargari added a commit to gmargari/progwebspider that referenced this issue May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.