Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'=' added at end of urls extracted from LinkExtractor #1982

Closed
gmargari opened this issue May 11, 2016 · 3 comments
Closed

'=' added at end of urls extracted from LinkExtractor #1982

gmargari opened this issue May 11, 2016 · 3 comments

Comments

@gmargari
Copy link

$ scrapy shell "https://uat.payleap.com/transactservices.svc"
> from scrapy.linkextractors import LinkExtractor
> [ l.url for l in LinkExtractor(allow=()).extract_links(response) ]
['https://uat.payleap.com/TransactServices.svc?wsdl=']

Scrapy shell info:

2016-05-11 16:37:21 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-05-11 16:37:21 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 16:37:21 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-05-11 16:37:22 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-05-11 16:37:22 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 16:37:22 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 16:37:22 [scrapy] INFO: Enabled item pipelines: 
2016-05-11 16:37:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-05-11 16:37:22 [scrapy] INFO: Spider opened
2016-05-11 16:37:23 [scrapy] DEBUG: Crawled (200) <GET https://uat.payleap.com/transactservices.svc> (referer: None)
@gmargari gmargari changed the title Extra "=" added at end of urls from LinkExtractor Extra "=" added at end of urls extracted from LinkExtractor May 11, 2016
@gmargari gmargari changed the title Extra "=" added at end of urls extracted from LinkExtractor '=' added at end of urls extracted from LinkExtractor May 11, 2016
@redapple
Copy link
Contributor

@gmargari , this is the (unfortunate) behavior of canonicalize_url (enabled by default in LinkExtractor).

You can disable it though:

In [1]: from scrapy.linkextractors import LinkExtractor

In [2]: [ l.url for l in LinkExtractor(allow=()).extract_links(response) ]
Out[2]: ['https://uat.payleap.com/TransactServices.svc?wsdl=']

In [3]: [ l.url for l in LinkExtractor(allow=(), canonicalize=False).extract_links(response) ]
Out[3]: ['https://uat.payleap.com/TransactServices.svc?wsdl']


@gmargari
Copy link
Author

Thanks!

@Digenis
Copy link
Member

Digenis commented May 11, 2016

@gmargari, see #1941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants