Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with the Chinese character in url #1571

Closed
ghost opened this issue Oct 30, 2015 · 4 comments
Closed

How to deal with the Chinese character in url #1571

ghost opened this issue Oct 30, 2015 · 4 comments

Comments

@ghost
Copy link

ghost commented Oct 30, 2015

I am following the link in the page to scrape the content in this page:
http://www.littleoslo.com/lyc/home/category/style/rap/
However, the url address of the songs contains url characters. I try to scrape the data using following code but failed:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.selector import HtmlXPathSelector

class LyricSpider(scrapy.Spider):
     name = "lyric_spider"
     allowed_domains = ["littleoslo.com"]
     start_urls = [
        "http://www.littleoslo.com/lyc/home/category/style/rap/"
     ]
     rules = (
         Rule(LinkExtractor(allow = (r'page/\d+')), follow = True),
     )



    def parse(self, response):
        for href in response.css("div > div > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//div/h3/a'):
            item = LyricItem()
            link = sel.xpath('@href').extract()
            yield item

Thanks in advance!!!

@OldPanda
Copy link

OldPanda commented Nov 1, 2015

I cannot see your whole project so I could only provide you a hint based on my guess. For encoding problems, you can try

  • ftfy
  • do url = url.encoding('utf-8') before sending request to this address

Hope this helps.

@ghost
Copy link
Author

ghost commented Nov 2, 2015

@OldPanda Thanks a lot! Actually, the url crawled by scrapy will be stored as unicode(or other format I don't know ....). And when it is sent for new request the server can recognise it. Thus there is no problem required additional effort. Anyway, thank you for your reply.

@redapple
Copy link
Contributor

@rylanchiu , do I understand correctly that your issue is now solved?

@ghost
Copy link
Author

ghost commented Feb 18, 2016

@redapple Yes. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants