How to deal with the Chinese character in url #1571

ghost · 2015-10-30T13:41:38Z

I am following the link in the page to scrape the content in this page:
http://www.littleoslo.com/lyc/home/category/style/rap/
However, the url address of the songs contains url characters. I try to scrape the data using following code but failed:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.selector import HtmlXPathSelector

class LyricSpider(scrapy.Spider):
     name = "lyric_spider"
     allowed_domains = ["littleoslo.com"]
     start_urls = [
        "http://www.littleoslo.com/lyc/home/category/style/rap/"
     ]
     rules = (
         Rule(LinkExtractor(allow = (r'page/\d+')), follow = True),
     )



    def parse(self, response):
        for href in response.css("div > div > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//div/h3/a'):
            item = LyricItem()
            link = sel.xpath('@href').extract()
            yield item

Thanks in advance!!!

OldPanda · 2015-11-01T01:51:36Z

I cannot see your whole project so I could only provide you a hint based on my guess. For encoding problems, you can try

ftfy
do url = url.encoding('utf-8') before sending request to this address

Hope this helps.

ghost · 2015-11-02T11:30:22Z

@OldPanda Thanks a lot! Actually, the url crawled by scrapy will be stored as unicode(or other format I don't know ....). And when it is sent for new request the server can recognise it. Thus there is no problem required additional effort. Anyway, thank you for your reply.

redapple · 2016-02-18T12:06:39Z

@rylanchiu , do I understand correctly that your issue is now solved?

ghost · 2016-02-18T14:32:30Z

@redapple Yes. Thanks a lot.

ghost mentioned this issue Oct 30, 2015

'module' object has no attribute 'Filed' #1573

Closed

redapple closed this as completed Feb 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with the Chinese character in url #1571

How to deal with the Chinese character in url #1571

ghost commented Oct 30, 2015

OldPanda commented Nov 1, 2015

ghost commented Nov 2, 2015

redapple commented Feb 18, 2016

ghost commented Feb 18, 2016

How to deal with the Chinese character in url #1571

How to deal with the Chinese character in url #1571

Comments

ghost commented Oct 30, 2015

OldPanda commented Nov 1, 2015

ghost commented Nov 2, 2015

redapple commented Feb 18, 2016

ghost commented Feb 18, 2016