# Introduction
Data collection is the first step of data science work, which is essential since it's the source for further manipulation and analysis on data. The efficiency and functionality of the ways to collect data is what a data scientist cares about when assembling data from web sources. Thus, it's very important to choose a proper method and tool for this part.

This tutorial is going to introduce a data collection library, Scrapy. Most of us are already familiar with useful web requesting and parsing libraries like BeautifulSoup, etc. Compared to these libraries, Scrapy is a crawler with full framework, providing functions from requesting, parsing, storing, processing to exporting data. 

Such complete data collection progress realized by Scrapy is supported by its well-structured architecture. Take a look at the picture: https://www.dropbox.com/s/zmysnamlzqbd3rt/scrapy_architecture_02.png?dl=0. 

1. Engine: This is the core processing unit of Scrapy. It manages work flows between other components.
2. Spiders: As its name states, this component send requests to crawl and parse reponses.
3. Item Pipelines: When talking about Item Pipelines, Item should be explained first. Item is like a container, holding the data extracted. Items are then transported to Item pipelines, where data is processed and exported to some database.
4. Downloader: It takes charge of fetching the webs and feed them to the engine, then to spiders.

In this tutorial, an example to collect lyrics and other information of songs from http://www.metrolyrics.com/ will help understand how each component of Scrapy works.

# Content


[Install Scrapy](#install)

[Write a Scrapy project](#write)
   
[Advantages of Scrapy](#advantages)

[Practice experience](#experience)

[Reference and further exploration](#more)



<a id='install'></a>
# Install Scrapy 
To get prepared to use scrapy, the first step is to install it. There are many ways for different platforms. 
With Anaconda installed on my PC, it's quite convenient to choose Scrapy from packages avaiable to install. It should be the same for those who use Miniconda. Without these environments, you can also simply use this command line, which is the typical method to install packages for python:

$ pip install Scrapy

Now, we can start to write a scrapy project.
<a id='write'></a>
# Write a Scrapy project

## Step 1: Creating a project
Under the directory you want to store your project, use this command line to create a new project:

$ scrapy startproject  XXXX   (where XXXX refers to the project name)

Here, I named my project as 'lyrics'. It will create a directory 'lyrics' for me with the following structure. (Note: By taking a look back to the main components of Scrapy I have mentioned above, you can understand more with these files automatically generated for us.)

lyrics/   
    
    lyrics.cfg            # deploy configuration file

    lyrics/               # Main code files are in this directory
        __init__.py

        items.py          # items definition file

        middlewares.py    # middlewares file

        pipelines.py      # pipelines file

        settings.py       # settings file

        spiders/          # You will later put your spiders here
            __init__.py

## Step 2: Define an Item
Item in Scrapy is like a holder to deposit the data extrcted. So, the first job to do is to define an Item by modifying the items.py file. (Note: Here in this notebook, I just use the Jupyter Magic commands to load and rewrite those python files under 'lyrics' directory. If you want to run and test it, please follow the commands or just run outside of this notebook.)

Spracy has provided an Item class. We need inherit it to define our Item calss, 'LyricsItem', to better structure data. The Field object is applied to allocate fields for Item. 

In [60]:
# %load lyrics/lyrics/items.py

In [72]:
%%writefile lyrics/lyrics/items.py
import scrapy

class LyricsItem(scrapy.Item):
    song = scrapy.Field()
    artist = scrapy.Field()
    url = scrapy.Field()

Overwriting lyrics/lyrics/items.py


Working as a data container, item in Scrapy is dictionary-like. We can build a simple instance of our Item to see. However, rather than passing dictionaries again and agian, it's more conveniente for other components of Scrapy to conduct further manipulation on data by accessing Item. So, later when writing spiders for a project, we just need to initalize and return Item.

In [74]:
from lyrics.lyrics.items import LyricsItem

testLyricItem = LyricsItem(song="Perfect", artist="Ed Sheeran")
print(testLyricItem)

{'artist': 'Ed Sheeran', 'song': 'Perfect'}


## Step 3: Write a spider
As its name states, spider in scrapy is used to scrape the websites where we want to extract data from. It should be written under the 'spiders' directory. 

In [113]:
%%writefile lyrics/lyrics/spiders/lyricSpider.py
import scrapy
from lyrics.items import LyricsItem

class LyricSpider(scrapy.Spider):
    name = 'lyricSpider'  #name for each spider should be unique
    allowed_domains = ["www.metrolyrics.com"]
    start_urls = ["http://www.metrolyrics.com/top100.html"] #where to start scraping
    
    
    #the parse() method will be called by defualt
    def parse(self, response):
        #use Scrapy's Selector to select certain parts of the HTML source:
        slc = scrapy.Selector(response)
        sites = slc.xpath('//div[@id="top100"]/div[@class="row"]/div/div[@class="grid_8"]/div/div/ul/li/span[3]')
        
        items = []        
        for site in sites:
            item = LyricsItem()
            item['song'] = site.xpath('a/text()').extract_first().strip()[:-7] #get rid of the 'Lyric' word of the text.
            item['artist'] = site.xpath('span/a/text()').extract()[0].strip() #extract()[0] and extract_first() does the same thing.
            item['url'] = site.xpath('a/@href').extract()[0]
            items.append(item)            
        
        return items   

Writing lyrics/lyrics/spiders/lyricSpider.py


### Selectors
Here I already used Selector, which is provided by Scrapy to parse HTML/XML source. You can also use BeautifulSoup we have learned from the course. But it's slower than lxml. Scrapy's Selctor is right built based on lxml. With Selector, xpath(), css(), re() can be used to pass Xpath/CSS/regex expressions, helping parse the source.

Xpath:

In my LyricSpider above, I tried Xpath. Using Chrome's 'Inspect' tool, we can navigate to the 'top100' part of the webpage, and find the song's name is content of 'a' tag nested inside. So we can access it with this expression: 

//div[@id="top100"]/div[@class="row"]/div/div[@class="grid_8"]/div/div/ul/li/span[3]/a/text()

Inspect tool actually provides an automatically generated XPath expression, which is very convenient. All you need to do is nevigate to the element you want and right-click on it. Then select 'copy' --> 'copy Xpath'. More about the XPath rules can be found from https://www.w3.org/TR/xpath/.

CSS:

CSS is as explicit as XPath and easy to learn. Here's a little example to show how CSS selector works for parsing HTML. More about CSS can be found from https://www.w3.org/TR/selectors/.

In [111]:
from scrapy import Selector
htmlTest = """
<div style="display: none;" id="div-gpt-ad-skin" x-dn="true" data-google-query-id="CJK3_M_ElNoCFYoxaQod7UsBIw">
    <div id="google_ads_iframe_/8264/aw-metrolyrics/charts/top100_1__container__" style="border: 0pt none;">
          <span>tea</span>
          <li class = "list_items">atheletic<li>
          <li class = "list_items">boots<li>
          <li class = "list_items">flats<a href="/set-up/asdh$%31988"></a><li>
          <li class = "grid_4">frame in blue<li>
    </div>
</div>
"""

slc1 = Selector(text = htmlTest)
nodes = slc1.css('li.list_items')
for node in nodes:
    print(node.css('::text').extract_first())
    if len(node.css('a::attr(href)').extract()) != 0:
         print("url:"+str(node.css('a::attr(href)').extract_first()))

atheletic
boots
flats
url:/set-up/asdh$%31988


### Use this Spider to scrape
Now we can use this spider to extract the top ranked songs with their singers on the main page. Scrapy's Feed Exports supports several serialization formats like JSON, CSV and XML to store data. This is one of its advantages over other data collection methods since we require such a proper way to export and store the scraped data for further goals maybe by other systems or tools. Under the 'lyrics' directory, use this command line to invoke the spider and write data as json file:

$ scrapy crawl lyricSpider -o topSongs_data.json

Let's open this json file to see the data we have scraped:

In [98]:
import json

with open("lyrics/topSongs_data.json", encoding='utf-8') as f:
    data = json.load(f)
print(data)

[{'song': 'Daru Badnaam', 'artist': 'Param Singh', 'url': 'http://www.metrolyrics.com/daru-badnaam-lyrics-param-singh.html'}, {'song': 'Beautiful In White', 'artist': 'Westlife', 'url': 'http://www.metrolyrics.com/beautiful-in-white-lyrics-westlife.html'}, {'song': 'Despacito', 'artist': 'Luis Fonsi', 'url': 'http://www.metrolyrics.com/despacito-lyrics-luis-fonsi.html'}, {'song': 'Torete', 'artist': 'Moonstar88', 'url': 'http://www.metrolyrics.com/torete-lyrics-moonstar88.html'}, {'song': 'Akin Ka Na Lang', 'artist': 'Morissette Amon', 'url': 'http://www.metrolyrics.com/akin-ka-na-lang-lyrics-morissette-amon.html'}, {'song': 'Summer Nights', 'artist': 'Grease', 'url': 'http://www.metrolyrics.com/summer-nights-lyrics-grease.html'}, {'song': 'Nadarang', 'artist': 'Shanti Dope', 'url': 'http://www.metrolyrics.com/nadarang-lyrics-shanti-dope.html'}, {'song': 'Pokémon Theme', 'artist': 'Pokémon', 'url': 'http://www.metrolyrics.com/pokemon-theme-lyrics-pokemon.html'}, {'song': 'A Thousand Ye

### An improved spider -- scraping multiple urls
The previous spider is a simple example to illustrate what a basic spider looks like. It only scrapes the information from the 'start_urls'. What we really need to achieve is lyrics, which are on the page behind each song.  Next, a new spider is created to parse not only the information on the main page, but also those on chained urls. 

Note, the Item is also modified slightly, simply adding some fields. Only code for the new spider is shown here.

In [116]:
%%writefile lyrics/lyrics/spiders/lyricSpiderImproved.py
import scrapy
from lyrics.items import LyricsItem

class LyricSpiderImproved(scrapy.Spider):
    name = 'lyricSpiderImproved' 
    allowed_domains = ["www.metrolyrics.com"]
    start_urls = ["http://www.metrolyrics.com/top100.html"]
    
    def parse(self, response):
        slc2 = scrapy.Selector(response)
        sites = slc2.xpath('//*[@id="main-content"]/div[1]/div/div/ul/li')
    
        for site in sites:
            item = LyricsItem()
            item['song'] = site.xpath('span[3]/a/text()').extrat_first().strip()[:-7] 
            item['artist'] = site.xpath('span[3]/span/a/text()').extrat_first().strip()
            item['url'] = site.xpath('span[3]/a/@href').extrat_first()
            
            #The following fields are assigned for later use. 
            #Another field called ''lastWeekRank' is also declared in Item
            item['thisWeekRank'] = site.xpath('span[1]/text()').extrat_first()
            item['up'] = site.xpath('div[@class="last-week up"]/text()').extrat_first()
            item['same'] = site.xpath('div[@class="last-week same"]/text()').extrat_first()
            item['down'] = site.xpath('div[@class="last-week down"]/text()').extrat_first()
            
            yield scrapy.Request(url = item['url'], meta = {'item':item}, callback = self.parse_lyric)

    
    def parse_lyric(self, response):
        slc3 = scrapy.Selector(response)
        textLines= slc3.xpath('//*[@id="lyrics-body-text"]/p/text()').extract()
        #print(textLines)
        
        item = response.meta['item']
        item['lyric'] = " ".join(textLines)
        yield item

Writing lyrics/lyrics/spiders/lyricSpiderImproved.py


The core improvement is to pass the urls we want to follow and parse_lyric method as callbacks. It will then parse one step deeper. This way can also be applied to scrape linked pages. For example, if we want to get information of all shoes listed on Amazon, which may appear on multiple pages, we can use this method to go over from the first page to the last. Such circumstances are even easier, where no other parse methods need to be written. Find the next url and pass self.parse itself as callback. The spider will recursively scrape the pages until there's no next. You can have a try if you're interested. It's easier to implement compared to the Yelp reviews example in our homework :)

Run this improved spider and export data into 'lyrics_data.json' file. See what we get:

In [145]:
with open("lyrics/lyrics_data.json", encoding='utf-8') as f:
    data = json.load(f)
print("Name: ", data[3]['song'])
print("Artist: ", data[3]['artist'])
print("Lyric:\n", data[3]['lyric'])

Name:  Beautiful In White
Artist:  Westlife
Lyric:
 Not sure if you know this 
But when we first met 
I got so nervous 
I couldn't speak 
In that very moment I found the one and 
My life had found its missing piece So as long as I live I'll love you 
Will have and hold you 
You look so beautiful in white 
And from now to my very last breath 
This day I'll cherish 
You look so beautiful in white 
Tonight What we have is timeless 
My love is endless 
And with this ring I say to the world 
You're my every reason 
You're all that I believe in 
With all my heart I mean every word So as long as I live I'll love you 
Will have and hold you 
You look so beautiful in white 
And from now to my very last breath 
This day I'll cherish 
You look so beautiful in white 
Tonight Oh, oh 
You look so beautiful in white 
Na na na na na 
So beautiful in white 
Tonight And if our daughter's what our future holds 
I hope she has your eyes 
Finds love like you and I did, yeah. 
When she falls in love we let 

## Step 4: Create Item Piplines
According to the architecture of Scrapy, Items are finally transported to Item Pipelines. In such components, data is proccesed and decided whether to be dumped or kept. Common applications are cleaning data, validating fields and droping duplications. Besides, data can be stored into database through these pipelines.

Now, I only want to keep those songs that rise in the ranking compared to last week. This is identified by whether the song has a 'div' tag with class "last-week same/up/down". Thus, in the 'process_item()' method which is called by default, I check their 'up' fields and dump those without value. And their last week's rankings then can be calculated. (Some modifications were also made on item.py file and the LyricSpiderImproved spider to add these fields.)

Other than declaring json file in command line, here I write another pipeline class 'JsonWriterPipeline' to write json files. Then calling the spider will automatically write data as the format you want. Writing data into database like MongoDB is very similar to this, which can be refered from https://doc.scrapy.org/en/latest/topics/item-pipeline.html .

In [130]:
# %load lyrics/lyrics/pipelines.py

In [139]:
%%writefile lyrics/lyrics/pipelines.py
import json
from scrapy.exceptions import DropItem

class LyricsPipeline(object):
    
    def process_item(self, item, spider):
        if item['up']:
            item['lastWeekRank'] = str(int(item['thisWeekRank'])+int(item['up'][1:]))
            return item
        else:
            raise DropItem("Song %s does not rise in ranking" % item)

            
class JsonWriterPipeline(object):
    
    def open_spider(self, spider):
        self.file = open('processed_data.json','w')
        
    def close_spider(self,spider):
        self.file.close()
        
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Overwriting lyrics/lyrics/pipelines.py


Only modifying the pipelines.py file is not enough though. It must be deployed in the setting.py file.

In [132]:
# %load lyrics/lyrics/settings.py

In [134]:
%%writefile lyrics/lyrics/settings.py

BOT_NAME = 'lyrics'

SPIDER_MODULES = ['lyrics.spiders']
NEWSPIDER_MODULE = 'lyrics.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True


# Configure item pipelines
ITEM_PIPELINES = {
   'lyrics.pipelines.LyricsPipeline': 300, #number decides the order to run. It should be between 0 and 1000.
   'lyrics.pipelines.JsonWriterPipeline': 800,
    
}

Overwriting lyrics/lyrics/settings.py


Now, run the spider again and see how many records of songs we have filtered out.

In [144]:
processed_data = []
for line in open("lyrics/processed_data.json", encoding='utf-8'):
    processed_data.append(line)
    
print(len(data))
print(len(processed_data))

20
12


### Even more: ImagesPipeline for scraping images
Scrapy provides a special Pipeline, ImagesPipeline, which is very handy for downloading images. Now, let's use this amazing feature to scrape the albums' covers of songs.

Note: The following code cells are not run here. Simply paste them into the current corresponding files for Item, ItemPipeline and settings. 

First, add some configurations in settings for the new Pipeline:

In [None]:
#Modify the pipelines' configuration
ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
   'lyrics.pipelines.LyricsPipeline': 300,
   'lyrics.pipelines.JsonWriterPipeline': 800,   
}

#Direct to the path where you want to store the downloaded images
IMAGES_STORE = 'pics_scraped'

Add fields for images and their urls in Item:

In [None]:
#Modify the class in items.py
images_urls = scrapy.Field()
images = scrapy.Field()

Also, let the spider scrape the urls of images by impelementing this in parse() method:

In [None]:
#Modify the LyricSpiderImproved class's parse() method in corresponding spider file
item['images_urls']= site.xpath('span[2]/a/img/@src').extract_first()

Finally, build this Pipeline class for images:

In [None]:
#Modify the pipelines.py
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline): #Be careful about what to be inherited here!!!
    
    def get_media_requests(self,item,info):
        for url in item['images_urls']:
            yield scrapy.Request(url)
    
    def item_completed(self,results,item,info):
        images_paths = [x['images'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Song has no cover")
        item['images'] = image_paths
        return item

<a id='advantages'></a>
# Advantages of Scrapy
From the general practice of Scrapy above, we can feel and summarize the advantages of this web crawler.

a) Scrapy has relatively complete framework. The whole progress of data collection, from requesting, storing, processing to storing data, has relative components for support. It is really well sturctrued.

b) Ways to store data of Scrapy provide various choices. It makes more manipulations on data possible even on other systems or tools.

c) Scrapy also have many amazing unique functions, like fast parsing by Selectors, scraping images by special pipelines.

Scrapy's advantages are more than these illustreated here. For example, the spider can be adjusted automatically rather than manually so that requesting and parsing work with high efficiency. More can be found through futher practice on Scrapy.

<a id='experience'></a>
# Practice experience
When writing this tutorial at the very begining, I used to practice techniques about Scrapy on Amazon. The reason I changed this is that my IP was banned because at that time I was not aware of the frequncy of requests I send. To protect data, most of the companies and organizations are cautious about crawlers and other similar robot-like visitors. There are several ways to handle this.

Some other tools can help rotate our IP address or help undirectly visit the website, which may be not proper or unpractical under some circumstances. So they are not recommended here. Only some nature methods related to Scrapy itself are listed here:

a) Slow down requesting by setting download delays. 

In [None]:
#Add DOWNLOAD_DELAY variable in settings.py
DOWNLOAD_DELAY = 3  # 2 or higher is recommended

b) Disable cookies since they are used to keep track of us visitors.

In [None]:
#Set COOKIES_ENABLED to be false variable in settings.py
COOKIES_ENABLED = false

However, other than the techniques suggested above, the general rule for data collection is to BE POLITE.

<a id='more'></a>
# References and further explorations
This tutorial only provides guidlines for new users of Scrapy. The functions of Scrapy are more than what are introduced here. If you are interested, following links will be very useful:

1.Official tutorial for Scrapy: https://doc.scrapy.org/en/latest/index.html

2.Some useful topics about Scrapy:

  Imitating logging: https://doc.scrapy.org/en/latest/topics/logging.html
  
  Debugging Spiders: https://doc.scrapy.org/en/latest/topics/debug.html
  
  Link Extractors: https://doc.scrapy.org/en/latest/topics/link-extractors.html
  
  Distributed crawlers based on Scrapy: https://github.com/rmax/scrapy-redis

4.How to crawl the web politely with Scrapy: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/