# SCRAPY TUTORIAL - Introduction to Python Scrapy Tool 

Our daily life has been bombarded with all kinds of information from all channels.  The variety and volume of data are available ubiquitously and updated relentlessly. Data preparation is thus of pivotal importance in any sort of analysis, planning and decision making process. 

In this task, we will be using Scrapy to get the Exchange rate information from this <a href="http://www.x-rates.com/table/?from=USD&amount=1"> website </a>.  
The typical structure of Scrapy is like the picture showing below, once set up, it should be building up these key parts: Spider, Item Pipeline, downloader, middleware, scheduler and the engine. 

<img src="Scrapy Structure.jpg" alt="Scrapy Structure" style="width:400px;height:300px"/>


The six major parts as are shown above are:
- Scrapy Engine: this is the part responsible for the communications between Spider, ItemPipline, Downloader and  Scheduler
- Scheduler: This is the part that takes the request sent from the - Engine(who get those from Spider), putting all of those into a scheduled queue. It will also be used to eliminate repeated requests.
- Downloader: This is the part reponsible for downloading all the responses from Engine requests, it will hand over items back to the Engine, who will then hand over the items to Spider
- Spider: This is the core to extract data items from the responses, and return URLs that needs to be followed up back to the Engine. (Engine will then send these Urls back into the Scheduler)
- Item Pipeline: this part will take over the items from Spider, analyze the items and store them in the format specified by users(such as Json)



You can try to create a Scrapy project from terminal and modify the python files in using editors such as Subline. But that is involving a lot of environment set-ups, installing and switching between screens. In this tutorial, we will be showing you how the Scrapy can be initiated and utilized in a relatively lightweight way within the Jupyter Notebook. 

If you are interested, You can still try the scrapy tool from your terminal after installing the scrapy and all its dependencies. Here is a <a href="https://doc.scrapy.org/en/latest/intro/install.html">guide</a> for the installation. 

After the installation, you there are many ways you can check if your scrapy is properly set up, and good to go, such as type "scrapy" directly in your terminal, the following messages should pop up:


<img src="scrapyStart.PNG" alt="Scrapy Start" style="width:500px;height:300px"/>

You can also use the commands it provides above to test the scrapy tool itself. The command 'fetch' is also useful to see if a website is available for data scrapying or not, below are examples of successful fetch and failed fetch:
```python
apple@venti:~$ scrapy fetch "https://finance.yahoo.com/"

2018-03-20 11:26:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://finance.yahoo.com/trending-tickers> (referer: None)
_______
apple@venti:~$ scrapy fetch "https://www.amazon.com/gp/twitter/"

2018-03-20 12:01:28 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.amazon.com/gp/twitter/> (referer: None)

```
You can also start your scrapy project in the terminal using: scrapy startproject <your project name>:

<img src="startANewProject.PNG" alt="Scrapy Start" style="width:700px;height:100px"/>


Once started, the system will auto generate the files that you can use to create your own scrapy project

<img src="scrapyDetails.PNG" alt="Scrapy Start" style="width:500px;height:80px"/>

Again, all the above are trying to demostrate you the traditional way that scrapy are used, but that is not the aim of tutorial, rather, it is trying to utilized a powerful but easy to use Jupyter Notebook. 


## 1. Setting Up the Environment 

Firstly, you will need to Install the scrapy package from the Anaconda Navigator's Environement panel. Find the scrapy packate and Apply it to the environment. The version that is used in this tutorial is 1.3.3.
<img src="Anaconda scrapy package.PNG" alt="Anaconda scrapy package" style="width:600px;height:200px"/>
Next Import the scrapy into this notebook.
You will also need to import the scrapy crawler so as to initiate the spider inside the notebook, who sends the command to Engine.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

## 2. Create the initial Spider

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).




In determining to use scrapy, it is important that you have a certain destination website(s), from whom you want to get the data for analyses. In this part, you need to specify: 
- the <b>name</b> of this spider file
- the <b>allowed_domains</b> (the boundaries of your data) (optional)
- the <b>start_urls</b> to kick off this crawling process and the parse method which defines how your data will be extracted. 

Since we haven't specify the itempipline file, we will can simply print them out to see the structure and put the data into logs using <b>self.log()</b>

Notice that, the Scrapy is intended to(designed to) be running on the terminal, the crawling process can only be run for once inside this notebook. Thus we will have to save the response locally for further analysis. 

We will be using "RERUN REQUIRED" notice to mark the code that you must restart the kernel. 


In [None]:
import logging

class XRateSpider(scrapy.Spider):
    
    name = 'ExchangeRates'
    start_urls = ['http://www.x-rates.com/table/?from=USD&amount=1'
    ]
    
    def parse(self, response):
        '''
        This method will be called everytime everytime a response is sent from the Engine 
        to the Spider. This method is responsible, to extract the desired data from the 
        response.
        
        Args: response sent from the Engine (In fact from the downloader)
        '''
        CurrencyType = response.xpath('//table[@class="tablesorter ratesTable"]/tbody/tr/td[not(@class)]/text()').extract() 
        DollarEquivalent = response.xpath("//table[@class='tablesorter ratesTable']/tbody/tr/td[@class='rtRates']/a/text()").extract()
        
        self.log("Log 1")
        self.log(CurrencyType)
        self.log(DollarEquivalent)
    

Try to test your Spider above using the below code. You should see the sample result as below. 

In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(XRateSpider)
process.start()

```python
2018-03-09 18:04:48 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot) 
2018-03-09 18:04:48 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-03-09 18:04:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
...
2018-03-09 18:04:48 [scrapy.core.engine] INFO: Spider closed (finished)
```

## 3. Item

Now you will need to define the Item class in order to specify the structure/format of the extracted data.Scrapy did provide a handy Item class to define common output data format.
It acts like a container to collect the scraped data. Inside the Item class, each field and data are arranged in the same way as a Python dictionary. 

In the below Item class, you will need to declear the item. The items are supposed to be declared using a simple class definition syntax and Field objects.

In [None]:
import scrapy

class XRateItem(scrapy.Item):
    '''
    Given our purpose is the extract all the types of currencies and their relavant value in terms
    of dollar, this class should contain three fields: 
      OneUnitinDollar - the dollar value of a unit of this currency 
      OneDollarEquivalent - the units of this currency in equivalent to one dollar
      Type - the Currency's type
    '''
    OneUnitinDollar = scrapy.Field()
    OneDollarEquivalent = scrapy.Field()
    Type = scrapy.Field()
    
    

To view the value of the items there are multiple ways similar to view the value of a certain key inside a dictionary. 
```python
>>> itme['Type']
'Argentine Peso'
>>> itme.get('OneDollarEquivalent')
'20.234946'

>>> itme['OneUnitinDollar']
'0.100125'
```

## 4. Item Pipeline

Next we need to set up the item pipeline class. The class name need to be aligned to the item that you want to process. Inside this class, the process_item method has to be filled in defineing how to process(store) the data items. 
Please pay attention to the following: 
- Don't forget to add your pipeline to the settings inside the spider. 
- The item passed from spider are in json format, in order to process the data item, you should Encoding it using json.dumps and write it into our destination file. 
- Do remember to return the item. It will, as mentioned above, return the item to Engine, and Engine, once received this signal will dispatch the next item from the spider for loop to the pipeline.

In [None]:
import json

class JsonWriterPipeline(object):
    
    def open_spider(self, spider):
        '''
        This method is executed when the spider starts. 
        '''
        self.file = open('itemlist.jl', 'w')

    def close_spider(self, spider):
        '''
        This method will only be executed when the spider is closed
        
        '''
        self.file.close()

    def process_item(self, item, spider):
        '''
        Args: item - this is the item passed from the Spider 
              spider - this is the spider associated with this item pipelin
        Returns: Item 
        '''
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


## 5. Settings 

Spiders can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:
```python
class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }
```

You should configure the 'ITEM_PIPELINES' for this spider inside the blow settings block. Similar to Dictionary, you should specify the name of the
pipeline to be associated with this spider. The number value of the pipeline indicates the priority of the execution (a comparatively smaller numbered pipeline will be executed first. the range is between 0-1000)


In this tutorial, we will be loading all the data into json format file named 'itemlist.json'and save it in the same level of directory. We will be using: FEED_FORMAT and FEED_URI in the custom_settings. 



```python
custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 100},
        'FEED_FORMAT':'json',                                 
        'FEED_URI': 'itemlist.json'
    }
```    

## 6. Finishing up Spider
Fill your Spider to store the items in the way you defined inside the Pipeline class. Please note, since Spider for a certain website can only be run once inside the Jupyter Notebook, You will need to restart the Notebook and run the rest of blocks except for the initial Spider. 

Notice the difference between using <b>yield</b> in the for loop and using <b>return</b> items outside of the for loop. You may find this<a href="https://www.quora.com/What-is-the-difference-between-yield-and-return-in-python"> post </a> useful in deciding which to be used. 
RERUN REQUIRED-- You will need to restart the kernel and run ONLY the item, item pipeline cells and the new spider you created below.


In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import logging

class XRateSpider(scrapy.Spider):
    
    name = 'ExchangeRates'
    start_urls = ['http://www.x-rates.com/table/?from=USD&amount=1'
    ]

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 100},
        'FEED_FORMAT':'json',                                 
        'FEED_URI': 'itemlist.json'
    }
    
    def parse(self, response):
        '''
        This method will be called everytime everytime a response is sent from the Engine 
        to the Spider. This method is responsible, to extract the desired data from the 
        response.
        
        Args: response sent from the Engine (In fact from the downloader)
        '''
        CurrencyType = response.xpath('//table[@class="tablesorter ratesTable"]/tbody/tr/td[not(@class)]/text()').extract() 
        DollarEquivalent = response.xpath("//table[@class='tablesorter ratesTable']/tbody/tr/td[@class='rtRates']/a/text()").extract()
        
        self.log("Log 1")
        self.log(CurrencyType)
        self.log(DollarEquivalent)
        
        for i in range(len(CurrencyType)):
            item = XRateItem()
            item["Type"] = CurrencyType[i]

            k = (i-1)*2+1
            j = i*2
            item["OneUnitinDollar"] = DollarEquivalent[k]

            item["OneDollarEquivalent"] = DollarEquivalent[j]
        

            yield item
            


In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'  
})

process.crawl(XRateSpider)
process.start()

In [None]:
import pandas as pd
dfjson = pd.read_json('itemlist.json')
dfjson


The above code should have the following output:
<img src="Pandas_output.PNG" alt="Pandas_output" style="width:400px;height:400px"/>



## 7. Write Items to MongoDB

You can also, write your items to MongoDB using pymongo. You can check the details of MongoDB in this <a href="https://www.mongodb.com/">link</a>. But this tutorial will only be using a <a href='https://mlab.com/welcome/'>mLab</a>, a Database-as-a-Service platform to store our data items. You will need to set up your credentials following this <a href='http://docs.mlab.com/'>documentation</a>. You will need to create a database named '15688scrapy' for the purpose of this tutorial.

Once setup, you can use the below command in your terminal to connect with your remote database:
```python
% mongo ds012345.mlab.com:56789/dbname -u dbuser -p dbpassword
```
You are supposed to see the below info:
```python
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
	http://docs.mongodb.org/
Questions? Try the support group
	http://groups.google.com/group/mongodb-user
```

Now come back to the Itempipeline class, you can now define another Itempipeline to process and store data into your mLab MongoDB:
- First your will need to specify the collection_name as 'XRatecollection' (for the purpose of this tutorial under the 15688scrapy database. The collection will be the "table" that holds all the data items from pipeline.  
- Inside the init method, define the mongo_uri and mongo_db. 
- We will be using a class method called from_crawler to returns an instance of the pipeline. It essentially create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

In [None]:
import pymongo

class MongoPipeline(object):
    '''Given the data items extracted from the spider, this class is to define another way
    to store data into NoSQL database MongoDB'''
    
    collection_name = 'XRateScrapycollection'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGODB_DATABASE', 'items')
        )

    def open_spider(self, spider):
        '''
        Now open the portal towards your remote MongoDB whenever your spider starts
        '''
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        '''
        Instead of writing the items from spider into local files, now you need to define how
        to insert the data items into your MongoDB in Json format.
        Using logs to track the insertions status and the materials inserted.
        '''
        self.db[self.collection_name].insert_one(dict(item))
        logging.debug("Added to mongo")
        logging.debug(dict(item))
        return item

## 8. Finishing up Spider
Again you will need to modify your Spider to render the changes and save the data into MongoDB.
Notice that you will start from the Items cell, the Item Pipeline cell and the below.
Mainly you will need to specify your MongoDB properties inside the custom_settings:

```python
custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.MongoPipeline': 1}, # Used for pipeline 1
        'MONGO_URI': 'mongodb://<dbuser>:<dbpassword>@ds012345.mlab.com:11059/15688scrapy',
        'MONGODB_DATABASE' : '15688scrapy'
    }
```

RERUN REQUIRED-- You will need to restart the kernel and run ONLY the item(provided below), your new MongoPipeline cells and the new spider you created below.

In [None]:
import scrapy
class XRateItem(scrapy.Item):
    '''
    Given our purpose is the extract all the types of currencies and their relavant value in terms
    of dollar, this class should contain three fields: 
      OneUnitinDollar - the dollar value of a unit of this currency 
      OneDollarEquivalent - the units of this currency in equivalent to one dollar
      Type - the Currency's type
    '''
    OneUnitinDollar = scrapy.Field()
    OneDollarEquivalent = scrapy.Field()
    Type = scrapy.Field()
    
    

In [None]:
import logging
import scrapy

from scrapy.crawler import CrawlerProcess

class XRateSpider(scrapy.Spider):
    
    name = 'ExchangeRates'
    start_urls = ['http://www.x-rates.com/table/?from=USD&amount=1'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.MongoPipeline': 1}, # Used for pipeline 1
        'MONGO_URI': 'mongodb://[username]:[password]@ds111059.mlab.com:11059/15688scrapy',
        'MONGODB_DATABASE' : '15688scrapy'
    }
    
    def parse(self, response):
        '''
        This method will be called everytime everytime a response is sent from the Engine 
        to the Spider. This method is responsible, to extract the desired data from the 
        response.
        
        Args: response sent from the Engine (In fact from the downloader)
        '''
        CurrencyType = response.xpath('//table[@class="tablesorter ratesTable"]/tbody/tr/td[not(@class)]/text()').extract() 
        DollarEquivalent = response.xpath("//table[@class='tablesorter ratesTable']/tbody/tr/td[@class='rtRates']/a/text()").extract()
        
        self.log("Log 1")
        self.log(CurrencyType)
        self.log(DollarEquivalent)
        
        for i in range(len(CurrencyType)):
            item = XRateItem()
            item["Type"] = CurrencyType[i]

            k = (i-1)*2+1
            j = i*2
            item["OneUnitinDollar"] = DollarEquivalent[k]

            item["OneDollarEquivalent"] = DollarEquivalent[j]
        

            yield item
            
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'  
})

process.crawl(XRateSpider)
process.start()


### 9. Data Presentation

In the last phase, we should be able to load the data from your database in prepaation for any sort of analysis. Here we will load the data and output it in the python dictionary format. For instance, we might be wondering which are the currencies with whom USD has an exchange rate greater than 5, in other words, from which contries we can get more than 5 unit of currencies in exchange of one dollar?
You may find <a href="https://docs.mongodb.com/manual/reference/sql-comparison/">mongoDB's Documentation</a> helpful in fitering your results

In [None]:
from pymongo import MongoClient

'''
Steps to follow:
1. Connect to Mongo Client using MongoClient. The link you should pass into its variable
parentheses is the same as you used to connect with your remote database, namely the value of MONGO_URI.
2. Pass your database value into 'db'
3. Pass your collection info into 'collection'
4. we can filter our result in the cursor function
5. lastly save your findings in a formatted dictionary and close the connection. 


'''
#Connect to Mongo Client 
client = MongoClient('mongodb://[username]:[password]@ds111059.mlab.com:11059/15688scrapy')

db = client.get_default_database()

print(db)
collection = db['XRateScrapycollection']
print(collection)

cursor = collection.find({'OneDollarEquivalent': {'$gte': '5'}}).sort('_id', 1)

result = {}

for doc in cursor:
    result.update({doc['Type']:doc['OneDollarEquivalent']})
    
print(result)
client.close()



Your expected result from the above code should be as follows:

```python
Database(MongoClient(host=['ds111059.mlab.com:11059'], document_class=dict, tz_aware=False, connect=True), '15688scrapy')

Collection(Database(MongoClient(host=['ds111059.mlab.com:11059'], document_class=dict, tz_aware=False, connect=True), '15688scrapy'), 'XRateScrapycollection')

{'Botswana Pula': '9.460701', 'Chilean Peso': '606.074548', 'Chinese Yuan Renminbi': '6.281979', 'Croatian Kuna': '5.999345', 'Danish Krone': '6.003960', 'Hong Kong Dollar': '7.846314', 'Icelandic Krona': '97.928434', 'Indian Rupee': '64.908844', 'Norwegian Krone': '7.713880', 'Philippine Peso': '52.426108', 'Russian Ruble': '57.474332', 'Swedish Krona': '8.232627', 'Trinidadian Dollar': '6.802672', 'Venezuelan Bolivar': '9.987500'}
```

Also, go to your mLab Collection, you should be able to see all the data item records in this format:

<img src="mLab.PNG" alt="mlab" style="width:400px;height:90px"/>
