<a href="https://colab.research.google.com/github/sukcsie/NLP-with-Python/blob/main/Scrapy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Using Scrapy in Google colab notebook 
I have created this tutorial taking ideas from [Scrapy's](https://docs.scrapy.org/en/latest/intro/tutorial.html) tutorial page and [stackoverflow](https://stackoverflow.com/questions/40856730/how-to-run-scrapy-project-in-jupyter) post, especially how to run scrapy in your `notebook` instead of installing independently in your machines. 

I have implemented a basic user-written crawler to crawl the pages and have shown how to represent the data read in Pandas data frame. It used `Scrapy` library to crawl and scrape data from a website.

Once the quotes are retrieved the JSON file will be created on disk (in the cloud for this case) and can be loaded to a Pandas dataframe. This dataframe can then be analyzed, modified and be used for further processing. This notebook simply loads the JSON file to a dataframe and writes it again to a pickle.

In [None]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

#### Import scrapy

In [None]:
try:
    import scrapy
except:
    !pip install scrapy
    import scrapy
from scrapy.crawler import CrawlerProcess

#### Set up a pipeline
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.



In [None]:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

#### Define the spider
Define the spider
The `MySpider` class defines from which URLs to start crawling and which values to retrieve. I set the logging level of the crawler to warning, otherwise the notebook is overloaded with DEBUG messages about the retrieved data.

In [None]:

import logging

class MySpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }


#### Start the crawler

In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start()

#### Check the files
Verify that the files has been created on disk. As we can observe the files are both created and have data. The .jl file has line separated JSON elements, while the .json file has one big JSON array containing all the quotes.

In [None]:
ll quoteresult.*

In [None]:
#!rm quoteresult.*

In [None]:
# executing a shell command on notebook 
#!tail -n 2 quoteresult.jl
!cat quoteresult.jl


In [None]:
# executing a shell command on notebook 
!tail -n 2 quoteresult.json
#!cat quoteresult.json

#### Create dataframes
Pandas can now be used to create dataframes and save the frames to pickles. The .json file can be loaded directly into a frame, whereas for the .jl file we need to specify the JSON objects are divided per line.



In [None]:
import pandas as pd
#dfjson = pd.read_json('quoteresult.json')
#dfjson

In [None]:
dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl


#### Accessing Google drive and keeping a copy of the file generated through scraping.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!ls /content/gdrive/

In [None]:
import shutil
shutil.copyfile('quoteresult.jl', '/content/gdrive/My Drive/Colab Notebooks/quote2.jl')

In [None]:
!cat '/content/gdrive/My Drive/Colab Notebooks/quote2.jl'