## Introduction

The web crawler, which can also be called internet bot or web spider, is an application program that scan through the internet programmatically and fetch the data for further use. After almost thirty years of development, the web spider can be categorized into four major types, include general purpose web crawler, focused web crawler, incremental web crawler and deep web crawler. Besides other types of web crawler, we will focus on implementing the general purpose web crawler in this tutorial. More specifically, we will make use of the Scrapy library in python to implement a basic web crawler to grab some information about food ingredients through [BBC Food Channel](https://www.bbc.co.uk/food/ingredients), format the data as JSON and display them in the notebook.

## Tutorial Content

In this tutorial, we will focus on how to write a basic web crawler using [Scrapy](https://scrapy.org/) library. After we extract the data, we will simply use [panda](https://pandas.pydata.org/) libraries to display them.

We will be covering the following topics in this tutorial. However, since using Scrapy library is mostly involved in using terminal, the results won't be generated by the notebook directly. In this case, the results will be screenshoted and inserted as pictures in this tutorial.

- [Install the Scrapy library](#Install the Scrapy library)
- [Create a starter project](#Create a starter project)
- [Implement a basic web crawler](#Implement a basic crawler)
- [Use Item to Hold Data](#Use Item to Hold Data)
- [Set up Item Pipeline to process and store the data](#Set Up Item Pipeline)
- [Display Data with Pandas](#Load Data Into Pandas)

After you followed these topics , you should gain a basic understanding of using web crawler with Scrapy Library. Moreover, you can find more functionalities of scrapy in [the official document](https://docs.scrapy.org/en/latest/index.html).

## Install the Scrapy library

Install the scrapy library is fast and easy. Based on the official document of Scrapy, you can install the Scrapy in the following two ways:
First, use the python package manager of python3, `pip`:

     $ pip3 install scrapy

Second, if you haven't install the [Anaconda](https://www.anaconda.com/download/#macos), you can click the link and follow the installation guides on the webpage. If you have already installed it, you can use anaconda to install scrapy:

     $ conda install -c conda-forge scrapy

## Create a starter project

By using command line tools provided by Scrapy Framework, it is really easy to create a web crawler project and start implementing our own crawler. First, navigate to the directory where you want to store the project and then type in the following command into the terminal(Since we are using notebook, you can also use the terminal inside the notebook very easily, simply go to the directory page of the notebook, click the New button on the upper right corner of page and select terminal, which will start a new terminal in the notebook):

    $ scrapy startproject tutorial

This command will create a folder named tutorial, which contains the following directory of the project(the comments on the left side denote the main functionality of that file):

tutorial/
    scrapy.cfg  # the configuration file for deployment
    
    tutorial/
        __init__.py
        
        items.py  # define your items in this file and set up item pipeline
        
        middlewares.py  # project middleware file
        
        pipelines.py  # project pipelines file
        
        settings.py  # project settings file
        
        spiders/  # the directory where the user places customized spider
            __init__.py 
            
In this tutorial, we will implement our crawler and put it into the spiders directory. Then, we will need to modify our crawler and items.py, pipelines.py and settings.py. <b>Therefore, the codes will be shown in the cell but running them in the notebook won't generate any output directly. Instead, we suggest you use terminal to run the command and examine the output. Additionaly, we will also include sample output as pictures in this tutorial.</b> So, let's get started!

## Implement a Basic Web Crawler

Before we implement the web crawler, we should have a clear view of which data we need to extract and how should we format them. Since we are extracting information about food ingredient from the index pages and each detail page. From the index page, assume that we want to extract the following three data field:

1. name of this food ingredient
2. url of the page containing the detail of this food ingredient
3. url of this food ingredient image

From each detail page, we will just extract the following two data field just to keep things simple:

1. name of this food ingredient
2. description of this food ingredient

In this part, we will implement a basic crawler in this tutorial project. To create a web crawler in the Scrapy framework, you have to subclass [scrapy.Spider](https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider), define the initial requests to send and hwo to parse the response. Typically, there are three things you need to keep eyes on when implementing this class:

1. name: this is an attribute inside the your spider class, it identifies your spider/crawler in this project. Thus, you need to keep in mind that you cannot assign the name to two different spiders in one project.
2. start_request function: the initial requests are normally defined in this function. You should return an iterable of Request objects which the crawler will begin to crawl. Subsequent requests will be generated from these initial requests.
3. parse function: this function is mainly used for handling the downloaded response for each of the requests. The response parameter is an instance of [TextResponse](https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse) object that contains the response content and some other helpful methods for processing the response.

The following code is a basic food ingredient spider that fetches the web content from BBC food channel about food ingredients which all ingreidents' names starting with 'a'. <b>Note that we are currently just outputing the fecthed data into the terminal, self.log() function is the log function inside Scrapy Framework that will enable us to print debug information in the terminal. In the next few sections, we will actually modify the code step by step to achieve ultimate goal of spider.</b> As for the project structure, you should create a new python file called ingredient_spider.py and place it under the spiders directory.

<b>Please don't run this cell because this is just a prototype version of the web crawler, instead you could run the same code using terminal!</b>

In [None]:
import scrapy
import re

class IngredientsSpider(scrapy.Spider):
    # name: identifies the Spider. It must be unique within a project, 
    #       that is, you can’t set the same name for different Spiders.
    name = "ingredients"

    def start_requests(self):
        """
        start_requests(): must return an iterable of Requests (you can return a 
            list of requests or write a generator function) which the Spider will begin to crawl from. 
            Subsequent requests will be generated successively from these initial requests.
        (you can just define a start_urls class attribute with a list of URLs. )
        """
        urls = [
            'http://www.bbc.co.uk/food/ingredients/by/letter/a',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_ingredient)

    def parse_ingredient(self, response):
        """
        parse(): a method that will be called to handle the response downloaded for each of the requests made. 
            The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
        """
        # we can use regular expression to extract the attribute value
        alt_attr = re.compile('alt=\"([\w\W]+?)\"')
        href_attr = re.compile('href=\"([\w\W]+?)\"')
        src_attr = re.compile('src=\"([\w\W]+?)\"')
        list_results = response.xpath('//ol[@class="resources foods grid-view"]/li/a[not(@class)]').extract()
        # error handling: if there is no such element in the response content, stop the spider
        if (len(list_results) == 0):
            return
        for temp in list_results:
            name = re.search(alt_attr, temp).group(1)
            url = re.search(href_attr, temp).group(1)
            src = re.search(src_attr, temp).group(1)
            self.log('Ingredient Name: ' + name)
            self.log('URL link to Details: ' + url)
            self.log('Ingredient Image Link: ' + src)
            # continue to crawl the web page that contains the details of this ingredient
            next_url = response.urljoin(url)
            # use yield to send the requests to the downloader
            yield scrapy.Request(next_url, callback=self.parse_ingredient_detail)

    def parse_ingredient_detail(self, response):
        """
        this parse function is used to parse the response content from detail page of one food ingredient
        """
        # get the description of the ingredient
        name = response.xpath('//div[@id="summary"]/h1[@class]/text()').extract()
        # error handling: set the variable as empty string if no such element is found
        if (len(name) != 0):
            name = name[0].rstrip()
        else:
            name = ''
        desp = response.xpath('//div[@id="summary"]/div/p/text()').extract()
        # error handling: set the variable as empty string if no such element is found
        if (len(desp) != 0):
            desp = desp[0]
        else:
            desp = ''
        self.log('Ingredient Name: ' + name)
        self.log('Ingredient Description: ' + desp)


There are few things that you need to pay attention regarding the bussiness logic behind the code shown above. They are listed as the following:

1. the creation of scrapy.Request object: notice that there are two important parameters that are passed in during the construction of this object, a url and a call back function. The url parameter specifies the endpoint of this request while the call-back function is the function that will be called when response is downloaded. In most case, this function is mainly for parsing the response content and other initial request for subsequent scrapying.

2. We locate and extract html element using xpath expression. [xpath](https://www.w3.org/TR/xpath/) is a language for addressing the part of an XML document. It is actually a very powerful tool for extracting data from XML-like file, such as HTML.
   
3. The three regular expressions used in the code are used to extract the value of the attribute, including 'alt', 'href', 'src'. 

After implementing the code shown above in the project and you can run the following command in the terminal to start up your spider:
    
    $ scrapy crawl ingredients
    
The output of the program is very long and we will only show the part where the debug information is displayed as we instruct it in the spider:

![Figure 1](sample_output1.png)
The picture shown above denotes that the ingredient name, link to detail page and link to image are all displayed in the terminal, which also means the data is scraped successfully from the index page.

![Figure 2](sample_output2.png)
The picture shown above contains similar information. The ingredient name and description are also shown in the terminal, which means that the data is scraped successfully from each ingredient detail page.

Obviously, although the data is scraped successfully from out target website, simply displaying the data in the terminal is of no use to us. Thus, in the next section, we will use the Item provided by the Scarpy to wrap the data and then set up the Item Pipeline to write the data into JSON format. 

## Use Item to Hold Data

In this section, you will learn to use the Item class provided by the Scrapy Framework to wrap the extracted data, which can be further used in Item Pipeline. In Scarpy Framework, Item classes are just containers for collecting the extracted data. They provide python dictionary-like API for developer to use. Thus, setting, getting and changing the value of an Item class is very easy and convenient. First, we should define an ingredient Item class inside the <b>items.py</b> in the project as the following: 

<b>You can run the following cell in notebook or copy the code into items.py and start the spider using terminal.</b>

In [1]:
import scrapy

class IngredientItem(scrapy.Item):
    # define this class for collecting data about ingredient name, url and image link.
    name = scrapy.Field()
    url = scrapy.Field()
    img_src = scrapy.Field()
    
class IngredientDetailItem(scrapy.Item):
    # define this class for collecting data about ingredient name and description about it.
    name = scrapy.Field()
    description = scrapy.Field()

In the above code, we have defined two Item classes for holding two different forms of data that we want to extract from the websites. Also, inside each class, you can simply define the field of the class by using [scrapy.Field](https://docs.scrapy.org/en/latest/topics/items.html) object. This object can hold any data type and you can also pass a serializer function for each field. More often, we will just create the Item class using its constructor and directly pass the field value into this function, like the following way:

    ingredient = IngredientItem(name=name, url=url, img_src=src)
    ingredient_detail = IngredientDetailItem(name=name, description=desp)
    
The above code snippets will be included in our finalized version of spider, to replace the log function that we previously have.

## Set Up Item Pipeline

In this section, you will learn to set up item pipeline to process the items that contained the data we extracted from our spider. Specifically, item pipelines are tyically used to clean or validate the extracted data, checking for duplicates or store the data in the database. For the purpose of demonstration, we will only show you how to set up item pipeline to write two types of items into two JSONLine files. The following code shows how should the pipeline be set up to achieve this functionality: 

In [2]:
import json

class JsonWriterPipeline(object):
    # the method will be called when the spider is opened
    def open_spider(self, spider):
        self.file_ingredient = open('ingredient.jl', 'w')
        self.file_ingredient_detail = open('ingredient_detail.jl', 'w')

    # this method will be called when the spider is closed
    def close_spider(self, spider):
        self.file_ingredient.close()
        self.file_ingredient_detail.close()

    def process_item(self, item, spider):
        # if it is an Ingredient, write it into the 'ingredient.jl' file
        if (isinstance(item, IngredientItem)):
            line = json.dumps(dict(item)) + '\n'
            self.file_ingredient.write(line)
        # if it is an ingredient detail item, write it into the 'ingredient_detail.jl' file
        elif (isinstance(item, IngredientDetailItem)):
            line = json.dumps(dict(item)) + '\n'
            self.file_ingredient_detail.write(line)
        return item

There are few things that need to be explained for the code shown above:

1. The three functions will be called in a specific order. The first function will be called is the <b>open_spider</b> when spider is opened, the second one is <b>process_item</b> and third one is <b>close_spider</b> when spider is closed.

2. After knowing the order of function calls, it is easy to understand that the open and close file operations should be conducted after the spider is opened and closed. The logic of how data should be processed needs to be put into the process_item function.

3. In the process_item function, since all the items will go through the same pipeline. We will use isinstance function to check whether the item is IngreidentItem or IngredientDetailItem and write them into different files as JSON object.

After setting up item and item pipeline, there is one thing to add in order to enable using pipeline for our spider. We have to add custom settings to specify we want to use pipeline for our spider. Note that this is not necessary if you are running you spider in the terminal, this is only needed for running Scrapy in the notebook. You can type in the following command in the terminal to start you spider:

    $ scrapy crawl ingredients
    
<b>Or you can just run the following two cells to start our spiders.</b>

In [3]:
import re
import logging

class IngredientsSpider(scrapy.Spider):
    name = "ingredients"
    # settings for start the pipeline and change the log level
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
    }
    def start_requests(self):
        urls = [
            'http://www.bbc.co.uk/food/ingredients/by/letter/a',
        ]
    
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_ingredient)

    def parse_ingredient(self, response):
        """
        parse(): a method that will be called to handle the response downloaded for each of the requests made. 
            The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
        """
        # we can use regular expression to extract the attribute value
        alt_attr = re.compile('alt=\"([\w\W]+?)\"')
        href_attr = re.compile('href=\"([\w\W]+?)\"')
        src_attr = re.compile('src=\"([\w\W]+?)\"')
        list_results = response.xpath('//ol[@class="resources foods grid-view"]/li/a[not(@class)]').extract()
        # error handling: if there is no such element in the response content, stop the spider
        if (len(list_results) == 0):
            return
        for temp in list_results:
            name = re.search(alt_attr, temp).group(1)
            url = re.search(href_attr, temp).group(1)
            src = re.search(src_attr, temp).group(1)
            # delete the log function and add Item loader
            # self.log('Ingredient Name: ' + name)
            # self.log('URL link to Details: ' + url)
            # self.log('Ingredient Image Link: ' + src)
            ingredient = IngredientItem(name=name, url=url, img_src=src)
            yield ingredient
            # continue to crawl the web page that contains the details of this ingredient
            next_url = response.urljoin(url)
            # use yield to send the requests to the downloader
            yield scrapy.Request(next_url, callback=self.parse_ingredient_detail)

    def parse_ingredient_detail(self, response):
        """
        this parse function is used to parse the response content from detail page of one food ingredient
        """
        # get the description of the ingredient
        name = response.xpath('//div[@id="summary"]/h1[@class]/text()').extract()
        # error handling: set the variable as empty string if no such element is found
        if (len(name) != 0):
            name = name[0].rstrip()
        else:
            name = ''
        desp = response.xpath('//div[@id="summary"]/div/p/text()').extract()
        # error handling: set the variable as empty string if no such element is found
        if (len(desp) != 0):
            desp = desp[0]
        else:
            desp = ''
        ingredient_detail = IngredientDetailItem(name=name, description=desp)
        yield ingredient_detail

In [4]:
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(IngredientsSpider)
process.start()

2018-03-27 15:54:08 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-27 15:54:08 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l  25 May 2017), cryptography 2.0.3, Platform Darwin-17.4.0-x86_64-i386-64bit
2018-03-27 15:54:08 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


## Load Data Into Pandas

The purpose of this step in our tutorial is just to verify that those two files have been created successfully and extracted data has been written into those two files. If implemented correctly, you should see formated output if you run the following cell.

In [5]:
# read the data into the using panda dataframe
import pandas as pd
ingredient_df = pd.read_json('ingredient.jl', lines=True)
display(ingredient_df)

ingredient_detail_df = pd.read_json('ingredient_detail.jl', lines=True)
display(ingredient_detail_df)

Unnamed: 0,img_src,name,url
0,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,acidulated water,/food/acidulated_water
1,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,ackee,/food/ackee
2,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,acorn squash,/food/acorn_squash
3,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,aduki beans,/food/aduki_beans
4,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,Advocaat,/food/egg_liqueur
5,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,agar-agar,/food/agar-agar
6,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,ale,/food/ale
7,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,Aleppo pepper,/food/aleppo_pepper
8,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,alfalfa sprouts,/food/alfalfa_sprouts
9,https://ichef.bbc.co.uk/food/ic/food_16x9_111/...,allspice,/food/allspice


Unnamed: 0,description,name
0,Serve them as an after-dinner treat with sweet...,Amaretti recipes
1,Also known as Chinese spinach or callaloo in C...,Amaranth recipes
2,"Lactose and cholesterol-free, almond milk is a...",Almond milk recipes
3,,Alfalfa sprouts recipes
4,,Almond essence recipes
5,Our almond recipes make the most of this versa...,Almond recipes
6,"Aleppo pepper, also known as pul biber, Halaby...",Aleppo pepper recipes
7,Almond extract is distilled from the essential...,Almond extract recipes
8,"An aromatic spice that looks like a large, smo...",Allspice recipes
9,A large family of beers with a great deal of s...,Ale recipes
