# Tutorial: Web Crawlers Development Using Scrapy for Python 

   
### Learning Outcomes:
**--** This tutorial teaches web scraping using Scrapy by building a web scraper for shopclues.com:power banks website.<br>
**--** Learn how to use Python for scraping any other websites to collect data

### Prereqs:
**--** There are no specific prerequisites of this article but a basic HTML and CSS knowledge will help you understand this tutorial with greater ease and speed.

### Table of Contents:
   **1.** Scrapy Outline <br>
   **2.** Scrapy Installation & Setup<br>
   **3.** Scrapy Shell<br>
   **4.** Scrapping **Shopclues.com:Power Banks** with Scrapy Shell<br>
   $\;\;\;\;\;$**4.1.** Using CSS Selectors for Extraction <br>
   $\;\;\;\;\;$**4.2.** Using XPath for Extraction <br>
   **5.** Writing Custom Spiders <br>
   $\;\;\;\;\;$**5.1.** Creating Scrapy Project <br>
   $\;\;\;\;\;$**5.2.** Creating a Custom Spyder <br>
   $\;\;\;\;\;$**5.2.** Exporting Scraped Data as a CSV <br>
   **6.** Additionla Details <br>
   $\;\;\;\;\;$**6.1.** Few Other Commands & Attributes <br>
   $\;\;\;\;\;$**6.2.** Following Links <br>
   **7.** Referred Sources <br>

### 1. Scrapy Outline:

Scrapy is a free and open-source web-crawling framework written in Python. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

Web scraping has become an effective way of extracting information from the web for decision making and analysis. Data scientists should know how to gather data from web pages and store that data in different formats for further analysis. So, web scraping became an essential part of the data science toolkit.

Scrapy uses **web crawler/spider/spiderbot** to extract information and anything visible on a web page. A spider is a program or automated script that browses the World Wide Web in a methodical, automated manner. 

One needs to code a web crawlers/spiders according to the web page to be extracted because of its own design, struture and web elements. In Scrapy it is easier to build and scale large crawling projects by allowing developers to reuse their code.

The structred process of web scrapping is shown below:
![web_scraping](https://topwebscrapingservice.files.wordpress.com/2016/06/custom-web-scraping-624x301.png)


### 2. Scrapy Installing & Setup

With Python 3.0 (and onwards) installed, if you are using anaconda, you can use **conda** to install **scrapy** with the following command in **anaconda prompt**:<br>

**`conda install -c conda-forge scrapy`**

Alternatively, you can use Python Package Installer pip in Anaconda as follows. This works for Linux, Mac, and Windows:

**`pip install scrapy`**

### 3. Scrapy Shell

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to 
run the spider. It is meant to be used for testing data extraction code, but you can actually use it for testing any kind 
of code as it is also a regular Python shell.

With the use of Scrapy shell, one can see the web page components and can use them to their requirement. To start the scrapy shell, type the following code in your command line or anaconda prompt:

**`scrapy shell`**

The above command outputs a bunch of code as follows when executed on anaconda prompt:

![scrapyshell](OSINT-Data-Science-Research-Work/Assets/1.scrapyshell.png)

### 4. Scrapping Shopclues.com:Power Banks with Scrapy Shell

We laready know that, a crawler or spider goes through a webpage downloading its text and metadata and it needs a starting point to start crawling (downloading) content from. <br>

In this implemetation, Crawler’s start URL is https://www.shopclues.com/mobile-accessories-power-banks.html

To run the crawler, use **fetch command** in the Scrapy Shell:

**`fetch('https://www.shopclues.com/mobile-accessories-power-banks.html')`**

Output of above command is as follows:

![fetch](https://lh4.googleusercontent.com/fy9fMqOVcG1U3iXAEpb0RHFGlvwpfe4ThfurG5Ya8SR6ss2drWJ1LLSjW0vpbAUii5_yXQ)

**NOTE:** The fetch command throws *syntax error* if the URL mentioned in it, is not enclosed in **quotes**.

When you crawl something with scrapy fetch command,  it returns a *response object* that contains the downloaded information. And the content in the response object can be viewed with the help of **view command** as follows:

**`view(response)`**

This command will open the downloaded page in your default browser as shown below:

<img src= "https://lh6.googleusercontent.com/OEjmYPpXqN8IV1jP7Iw-YuROlcHPdNLwolrFl8pN1AdsAyugtZfcxPSByJcGzvhrhOlxnw" alt="reponse" style="width: 700px;"/>

By observing the downloaded website, it is clear that the crawler downloaded the entire webpage successfully.

Inorder to view the raw HTML script of the downloaed webpage, use **print command** as follows:

**`print(response.txt)`**

This displays the HTML script of the webpage that can be obtained by left right-clicking and selecting view source or view page source.By inspecting the HTML of webpage, we can identify the element name given to the required details. One can find them with in `<a>` tag with a class attribute.

Let us take the following elements for extraction: <br>
**--** Power Bank Name (prod_name)<br>
**--** Power Bank price (p_price)<br>
**--** Power Bank discount (prd_discount)<br>
**--** Product image<br>

### 4.1.*Using CSS Selectors for Extraction*

To extract the required data from website, use the element attributes or the css selector like classes, id etc along with **extract() command**.

#### 4.1.1. Power Bank Name Extraction:

The following code extracts the power bank namesin the webpage:

**`response.css(".prod_name::text").extract()`**

<img src= "https://lh3.googleusercontent.com/HM5VroG1N3vB0b5RAPXxxA-KUEgMClbWbd53sYhRizzNCm9Y22qnVXKPgbfqtPigNgXO0Q" alt="name" style="width: 500px;"/>

Just to extract the details of the first elemnet that satisfies the css selector, use **extract_first() command** as follows:

**`response.css(".prod_name::text").extract_first()`**

![first](https://lh3.googleusercontent.com/IgIz11iT5mGOoIr41oOPm9HF2TsvA0GqM8U3n8gcGr7owZZ3PuH2Um6OrxXpEgur5Qk33g)

#### 4.1.2. Extracting Power Bank Price:

The following command displays the individual price of the power bank as follows:

**`response.css(".p_price::text").extract()`**

<img src="https://lh4.googleusercontent.com/a8EQk874keynMA2gLeKV1bILXZbbDUlqwRgisDKsiueRn0Tt7RL-T_rD60yNXCqwrsl7QQ" alt="price" style="width: 400px;"/>

#### 4.1.3. Extracting Power Bank Discount:

The following command displays the discount given to each power bank as follows:

**`response.css(".prd_discount::text").extract()`**

<img src= "https://lh3.googleusercontent.com/o1ircKYiWPQOo9Hlkjx2Y_Fh6V1qcBN4KXQcNIi6fAksLE1u83XvQ0wW5fG4VfBvvbrjeQ" alt="discount" style="width: 500px;"/>

### Till now:

`response` – An object that the scrapy crawler returns which contains all the information about the downloaded content.<br>
`response.css()` – Matches the element with the given CSS selectors.<br>
`extract_first()` – Extracts the “first” element that matches the given criteria.<br>
`extract()` – Extracts “all” the elements that match the given criteria.<br>

### 4.2. *Using XPath for Extraction*

XPath is a query language used for navigating through an XML document by selecting nodes from it. Scrapy uses Xpath to navigate torought out the HTML document. Also, the CSS selectors used above can be converted to XPath. But in many cases, CSS is very easy to use instead of XPath 

Following are the few commands for data extracting using Xpath:

To get code under `<html>` tag ot html node:

`response.xpath("/html").extract()`

![html](https://lh4.googleusercontent.com/-SAaJx2-8vVoqvLORLGCVPbZVVhlj2A26YO8SFRqqYSb0DCxXYH7rZdQxmmURy4fGB8TBg)

To extract body node, which is the child of html node:

`response.xpath("/html/body").extract()`

![body](https://lh4.googleusercontent.com/u4sXKJNRb4jS1C1FpzR1EeazhFUl5u8uGKtTbt9MGWUDH8lxpvDiw2R4QlF_arh41TO9-Q)

To access all `<div>` descendants of html node:

`response.xpath("/html//div").extract()`

The above command can also be written by excluding '/html' as follows:

`response.xpath("//div").extract()`

**NOTE:** The XPath language is based on a tree representation of the XML document with html. So here, in the commands `/` and `//` navigates through direct child node and descendent nodes respectively.

Apart from using HTML tags for navigating through the webpage, we can also use attributes and their values to extract the required data. The syntax is as follows:

`response.xpath("//div[@class='sc_callouts']").extract()`

![callouts](https://lh5.googleusercontent.com/OjYxYL-aOXgIZ8ksQ3S0BBp6yQfUiMJZLXPRRVCqBf6Z2_5vQJrYoe-gtnE5irkvShqbPQ)

To filter nodes further more:

`response.xpath("//div[@class='quote']/span[@class='text']").extract()`

Inorder to extract all text inside nodes use **text() command** as follows:
Consider following html code, 

![html](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1547156867/tut22_foyf9t.jpg)

To get the text inside `<a>` tag, 

`response.xpath('//div[@class="site-notice-container container"]/a[@class="notice-close"]/text()').extract()`

![text](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1547156867/tut23_um9wdm.jpg)

### 5.Writing Custom Spiders

It is known that a spider is a program that downloads content from given web site. But each website has its own unique structure whihc results in unique HTML code. Inorder to extract data from different webistes, one has to write custom spiders for each website according to its design. 

Also we need to write code to convert the extracted data to a structured format and store it in a reusable format like CSV, JSON, excel etc. To create a custom spider for any webiste, first step is to create a Scrapy project. 

### 5.1. *Creating Scrapy Project*

Scrapy project is to be created to store the spider code and its results. The following command in the command line or anaconda prompt, creates a scrapy project:

`scrapy startproject shopclues`

![start](https://lh5.googleusercontent.com/-uKIxZYYP5N1QsmQcrLwhwXvk1hM4dwHWe9AJas7fw4_qwodTf0c0J6JIHkY2NOWtwku6A)

The above command creates a hidden folder in your default python or anaconda installation with the folder name as shopclues(in this scenario; one can give any name). The contents in the hidden folder and its purpose is elucidated in the below table:

| file/folder | Purpose | 
| --- | --- | 
| scrapy.cfg | deploys configuration file |
| shopclues/ | Project's Python module, you'll import your code from here |
| __init.py__ | Initialization file |
| items.py | project items file |
|pipelines.py|	project pipelines file|
|settings.py|	project settings file|
|spiders/	|a directory where you'll later put your spiders|


### 5.2. *Creating a Custom Spider*

For creating of spider, we have to work from the newly created hidden folder. So change your working directory accordingly by using `cd` command. And then execute the following command in the anaconda prompt:

`scrapy genspider shopclues_powerbanks https://www.shopclues.com/mobile-accessories-power-banks.html`

![power](https://lh6.googleusercontent.com/YcvMrA3ocEZzHGDXbNu6ZDOCAQLbjmrKPkiWIiGiLHHtBQGemB1p7uTPiwjojflYCuxJRA)

The above command creates a template file named shopclues_powerbanks.py in the spiders directory as mentioned above. The code in that file is as below:

<img src="https://lh6.googleusercontent.com/75-CBqzBrFVo1saDGggTZ6YF8oBJP3CmsrDQmZX1VoyywsA7v6_-AwkwWpd2lFZcqMZm5A" alt="file" style ="width:700px;" />


**Describing the terms in above code:**
*   **name:** Name is the name of the spider. Proper names will help you keep track of all the spider's you make. Names must be unique as it will be used to run the spider when scrapy crawl name_of_spider is used.
*   **allowed_domains:** An optional python list, contains domains that are allowed to get crawled. Request for URLs not in this list will not be crawled. This should include only the domain of the website (Example: aliexpress.com) and not the entire URL specified in start_urls otherwise you will get warnings.
*   **start_urls:** This requests for the URLs mentioned. A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent Request will be generated successively from data contained in the start URLs.
* **parse(self, response):** This function will be called whenever a URL is crawled successfully. It is also called the callback function. The response (used in Scrapy shell) returned as a result of crawling is passed in this function, and you write the extraction code inside it!

Add the above extraction logics using CSS reference or Xpath in the parse function in shopclues_powerbanks.py file as below:

<img src="https://lh3.googleusercontent.com/SPLhpmrj404xtRrmKid9l50OpedpntqvgyQZBC8wELzKKGZ0tlvtNXeuR-VkYTdnKpKSjQ" alt="code" style ="width:700px;" />

**NOTE:** Here in the start_urls, we can add multiple URLs for data extraction of the same domain seprated by commas. <br>

**Describing the terms in above code:**
* **zip():** takes n number of iterables and returns a list of tuples. ith element of the tuple is created using the ith element from each of the iterables.
* **yield:** This keyword is used whenever you are defining a generator function. A generator function is just like a normal function except it uses yield keyword instead of return. The yield keyword is used whenever the caller function needs a value and the function containing yield will retain its local state and continue executing where it left off after yielding value to the caller function. Here yield gives the generated dictionary to Scrapy which will process and save it!

After saving the above file, run the following command in anaconda prompt:

`scrapy crawl shopclues_powerbanks`

The above command outputs bunch of lines in which the extracted data can be seen as follows:

![result](https://lh5.googleusercontent.com/6oOUMt-13YkdZGYgugQF7XEU2HwyTNYeN20UJFgtT69zlhyW3tP5JeWbhPqbnVylDeDB3w)

By the end of this, we have successfully built a custom spider and extracted the required details for the shopclues e-commerce website. Further step is to export the scraped data as a csv.

### 5.3. *Exporting Scraped Data as a CSV*

On observing the scraped data from the above step, we can see that each item/entity is separated by a comma(,). And the with that data representation, it is not easy to perform any data analysis and classifications. So we need to represent the data from the spider in preferable formats like CSV, Excel, JSON etc. that can then be imported into programs. 

Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. We can export the data to CSV by adding the following highlighted code snippet in the `settings.py` file:

<img src="https://lh3.googleusercontent.com/lNMRkJ4e0h28EQDBG31UR2S4ZP6tyb03LNJV3ZjavFNYCEdR6hvQJXss602NpDPa-zGdBg" alt="Drawing" style="width: 700px;"/>

After saving the above file, rerun the following command in anaconda prompt:

`scrapy crawl shopclues_powerbanks`

The above command creates a CSV file nammed `shopclues` and the data in it is as follows:

![excel](https://lh3.googleusercontent.com/KwljWXuPR_p30KK2RrvPQdWLiLvOIX7o10FM5C3Jrqrozzfh-WID2g3EbbLkn8Agby5-GQ)

By the end of this, we are able to successfully scrape the website contents and store it in readable and accessable CSV format.

### 6. Additional Details:

So far, we learnt about the basic commands and coding reqiored to custom buld a spider. Now, lets learn few more commands and attributes.

### 6.1. Few Other Commands & Attributes:
* **FEED_FORMAT:** Used to set the format to which scraped data to be exported. Supported formats are:JSON, JSON lines, XML and CSV.
* **FEED_URI:** Used to specify the dersired location to store the exported file. FTP(File Transfer Protocol) can also be used.
* **%(time)s:** Exported file name gets replaced by a timestamp when the feed is being created.
* **%(name)s:** Used to replaced file name by the spider name.

**NOTE:** The Feed changes you make in settings.py will apply to all spiders in the project. If custom settings are set for a particular spider that they will override the settings in the settings.py file.

Modify the spider file, shopclues_powerbanks.py file with custom setting as shown below:

<img src="https://lh6.googleusercontent.com/34kNs9ywUDpQNZ4B1D3d7cVHIEXwJjiroP4_l0PE-gwaa3pSlbTmLV2nTebGr-b3RqmUAA" alt="modified" style="width: 700px;"/>

Then rerun the `scrapy crawl shopclues_powerbanks` which gives the following json file with the timestamp in the file name.
![timestamp](https://lh6.googleusercontent.com/RU_fb5_VkPj5SuhMVg2NL_2s5dyD3OKjve8UYeYG_qwcrC7-R5gLUY-29HtjVeAYum2I5Q)

Eventhough the settings.py has predefined FEED_FORMAT & FEED_URI, on including the custom settings in the spider, it has overwritten the predefined values.

### 6.2. Following Links:

As of now, we know that to scrap throught the multiple pages of the same website, we include them in the `start_urls` separated by commas. But if the number of pages increase, it will be tedious to add all the links. A crawler should be able to crawl by itself through all the pages of the website, and only the starting point should be mentioned in the start_urls.

If a page has subsequent pages, you will see a navigator for it at the end of the page that will allow moving back and forth the pages. Upon inspecting the HTML code of this navigator, we can see that it has bunch of URLs within `<a>` tags. Now, modify the spider by adding the following code snippet after yield statement to  continue parsing other pages of the website:

<img src="https://lh3.googleusercontent.com/j0RYq0maszclhZx4GzWWAoMpjH_fUpY6vgdj4ft0rG8ZJvTRNLSjPrKyBFKwpnYTR1fKMw" alt="follow" style="width :700px;" />

**Describing the terms:**
* Using `next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()`, we first extracted the link of the next page. If the variable next_page gets a link and is not empty, then it will enter the if body.

* **response.urljoin(next_page):** The parse() method will use this method to build a new url and provide a new request, which will be sent later to the callback.

* After receiving the new URL, it will scrape that link executing the for body and again look for the next page. This will continue until it doesn't get a next page link.

This spider scrapes all the pages of the website and returns an exported data file with required details from all subsequent pages. *The obtained file has larger size when comapres to the initial csv file.*

Apart from the discussed, there are a plethora of forms that scrapy support for exporting feed and using css selectors. If you want to dig deeper you can refer the <a href="https://doc.scrapy.org/en/latest/index.html">Scrapy documentation</a> which has complete information.

### 7. Referred Sources:
* https://doc.scrapy.org/en/latest/index.html
* https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python
* https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/
* https://www.tutorialspoint.com/scrapy/scrapy_create_project.htm (Entire Scrapy Live Project Module)
