# Scrapy Webscraping Tutorial

#### Table of Contents

1. <a href='##Introduction'>Webscraping with Scrapy</a> <br>
2. <a href='##Scrapy_Shell'>Using the Scrapy shell</a> <br>
> 2.a <a href='##xpath'>Intro to data extraction with xpath</a> <br>
3. <a href='##start_project'>Initializing a Scrapy project</a> <br>
> 3.a <a href='##file_structure'>File structure of the Scrapy framework</a> <br>
4. <a href='##example'>Scrapy Example</a> <br>

<a id='#Introduction'></a>

### Webscraping with Scrapy

What is **Scrapy** and why is this an important Python package for us to learn? There are several reasons why our understanding of webscraping packages is incomplete with a detailed analysis of this important tool for data extraction.

* **Webscraping Framework**: **Scrapy** is a webscraping framework for large-scale webscraping projects. Where **Beautifulsoup** and **Selenium** (which we previously covered) are good tools for individual/small scale webscrapings tasks, **Scrapy** is built for professional level deployment <br><br>
* **Multiple Requests**: While packages such as **Selenium** induce actions in HTML in a sequential manner, **Scrapy** makes requests in parallel with error handling. This means that if one request fails, the entire process will continue to run and process the next request. <br><br>
* **Cloud Deployment**: Webcrawlers created in **Scrapy** can be deployed in the cloud

<a id='#Scrapy_Shell'></a>

### The Scrapy Shell

Scrapy has a unique feature that allows the user to generate an **http** request to a website, and then parse the website using **XPath** in the console. 

**`scrapy shell www.reddit.com`** will return a **`response`** object

**`response.url`** will return the url we just passed

**`response.body`** will display all of the html code from the request

**`response.css`** will return all of the tags passed to the **`css`** argument

Run the following code to create a collection of the **h2** objects which we can loop through and extract the text:

`>>items = response.css('h2')` <br>
`>>len(items)`

`>>for item in items:`
 > `print(item.extract())`

The **response** object returns the following attributes:
* **url** (string) – the URL of this response
* **status** (integer) – the HTTP status of the response.
* **headers** (dict) – the headers of this response. 
* **flags** (list) – is a list containing the initial values for the Response.flags attribute. 
* **request** (Request object) – the initial value of the Response.request attribute. 
* **xpath** (response object) - specific tag we want to grab from the HTML page

https://doc.scrapy.org/en/latest/topics/request-response.html

<a id='#xpath'></a>

### Introduction to XPath

XPath allows us to define specific tags that we want to grab in the HTML code. XPath allows us to define these tags hierarchaly, which means they are more specific. For example, going back to our www.rottentomatoes.com website - suppose we want to grab all of the tags for movies which are opening this week, how can we define this query?

![title](xpath1.png)

We can grab these objects with the following xpath code:

**`>>response.xpath("//tr[@class='sidebarInTheaterOpening']")`**

returns a list of objects, but not the specific attributes that we need (href, text). In order to get these, we need to dig deeper into the object:

**`>>response.xpath('//tr[@class="sidebarInTheaterOpening"]/td[@class="middle_col"]/a/text()')`**

**`>>response.xpath('//tr[@class="sidebarInTheaterOpening"]/td[@class="middle_col"]/a/@href')`**

In the above example, we start by telling our xpath search to find all instances of **tr** tags since we start with the **//** operator.

Next we say to find the **@class** items which are equal to **sidebarInTheaterOpening**

Once we have all of these classes, we go down one level and find the **td** tag, where the **@class** attribute is equal to **middle_col**. 

We proceed another level lower and find the **a** tags. Once we have the **a** tags, we extract the **text()** and **@href** attributes and return these.

* **//tr** - search the entire tree, starting from the root for the tag which follows this and return all **tr** instances
* **//tr/a** - for each **tr** instance, find the **a** tag immediately underneath that tag
* **//tr[@class='sidebarInTheaterOpening']** - for each **tr** tag, select the ones which have **@class** equal to 'sidebarInTheaterOpening'. 
* **//tr[@class='sidebarInTheaterOpening']/td/** - find the **td** tag immediately following the previous class attributes
* **//tr[@class='sidebarInTheaterOpening']/td/text()** - extract the text associated with the following query

For a more thorough tutorial on XPath, take a look at the following link:

https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples

<a id='#start_project'></a>

### Initialize a Scrapy Project

#### 1. Step One

We can initiliaze our Scrapy project by navigating to the directory where we want to store our code and typing:

**`>>scrapy startproject scrapy_tutorial`**

which will generate a file structure with the following hierarchy:

![title](scrapyFileStructure.png)

#### 1. Step Two

Next we create the scraper file for parsing our webpage. This scraper will be named and called through the command line. The code for the file looks like the following:

![title](spiderFile.png)

We run the scraper using the command 

**`>>scrapy crawl CFEM`**

where 

**`name = "CFEM"`** is the name we set in the **`CFEMScraper.py`** file

<a id='#file_structure'></a>
Our file tree now looks like the following:

![title](scrapyFileStructureAfter.png)

**Pro Tip:** Make sure to be in the top **'scrapy_tutorial'** (right next to **scrapy.cfg**) folder in order to run the webscraper!

<a id='#example'></a>

### Scrapy Tutorial

Now that we've covered a few of the basics, let's look at a specific example and cover some code with hands on application. Much of the documentation can be found at the following website, along with relevant examples:

https://docs.scrapy.org/en/latest/intro/tutorial.html

Additionally take a look at my **Scrapy** example found in the **scrapy_tutorial** example at:

https://github.com/zachescalante/Zach-Escalante-Code/tree/master/scrapy_tutorial