## Scrapy Basics

Need to install 2 libraries
- scrapy
    - `conda install -c conda-forge scrapy` or `pip install Scrapy`
- protego
    - `conda install -c conda-forge protego` or `pip install Protego`

See the avaiable commands by executing `scrapy` command on your terminal.  
How scrapy works on your computer you can check it by doing the benchmark test by executing this command `scrapy bench`.  
Fetch a URL using the Scrapy downloader(will give you the html) `scrapy fetch http://google.com`  
Generate new spider using pre-defined templates (everytime we want to create a spider to scrape a website we have to use this command)

To start a project use this command in the terminal `scrapy startproject project_name`
After executing this command a folder will be created with the given project name. Inside that folder there will be another folder according to the project name and a `scrapy.cfg` file which will help us run commands.  
Inside the project_name folder you will see the `items.py`. It will provide the ability to better structure the data was scraped.  
We also have `middlewares.py`, here we can plug custom functionality to process the responses and request.  
The `pipelines.py` stores items which create in a database like `mongo` or `sql`.  
The `settings.py` adds extra configuration to the project.


To start a new spider we have write the command `scrapy genspider spider_name link`  
For ex. `scrpay genspider worldometer www.worldometers.info/world-population/population-by-country`.   
After executing this you'll get a message like this  
```
Created spider 'worldometer' using template 'basic' in module:
  scrapy_spider.spiders.worldometer
```


Then we can find the spider file inside the `spiders` folder which has been created after executing the above command. 

In Scrapy there are two popular templates. The `scrapy.spider` and the `CrawlSpider` the scrapy.spider is the simplest spider.  
It doesn't provide any special functionality, but we can customize this template to scrape the way we want.  

On the other hand, that CrawlSpider is the most commonly used spider for crawling regular websites. It provides some mechanisms for following links by defining a set of rules.

*Note that crawling is not the same as scraping a website.*

A crawler usually brosius the World Wide Web for the purpose of web indexing. But Web scraping is more about extracting information from websites.
So a crawl spider might not be the best suited for your web scraping project.

Now take a look at the scrapy spider template.  
![spider_temp](../images/scrapy_spider_temp.png)  
This has a default class with the name of the spider we created.
`start_urls` is the variable that contain the url which spider will start from and the parse function to handle the response of a request. 

So in a scrapy, we use `response` to find elements. This is just like the driver in selenium or the soup in beautifulsoup. This response represents the response we get after we sent request to a website.   
Now, unlike selenium, we can only find elements with XPATHS on scrapy. Scrapy doesn't have functions like fine element by ids or class names or tag names. But we can still find these elements writing an equivalent XPath.

To find element with XPath on scrapy, we have to write -   
`response.xpath('//tag[@AttributeName="Value"]')`  
Like selenium and beautifulsoup we can also find multiple elements using scrapy by using `getall()` -   
`response.xpath('//tag[@AttributeName="Value"]').get()`  
`response.xpath('//tag[@AttributeName="Value"]').getall()`

Scrapy offers an inbuilt way of saving and storing data through the `yield` keyword.  

Yield takes only one of the following data types:
- Request (Scrapy object)
- BaseItem (Scrapy object)
- Dict
- None

This means that you can’t try passing it a string or integer, else you’ll get an error.

### Scrapy Shell

In order to enter into the scrapy shell run this command on terminal `scrapy shell`.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Create and Send request to url - `r = scrapy.Request(url='https://www.worldometers.info/world-population/population-by-country/')`  
Open spider and see the result of the request - `fetch(r)`  
Check the HTML of the website - `response.body`  
Specific elements - `response.xpath('//h1')`  
Get the text of specific element - `response.xpath('//h1/text()')` By this we get the text but not clean still have the attribute names in it.  
Get the clean text of specific element - `response.xpath('//h1/text()').get()`
Get all the country names from the above website - `response.xpath('//td/a/text()').getall()`

We can do the same thing by building a spider, check the scrapy spider template screenshot above. Code can be found [here](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/worldometer.py).  
To execute this spider run this command `scrapy crawl spider_name`. In this case the spider name is `worldometer`.  

## Getting links listed in a website [click](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/worldometer_links.py)

## Scraping data from multiple links [here](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/worldometer_multi_links.py).  
To export the data into a json or csv file use this command `scrapy crawl spider_name -o filename.extension`.  
For ex. `scrapy crawl worldometer_multi_links -o population.json`

## Pagination with Scrapy 

Let's build a new spider for this. This time we'll use [audible website (Canada Region)](https://www.audible.ca/search)  
Scapre the first page [code](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/audible.py)  
Scrape all pages [code](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/audible_pagination.py)  
*Like selenium and beautifulsoup we can't click on a button using Scrapy*

Scrapy is way faster than selenium and beautifulsoup. It scraped 25 pages, each contains info of approx 20 books in just 24 seconds.

## Change User Agent

Use this command in the terminal to check the user agent `scrapy shell "link"`  
For ex. `scrapy shell "https://www.audible.ca/search"` And now to get the user agent use `request.headers`. Then you'll get a dictionary. One of the key of this dictionary is User-Agent. The value of this user agent is Scrapy. That's how the website will easily know that we're using scrapy to scrape the website. Therefore, we have to change the user agent so it should look like we're sending request using chrome.  

Let's check how to check the user agent in the browser -  
Step 1: Inspect  
![INSPECT](../images/inspect_audible.png)  
Step 2: Select the Network option and reload the page and wait until all the elements are loaded.
![Network](../images/network.png)
Step 3: After loading the page search for any HTML request in the search bar.  
![search html](../images/search.png)
Step 4: Click any of the name you'll get the User-Agent on the right.  
![user agent](../images/header.png)  

Now that we have the value of user agent go to the `settings.py` file inside project folder that we created earlier.  
Inside the `settings.py` file change the key value of `DEFAULT_REQUEST_HEADERS` dictionary like below -  
```
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82'
}
```
We can also do this in the actual spider file check [here](paste link) 

[Here](https://explore.whatismybrowser.com/useragents/explore/) you can find user agent for different platforms and softwares.


## Spider template

Check the available spider template using this command - `scrapy genspider -l`  
To create a spider using `crawl` template use this command - `scrapy genspider -t crawl spider_name link`  
For ex. `scrapy genspider -t crawl transcripts subslikescript.com/movies`   
Now we'll have new property inside that spider file which is rule. The rule object tells spider what are the links we want to scrape while scraping.  
`rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)`
`LinkExtractor` will follow the links that contains the string mentioned in `allow` parameter and will use the `parse_item` function to parse information from those links. We can also use `deny` parameter instead of `allow` which will not follow the links mentioned in deny parameter. Another parameter is `restrict_xpaths` which restrict the spider to follow only the `xpaths expression` mentioned in this parameter.    

Scrapy doesn't support `utf-8` encoding by default in earlier versions so we have to check in `settings.py` whether we have this line of code at the bottom or not.  
`FEED_EXPORT_ENCODING = "utf-8"`

## Full code with crawler template [here](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/transcripts_crawler.py)

## Pagination with spider crawl template [code](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/crawler_pagination.py)
You have to set `DOWNLOAD_DELAY = 0.5` in `settings.py` otherwise you'll get 429 unknown status error which usually indicates that you are making too many requests to the server in a given time frame.   
After running this you will see spider code is shorter and much faster than compare to soup and selenium 

## Change Crawler User Agent [code](https://github.com/sahaavi/Web-Scraping/blob/main/Scrapy/scrapy_spider/scrapy_spider/spiders/crawler_user_agent.py)

## Pipeline
We can modify the `Pipelines.py` file in the project folder in order to get message when the spider starts running and stops running.  
By default there is only one function inside `Pipelines.py` which is `process_item`. We'll add two other funcitons - 
 - open_spider
 - close_spider  
Then we have to go the settings.py and uncomment these codes.  
```
ITEM_PIPELINES = {
   "scrapy_spider.pipelines.ScrapySpiderPipeline": 300,
}
```

Now we'll get messages like below whenever a spider opens because of the open spider function.  
 ```
2023-07-18 11:03:56 [scrapy.core.engine] INFO: Spider opened
2023-07-18 11:03:56 [crawler_user_agent] INFO: Spider opened: crawler_user_agent
 ```

## Export Data to MongoDB Database

Setup the account of [MongoDB](https://www.mongodb.com/) for free. Then create a project using free M0 option.
Then install these packages in your pc.
    - pymongo
    - dnspython
Use this command to install both together `conda install pymongo dnspython -y`  
Then in network access of MongoDB project set this ip 0.0.0.0/0

Then goto Database option in Deployment and select the Connect.  
![mongo_connect](../images/mongo_connect.png)  
Then choose the Driver option and select python and the your python version. Next copy the connection string which we have to paste inside the pipelines. 
Replace the `<password>` with the actual password you set.
Then we have to change pipeline to MongoDBPipeline inside settings.py because this is the class name we used inside our Pipeline file.
Check all the codes - 
[scrapy_mongo](paste folder link)
    - [pipelines]()  
    - [settings]()  
    - [spider]()  

Then click on collections in Database on MongoDB to check the scraped data.

## Export Data SQLite Database

You can find the codes for settings and pipelines in scrapy_mongo project under MongoDB Codes.