# Assignment Title: Web Scraping with Scrapy

## Objective:
The objective of this assignment is to help trainees gain hands-on experience with Scrapy, a powerful web scraping framework in Python. By the end of this assignment, trainees should be able to create Scrapy projects, build spiders to extract data from websites, and store the scraped data in various formats.


## Task 1: Install and Set Up Scrapy
### Install Scrapy:
  - Install Scrapy in your Python environment.
  - Use the following command to install: pip install scrapy

In [1]:
pip install scrapy

Collecting scrapy
  Using cached Scrapy-2.11.2-py2.py3-none-any.whl (290 kB)
Collecting tldextract
  Using cached tldextract-5.1.2-py3-none-any.whl (97 kB)
Collecting parsel>=1.5.0
  Using cached parsel-1.9.1-py2.py3-none-any.whl (17 kB)
Collecting cssselect>=0.9.1
  Using cached cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting queuelib>=1.4.2
Note: you may need to restart the kernel to use updated packages.
  Using cached queuelib-1.7.0-py2.py3-none-any.whl (13 kB)

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.9.1 requires libclang>=13.0.0, which is not installed.
tensorflow 2.9.1 requires tensorflow-io-gcs-filesystem>=0.23.1, which is not installed.
spyder 5.1.5 requires pyqt5<5.13, which is not installed.
spyder 5.1.5 requires pyqtwebengine<5.13, which is not installed.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.
anaconda-project 0.10.1 requires ruamel-yaml, which is not installed.
tensorflow 2.9.1 requires absl-py>=1.0.0, but you have absl-py 0.15.0 which is incompatible.
tensorflow 2.9.1 requires flatbuffers<2,>=1.12, but you have flatbuffers 2.0 which is incompatible.
tensorflow 2.9.1 requires gast<=0.4.0,>=0.2.1, but you have gast 0.5.3 which is incompatible.
tensorflow 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.



Collecting PyDispatcher>=2.0.5
  Using cached PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Collecting protego>=0.1.15
  Using cached Protego-0.3.1-py2.py3-none-any.whl (8.5 kB)
Collecting cryptography>=36.0.0
  Downloading cryptography-43.0.1-cp39-abi3-win_amd64.whl (3.1 MB)
Collecting w3lib>=1.17.0
  Using cached w3lib-2.2.1-py3-none-any.whl (21 kB)
Collecting Twisted>=18.9.0
  Using cached twisted-24.7.0-py3-none-any.whl (3.2 MB)
Collecting itemadapter>=0.1.0
  Using cached itemadapter-0.9.0-py3-none-any.whl (11 kB)
Collecting itemloaders>=1.0.1
  Using cached itemloaders-1.3.1-py3-none-any.whl (12 kB)
Collecting service-identity>=18.1.0
  Using cached service_identity-24.1.0-py3-none-any.whl (12 kB)
Collecting jmespath>=0.9.5
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting incremental>=24.7.0
  Using cached incremental-24.7.2-py3-none-any.whl (20 kB)
Collecting hyperlink>=17.1.1
  Using cached hyperlink-21.0.0-py2.py3-none-any.whl (74 kB)
Collecting typing-extensio

## Create a Scrapy Project: 
 - Create a new Scrapy project named "web_scraper" in your working directory.

In [None]:

"""
After installing a scrapy using above command in my system with the help of command terminal.I went back to my file and
created specific file(example quotes) to store the project. 
And i went back to the command terminal and changed my working directory to the file that i created 
and run (scrapy startproject web_scraper) to create a Scrapy project.
"""

scrapy startproject web_scraper  #create a scrapy project named web_scraper



## Task 2: Create a Spider to Scrape a Website
### Choose a Website: Select a simple, publicly accessible website to scrape.
Examples include:
 - http://quotes.toscrape.com (A website designed for practicing web scraping)
 - Generate a Spider: Create a spider within your project to scrape the website.
 - Name the spider based on the website, e.g., quotes_spider for the quotes website.

In [None]:
# create a spider inside the web_scraper directory 
cd web_scraper
scrapy genspider quotes quotes.com

### Extract Data:
- Extract the following data from the website:

    - Quotes: Extract the text of the quotes.
    - Authors: Extract the name of the author for each quote.
    - Tags: Extract tags associated with each quote

In [None]:
# Create a spider folder quotes_spider.py 


import scrapy

class QuotesSpiderSpider(scrapy.Spider):
    name = "quotes_spider"
    allowed_domains = ["quotes_spider.py"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
          for quote in response.css("div.quotes"):
                yield{
                      "author":quote.xpath("span/small/text()").get(),
                       "text": quote.css("span.text::text").get(),
                 }

           next_page = response.css('li.next a::attr("href")').get()
           if next_gen is not None:
                yield response.follow(next_page, self.parse)

## Task 3: Save the Scraped Data
- Save Data to a JSON File: Run the spider and save the scraped data to a JSON file.

In [None]:
# Saved the Scrapped data using the following code in the spider with the help of notepad inserting following commands
scrapy crawl  QuoteSpider -o output.json

 - Save Data to a CSV File: Run the spider again and save the data to a CSV file.

### Task 4: Implement Error Handling and Logging
 - Add Error Handling: Modify your spider to include basic error handling, such as retrying failed requests or skipping certain    elements if they are not found.

- Enable Logging: Configure Scrapy’s logging to monitor your spider’s activity. Write logs to a file for review.
 