# Task 1: Web Scraping Project - Codveda Internship
Scraping data from [quotes.toscrape.com] (http://quotes.toscrape.com) website using python libraries like 'requests', 'BeautifulSoup', and storing it in a structured format(CSV).



## Introduction
In this task, i performed webscraping to collect data from [quotes.toscrape.com] (http://quotes.toscrape.com) website, with the aim of:
- Identifying and inspecting the website structure
- Using requests and BeautifulSoup to extract data
- Handling pagination and store the data in CSV format

This exercise improves my skills in automation, data extraction, and structuring real-world data

In [None]:
!pip install requests
!pip install beautifulsoup4



## Step 1: Importing Required Libraries
Here, i imported python libraries essential for the task:
- 'requests' to make HTTP calls
- 'BeautifulSoup' for parsing HTML
- 'pandas' for storing and manipulating data

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Step 2: Sending HTTP Request and Parsing HTML
I used the 'requests' library to fetch the web page and then parsed the HTML content using 'BeautifulSoup'. This allowed me to inspect and extract specific elements from the page.

In [None]:
url = 'http://quotes.toscrape.com/page/1/'
response = requests.get(url)
print(response.status_code)

200


In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

## Step 3: Extracting the data
I identified  HTML elements (tags and classes) that contain the data that i need. i then extracted the required fields (e.g., quotes, authors and tags) using BeautifulSoup function; 'find_all()'

In [None]:
quotes = soup.find_all('div', class_='quote')
data =[]
for quote in quotes:
  text = quote.find('span', class_='text').text
  author =quote.find('small', class_='author').text
  tags = [tag.text for tag in quote.find_all('a', class_='tag')]
  data.append({'Quote':text, 'Author':author, 'Tags': ','.join(tags)})
  df = pd.DataFrame(data)
  df.head()

## Step 4: Handling Pagination
To scrape multiple pages, i looped through the paginated URLs. For each page, i repeated the data extraction process until there were no more pages to scrape.

In [None]:
all_data = []
for page in range(1, 11): # Scrape first 10 pages
    url = f'http://quotes.toscrape.com/page/{page}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
      text = quote.find('span', class_='text').text
      author =quote.find('small', class_='author').text
      tags = [tag.text for tag in quote.find_all('a', class_='tag')]
      all_data.append({'Quote':text, 'Author':author, 'Tags': ','.join(tags)})
df = pd.DataFrame(all_data)
df.head()


Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational"


## Step 5: Storing Data in CSV
After collecting the data, i structured it into a pandas DataFrame and exported it as a '.csv' file. This makes it easier to analyze and use for future tasks like data cleaning and modelling.

In [None]:
df.to_csv('quotes.csv' , index=False)

In [None]:
from google.colab import files
files.download('quotes.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Conclusion
I successfully scraped quotes, authors, and tags from 'quotes.toscrape.com', navigated pagination, and stored the data in a structured CSV format using 'pandas'.

This data can now be used for text analysis, author statistics, or NLP projects in future tasks.