<div id="header">
    <p style="color:black; text-align:center; font-weight:bold; font-family:Tahoma, sans-serif; font-size:24px;">
        Data Gathering with Web Scraping
    </p>
</div>

<div style="background-color:#bfbfbf; padding:8px; border:2px dotted black; border-radius:8px; font-family:sans-serif; line-height: 1.7em">
Data gathering is the first and one of the most essential steps in the machine learning workflow.

Web Scraping is the process of extracting data from websites, allowing us to access structured or unstructured information available on the internet. It is a valuable tool for collecting data that may not be readily available through APIs or public datasets.

When performing web scraping to gather data, the general process follows these steps:

HTTP Request: A client (such as our Python code) sends an HTTP request to the website's URL.
HTML Response: The server returns an HTML response containing the webpage's content.
HTML Parsing: The HTML content is parsed using libraries such as BeautifulSoup or lxml, allowing the extraction of specific elements (e.g., text, links, images).
Data Structuring: Extracted data is cleaned and organized into a structured format like a pandas DataFrame.
Data Cleaning: After scraping, data cleaning involves steps like handling missing values, removing duplicates, and formatting columns for consistency.
Example: Gathering Book Details

In this notebook, the process of gathering book details such as titles, prices, ratings, and availability from a mock e-commerce website (Books to Scrape) is demonstrated. By scraping the HTML content of the website, the relevant fields are extracted, processed into a structured format using pandas, and then saved as a CSV file for further analysis. This method can be extended to scrape real-world data from other websites, subject to their terms of service.

</div>

In [21]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [22]:
base_url = "http://books.toscrape.com/catalogue/page-{}.html"

In [23]:
dataframes = []

In [24]:
final = pd.DataFrame()

In [25]:
# Loop through the pages
for page in range(1, 51):
    webpage = requests.get(base_url.format(page)).text
    soup = BeautifulSoup(webpage, 'lxml')

    books = soup.find_all('article', class_='product_pod')

    # Lists to store data for this page
    titles = []
    prices = []
    availability = []
    ratings = []

    for book in books:
        try:
            titles.append(book.h3.a['title'])
        except:
            titles.append(np.nan)

        try:
            prices.append(book.find('p', class_='price_color').text.strip())
        except:
            prices.append(np.nan)

        try:
            availability.append(book.find('p', class_='instock availability').text.strip())
        except:
            availability.append(np.nan)

        try:
            rating_class = book.find('p', class_='star-rating')['class']
            ratings.append(rating_class[1] if len(rating_class) > 1 else np.nan)
        except:
            ratings.append(np.nan)

    # Create a DataFrame for this page
    df = pd.DataFrame({
        'Title': titles,
        'Price': prices,
        'Availability': availability,
        'Rating': ratings
    })

    dataframes.append(df)

In [26]:
# Combine DataFrames
final = pd.concat(dataframes, ignore_index=True)

In [27]:
print(final.sample(5))

                                                 Title    Price Availability  \
164  The 10% Entrepreneur: Live Your Startup Dream ...  Â£27.55     In stock   
227                                       Twenty Yawns  Â£22.08     In stock   
910         Travels with Charley: In Search of America  Â£57.82     In stock   
856                                        Dark Places  Â£23.90     In stock   
844               Fifty Shades Freed (Fifty Shades #3)  Â£15.36     In stock   

    Rating  
164  Three  
227    Two  
910   Five  
856   Five  
844   Five  


In [28]:
# Save the final DataFrame to a CSV file
final.to_csv('books_to_scrape.csv', index=False, encoding='utf-8')
print("Data saved to books_to_scrape.csv")

Data saved to books_to_scrape.csv
