# Step 2: Parse HTML and Extract Book Data

In this notebook, we'll:
1. Parse the downloaded HTML files using BeautifulSoup
2. Extract book information (title, price, rating)
3. Combine data from all 50 pages
4. Export the data to a CSV file

## Install Required Libraries

First, let's install BeautifulSoup4 and lxml for HTML parsing.

In [1]:
!pip install beautifulsoup4 lxml



## Import Libraries

Import BeautifulSoup for HTML parsing and pandas for data manipulation.

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import os

## Extract Data from All Pages

Now we'll loop through all 50 downloaded HTML files and extract book information from each page.

In [3]:
# List to store all book data
all_books = []

# Loop through all 50 pages
for page_num in range(1, 51):
    try:
        # Read the HTML file
        with open(f"htmls/page{page_num}.html", "r", encoding="utf-8") as f:
            content = f.read()
        
        # Parse with BeautifulSoup
        soup = BeautifulSoup(content, "html.parser")
        
        # Find all book articles on the page
        articles = soup.select("article.product_pod")
        
        # Extract data from each book
        for article in articles:
            # Get book title
            title = article.find("h3").find("a")["title"]
            
            # Get price (remove the £ symbol)
            price = article.select_one("p.price_color").text.split("£")[1]
            
            # Get rating
            rating_element = article.select_one("p.star-rating")
            rating = rating_element['class'][1]
            
            # Add to our list
            all_books.append([title, price, rating])
        
        print(f"Extracted data from page {page_num}")
        
    except Exception as e:
        print(f"Error processing page {page_num}: {e}")

print(f"\nTotal books extracted: {len(all_books)}")

Extracted data from page 1
Extracted data from page 2
Extracted data from page 3
Extracted data from page 4
Extracted data from page 5
Extracted data from page 6
Extracted data from page 7
Extracted data from page 8
Extracted data from page 9
Extracted data from page 10
Extracted data from page 11
Extracted data from page 12
Extracted data from page 13
Extracted data from page 14
Extracted data from page 15
Extracted data from page 16
Extracted data from page 17
Extracted data from page 18
Extracted data from page 19
Extracted data from page 20
Extracted data from page 21
Extracted data from page 22
Extracted data from page 23
Extracted data from page 24
Extracted data from page 25
Extracted data from page 26
Extracted data from page 27
Extracted data from page 28
Extracted data from page 29
Extracted data from page 30
Extracted data from page 31
Extracted data from page 32
Extracted data from page 33
Extracted data from page 34
Extracted data from page 35
Extracted data from page 36
E

In [4]:
# Create DataFrame with appropriate column names
df = pd.DataFrame(all_books, columns=["Book Title", "Price", "Rating"])
print(f"DataFrame created with {len(df)} books")

DataFrame created with 1000 books


## Create DataFrame

Convert our list of books into a pandas DataFrame for easier manipulation.

In [5]:
# Display first 10 rows
df.head(10)

Unnamed: 0,Book Title,Price,Rating
0,A Light in the Attic,51.77,Three
1,Tipping the Velvet,53.74,One
2,Soumission,50.1,One
3,Sharp Objects,47.82,Four
4,Sapiens: A Brief History of Humankind,54.23,Five
5,The Requiem Red,22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,22.6,Four
9,The Black Maria,52.15,One


## Preview the Data

Let's take a look at the first few rows of our dataset.

In [6]:
# Save to CSV file
df.to_csv("data.csv", index=False)
print("Data successfully saved to data.csv!")

Data successfully saved to data.csv!


## Summary

✅ Successfully scraped and extracted book data from all 50 pages!

You can now use `data.csv` for further analysis or visualization.

## Export to CSV

Save our extracted data to a CSV file for future use.