# Lab Assignment 5: Web Scraping
## DS 6001

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.


## Problem 0
Import the following packages:

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup


## Problem 1
Large language models are very impressive. They use deep neural networks with billions of parameters. They fine-tune results with clever approaches to reinforcement learning through human feedback. Many of them employ APIs with user interfaces that allow users to chat in natural language through a simple textbox, and receive a response in only a few seconds. And the societal impact of these models is undeniably enormous, to an extent we are only beginning to understand.

But, none of that is the most impressive part of LLMs. Like anything else in machine learning, the most impressive and difficult element of the work is the data collection.

The data used to train major LLMs is something that big companies communicate very little about. [OpenAI](https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-foundation-models-are-developed) only says that GPT and other baseline models are trained on data that is "freely and openly accessible on the internet." Many commentators, such as Dr. Vered Shwartz, assistant professor in the University of British Columbia department of computer science, say that LLMs like GPT are trained on "[essentially the entire internet](https://science.ubc.ca/news/chatgpt-has-read-almost-whole-internet-hasnt-solved-its-diversity-issues)." Think about pulling the entire internet into a single model and tell me this isn't the most impressive thing about ChatGPT.

A primary training set for Open AI's GPT and other major LLMs is the corpus compiled by [Common Crawl](https://commoncrawl.org/faq), a nonprofit organization that describes itself as "dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis." According to the [Mozilla Foundation](https://www.mozillafoundation.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/), "[o]ver 80% of GPT-3 tokens (a representation unit of text data) stemmed from Common Crawl. Many models published by other developers likewise rely heavily on it: the study analyzed 47 LLMs published between 2019 and October 2023 that power text generators and found at least 64% of them were trained on Common Crawl."

Common Crawl is a massive web scraping endeavor. But it's not necessarily a sophisticated one. Mostly, Common Crawl is downloading the raw HTML from the webpages it scrapes and extracting text from the HTML. A task like this is exactly the kind of thing we can use `requests` and `BeautifulSoup` to do. 

Common Crawl is also at the center of many controversies and legal actions regarding generative AI, such as the New York Times' copyright infringement [lawsuit](https://www.npr.org/2025/03/26/nx-s1-5288157/new-york-times-openai-copyright-case-goes-forward) against OpenAI, and concerns about bias and racism in LLMs stemming from their training data.

For this problem, please examine the [Common Crawl website](https://commoncrawl.org/) and read "[Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI](https://assets.mofoprod.net/network/documents/Common_Crawl_Mozilla_Foundation_2024.pdf)" by Stefan Baack for the Mozilla Foundation.


### Part a
If you want access to the latest Common Crawl dataset, where can you get it? How many website does the dataset contain? And outside of computational and storage costs, how much will it cost to obtain the data? (Use [Common Crawl's website](https://commoncrawl.org/) to answer this question) [8 points]

### Reference :

- https://commoncrawl.org/blog/september-2025-crawl-archive-now-available
- https://commoncrawl.org/get-started
- https://webdatacommons.org/structureddata/
- https://aws.amazon.com/s3/pricing/

#### 1. Where to get the latest Common Crawl dataset

- Common Crawl data is freely available to anyone. It’s hosted on AWS Public Datasets (in an S3 bucket: s3://commoncrawl/ in the US-East-1 region) and also accessible via HTTP(S) using https://data.commoncrawl.org/ for external downloads. 

- You can choose particular monthly “crawls” (they are released regularly). 

#### 2. How many websites/pages does the dataset contain

- This depends on how you count (pages, domains, hosts). Here are the stats:

    - The Common Crawl corpus spans over 15 years and now includes over 250 billion pages. 

    - They add 3–5 billion new pages per month. 

    - On the domain / host side: in some recent monthly crawls, the number of parsed HTML URLs is in the billions, and the number of distinct domains (pay-level / registered domains) is tens of millions. For example, in one of the October 2024 crawls, there were ~2.39 billion HTML URLs, and ~37.4 million domains. 

#### 3. What does it cost (outside of compute / storage)

- The data itself is free: you do not pay for the raw dataset. 

But there are other costs you need to think about:

- Data Transfer
    - If you download data from AWS to outside of AWS or across regions, you will likely incur AWS egress fees. The price can be significant if you're pulling many terabytes. Common Crawl hosts data in US-East-1; downloading outside that region or to your local machines will cost bandwidth. (Exact rate depends on AWS’s current pricing tiers.) This is not charged by Common Crawl but by AWS for moving that data. 

- Compute / Storage Costs
    - Even if the data is free, storing petabytes (or many terabytes) locally or on cloud storage costs money. Processing it (e.g. parsing, cleaning, extracting) requires compute resources (CPU / memory / possibly distributed compute) which costs money.

- Time / Software Costs
    - Depending on how you access and use the data, there may be costs in managing infrastructure, downloading, handling big files, etc.

#### 4. Estimate: If you wanted to obtain the whole dataset downloaded locally

Suppose the corpus is many petabytes (CC has “petabytes of data regularly collected since 2008”). 

- Storing even 1 PB locally (or on cloud) could cost thousands of dollars per month (or in upfront hardware costs).

Transferring e.g. hundreds of terabytes out of AWS could cost at e.g. $0.05-$0.09 per GB (depending on region / amount) which adds up quickly. For example, 100 TB at $0.09/GB is $9,000 just for transfer.

So while the data is “free,” realistically accessing large portions of it has non-trivial cost in network and storage.

### Part b
Is Common Crawl more useful for the pre-training or fine-tuning stage of the development of an LLM? Why is the Common Crawl corpus used by so many AI efforts, and why must it usually be altered or filtered in some way? (see pages 11-13 of Baack's article) [8 points]

#### Answer :

- Common Crawl is primarily used in the pre-training stage of large language models. It provides a massive, diverse, and freely available dataset that helps models learn general language patterns, knowledge, and context across a wide range of topics. However, the raw Common Crawl data is not used directly in pre-training because it contains harmful content, biases, misinformation, and strong overrepresentation of English content.

    - To make it suitable for training, developers must filter and curate the data. This involves removing toxic or misleading material, balancing representation across languages and cultures, and improving data quality through metadata enrichment. These steps are essential to ensure that the resulting models are reliable, safe, and effective.


- Common Crawl is not typically used for fine-tuning.

    - Fine-tuning usually relies on smaller, carefully curated datasets that are high-quality, task-specific, and often labeled—for example, datasets for summarization, translation, or instruction-following. Common Crawl, being massive and noisy, is too unfiltered and general for fine-tuning, which requires precise, clean, and often human-reviewed examples.

- Common Crawl is widely utilized in AI development due to its vast and freely accessible dataset, which offers a diverse range of web content. This extensive corpus aids in training large language models (LLMs) by providing a broad linguistic and topical foundation. 

    - Common Crawl encompasses a vast array of web content, offering a rich and varied dataset that aids in training models on a wide range of topics and languages.

    - Being freely accessible, it significantly reduces the financial barriers for AI development, making it an attractive option for researchers and developers.
    
    - Its open nature allows researchers and developers to utilize and build upon the data, fostering innovation and collaboration within the AI community.
    
    However, its raw form is not suitable for direct use in training due to several inherent issues.
    
    - The corpus includes hate speech, misinformation, and other harmful material that can lead to undesirable outputs in models.

    - The automated crawling process prioritizes frequently linked pages, which may marginalize certain communities and perspectives.

    - There's a strong bias toward English content, with underrepresentation of other languages and regional viewpoints.

AI developers typically engage in extensive filtering and curation processes, such as removing toxic or misleading content, ensuring diverse linguistic and cultural representation, and improving data quality through metadata enrichment to make the common crawl data work.

### Part c
Given how Common Crawl decides which URLs to crawl, what are three reasons why the data cannot be said to be the complete internet or a representative sample of it? (see pages 17-22 of Baack's article) [8 points]

#### Answer:

- Many major websites—including news outlets, social media platforms, and commercial services—explicitly block Common Crawl’s bots using robots.txt files or other technical barriers. This means entire categories of influential content are systematically excluded, such as Facebook, Instagram, and The New York Times. As Baack notes, this creates a structural blind spot in the dataset.

- Common Crawl disproportionately captures English-language content and domains optimized for search engines. This leads to an overrepresentation of commercial, Western-centric websites and underrepresentation of non-English, regional, or marginalized voices. Baack emphasizes that this linguistic and geographic skew limits the diversity of perspectives in the dataset.

- Common Crawl uses a link-following strategy, starting from seed URLs and expanding through hyperlinks. This method favors well-connected, high-traffic websites while neglecting isolated or less-linked pages. As Baack explains, this approach amplifies the visibility of already dominant domains and suppresses niche or emerging content, making the dataset structurally biased toward the mainstream.

### Part d
Of the suggestions that the Mozilla Foundation make for the future development of Common Crawl, are there any that you especially agree with or disagree with, and why? (See pages 30-31 of Baack's article) [8 points]

#### Answer:

- One suggestion I strongly agree with is the call for greater transparency and documentation around Common Crawl’s data collection methods. As Baack notes, while the dataset is open, the crawling strategy—such as how seed URLs are chosen, which domains are excluded, and how filtering is applied—is not always clearly explained. Improving this transparency would help researchers better understand the biases baked into the dataset and make more informed decisions when using it to train AI models.

- I also support the recommendation to diversify the sources and languages included in future crawls. Common Crawl currently overrepresents English-language and commercial domains, which can reinforce cultural and linguistic biases in downstream models. By intentionally including more regional, non-English, and marginalized content, the dataset could become more representative of global internet usage and reduce the risk of exclusionary AI systems.

- One suggestion I’m more cautious about is the idea of curating or filtering content more aggressively at the source level. While this could reduce harmful or low-quality data, it raises concerns about who decides what gets filtered and why. Over-curation could unintentionally suppress dissenting voices or niche communities. I believe filtering is important, but it should be done transparently and with input from diverse stakeholders.


## Problem 2
For the following problems, you will be scraping http://books.toscrape.com/. This website is a fake book retailer, designed to mimic the design of many retail websites. It exists solely to help students practice web-scraping, so there aren’t going to be any ethical concerns with this particular exercise, and there shouldn’t be any issues with rate limits or other gates that could prevent web-scraping. Take a moment and look at this website, so that you know what you will be working with.

Your goal is to generate a dataframe with four columns: one for the title, one for the price, one for the star-rating, and one or the book cover JPEG’s URL. The dataframe will also 1000 rows, one for each of the 1000 books listed on the 50 pages of this website.

### Part a
Pull the HTML code from http://books.toscrape.com/. Make sure you provide the correct user agent string. Then parse this HTML code and save the parsed code as a separate Python variable. [8 points]

In [None]:
# Define the URL to scrape

url = 'https://books.toscrape.com/'
 
# Define the useragent & headers for the HTTP request

useragent = f'tm/0.0 (xdy6sg@virginia.edu) python-request/{requests.__version__}'
headers = {'User-Agent': useragent, 'From': 'xdy6sg@virginia.edu'}


In [None]:
# Make the GET request to the website 

r = requests.get(url, headers=headers)
r

<Response [200]>

In [None]:
# Parse the HTML content using BeautifulSoup and store it in a variable

mysoup = BeautifulSoup(r.text, 'html.parser')
mysoup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

### Part b
Extract all 20 of the book titles and save them in a list. [8 points]

In [6]:
# Find all the anchor tags with a title attribute

ataglist = mysoup.find_all('a', title=True)
ataglist

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>,
 <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>,
 <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>,
 <a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>,
 <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>,
 <a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>,
 <a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>,
 <a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the 

In [7]:
# Print the title attribute of the first anchor tag in the list

ataglist[0]['title']

'A Light in the Attic'

In [13]:
# initiate the list to store the titles

book_titles = []

# Loop through the first 20 anchor tags and extract the title attribute

for a in ataglist[:20]:
    book_titles.append(a['title'])

book_titles

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

### Part c
Extract the price of each of the 20 books and save these prices in a list. (The prices are listed in British pounds, and include the £ symbol. Remove the £ symbols: if you’ve saved the prices in a list named `prices`, then the following code should work: `prices = [s.replace('Â£', '') for s in prices]`.) [8 points]

In [14]:
# Find all the paragraph tags with class 'price_color'

ptaglist = mysoup.find_all('p', class_='price_color')
ptaglist

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

In [15]:
# Print the text of the first price tag in the list

ptaglist[0].text

'Â£51.77'

In [27]:
# List to store prices
book_prices = []

# Loop through the first 20 price tags and get their text
for p in ptaglist[:20]:
    book_prices.append(p.get_text())

# Remove the £ symbols (and any extra encoding issues)
book_prices = [s.replace('Â£', '').replace('£', '') for s in book_prices]

# output the cleaned prices
book_prices

['51.77',
 '53.74',
 '50.10',
 '47.82',
 '54.23',
 '22.65',
 '33.34',
 '17.93',
 '22.60',
 '52.15',
 '13.99',
 '20.66',
 '17.46',
 '52.29',
 '35.02',
 '57.25',
 '23.88',
 '37.59',
 '51.33',
 '45.17']

In [28]:
# Create a dictionary of title: price
book_dict = dict(zip(book_titles, book_prices))

# Print the dictionary
for title, price in book_dict.items():
    print(f"{title}: {price}")

A Light in the Attic: 51.77
Tipping the Velvet: 53.74
Soumission: 50.10
Sharp Objects: 47.82
Sapiens: A Brief History of Humankind: 54.23
The Requiem Red: 22.65
The Dirty Little Secrets of Getting Your Dream Job: 33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull: 17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics: 22.60
The Black Maria: 52.15
Starving Hearts (Triangular Trade Trilogy, #1): 13.99
Shakespeare's Sonnets: 20.66
Set Me Free: 17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1): 52.29
Rip it Up and Start Again: 35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991: 57.25
Olio: 23.88
Mesaerion: The Best Science Fiction Stories 1800-1849: 37.59
Libertarianism for Beginners: 51.33
It's Only the Himalayas: 45.17


### Part d
Extract the star level ratings for the 20 books. [Hint: for tags such as `<p class="star-rating One">` in which the class has a space, the class is actually a list in which the first item in the list is `"star-rating"` and the second item in the list is `"One"`. It's possible to search on either item in this list.][8 points]

In [None]:
# Find all the paragraph tags with class 'star-rating'

rating_tags = mysoup.find_all('p', class_='star-rating')
rating_tags

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating Five">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i 

In [24]:
# Print the class list of the first rating tag

rating_tags[0]['class']  # 'Three'


['star-rating', 'Three']

In [25]:
# initiate the list to store the star ratings

book_ratings = []

# Loop through the first 20 rating tags and extract the second class (which indicates the rating)

for tag in rating_tags[:20]:
    book_ratings.append(tag['class'][1])

book_ratings

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

In [29]:
# Combine title, price, and rating into a single dictionary
book_dict = {}

for title, price, rating in zip(book_titles, book_prices, book_ratings):
    book_dict[title] = {
        'Price': price,
        'Rating': rating
    }

# Print the updated dictionary
for title, info in book_dict.items():
    print(f"{title}: Price = {info['Price']}, Rating = {info['Rating']}")

A Light in the Attic: Price = 51.77, Rating = Three
Tipping the Velvet: Price = 53.74, Rating = One
Soumission: Price = 50.10, Rating = One
Sharp Objects: Price = 47.82, Rating = Four
Sapiens: A Brief History of Humankind: Price = 54.23, Rating = Five
The Requiem Red: Price = 22.65, Rating = One
The Dirty Little Secrets of Getting Your Dream Job: Price = 33.34, Rating = Four
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull: Price = 17.93, Rating = Three
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics: Price = 22.60, Rating = Four
The Black Maria: Price = 52.15, Rating = One
Starving Hearts (Triangular Trade Trilogy, #1): Price = 13.99, Rating = Two
Shakespeare's Sonnets: Price = 20.66, Rating = Four
Set Me Free: Price = 17.46, Rating = Five
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1): Price = 52.29, Rating = Five
Rip it Up and Start Again: Price = 35.02, Rating = Five
Our Band Could Be You

### Part e
Extract the URLs for the JPEG thumbnail images that show the covers of the 20 books. (Maybe we want to mine the images to build models that predict the star level, literally judging books by their covers.) [8 points]

In [30]:
# Find all the image tags

img_tags = mysoup.find_all('img')
img_tags

[<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>,
 <img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/>,
 <img alt="Soumission" class="thumbnail" src="media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"/>,
 <img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/>,
 <img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/>,
 <img alt="The Requiem Red" class="thumbnail" src="media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg"/>,
 <img alt="The Dirty Little Secrets of Getting Your Dream Job" class="thumbnail" src="media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg"/>,
 <img alt="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull" class="thumbnail" src="media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc6

In [31]:
# Print the 'src' attribute of the first image tag

img_tags[0]['src']

'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'

In [32]:
# initiate the list to store the image URLs

book_images = []

# Find all <img> tags with class 'thumbnail'
img_tags = mysoup.find_all('img', class_='thumbnail')

# Loop through the first 20 image tags and extract the full image URL

for img in img_tags[:20]:
    # Replace relative path with full URL
    full_url = "http://books.toscrape.com/" + img['src'].replace('../', '')
    book_images.append(full_url)

book_images

['http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
 'http://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
 'http://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
 'http://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
 'http://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg',
 'http://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg',
 'http://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg',
 'http://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg',
 'http://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg',
 'http://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg',
 'http://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg',
 'http://books.toscrape.com/media/cache/10/48/1048f63d3b5061cd2f4

### Part f
Create a dataframe with one row for each of the 20 books, and the book titles, prices, star ratings, and cover JPEG URLs as the four columns. [8 points]

In [33]:
# Create a dictionary with column names and corresponding lists

book_data = {
    'Title': book_titles,
    'Price': book_prices,
    'Rating': book_ratings,
    'ImageURL': book_images
}

book_data

{'Title': ['A Light in the Attic',
  'Tipping the Velvet',
  'Soumission',
  'Sharp Objects',
  'Sapiens: A Brief History of Humankind',
  'The Requiem Red',
  'The Dirty Little Secrets of Getting Your Dream Job',
  'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  'The Black Maria',
  'Starving Hearts (Triangular Trade Trilogy, #1)',
  "Shakespeare's Sonnets",
  'Set Me Free',
  "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
  'Rip it Up and Start Again',
  'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
  'Olio',
  'Mesaerion: The Best Science Fiction Stories 1800-1849',
  'Libertarianism for Beginners',
  "It's Only the Himalayas"],
 'Price': ['51.77',
  '53.74',
  '50.10',
  '47.82',
  '54.23',
  '22.65',
  '33.34',
  '17.93',
  '22.60',
  '52.15',
  '13.99',
  '20.66',
  '17.46',
  '52.29',


In [34]:
# Convert the dictionary into a pandas DataFrame

book_df = pd.DataFrame(book_data)

book_df.head()

Unnamed: 0,Title,Price,Rating,ImageURL
0,A Light in the Attic,51.77,Three,http://books.toscrape.com/media/cache/2c/da/2c...
1,Tipping the Velvet,53.74,One,http://books.toscrape.com/media/cache/26/0c/26...
2,Soumission,50.1,One,http://books.toscrape.com/media/cache/3e/ef/3e...
3,Sharp Objects,47.82,Four,http://books.toscrape.com/media/cache/32/51/32...
4,Sapiens: A Brief History of Humankind,54.23,Five,http://books.toscrape.com/media/cache/be/a5/be...


### Part g
Create a function that takes the URL of the webpage to scrape as an input, applies the code you wrote for parts a through e, and generates the dataframe from part f as the output. [10 points]

In [35]:
# Define the function to scrape the book page

def scrape_book_page(url, headers):
    
    # s1: Pull HTML with user-agent
    useragent = f'tm/0.0 (xdy6sg@virginia.edu) python-request/{requests.__version__}'
    headers = {'User-Agent': useragent, 'From': 'xdy6sg@virginia.edu'}

    r = requests.get(url, headers=headers)

    # s2: Parse HTML with BeautifulSoup
    mysoup = BeautifulSoup(r.content, "html.parser")
    
    # s3: Find & extract book titles
    ataglist = mysoup.find_all('a', title=True)
    book_titles = [a['title'] for a in ataglist[:20]]
    
    # s4: Find and extract prices
    ptaglist = mysoup.find_all('p', class_='price_color')
    book_prices = [p.text for p in ptaglist[:20]]
    book_prices = [s.replace('Â£', '').replace('£', '') for s in book_prices]
    
    # Step D: Extract star ratings
    rating_tags = mysoup.find_all('p', class_='star-rating')
    book_ratings = [tag['class'][1] for tag in rating_tags[:20]]
    
    # Step E: Extract image URLs
    img_tags = mysoup.find_all('img', attrs={'src': True})
    book_images = ["http://books.toscrape.com/" + img['src'].replace('../', '') for img in img_tags[:20]]
    
    # Step F: Create DataFrame
    book_data = {
        'Title': book_titles,
        'Price': book_prices,
        'Rating': book_ratings,
        'ImageURL': book_images
    }
    book_df = pd.DataFrame(book_data)
    
    return book_df

In [36]:
scrape_book_page(url, headers)

Unnamed: 0,Title,Price,Rating,ImageURL
0,A Light in the Attic,51.77,Three,http://books.toscrape.com/media/cache/2c/da/2c...
1,Tipping the Velvet,53.74,One,http://books.toscrape.com/media/cache/26/0c/26...
2,Soumission,50.1,One,http://books.toscrape.com/media/cache/3e/ef/3e...
3,Sharp Objects,47.82,Four,http://books.toscrape.com/media/cache/32/51/32...
4,Sapiens: A Brief History of Humankind,54.23,Five,http://books.toscrape.com/media/cache/be/a5/be...
5,The Requiem Red,22.65,One,http://books.toscrape.com/media/cache/68/33/68...
6,The Dirty Little Secrets of Getting Your Dream...,33.34,Four,http://books.toscrape.com/media/cache/92/27/92...
7,The Coming Woman: A Novel Based on the Life of...,17.93,Three,http://books.toscrape.com/media/cache/3d/54/3d...
8,The Boys in the Boat: Nine Americans and Their...,22.6,Four,http://books.toscrape.com/media/cache/66/88/66...
9,The Black Maria,52.15,One,http://books.toscrape.com/media/cache/58/46/58...


### Part h
Notice that there are many pages to http://books.toscrape.com/. When you click on “Next” in the bottom-right corner of the screen, it takes you to http://books.toscrape.com/catalogue/page-2.html. The front page is the same as http://books.toscrape.com/catalogue/page-1.html, and there are 50 total pages.

Write a loop that uses the function you wrote in part g to scrape each of the 50 pages, and append each of these data frames together. If you write this loop correctly, your dataframe will have 1000 rows (20 books on each of the 50 pages). 

Some hints:

* Typing `new_df = pd.DataFrame()` with nothing in the parentheses will create an empty data frame on which new data can be appended.

* There are many loops you can use, but the most straightforward one is a for-values loop that counts from 1 to 50. In Python, you can initialize such a loop with for i in range(1, 51):, and indenting every line below it that belongs inside the loop. Inside the loop, the letter i is now a stand-in for the number currently being considered.

* You will need to figure out how to replace the number in URLs like http://books.toscrape.com/catalogue/page-2.html with the number currently under consideration in the loop. You might need the `str()` function, which turns numeric values into strings.

* `pd.concat()` is a method that appends dataframes together.

[10 points]

In [37]:
# Initialize an empty DataFrame to store all book data

all_books_df = pd.DataFrame()

In [41]:
# Loop through pages 1 to 50

for i in range(1, 51):
    # Construct the URL for the current page
    page_url = "http://books.toscrape.com/catalogue/page-" + str(i) + ".html"
    
    # Scrape the page using the function from part g
    page_df = scrape_book_page(page_url,headers)
    
    # Append the page's DataFrame to the master DataFrame
    all_books_df = pd.concat([all_books_df, page_df], ignore_index=True)


all_books_df



Unnamed: 0,Title,Price,Rating,ImageURL
0,A Light in the Attic,51.77,Three,http://books.toscrape.com/media/cache/2c/da/2c...
1,Tipping the Velvet,53.74,One,http://books.toscrape.com/media/cache/26/0c/26...
2,Soumission,50.10,One,http://books.toscrape.com/media/cache/3e/ef/3e...
3,Sharp Objects,47.82,Four,http://books.toscrape.com/media/cache/32/51/32...
4,Sapiens: A Brief History of Humankind,54.23,Five,http://books.toscrape.com/media/cache/be/a5/be...
...,...,...,...,...
1995,Alice in Wonderland (Alice's Adventures in Won...,55.53,One,http://books.toscrape.com/media/cache/96/ee/96...
1996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",57.06,Four,http://books.toscrape.com/media/cache/09/7c/09...
1997,A Spy's Devotion (The Regency Spies of London #1),16.97,Five,http://books.toscrape.com/media/cache/1b/5f/1b...
1998,1st to Die (Women's Murder Club #1),53.98,One,http://books.toscrape.com/media/cache/2b/41/2b...


In [42]:
# Preview the final DataFrame

print(all_books_df.head())
print(f"Total books scraped: {len(all_books_df)}")

                                   Title  Price Rating  \
0                   A Light in the Attic  51.77  Three   
1                     Tipping the Velvet  53.74    One   
2                             Soumission  50.10    One   
3                          Sharp Objects  47.82   Four   
4  Sapiens: A Brief History of Humankind  54.23   Five   

                                            ImageURL  
0  http://books.toscrape.com/media/cache/2c/da/2c...  
1  http://books.toscrape.com/media/cache/26/0c/26...  
2  http://books.toscrape.com/media/cache/3e/ef/3e...  
3  http://books.toscrape.com/media/cache/32/51/32...  
4  http://books.toscrape.com/media/cache/be/a5/be...  
Total books scraped: 2000
