<a href="https://colab.research.google.com/github/Saumya0330/Machine_Learning_Saumya_UML501/blob/main/ML_Assignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This assignment will help you practice web scraping techniques by extracting structured data
from a live practice website. You will learn how to navigate HTML structures, extract relevant
information, and save it in a structured format for analysis.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin

Q1. Write a Python program to scrape all available books from the website
(https://books.toscrape.com/) Books to Scrape – a live site built for practicing scraping (safe,
legal, no anti-bot). For each book, extract the following details:
1. Title
2. Price
3. Availability (In stock / Out of stock)
4. Star Rating (One, Two, Three, Four, Five)

Store the scraped results into a Pandas DataFrame and export them to a CSV file named
books.csv.

(Note: Use the requests library to fetch the HTML page. Use BeautifulSoup to parse and extract
book details and handle pagination so that books from all pages are scraped)

In [None]:
response = requests.get("https://books.toscrape.com/")
html = response.text
print(response.status_code) #200 means OK
print(response.headers)

200
{'Date': 'Thu, 04 Sep 2025 16:19:47 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:32 GMT', 'ETag': 'W/"63e40de8-c85e"', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload', 'Content-Encoding': 'br'}


In [None]:
def scrap_books(soup):
  books_scrap = soup.find_all("article", class_="product_pod")
  page_books=[]
  for book in books_scrap:
    title = (book.h3.a["title"])
    price = book.find("p", class_="price_color").text
    avail = book.find("p", class_="instock availability").text.strip()
    rating = book.find("p", class_="star-rating")["class"][1]
    page_books.append({"Title":title, "Price":price, "Availabilty":avail, "Rating":rating})
    #print(book.prettify()) #shows html script and class names
  return page_books

In [None]:
url = "https://books.toscrape.com/catalogue/page-1.html"
books=[]
while url:
  response = requests.get(url)
  soup = BeautifulSoup(response.text, "html.parser")

  books.extend(scrap_books(soup))

  next_button = soup.find("li" , class_="next")
  if next_button:
    next_page = next_button.a["href"]
    url = urljoin(url, next_page)
  else:
    url = None

df = pd.DataFrame(books)
df.to_csv("books.csv", index=False)

In [None]:
df.head()

Unnamed: 0,Title,Price,Availabilty,Rating
0,A Light in the Attic,Â£51.77,In stock,Three
1,Tipping the Velvet,Â£53.74,In stock,One
2,Soumission,Â£50.10,In stock,One
3,Sharp Objects,Â£47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,Â£54.23,In stock,Five


In [None]:
df.shape

(1000, 4)

Q2. Write a Python program to scrape the IMDB Top 250 Movies list
(https://www.imdb.com/chart/top/) . For each movie, extract the following details:
1. Rank (1–250)
2. Movie Title
3. Year of Release
4. IMDB Rating

Store the results in a Pandas DataFrame and export it to a CSV file named imdb_top250.csv.
(Note: Use Selenium/Playwright to scrape the required details from this website)

In [None]:
!pip install selenium



In [6]:
from selenium import webdriver
import time
import tempfile

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                     "AppleWebKit/537.36 (KHTML, like Gecko) "
                     "Chrome/117.0.0.0 Safari/537.36")

# Make a unique temp directory for user data
temp_profile = tempfile.mkdtemp()
options.add_argument(f"--user-data-dir={temp_profile}")
driver = webdriver.Chrome(options=options)

url = "https://www.imdb.com/chart/top/"
driver.get(url)

time.sleep(3)

html = driver.page_source

driver.quit()
print(html[:1000])

<html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" class=" scriptsOn"><head><script async="" src="https://images-na.ssl-images-amazon.com/images/I/216YVwoRFDL.js" crossorigin="anonymous"></script><meta charset="utf-8"><meta name="viewport" content="width=device-width"><script async="" defer="" src="https://launchpad.privacymanager.io/latest/launchpad.bundle.js"></script><script src="https://cdn.hadronid.net/hadron.js?url=https%3A%2F%2Fwww.imdb.com%2Fchart%2Ftop%2F&amp;ref=&amp;_it=amazon&amp;partner_id=745"></script><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><title>IMDb Top 250 movies</title><meta name="description" content="As rated by regular IMDb voters." data-id="main"><meta name="google-site-verification" content="0cadf7898134e79b"><meta name="msvalidate.01" content="C1DACEF2769068C0B0D2687C9E5105FA"><meta name="robots" content="max-image-preview:large"><meta property="og:url" conte

In [7]:
soup = BeautifulSoup(html, "html.parser")

# Find all movie items
movies = soup.find_all("li", class_="ipc-metadata-list-summary-item")
print(len(movies))

250


In [8]:
movies=[]
for m in soup.find_all('li',class_="ipc-metadata-list-summary-item"):
  text = m.find('h3', class_="ipc-title__text ipc-title__text--reduced")
  text = text.get_text(strip=True)
  rank, title = text.split(". ",1)

  release = m.find('span',class_="cli-title-metadata-item")
  release = release.get_text(strip=True) if release else None
  rating = m.find('span',class_="ipc-rating-star--rating")
  rating = rating.get_text(strip=True) if rating else None
  movies.append({"Rank":rank, "Title":title, "Release":release, "Rating":rating})

print(movies)

df = pd.DataFrame(movies)
df.to_csv("imdb_top250.csv", index=False)

[{'Rank': '1', 'Title': 'The Shawshank Redemption', 'Release': '1994', 'Rating': '9.3'}, {'Rank': '2', 'Title': 'The Godfather', 'Release': '1972', 'Rating': '9.2'}, {'Rank': '3', 'Title': 'The Dark Knight', 'Release': '2008', 'Rating': '9.1'}, {'Rank': '4', 'Title': 'The Godfather Part II', 'Release': '1974', 'Rating': '9.0'}, {'Rank': '5', 'Title': '12 Angry Men', 'Release': '1957', 'Rating': '9.0'}, {'Rank': '6', 'Title': 'The Lord of the Rings: The Return of the King', 'Release': '2003', 'Rating': '9.0'}, {'Rank': '7', 'Title': "Schindler's List", 'Release': '1993', 'Rating': '9.0'}, {'Rank': '8', 'Title': 'Pulp Fiction', 'Release': '1994', 'Rating': '8.8'}, {'Rank': '9', 'Title': 'The Lord of the Rings: The Fellowship of the Ring', 'Release': '2001', 'Rating': '8.9'}, {'Rank': '10', 'Title': 'The Good, the Bad and the Ugly', 'Release': '1966', 'Rating': '8.8'}, {'Rank': '11', 'Title': 'Forrest Gump', 'Release': '1994', 'Rating': '8.8'}, {'Rank': '12', 'Title': 'The Lord of the Rin

In [9]:
df.head()

Unnamed: 0,Rank,Title,Release,Rating
0,1,The Shawshank Redemption,1994,9.3
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.1
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,9.0


In [10]:
df.shape

(250, 4)

Q3. Write a Python program to scrape the weather information for top world cities from the
given website (https://www.timeanddate.com/weather/) . For each city, extract the following
details:
1. City Name
2. Temperature
3. Weather Condition (e.g., Clear, Cloudy, Rainy, etc.)

Store the results in a Pandas DataFrame and export it to a CSV file named weather.csv.

In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.timeanddate.com/weather"
headers = {"User-Agent": "Mozilla/5.0"}
res = requests.get(url, headers=headers)

soup = BeautifulSoup(res.text, "html.parser")

movies = []
for row in soup.find_all("tr"):
    cols = row.find_all("td")
    if len(cols) >= 4:  # only process rows that have enough columns
        city = cols[0].get_text(strip=True)

        # weather condition from <img alt="">
        condition = None
        if cols[2].find("img"):
            condition = cols[2].find("img")["alt"]

        temp = cols[3].get_text(strip=True)

        movies.append({"City": city, "Temperature": temp, "Condition": condition})

df = pd.DataFrame(movies)
df.to_csv("weather.csv", index=False)

print(df.head(10))


          City Temperature                      Condition
0        Accra       77 °F            Partly sunny. Warm.
1  Addis Ababa       63 °F          Passing clouds. Mild.
2     Adelaide       50 °F                    Quite cool.
3      Algiers       79 °F          Passing clouds. Warm.
4       Almaty       55 °F      Clear. Refreshingly cool.
5        Amman       77 °F                   Clear. Warm.
6   Amsterdam*       66 °F          Passing clouds. Mild.
7       Anadyr       41 °F  Scattered clouds. Quite cool.
8   Anchorage*       61 °F           Mostly cloudy. Cool.
9       Ankara       77 °F                   Clear. Warm.
