This assignment will help you practice web scraping techniques by extracting structured data
from a live practice website. You will learn how to navigate HTML structures, extract relevant
information, and save it in a structured format for analysis.
Q1. Write a Python program to scrape all available books from the website
(https://books.toscrape.com/) Books to Scrape – a live site built for practicing scraping (safe,
legal, no anti-bot). For each book, extract the following details:
1. Title
2. Price
3. Availability (In stock / Out of stock)
4. Star Rating (One, Two, Three, Four, Five)
Store the scraped results into a Pandas DataFrame and export them to a CSV file named
books.csv.
(Note: Use the requests library to fetch the HTML page. Use BeautifulSoup to parse and extract
book details and handle pagination so that books from all pages are scraped)


In [None]:
%pip install requests beautifulsoup4 pandas



In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_book_details(url):
    """Scrapes book details from a single page."""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    books = []
    articles = soup.find_all('article', class_='product_pod')
    for article in articles:
        title = article.h3.a['title']
        price = article.find('p', class_='price_color').text
        availability = article.find('p', class_='instock availability').text.strip()
        star_rating = article.find('p', class_='star-rating')['class'][1]
        books.append({
            'Title': title,
            'Price': price,
            'Availability': availability,
            'Star Rating': star_rating
        })
    return books

In [None]:
base_url = 'https://books.toscrape.com/'
all_books = []
url = base_url

while url:
    print(f"Scraping {url}...")
    all_books.extend(scrape_book_details(url))
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    next_button = soup.find('li', class_='next')
    if next_button:
        next_page_url = next_button.a['href']
        url = base_url + next_page_url
    else:
        url = None

print("Finished scraping all pages.")

Scraping https://books.toscrape.com/...
Scraping https://books.toscrape.com/catalogue/page-2.html...
Scraping https://books.toscrape.com/page-3.html...
Finished scraping all pages.


In [None]:
import pandas as pd

books_df = pd.DataFrame(all_books)
display(books_df.head())

Unnamed: 0,Title,Price,Availability,Star Rating
0,A Light in the Attic,£51.77,In stock,Three
1,Tipping the Velvet,£53.74,In stock,One
2,Soumission,£50.10,In stock,One
3,Sharp Objects,£47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,£54.23,In stock,Five


In [None]:
books_df.to_csv('books.csv', index=False)
print("Data exported to books.csv")

Data exported to books.csv


Q2. Write a Python program to scrape the IMDB Top 250 Movies list
(https://www.imdb.com/chart/top/) . For each movie, extract the following details:
1. Rank (1–250)
2. Movie Title
3. Year of Release
4. IMDB Rating
Store the results in a Pandas DataFrame and export it to a CSV file named imdb_top250.csv.
(Note: Use Selenium/Playwright to scrape the required details from this website)


In [24]:
!pip install selenium pandas webdriver-manager --quiet

In [25]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time


In [26]:
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in headless mode
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [27]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

In [28]:
url = "https://www.imdb.com/chart/top/"
driver.get(url)
time.sleep(3)  # Wait for page to load

In [29]:
movies_table = driver.find_elements(By.XPATH, '//tbody[@class="lister-list"]/tr')

ranks = []
titles = []
years = []
ratings = []

for index, row in enumerate(movies_table, start=1):
    title_column = row.find_element(By.CLASS_NAME, 'titleColumn')
    title = title_column.find_element(By.TAG_NAME, 'a').text
    year = title_column.find_element(By.CLASS_NAME, 'secondaryInfo').text.strip("()")
    rating = row.find_element(By.CLASS_NAME, 'imdbRating').text.strip()

    ranks.append(index)
    titles.append(title)
    years.append(int(year))
    ratings.append(float(rating))

In [30]:
driver.quit()

In [31]:
df = pd.DataFrame({
    'Rank': ranks,
    'Movie Title': titles,
    'Year of Release': years,
    'IMDB Rating': ratings
})

df.to_csv('imdb_top250.csv', index=False)

In [32]:
print("✅ Scraping completed. File saved as imdb_top250.csv")

# Step 10: Create Download Link
from google.colab import files
files.download('imdb_top250.csv')

✅ Scraping completed. File saved as imdb_top250.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Q3. Write a Python program to scrape the weather information for top world cities from the
given website (https://www.timeanddate.com/weather/) . For each city, extract the following
details:
1. City Name
2. Temperature
3. Weather Condition (e.g., Clear, Cloudy, Rainy, etc.)
Store the results in a Pandas DataFrame and export it to a CSV file named weather.csv.

In [33]:
!pip install selenium pandas webdriver-manager --quiet


In [34]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time


In [35]:
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [36]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

In [37]:
url = "https://www.timeanddate.com/weather/"
driver.get(url)
time.sleep(3)

In [38]:
cities_data = []

In [39]:
cities_table = driver.find_elements(By.CSS_SELECTOR, "table.zebra.tb-theme.tb-hover tbody tr")

for row in cities_table:
    try:
        city_element = row.find_element(By.CSS_SELECTOR, 'td a')
        city_name = city_element.text
        city_link = city_element.get_attribute("href")

        # Navigate to each city page
        driver.get(city_link)
        time.sleep(2)

        # Scrape temperature and weather condition
        temp = driver.find_element(By.ID, "wt-tp").text.strip()
        condition = driver.find_element(By.ID, "wt-cc").text.strip()

        # Append to list
        cities_data.append({
            "City Name": city_name,
            "Temperature": temp,
            "Weather Condition": condition
        })

        # Go back to main weather page
        driver.back()
        time.sleep(2)
    except Exception as e:
        print(f"Skipping a row due to error: {e}")
        continue

# Step 7: Close the Browser
driver.quit()

# Step 8: Save to DataFrame and CSV
df = pd.DataFrame(cities_data)
df.to_csv("weather.csv", index=False)

# Step 9: Download CSV in Colab
from google.colab import files
files.download("weather.csv")

print("✅ Scraping completed and saved to weather.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Scraping completed and saved to weather.csv
