INTRODUCTION
#  IMDb Top 100 Movies Web Scraper (Selenium & Python)
##  Introduction

This Jupyter Notebook scrapes IMDb's Top 100 Movies and extracts:
-  Movie Title
-  Release Year
-  IMDb Rating

It uses Selenium WebDriver, which automates web browsers to interact with dynamic websites like IMDb.

---

###  Why Selenium?
IMDb dynamically loads content using JavaScript, making static scrapers (like BeautifulSoup) less effective.
Selenium **automates the browser** to load and extract the required data.

Install & Import Dependencies
## Install & Import Dependencies
Before running the script, ensure all required libraries are installed.

To install them, run:
```bash
pip install selenium pandas
pip install selenium pandas --#run this only it is not insatlled

In [None]:
# Import necessary libraries
from selenium import webdriver  # Automate browser interaction
from selenium.webdriver.common.by import By  # Locate elements on a webpage
from selenium.webdriver.chrome.options import Options  # Configure Chrome behavior
from selenium.webdriver.chrome.service import Service  # Manage ChromeDriver execution
import pandas as pd  # Handle extracted data
import time  # Introduce delays to allow page loading

Setup ChromeDriver & Open IMDb Page

## Setting Up WebDriver
To interact with IMDb, Selenium requires ChromeDriver, which controls Google Chrome.

### 📌 Steps:
1. Find your Chrome version:
   - Open Chrome and go to: `chrome://settings/help`
   - Note the version (e.g., Version 121.0.0.0)
   
2. Download the matching ChromeDriver:
   - Visit: [ChromeDriver Website](https://sites.google.com/chromium.org/driver/)
   - Download the version matching your Chrome.
   - Extract `chromedriver.exe` and copy its path.

3. Update the script: Replace `chrome_driver_path` with your actual path.

In [None]:
# Define path to ChromeDriver (Change this to your actual path)
chrome_driver_path = r"C:\Users\Acer\Documents\chromedriver-win64\chromedriver.exe" ---## Path for chromedriver

# Set up Chrome WebDriver options
options = Options() # creates an instance of options, which is used to set preferences for hpw chrome behaves when controlled by selenium.
options.add_argument('--headless')  # Run in headless mode (no UI) #Headless mode is useful for automation and web scraping because it improves performance and prevents UI pop-ups

options.add_argument('--disable-blink-features=AutomationControlled')  # Avoid bot detection ,Disabling this makes the automation look more like a real user

options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36")  # ## Set a real user-agent -Sets a custom user-agent to make the request appear as if it is coming from a real browser instead of Selenium/This user-agent mimics a real Chrome browser on Windows.

# Initialize Chrome WebDriver
driver = webdriver.Chrome(service=Service(chrome_driver_path), options=options)

# Open IMDb Top 250 Movies page
url = 'https://www.imdb.com/chart/top/'
driver.get(url)

# Allow time for page to fully load
time.sleep(10)  # Adjust based on network speed

 Extract Movie Titles, Years, and Ratings
##  Extract Movie Titles, Years & Ratings
Once the IMDb page loads, we extract:
-  Title: Movie name without ranking
-  Year: Release year
- Rating: IMDb rating (out of 10)

In [None]:
# Locate all movie list elements on IMDb
movies = driver.find_elements(By.XPATH, '//li[contains(@class,"ipc-metadata-list-summary-item")]') # The find_elements function returns a list of WebElements matching the specified XPath
 #Uses contains(@class,"ipc-metadata-list-summary-item") to match list elements that contain movie information.

# Initialize an empty list to store extracted data
movie_data = [] #Creates an empty list to store the extracted movie details.
 
# Loop through first 100 movies (Modify if needed)
for movie in movies[:100]:  # This loops through the first 100 movies from the IMDb page
    try:
        # Extract movie title (Remove ranking number)
        title_raw = movie.find_element(By.XPATH, './/div[contains(@class,"ipc-title")]/a/h3').text #The XPath .//div[contains(@class,"ipc-title")]/a/h3 locates the title element inside each movie's list item.
        title = title_raw.split('. ', 1)[-1]  # Remove number prefix #title_raw.split('. ', 1)[-1] removes the ranking prefix (e.g., "1. The Shawshank Redemption" → "The Shawshank Redemption").


        # Extract release year
        year = movie.find_element(By.XPATH, './/div[contains(@class,"cli-title-metadata")]/span[contains(@class,"cli-title-metadata-item")][1]').text #The XPath selects the first span inside the <div> containing metadata (like year, duration, etc.

        # Extract IMDb rating
        rating = movie.find_element(By.XPATH, './/span[contains(@class,"ipc-rating-star--rating")]').text.split()[0] # .text.split()[0] extracts the rating number and ignores extra text.

        # Store extracted data
        movie_data.append({'Title': title, 'Year': year, 'Rating': rating}) #Appends a dictionary ({'Title': title, 'Year': year, 'Rating': rating}) to the movie_data list.
    
    except Exception as e:
        print(f"Skipping a movie due to error: {e}")  # Why? If an element is missing or there's a Selenium error, this prevents the script from crashing.


## Now that you have successfully scraped IMDb Top 100 movies, you need to clean the dataset to ensure its quality before further analysis.

# Load IMDb Data & Check Initial Structure
## Step 1: Load IMDb Data & Check Initial Structure
Before cleaning, let's load the scraped data and inspect its structure.

In [None]:
# Import required libraries
import pandas as pd

csv_path = r'C:\Users\Acer\Documents\BDA3.3\Project 1- Web scraping\imdb_top_100_movies.csv'
try:
    df = pd.read_csv(csv_path, encoding='latin1')  # Use 'ISO-8859-1' if needed
    print("✅ File successfully loaded!")
except Exception as e:
    print(f"❌ Error loading file: {e}")

print("\n🔹 Dataset Overview:")
print(df.info())

 Trim Whitespace from Text Fields

To remove extra spaces that might cause inconsistencies:

In [None]:
df['Title'] = df['Title'].str.strip()
df['Year'] = df['Year'].astype(str).str.strip()
df['Rating'] = df['Rating'].astype(str).str.strip()

Convert Data Types

To ensure numerical values are properly formatted:

In [None]:
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')  # Convert Year to integer
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')  # Convert Rating to float

df = df.dropna()
df['Year'] = df['Year'].astype(int)
df['Rating'] = df['Rating'].astype(float) #If any conversion fails, those rows are dropped:

Remove Duplicate Entries

Duplicate records are removed to maintain data integrity:

In [None]:
df = df.drop_duplicates()
df.reset_index(drop=True, inplace=True)

In [None]:
Save the Cleaned Dataset

After cleaning, the dataset is saved as a new CSV file:

clean_csv_path = r'C:\Users\Acer\Documents\BDA3.3\Project 1- Web scraping\imdb_top_100_movies_cleaned.csv'
df.to_csv(clean_csv_path, index=False)