<a href="https://colab.research.google.com/github/vkquests/Web_Scraping/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping IMDb's Top 250 movies
IMDb (Internet Movie Database) is one of the most comprehensive and popular online databases of movies, television shows, and celebrities. It provides a wealth of information about films, including details about cast and crew, user reviews, ratings, and more. While IMDb offers an extensive user interface for browsing and searching its database, there are many situations where we might want to extract or analyze IMDb data programmatically. This is where web scraping comes into play.

**Web scraping IMDb's Top 250 movies** is a process that allows us to automatically extract and collect data from IMDb's website related to these highly-rated films. This data includes movie titles, IMDb ratings, release years, runtime, and age ratings.

## Project Overview
Web scraping IMDb's Top 250 movies typically involves sending HTTP requests to IMDb's website, parsing the HTML content, and extracting the desired movie-related data. It's important to approach web scraping responsibly and adhere to IMDb's terms of service and ethical scraping practices while collecting data from their platform.

In this project, we'll demonstrate how to scrape IMDb's Top 250 movies, providing practical insights into web scraping techniques and enabling users to access and utilize this valuable list of cinematic masterpieces. The scraped data will be organized and saved for analysis, exploration, and enjoyment.



## Code Implementation

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Create a header dictionary with a User-Agent to bypass 403: Forbidden error
header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
    'Accept-Language':'en-US,en;q=1'
}

try:
  source = requests.get(url="https://www.imdb.com/chart/top/",headers=header)
  #check whether requests.get() was successful, if not will get Exception message
  source.raise_for_status()

  # Create a BeautifulSoup object to parse the HTML content
  soup = BeautifulSoup(source.text, 'html.parser')
  movies = soup.find_all('li', class_="ipc-metadata-list-summary-item")

  # creating a data variable with header, later will append the values
  data=[['Rank','IMDb_Rating', 'Name', 'Release_Year','Runtime','Age_Rating']]

  for movie in movies:
    #Fetch rank and name
    rank_name_elements  = movie.find('h3', class_='ipc-title__text').text.split('.')
    rank = rank_name_elements[0].strip()
    name = rank_name_elements[1].strip()

    # Fetch the rating
    rating_element = movie.find('span', class_='ipc-rating-star')
    imdb_rating = rating_element['aria-label'].split(':')[-1].strip() if rating_element else 'N/A'

    # Fetch the text content of the 1st, 2nd and 3rd <span> elements inside <div>
    div_element = movie.find('div', class_='cli-title-metadata')
    span_elements = div_element.find_all('span', class_='cli-title-metadata-item')

    #Fetch release_year
    release_year = span_elements[0].text.strip() if span_elements else 'N/A'

    #Fetch Runtime
    runtime = span_elements[1].text.strip() if len(span_elements) >= 2 else 'N/A'

    #Fetch age_rating
    age_rating = span_elements[2].text.strip() if len(span_elements) >= 3 else 'N/A'

    data.append([rank, imdb_rating, name, release_year, runtime, age_rating ])

    print(rank,'|', imdb_rating,'|', name,'|', release_year,'|', runtime, '|', age_rating,  )

except Exception as e:
  print(e)

1 | 9.3 | The Shawshank Redemption | 1994 | 2h 22m | R
2 | 9.2 | The Godfather | 1972 | 2h 55m | R
3 | 9.0 | The Dark Knight | 2008 | 2h 32m | PG-13
4 | 9.0 | The Godfather Part II | 1974 | 3h 22m | R
5 | 9.0 | 12 Angry Men | 1957 | 1h 36m | Approved
6 | 9.0 | Schindler's List | 1993 | 3h 15m | R
7 | 9.0 | The Lord of the Rings: The Return of the King | 2003 | 3h 21m | PG-13
8 | 8.9 | Pulp Fiction | 1994 | 2h 34m | R
9 | 8.8 | The Lord of the Rings: The Fellowship of the Ring | 2001 | 2h 58m | PG-13
10 | 8.8 | The Good, the Bad and the Ugly | 1966 | 2h 58m | Approved
11 | 8.8 | Forrest Gump | 1994 | 2h 22m | PG-13
12 | 8.8 | Fight Club | 1999 | 2h 19m | R
13 | 8.8 | The Lord of the Rings: The Two Towers | 2002 | 2h 59m | PG-13
14 | 8.8 | Inception | 2010 | 2h 28m | PG-13
15 | 8.7 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 2h 4m | PG
16 | 8.7 | The Matrix | 1999 | 2h 16m | R
17 | 8.7 | Goodfellas | 1990 | 2h 25m | R
18 | 8.7 | One Flew Over the Cuckoo's Nest | 1975 | 2h 1

In [3]:
# Create a DataFrame using the collected data
df= pd.DataFrame(data)
df

Unnamed: 0,0,1,2,3,4,5
0,Rank,IMDb_Rating,Name,Release_Year,Runtime,Age_Rating
1,1,9.3,The Shawshank Redemption,1994,2h 22m,R
2,2,9.2,The Godfather,1972,2h 55m,R
3,3,9.0,The Dark Knight,2008,2h 32m,PG-13
4,4,9.0,The Godfather Part II,1974,3h 22m,R
...,...,...,...,...,...,...
246,246,8.1,The 400 Blows,1959,1h 39m,Not Rated
247,247,8.1,Persona,1966,1h 23m,Not Rated
248,248,8.0,Life of Brian,1979,1h 34m,R
249,249,8.0,Aladdin,1992,1h 30m,G


In [4]:
#Writing the df into a csv file.
df.to_csv('IMDb_Movie_Data.csv', index=False, header=False)

In [5]:
#checking whether the write was successful
csv_data = pd.read_csv('IMDb_Movie_Data.csv')
csv_data.head(20)


Unnamed: 0,Rank,IMDb_Rating,Name,Release_Year,Runtime,Age_Rating
0,1,9.3,The Shawshank Redemption,1994,2h 22m,R
1,2,9.2,The Godfather,1972,2h 55m,R
2,3,9.0,The Dark Knight,2008,2h 32m,PG-13
3,4,9.0,The Godfather Part II,1974,3h 22m,R
4,5,9.0,12 Angry Men,1957,1h 36m,Approved
5,6,9.0,Schindler's List,1993,3h 15m,R
6,7,9.0,The Lord of the Rings: The Return of the King,2003,3h 21m,PG-13
7,8,8.9,Pulp Fiction,1994,2h 34m,R
8,9,8.8,The Lord of the Rings: The Fellowship of the Ring,2001,2h 58m,PG-13
9,10,8.8,"The Good, the Bad and the Ugly",1966,2h 58m,Approved


## Conclusion
Web scraping IMDb's Top 250 movies has opened the door to a world of cinematic excellence and data-driven insights. Through this project, we've delved into the art and science of web scraping, enabling us to collect and harness valuable information about the movies most highly rated by IMDb users.

As we conclude this project, it's essential to note that web scraping should always be performed responsibly and ethically, in compliance with the terms of service of the websites from which data is collected.

The knowledge and skills acquired in this project can be applied to various web scraping endeavors, making it a valuable addition to one's toolkit. Whether one is passionate about movies, conducting research, or simply curious about the world of web scraping, the IMDb Top 250 movies project has provided a tangible example of the power and possibilities of web scraping.