<a href="https://colab.research.google.com/github/srishagorasa1/IMDb/blob/main/IMDbMovies_WebScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Web Scraping - IMDB Site:**

step by step web scraping from IMBd's top 1000 movies using Python.

Below is the information gathered from each movie:

*   Title
*   Release year
*   Rating
*   Metascore
*   Gross earnings
*   Votes
*   Movie length


Link of the IMDB website: https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv

References:

https://medium.com/@angelicacodes

https://medium.com/better-programming/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a 

https://medium.com/better-programming/how-to-scrape-multiple-pages-of-a-website-using-a-python-web-scraper-4e2c641cff8

**Loading the libraries**

In [16]:
import pandas as pd
import numpy as np

import requests
from requests import get
from bs4 import BeautifulSoup

from time import sleep
from random import randint

**Scraping from the web**

In [17]:
# Creating the lists we want to write into
titles = []
years = []
time = []
imdb_ratings = []
metascores = []
votes = []
us_gross = []

In [18]:
# Getting English translated titles from the movies
headers = {'Accept-Language': 'en-US, en;q=0.5'}

There are 1000 movies and each page has 50 movies listed.

So the first 50 movies' url: https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv

Movies from 51 to 100: https://www.imdb.com/search/title/?groups=top_1000&start=51&ref_=adv_nxt

Movies from 101 to 150: https://www.imdb.com/search/title/?groups=top_1000&start=101&ref_=adv_nxt

In [20]:
pages = np.arange(1, 1001, 50)
pages

array([  1,  51, 101, 151, 201, 251, 301, 351, 401, 451, 501, 551, 601,
       651, 701, 751, 801, 851, 901, 951])

In [21]:
# Storing each of the urls of 50 movies 
for page in pages:
    # Getting the contents from the each url
    page = requests.get('https://www.imdb.com/search/title/?groups=top_1000&start=' + str(page) + '&ref_=adv_nxt', headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # Aiming the part of the html we want to get the information from
    movie_div = soup.find_all('div', class_='lister-item mode-advanced')
    
    # Controling the loop’s rate by pausing the execution of the loop for a specified amount of time
    # Waiting time between requests for a number between 2-10 seconds
    sleep(randint(2,10))
    
    for container in movie_div:
        # Scraping the movie's name
        name = container.h3.a.text
        titles.append(name)
        
        # Scraping the movie's year
        year = container.h3.find('span', class_='lister-item-year').text
        years.append(year)
        
        # Scraping the movie's length
        runtime = container.find('span', class_='runtime').text if container.p.find('span', class_='runtime') else '-'
        time.append(runtime)
        
        # Scraping the rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        
        # Scraping the metascore
        m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
        metascores.append(m_score)
        
        # Scraping votes and gross earnings
        nv = container.find_all('span', attrs={'name':'nv'})
        vote = nv[0].text
        votes.append(vote)
        grosses = nv[1].text if len(nv) > 1 else '-'
        us_gross.append(grosses)

**Creating the dataset**

In [24]:
movies = pd.DataFrame({'movie':titles,
                       'year':years,
                       'time_minute':time,
                       'imdb_rating':imdb_ratings,
                       'metascore':metascores,
                       'vote':votes,
                       'gross_earning':us_gross})

movies.head()

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,(2021),113 min,7.9,81,30785,-
1,The Father,(I) (2020),97 min,8.3,88,54434,-
2,Zack Snyder's Justice League,(2021),242 min,8.2,54,282156,-
3,Sound of Metal,(2019),120 min,7.8,82,81947,-
4,Another Round,(2020),117 min,7.8,80,73068,-


In [25]:
movies.dtypes

movie             object
year              object
time_minute       object
imdb_rating      float64
metascore         object
vote              object
gross_earning     object
dtype: object

**Cleaning the dataset**

In [26]:
# Cleaning 'year' column
movies['year'] = movies['year'].str.extract('(\d+)').astype(int)
movies.head(3)

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,2021,113 min,7.9,81,30785,-
1,The Father,2020,97 min,8.3,88,54434,-
2,Zack Snyder's Justice League,2021,242 min,8.2,54,282156,-


In [27]:
# Cleaning 'time_minute' column
movies['time_minute'] = movies['time_minute'].str.extract('(\d+)').astype(int)
movies.head(3)

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,2021,113,7.9,81,30785,-
1,The Father,2020,97,8.3,88,54434,-
2,Zack Snyder's Justice League,2021,242,8.2,54,282156,-


In [28]:
# Cleaning 'metascore' column
movies['metascore'] = movies['metascore'].str.extract('(\d+)')
# convert it to float and if there are dashes turn it into NaN
movies['metascore'] = pd.to_numeric(movies['metascore'], errors='coerce')

In [29]:
# Cleaning 'vote' column
movies['vote'] = movies['vote'].str.replace(',', '').astype(int)
movies.head(3)

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,2021,113,7.9,81.0,30785,-
1,The Father,2020,97,8.3,88.0,54434,-
2,Zack Snyder's Justice League,2021,242,8.2,54.0,282156,-


In [30]:
# Cleaning 'gross_earning' column
# left strip $ and right strip M 
movies['gross_earning'] = movies['gross_earning'].map(lambda x: x.lstrip('$').rstrip('M'))
# convert it to float and if there are dashes turn it into NaN
movies['gross_earning'] = pd.to_numeric(movies['gross_earning'], errors='coerce')
movies.head(3)

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,2021,113,7.9,81.0,30785,
1,The Father,2020,97,8.3,88.0,54434,
2,Zack Snyder's Justice League,2021,242,8.2,54.0,282156,


In [31]:
movies.dtypes

movie             object
year               int64
time_minute        int64
imdb_rating      float64
metascore        float64
vote               int64
gross_earning    float64
dtype: object

**Final Dataset**

In [32]:
movies

Unnamed: 0,movie,year,time_minute,imdb_rating,metascore,vote,gross_earning
0,The Mitchells vs the Machines,2021,113,7.9,81.0,30785,
1,The Father,2020,97,8.3,88.0,54434,
2,Zack Snyder's Justice League,2021,242,8.2,54.0,282156,
3,Sound of Metal,2019,120,7.8,82.0,81947,
4,Another Round,2020,117,7.8,80.0,73068,
...,...,...,...,...,...,...,...
995,Raazi,2018,138,7.8,,26057,
996,A Night at the Opera,1935,96,7.9,,30930,2.54
997,The Breath,2009,128,8.0,,32233,
998,English Vinglish,2012,134,7.8,,34029,1.67


**Saving the dataset to a csv file**

In [33]:
movies.to_csv('movies.csv')