<a href="https://colab.research.google.com/github/tegacodess/My-Data-Projects/blob/main/IMDB_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMDB Site Webscrape 🎥

In this notebook, I scraped the top 25 movies listed on the IMDB site. As a result of the dynamic, and constantly changing nature of the data obtained, the values in this notebook may not correspond with the actual top 25 movies on the site in the near future.

To this effect, I have shown the data obtained and its state after every line of code that modifies it for the viewers proper understanding.

### Objectives
To obtain:
1. Movie titles
2. Movie rank number
3. Number of votes
4. Duration
5. Release year
6. Metascore
7. Movie description


### Tools
* Python
* Requests Library
* BeautifulSoup Library
* Pandas Library

After the data has been obtained, it would be converted into a DataFrame and exported in the CSV, TSV, and Excel file formats.

In [49]:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Define a user-agent to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

url = "https://www.imdb.com/search/title/?groups=top_250"  # Replace with your actual URL
response = requests.get(url, headers=headers)

# Check if the request was successful
response.status_code


Status code: 200


200

In [50]:
body = response.content

In [51]:
soup = BeautifulSoup(body, 'html.parser')

### Scrape Movie Titles & Ranks

In [55]:
# html element
#  <h3 class="ipc-title__text">Jaws</h3>

In [56]:
titles = soup.find_all("h3", class_='ipc-title__text')
titles

[<h3 class="ipc-title__text ipc-title__text--reduced">1. Jaws</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">2. How to Train Your Dragon</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">3. The Shawshank Redemption</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">4. Top Gun: Maverick</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">5. The Godfather</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">6. The Wild Robot</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">7. Interstellar</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">8. Oppenheimer</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">9. Inception</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">10. The Departed</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">11. Goodfellas</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">12. Jurassic Park</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">13. The Dark

In [57]:
movie_titles = []
for title in titles:
  movie = title.get_text()
  movie_titles.append(movie)

movie_titles

['1. Jaws',
 '2. How to Train Your Dragon',
 '3. The Shawshank Redemption',
 '4. Top Gun: Maverick',
 '5. The Godfather',
 '6. The Wild Robot',
 '7. Interstellar',
 '8. Oppenheimer',
 '9. Inception',
 '10. The Departed',
 '11. Goodfellas',
 '12. Jurassic Park',
 '13. The Dark Knight',
 '14. Se7en',
 '15. Fight Club',
 '16. Dune: Part Two',
 '17. The Lord of the Rings: The Fellowship of the Ring',
 '18. Pulp Fiction',
 '19. The Matrix',
 '20. Parasite',
 '21. The Prestige',
 '22. Inglourious Basterds',
 '23. Gladiator',
 '24. The Silence of the Lambs',
 '25. No Country for Old Men',
 'Recently viewed']

In [58]:
del movie_titles[-1] #to remove the last item in the list since its not a movie title
movie_titles

['1. Jaws',
 '2. How to Train Your Dragon',
 '3. The Shawshank Redemption',
 '4. Top Gun: Maverick',
 '5. The Godfather',
 '6. The Wild Robot',
 '7. Interstellar',
 '8. Oppenheimer',
 '9. Inception',
 '10. The Departed',
 '11. Goodfellas',
 '12. Jurassic Park',
 '13. The Dark Knight',
 '14. Se7en',
 '15. Fight Club',
 '16. Dune: Part Two',
 '17. The Lord of the Rings: The Fellowship of the Ring',
 '18. Pulp Fiction',
 '19. The Matrix',
 '20. Parasite',
 '21. The Prestige',
 '22. Inglourious Basterds',
 '23. Gladiator',
 '24. The Silence of the Lambs',
 '25. No Country for Old Men']

In [59]:
titles = []
ranks = []
for movie in movie_titles:
 rank, title = movie.split('. ')
 ranks.append(rank)
 titles.append(title)

In [60]:
titles

['Jaws',
 'How to Train Your Dragon',
 'The Shawshank Redemption',
 'Top Gun: Maverick',
 'The Godfather',
 'The Wild Robot',
 'Interstellar',
 'Oppenheimer',
 'Inception',
 'The Departed',
 'Goodfellas',
 'Jurassic Park',
 'The Dark Knight',
 'Se7en',
 'Fight Club',
 'Dune: Part Two',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Pulp Fiction',
 'The Matrix',
 'Parasite',
 'The Prestige',
 'Inglourious Basterds',
 'Gladiator',
 'The Silence of the Lambs',
 'No Country for Old Men']

In [61]:
ranks

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25']

In [11]:
len(movie_titles)

25

### Scrape Movie Ratings

In [12]:
# html element
# <span class="ipc-rating-star--voteCount">&nbsp;(<!-- -->1.5M<!-- -->)</span>

In [62]:
vote = soup.find_all("span", class_='ipc-rating-star--voteCount')
vote

[<span class="ipc-rating-star--voteCount"> (<!-- -->700K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->853K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->3.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->785K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->183K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.4M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->904K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.7M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.5M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount

In [63]:
vote_count = []
for rating in vote:
  num = rating.get_text()
  text = num.strip('\xa0()')
  vote_count.append(text)


vote_count
# len(vote_count)

['700K',
 '853K',
 '3.1M',
 '785K',
 '2.1M',
 '183K',
 '2.4M',
 '904K',
 '2.7M',
 '1.5M',
 '1.3M',
 '1.1M',
 '3M',
 '1.9M',
 '2.5M',
 '643K',
 '2.1M',
 '2.3M',
 '2.2M',
 '1.1M',
 '1.5M',
 '1.7M',
 '1.8M',
 '1.6M',
 '1.1M']

### Release Year, Duration and Parental Guidiance Rating

In [15]:
#html element
# <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">1975</span>

In [44]:
release= soup.find_all("span", class_='sc-86fea7d1-8 JTbpG dli-title-metadata-item')
release
# len(release)

[<span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">1975</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2h 4m</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">PG</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2010</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">1h 38m</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">PG</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">1994</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2h 22m</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">R</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2022</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2h 10m</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">PG-13</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">1972</span>,
 <span class="sc-86fea7d1-8 JTbpG dli-title-metadata-item">2h 5

In [64]:
# using a for loop and conditional statements, extract the release year, pg rating and duration from the list 'release'

release_year = []
pg_rating=[]
duration =[]

for year in release:
  text = year.get_text()

  # for release year
  if text.isdigit and len(text) == 4:
    release_year.append(text)

# duration
  elif 'h' in text and 'm' in text or 'h' in text:
    duration.append(text)
# parental guidiance
  else:
    pg_rating.append(text)

In [65]:
release_year

['1975',
 '2010',
 '1994',
 '2022',
 '1972',
 '2024',
 '2014',
 '2023',
 '2010',
 '2006',
 '1990',
 '1993',
 '2008',
 '1995',
 '1999',
 '2024',
 '2001',
 '1994',
 '1999',
 '2019',
 '2006',
 '2009',
 '2000',
 '1991',
 '2007']

In [66]:
pg_rating

['PG',
 'PG',
 'R',
 'PG-13',
 'R',
 'PG',
 'PG-13',
 'R',
 'PG-13',
 'R',
 'R',
 'PG-13',
 'PG-13',
 'R',
 'R',
 'PG-13',
 'PG-13',
 'R',
 'R',
 'R',
 'PG-13',
 'R',
 'R',
 'R',
 'R']

In [67]:
duration

['2h 4m',
 '1h 38m',
 '2h 22m',
 '2h 10m',
 '2h 55m',
 '1h 42m',
 '2h 49m',
 '3h',
 '2h 28m',
 '2h 31m',
 '2h 25m',
 '2h 7m',
 '2h 32m',
 '2h 7m',
 '2h 19m',
 '2h 46m',
 '2h 58m',
 '2h 34m',
 '2h 16m',
 '2h 12m',
 '2h 10m',
 '2h 33m',
 '2h 35m',
 '1h 58m',
 '2h 2m']

### Movie Description

In [25]:
# html element
# <div class="ipc-html-content-inner-div" role="presentation">Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vader.</div>

In [42]:
description= soup.find_all("div", class_='ipc-html-content-inner-div')
description

In [68]:
movie_description = []
for desc in description:
  movie_desc = desc.get_text()
  movie_description.append(movie_desc)

movie_description

["When a massive killer shark unleashes chaos on a beach community off Long Island, it's up to the local police chief, a marine biologist, and an old seafarer to hunt the beast down.",
 'A hapless young Viking who aspires to hunt dragons becomes the unlikely friend of a young dragon himself, and learns there may be more to the creatures than he assumed.',
 'A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.',
 'The story involves Maverick confronting his past while training a group of younger Top Gun graduates, including the son of his deceased best friend, for a dangerous mission.',
 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.',
 "After a shipwreck, an intelligent robot called Roz is stranded on an uninhabited island. To survive the harsh environment, Roz bonds with the island's 

In [28]:
len(movie_description)

25

### Metascore

In [69]:
# html tag
# <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">87</span>

In [70]:
meta_score = soup.find_all("span", class_ = 'sc-9fe7b0ef-0 hDuMnh metacritic-score-box')
meta_score

[<span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">87</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">75</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">82</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">78</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">100</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">85</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">74</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">90</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">74</span>,
 <span class="sc-9fe7b0ef-0 hDuMnh metacritic-score-box" style="background-color:#54A72A">

In [71]:
metascore =[]
for score in meta_score:
  meta = score.get_text()
  metascore.append(meta)

metascore

['87',
 '75',
 '82',
 '78',
 '100',
 '85',
 '74',
 '90',
 '74',
 '85',
 '92',
 '68',
 '85',
 '65',
 '67',
 '79',
 '92',
 '95',
 '73',
 '97',
 '66',
 '69',
 '67',
 '86',
 '92']

### Store the scraped data in a data frame


In [37]:
data = pd.DataFrame()

data['Title'] = titles
data['Rank'] = ranks
data['Vote'] = vote_count
data['Duration'] = duration
data['Release Year'] = release_year
data['PG Rating'] = pg_rating
data['Metascore'] = metascore
data['Description'] = movie_description

data.head()

Unnamed: 0,Title,Rank,Vote,Duration,Release Year,PG Rating,Metascore,Description
0,Jaws,1,700K,2h 4m,1975,PG,87,When a massive killer shark unleashes chaos on...
1,How to Train Your Dragon,2,853K,1h 38m,2010,PG,75,A hapless young Viking who aspires to hunt dra...
2,The Shawshank Redemption,3,3.1M,2h 22m,1994,R,82,A banker convicted of uxoricide forms a friend...
3,Top Gun: Maverick,4,785K,2h 10m,2022,PG-13,78,The story involves Maverick confronting his pa...
4,The Godfather,5,2.1M,2h 55m,1972,R,100,The aging patriarch of an organized crime dyna...


In [38]:
data.shape


(25, 8)

In [72]:
# export as csv file
data.to_csv('IMDB Top 25 Movies.csv', index = False)
data.to_csv('IMDB Top 25 Movies.tsv', sep='\t', index=False)
data.to_excel('IMDB Top 25 Movies.xlsx', index=False)