### IMDb Movie Reviews: Scraping and Cleaning the Movie Ratings

In [6]:
# Import Libraries 
!pip install html5lib


Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
   ---------------------------------------- 0.0/112.2 kB ? eta -:--:--
   --- ------------------------------------ 10.2/112.2 kB ? eta -:--:--
   -------------- ------------------------ 41.0/112.2 kB 653.6 kB/s eta 0:00:01
   ---------------------------------------- 112.2/112.2 kB 1.1 MB/s eta 0:00:00
Installing collected packages: html5lib
Successfully installed html5lib-1.1


In [8]:
import requests
import bs4
import html5lib
import json
from bs4 import BeautifulSoup

###  Define the Function to Scrape IMDb Top Movies

In [13]:
# Set the headers and the URL
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
URL = 'https://m.imdb.com/chart/top/?ref_=nv_mv_250'

# Get the page content
r = requests.get(url=URL, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')  # Parse the HTML content
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <script>
   if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }
  </script>
  <script>
   window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1727210388266);
        }
    })
  </script>
  <title>
   IMDb Top 250 Movies
  </title>
  <meta content="As rated by regular IMDb voters." data-id="main" name="description"/>
  <script type="application/ld+json">
   {"@type":"ItemList","itemListElement":[{"@type":"ListItem","item":{"@t

### Extract JSON Data from HTML Script Tag

In [16]:
script_tag = soup.find('script', type='application/ld+json')
if script_tag is None:
    print("Error: No script tag found containing movie data.")

json_string = script_tag.string

### Extract the List of Movies

In [19]:
# Extract movie list
data = json.loads(json_string)
movies = data['itemListElement']

### Loop Through Movies and Fetch Data

In [None]:
# Loop through each movie and fetch the relevant data
for movie in movies:
    movie_data = movie['item']
    title = movie_data['name']
    url = movie_data['url']
   
    # Fetch the movie's page content
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html5lib')
    review_div = soup.find('article', class_="sc-f5a8a7b0-0 dXbwtD") # this might be confusing because i could not get the reviews, so you need to run a loop.
    
    if review_div is None:
        continue  # Skip if no review found
   
    # Get the review text and rating
    review_text = review_div.get_text(strip=True)
    rating = movie_data.get('aggregateRating', {}).get('ratingValue', 'N/A')
    print(f"Title: {title}, URL: {url}, Rating value: {rating}, Review: {review_text}") # extract and print




Title: The Shawshank Redemption, URL: https://www.imdb.com/title/tt0111161/, Rating value: 9.3, Review: Featured reviewPrepare to be movedI have never seen such an amazing film since I saw The Shawshank Redemption. Shawshank encompasses friendships, hardships, hopes, and dreams.  And what is so great about the movie is that it moves you, it gives you hope.  Even though the circumstances between the characters and the viewers are quite different, you don't feel that far removed from what the characters are going through.It is a simple film, yet it has an everlasting message.  Frank Darabont didn't need to put any kind of outlandish special effects to get us to love this film, the narration and the acting does that for him.  Why this movie didn't win all seven Oscars is beyond me, but don't let that sway you to not see this film, let its ranking on the IMDb's top 250 list sway you, let your friends recommendation about the movie sway you.Set aside a little over two hours tonight and rent

In [None]:
### And here we have the desired output
