# Option 1: JSON in HTML
Example:

Write a Python function that takes the URL of an IMDb movie page, extracts the embedded JSON data and converts it into a flattened DataFrame.
The DataFrame should contain only text or numeric columns. Avoid using columns with lists or dictionary as columns.
Apply this function to a list of IMDb movie URLs, for example, the top 200 movies. Write a script to obtain the links.
Document your code.

In [1]:
# Import necessary libraries
from scrapethat import *
import json
from bs4 import BeautifulSoup
import requests
import cloudscraper
import pandas as pd
from pandas import json_normalize
from tqdm import tqdm
import re

## Get list of IMDb movie URLs

In [2]:
link = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'

In [3]:
top250 = read_cloud(link)

In [28]:
# Get a list of links to the top250 movies

elements = top250.find_all(class_ ='ipc-title ipc-title--base ipc-title--title ipc-title-link-no-icon ipc-title--on-textPrimary sc-479faa3c-9 dkLVoC cli-title')
links_list = [x.find('a')['href'] for x in elements]
links_list = ['https://www.imdb.com'+x for x in links_list]

In [29]:
links_list

['https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1',
 'https://www.imdb.com/title/tt0068646/?ref_=chttp_t_2',
 'https://www.imdb.com/title/tt0468569/?ref_=chttp_t_3',
 'https://www.imdb.com/title/tt0071562/?ref_=chttp_t_4',
 'https://www.imdb.com/title/tt0050083/?ref_=chttp_t_5',
 'https://www.imdb.com/title/tt0108052/?ref_=chttp_t_6',
 'https://www.imdb.com/title/tt0167260/?ref_=chttp_t_7',
 'https://www.imdb.com/title/tt0110912/?ref_=chttp_t_8',
 'https://www.imdb.com/title/tt0120737/?ref_=chttp_t_9',
 'https://www.imdb.com/title/tt0060196/?ref_=chttp_t_10',
 'https://www.imdb.com/title/tt0109830/?ref_=chttp_t_11',
 'https://www.imdb.com/title/tt0137523/?ref_=chttp_t_12',
 'https://www.imdb.com/title/tt0167261/?ref_=chttp_t_13',
 'https://www.imdb.com/title/tt1375666/?ref_=chttp_t_14',
 'https://www.imdb.com/title/tt0080684/?ref_=chttp_t_15',
 'https://www.imdb.com/title/tt0133093/?ref_=chttp_t_16',
 'https://www.imdb.com/title/tt0099685/?ref_=chttp_t_17',
 'https://www.imdb.com/

In [5]:
# Define a function that gets one movie
def get_a_movie(link):
    try:
        t = read_cloud(link)
        data = json.loads(t.find('script', {'type': 'application/ld+json'}).text)
        # Look for each column in the JSON
        df = {
            'movie':data['name'],
            'alternateName':data.get('alternateName', None), # if there is not an alternateName, gives None
            'description':data['description'],
            'ratingCount':data['aggregateRating']['ratingCount'],
            'ratingValue':data['aggregateRating']['ratingValue'],
            'genre1':data['genre'][0] if len(data['genre']) > 0 else None,
            'genre2':data['genre'][1] if len(data['genre']) > 1 else None, # if there is only 1 genre gives None
            'genre3':data['genre'][2] if len(data['genre']) > 2 else None, # if there are only 2 genres gives None
            'published':data['datePublished'],
            'keywords':data['keywords'],
            'duration':data['duration'],
            'actor1':data['actor'][0]['name'],
            'actor2':data['actor'][1]['name'],
            'actor3':data['actor'][2]['name'],
            'director':data['director'][0]['name']
        }
        return df
    except:
        print('error')
        pass

In [6]:
# try it on 1 movie
get_a_movie('https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1')

{'movie': 'The Shawshank Redemption',
 'alternateName': None,
 'description': 'Over the course of several years, two convicts form a friendship, seeking consolation and, eventually, redemption through basic compassion.',
 'ratingCount': 2824041,
 'ratingValue': 9.3,
 'genre1': 'Drama',
 'genre2': None,
 'genre3': None,
 'published': '1994-10-14',
 'keywords': 'prison,based on the works of stephen king,escape from prison,friendship between men,wrongful conviction',
 'duration': 'PT2H22M',
 'actor1': 'Tim Robbins',
 'actor2': 'Morgan Freeman',
 'actor3': 'Bob Gunton',
 'director': 'Frank Darabont'}

## Get All Movies

In [7]:
# Function to get all movies in the list
def get_all_movies(links_list):
    
    dicts = []

    for link in tqdm(links_list):
        movie = get_a_movie(link)
        dicts.append(movie)
        df = pd.DataFrame(dicts)
    return df

In [22]:
# get all movies
movies = get_all_movies(links_list)

100%|████████████████████████████████████████████████████████████████████████████████| 250/250 [05:26<00:00,  1.30s/it]


In [10]:
# Make a function to create duration in minutes column from `duration`
def duration_to_min(duration):
    match = re.match(r'PT(\d+H)?(\d+M)?', duration)
    hours = int(match.group(1).replace("H", "")) if match.group(1) else 0 # remove H and create int from the str
    minutes = int(match.group(2).replace("M", "")) if match.group(2) else 0 # remove M and create int from the str
    duration = hours * 60 + minutes # create duration in minutes value
    return duration

In [11]:
# Apply function to the column and create new variable
movies['duration_min'] = movies['duration'].apply(duration_to_min)

In [17]:
# Create function that converts `published` to datetime datatype
def change_datetime(df):
    df['published'] = df['published'].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d'))
    return df

In [18]:
# apply function
change_datetime(movies)

Unnamed: 0,movie,alternateName,description,ratingCount,ratingValue,genre1,genre2,genre3,published,keywords,actor1,actor2,actor3,director,duration_min
0,The Shawshank Redemption,,"Over the course of several years, two convicts...",2824041,9.3,Drama,,,1994-10-14,"prison,based on the works of stephen king,esca...",Tim Robbins,Morgan Freeman,Bob Gunton,Frank Darabont,142
1,The Godfather,,"Don Vito Corleone, head of a mafia family, dec...",1968486,9.2,Crime,Drama,,1972-03-24,"mafia,patriarch,crime family,organized crime,r...",Marlon Brando,Al Pacino,James Caan,Francis Ford Coppola,175
2,The Dark Knight,,When the menace known as the Joker wreaks havo...,2805532,9.0,Action,Crime,Drama,2008-07-18,"dc comics,psychopath,moral dilemma,superhero,c...",Christian Bale,Heath Ledger,Aaron Eckhart,Christopher Nolan,152
3,The Godfather Part II,,The early life and career of Vito Corleone in ...,1336147,9.0,Crime,Drama,,1974-12-18,"revenge,1950s,corrupt politician,cuban revolut...",Al Pacino,Robert De Niro,Robert Duvall,Francis Ford Coppola,202
4,12 Angry Men,,The jury in a New York City murder trial is fr...,841987,9.0,Crime,Drama,,1957-04-10,"jury,dialogue driven,murder,courtroom,trial",Henry Fonda,Lee J. Cobb,Martin Balsam,Sidney Lumet,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Les quatre cents coups,The 400 Blows,"A young boy, left without attention, delves in...",125476,8.1,Crime,Drama,,1959-11-16,"coming of age,skipping school,mother son relat...",Jean-Pierre Léaud,Albert Rémy,Claire Maurier,François Truffaut,99
246,Aladdin,,A kind-hearted street urchin and a power-hungr...,453913,8.0,Animation,Adventure,Comedy,1992-11-25,"genie,three wishes,arab,princess,prince",Scott Weinger,Robin Williams,Linda Larkin,Ron Clements,90
247,Dances with Wolves,,"Lieutenant John Dunbar, assigned to a remote w...",284086,8.0,Adventure,Drama,Western,1990-11-21,"friendship,native american,wolf,19th century,f...",Kevin Costner,Mary McDonnell,Graham Greene,Kevin Costner,181
248,Life of Brian,,Born on the original Christmas in the stable n...,417338,8.0,Comedy,,,1979-08-17,"monty python,actor playing multiple roles,sati...",Graham Chapman,John Cleese,Michael Palin,Terry Jones,94


In [14]:
#drop duration column, we don't need it anymore
movies.drop(columns='duration',inplace=True)

In [19]:
movies

Unnamed: 0,movie,alternateName,description,ratingCount,ratingValue,genre1,genre2,genre3,published,keywords,actor1,actor2,actor3,director,duration_min
0,The Shawshank Redemption,,"Over the course of several years, two convicts...",2824041,9.3,Drama,,,1994-10-14,"prison,based on the works of stephen king,esca...",Tim Robbins,Morgan Freeman,Bob Gunton,Frank Darabont,142
1,The Godfather,,"Don Vito Corleone, head of a mafia family, dec...",1968486,9.2,Crime,Drama,,1972-03-24,"mafia,patriarch,crime family,organized crime,r...",Marlon Brando,Al Pacino,James Caan,Francis Ford Coppola,175
2,The Dark Knight,,When the menace known as the Joker wreaks havo...,2805532,9.0,Action,Crime,Drama,2008-07-18,"dc comics,psychopath,moral dilemma,superhero,c...",Christian Bale,Heath Ledger,Aaron Eckhart,Christopher Nolan,152
3,The Godfather Part II,,The early life and career of Vito Corleone in ...,1336147,9.0,Crime,Drama,,1974-12-18,"revenge,1950s,corrupt politician,cuban revolut...",Al Pacino,Robert De Niro,Robert Duvall,Francis Ford Coppola,202
4,12 Angry Men,,The jury in a New York City murder trial is fr...,841987,9.0,Crime,Drama,,1957-04-10,"jury,dialogue driven,murder,courtroom,trial",Henry Fonda,Lee J. Cobb,Martin Balsam,Sidney Lumet,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Les quatre cents coups,The 400 Blows,"A young boy, left without attention, delves in...",125476,8.1,Crime,Drama,,1959-11-16,"coming of age,skipping school,mother son relat...",Jean-Pierre Léaud,Albert Rémy,Claire Maurier,François Truffaut,99
246,Aladdin,,A kind-hearted street urchin and a power-hungr...,453913,8.0,Animation,Adventure,Comedy,1992-11-25,"genie,three wishes,arab,princess,prince",Scott Weinger,Robin Williams,Linda Larkin,Ron Clements,90
247,Dances with Wolves,,"Lieutenant John Dunbar, assigned to a remote w...",284086,8.0,Adventure,Drama,Western,1990-11-21,"friendship,native american,wolf,19th century,f...",Kevin Costner,Mary McDonnell,Graham Greene,Kevin Costner,181
248,Life of Brian,,Born on the original Christmas in the stable n...,417338,8.0,Comedy,,,1979-08-17,"monty python,actor playing multiple roles,sati...",Graham Chapman,John Cleese,Michael Palin,Terry Jones,94


## Findings
- All 250 were successfully scraped
- data table features `alternateName` column. Foreign movies tend to have their foreign name, this gives the name in English.
- `genre1`,`genre2`,`genre3` columns feature genres of the movie. None if only 1 or 2 genre.
- `actor1`,`actor2`,`actor3` columns feature lead actors,actresses. None if only 1 or 2 actor/actress.
- `published` is in Datetime datatype.
- `duration_min` shows how long the movie is in minutes.