# Design an App that predicts movie genres and detects spoilers in reviews

Now that I have the CSV files of the movies divided by decades and enclosing the `tconst` info, I have everything I need to write my web scraper and to make it work. So let's jump into it...

### Part2: IMDB web scraper

Author: Sana Krichen    
https://www.linkedin.com/in/sanakrichen/    
https://github.com/skrichen
    

Let's proceed with our imports

In [12]:
# Standard imports
import numpy as np
import pandas as pd
import ast 

# For scraping
import requests
from bs4 import BeautifulSoup

# For adding delays so that we don't spam requests
import time
# jason regex and string
import string
import re
import json


The following is a function that I am writing to get genres from the html code of the IMDB website... This function could be replaced by the following command in my web scraper.. 

    #X=json.loads(str(soup.findAll('script', {'type':'application/ld+json'})[0].text))
    #X['genre']

However, for some reason, it is not returning anything to me... My guess is that this may be an issue related to the version of `BeautifulSoup` that I have but I honestly didn't push the troubleshooting any further than that. I preferred saving myself some time and a headache by writing a function that does the job.. Even though it is not the "cleanest" way to do this, but it actually, very often, gets the job done.

I am showing in this image a typical output of the following command `str(soup.findAll('script', {'type':'application/ld+json'}))`

<img src="https://drive.google.com/uc?export=view&id=1wOoZw8DpaiqGswURMqFoSNs0kUyK3gzd" width="640" height="480">


This looks like a jason dictionary and the goal of this function is to extract what comes after the word `genre` shows up. However I cannot extract everything after that. I need to stop when I get to the next key...
The genres are stored in a list so I figured that reaching the closing bracket `]` is a good stopping condition, however after running the scraper for some trials I had to include some other stopping instances into my stopping conditions such as `'"actor":','"director":','"creator":','"datePublished":','}</script>]'` to avoid getting some errors...


The idea of the function is to get the dictionary, convert it to a string, split it, go through every instance and locate the word `genre`. Once we are there, we will append every word that comes after to a list of genre after making sure not to include any punctuation. We will keep on appending words to the list until we hit the stopping condition that I previously mentioned.

In [14]:
def get_genre(str_dic):
    '''
    This functions allows the extraction of the genres from the html code
    '''
    genre_list=[]
    # loop through every element in str_dic until we hit the word genre
    for i in str_dic:
        if i == '"genre":':
            # keep track on the index
            j=str_dic.index(i)+1
            # "the keep going" condition to keep appending words to the list genre_list
            while str_dic[j] not in ['],','"actor":','"director":','"creator":','"datePublished":','}</script>]']:
                # check that we don't include ponctuations such as '[' or ']'
                for punctuation_mark in string.punctuation:
                    # replace the ponctuation by empty space
                    str_dic[j]=str_dic[j].replace(punctuation_mark,'')
                if str_dic[j]!="":   
                    genre_list.append(str_dic[j])
                j+=1
            break  

    return genre_list

Let's write a web scraper for a particular movie. The goal is, giving the unique identifier of the movie, we will first access the plot. If there is a plot (sometimes in the IMDB website, there is no synopsis for the movie ), then we will store it along with its identifier and we will go ahead and fetch it genres using the function get_genre().


In [15]:

#tconst ='tt0110912'
#tconst='tt0133093'
#tconst='tt0000009'
#tconst='tt0000739'
#tconst='tt0001163'
#tconst='tt0003140'
#tconst='tt0109830'
#tconst='tt0049310'

tconst='tt0068646'

#First we will pull down the content of the page of the movie into a Python (string) variable, using Requests.
response2=requests.get(f'https://www.imdb.com/title/{tconst}/plotsummary')

# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup2 = BeautifulSoup(response2.content)

# extract the plot content
plot=soup2.find('ul', id="plot-synopsis-content").text.strip()

# initialize an empty dictionary
imdb_to_scrap = {'tconst':"", 'genre':[], 'plot':''}

# Check if the movie has a plot, if yes proceed
if plot!='It looks like we don\'t have a Synopsis for this title yet. Be the first to contribute! Just click the "Edit page" button at the bottom of the page or learn more in the Synopsis submission guide.':
    
    # assign to tconct its value
    imdb_to_scrap = {'tconst':tconst, 'genre':[], 'plot':''}
    
    # assign the plot its content
    imdb_to_scrap['plot']=plot

    # Send a get request and assign the response to a variable
    response = requests.get(f'https://www.imdb.com/title/{tconst}/')

    

    # Turn the response.content into a Beautiful Soup object
    soup = BeautifulSoup(response.content,'html.parser')
    

    # extract genres using get_genre
    str_dic=str(soup.findAll('script', {'type':'application/ld+json'})).split()
    imdb_to_scrap['genre']=[get_genre(str_dic)]


print(imdb_to_scrap)

{'tconst': 'tt0068646', 'genre': [['Crime', 'Drama']], 'plot': 'In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone\'s daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as "Godfather." He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, "no Sicilian can refuse a request on his daughter\'s wedding day." One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment from the presiding judge. The Don is disappointed in Bonasera, who\'d avoided most contact with the Don due to Corleone\'s nefarious business dealings. The Don\'s wife is godmother to Bonasera\'s shamed daughter, a relationship 

Perfect!!! Now that I can do it for one movie, let's make a loop and do it for the movies that I previously collected.

Let's for instance chhose the df_nineties dataset

In [16]:
df_nineties=pd.read_csv('/Users/SanaKrichen/Desktop/Brainstation/Capstone Project/nineties/df_nineties.csv')
df_nineties

Unnamed: 0,tconst,primaryTitle,startYear
0,tt0015724,Dama de noche,1993
1,tt0059900,"Wenn du groß bist, lieber Adam",1990
2,tt0065188,"Vojtech, receny sirotek",1990
3,tt0066498,The Ear,1990
4,tt0072670,Attila 74: The Rape of Cyprus,1995
...,...,...,...
13952,tt9501772,Nan Fu Nv Zhu Ren,1999
13953,tt9531456,Proklyaty i zabyty,1997
13954,tt9673398,"I Respond to You, God",1996
13955,tt9673442,Blood and Musk,1997


Let's rewrite the previous web scraper to include looping over a series of unique identifiers stored (in this particular case) in `df_ninties['tconst']`. Every time that I collect all the info that I need, I append them in my scrap_df dataframe. In the following code I am also including some improvements. In fact, sometimes when the code goes to the url of a movie it gets an error of type `404 Error - IMDb` which means that it can't find the page that it wants to access. The code will then throw me an error and the loop will stop. In order to avoid this problem, I included a conditional statement where I am making sure that I won't execute my commands unless I am not in this 404 error situation. 
Looping over ten thousands movies takes a lot of time so in order for me to keep an eye on the overall progress, I added a few lines of code to get a notification everytime my code stores a new set of 50 movies with their info in my scrap_df dataframe. 

In [22]:
# initialize a dataframe with the columns tconst, genre and plot
scrap_df = pd.DataFrame(columns=['tconst', 'genre', 'plot'])
i=0

# Looping over all my movies
for tconst in df_nineties['tconst']:

    imdb_to_scrap = {'tconst':"", 'genre':[], 'plot':""}
    
    # Send a get request and assign the response to a variable
    response2=requests.get(f'https://www.imdb.com/title/{tconst}/plotsummary')
    # Turn the undecoded content into a Beautiful Soup object and assign it to a variable
    soup2 = BeautifulSoup(response2.content,'html.parser')
    
    
    #Make sure to execute my commands only if I don't have a 404 error
    if soup2.title.text.strip() != '404 Error - IMDb':
        # extract the plot content
        plot=soup2.find('ul', id="plot-synopsis-content").text.strip()

        # Check if the movie has a plot, if yes proceed
        if plot!='It looks like we don\'t have a Synopsis for this title yet. Be the first to contribute! Just click the "Edit page" button at the bottom of the page or learn more in the Synopsis submission guide.':

            # assign tconst and the plot their values
            imdb_to_scrap = {'tconst':tconst, 'genre':[], 'plot':plot}

            # Send a get request and assign the response to a variable
            response1 = requests.get(f'https://www.imdb.com/title/{tconst}/')
            # Turn the undecoded content into a Beautiful Soup object and assign it to a variable
            soup1 = BeautifulSoup(response1.content,'html.parser')

            # extract the genres of the movie
            str_dic=str(soup1.findAll('script', {'type':'application/ld+json'})).split()
            imdb_to_scrap['genre']=[get_genre(str_dic)]

            # append all the collected info into my dataframe
            df = pd.DataFrame.from_dict(imdb_to_scrap)
            scrap_df = scrap_df.append(df, ignore_index=True)

            #follow progress: print tconst every 50 iterations
            i+=1
            if i%50==0:
                print(f'after 50 append {tconst}')

#print scrap_df
scrap_df


after 50 append tt0100318
after 50 append tt0101811
after 50 append tt0102975
after 50 append tt0104815
after 50 append tt0106489
after 50 append tt0107983
after 50 append tt0109759
after 50 append tt0111161
after 50 append tt0112870
after 50 append tt0114369
after 50 append tt0116275
after 50 append tt0117737
after 50 append tt0118983
after 50 append tt0120018
after 50 append tt0120746
after 50 append tt0124268
after 50 append tt0137338
after 50 append tt0151738
after 50 append tt0172396
after 50 append tt0204709
after 50 append tt0270539
after 50 append tt1085811


Unnamed: 0,tconst,genre,plot
0,tt0097571,[Drama],When Amber film and photography collective dec...
1,tt0098990,[War],The film opens with a scene of an Islamic circ...
2,tt0099012,"[Comedy, Romance]","Synopsis by ""Cult Movies"" author Danny Peary....."
3,tt0099052,"[Comedy, Fantasy, Horror, Thriller]",A photographer named Jerry Manley arrives in V...
4,tt0099077,"[Biography, Drama]","In 1969, Dr. Malcolm Sayer (Robin Williams) is..."
...,...,...,...
1105,tt1260583,[Documentary],Twelve youngsters (ages 10 to 15) from the Fre...
1106,tt1409031,[Documentary],Filmed over a period of three years between Ju...
1107,tt2010915,"[Drama, Fantasy, Romance]",Beladingala Baale is the tale of a chess playe...
1108,tt4744132,"[Drama, Family, Romance]",Deepa Nandy lives in Taramati with her sister ...


In [23]:
# display scrap_df
scrap_df

Unnamed: 0,tconst,genre,plot
0,tt0097571,[Drama],When Amber film and photography collective dec...
1,tt0098990,[War],The film opens with a scene of an Islamic circ...
2,tt0099012,"[Comedy, Romance]","Synopsis by ""Cult Movies"" author Danny Peary....."
3,tt0099052,"[Comedy, Fantasy, Horror, Thriller]",A photographer named Jerry Manley arrives in V...
4,tt0099077,"[Biography, Drama]","In 1969, Dr. Malcolm Sayer (Robin Williams) is..."
...,...,...,...
1105,tt1260583,[Documentary],Twelve youngsters (ages 10 to 15) from the Fre...
1106,tt1409031,[Documentary],Filmed over a period of three years between Ju...
1107,tt2010915,"[Drama, Fantasy, Romance]",Beladingala Baale is the tale of a chess playe...
1108,tt4744132,"[Drama, Family, Romance]",Deepa Nandy lives in Taramati with her sister ...


In [None]:
# export scrap_df into a csv format and place it in the folder where I need it
scrap_df.to_csv (r'/Users/SanaKrichen/Desktop/Brainstation/Capstone Project/nineties/scraped_df_nineties.csv', index = False, header=True)

Please note that this web scraper was slightly tweaked and modified to adjust to the dataset that I want to loop over. Here I am showcasing the web scraper for movies from the nineties so I am using my special dataset of my movies from that decade. You will find that I have many folders with decade names in my working directory. Every directory has a specific web scraper that was adjusted to call a certain dataframe (in this case `df_nineties` ) and to return a certain output dataframe (in this case `scraped_df_nineties.csv` )


After running my web web scraper through every csv file that I have, I will get many csv files of the scraped info for every decade. In other words, I will end up having the following files 


In [20]:
print('The files are:')
for dec in ['fifties', 'sixties', 'seventies', 'eighties', 'nineties', 'twothousands', 'twothousandstens','twothousandstwenties']:
    print(f'scraped_df_{dec}.csv')

The files are:
scraped_df_fifties.csv
scraped_df_sixties.csv
scraped_df_seventies.csv
scraped_df_eighties.csv
scraped_df_nineties.csv
scraped_df_twothousands.csv
scraped_df_twothousandstens.csv
scraped_df_twothousandstwenties.csv


For more info you can check out the folders named after the different decades... Please be aware that the code is the same as in this web scraper but slightly adjusted to fit the right input and output. Please note also that I ma not including all the comments and markdowns in those notebooks as everything is explained here.

In [None]:
Coming next: IMDB_EDA