# Movie Director stuff

stuff bout goal

whatever.

### HTTP code

First, make a function which gets a page from meta critics "Movie Releases by Score" list. The input to the function is page number of movies on the "Movie Releases by Score" page. Ie page 0 gets an html page for the top 100 movies, page 1 gets movies 101-200, etc.

notice that there is some extra headers that I added to the requests.get function. This is because Metacritic will 403 error out anybody who accesses the site without the user-agent http header.

I also added in some code which guarantees that we got back html and didn’t get a 400 error, just in case they decided to change the url of the page.

In [1]:
from datetime import date, datetime
from typing import Optional, Iterable, Dict, Union
import pandas as pd
from bs4 import BeautifulSoup, element
from time import sleep
import requests
import numpy as np



def getMetaCriticPage(page_num:int)->Optional[str]:
    meta_critic_page_headers={
        "accept": "text/html",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
    }
    http_req=requests.get(f"https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page={page_num}",headers=meta_critic_page_headers)
    if http_req.status_code!=200 or "text/html" not in http_req.headers["content-type"]:
#         wait 25 seconds and try again. Metacritic will sometimes reject your request, even if you only request every 5 seconds. to prevent that, try again.
        sleep(25)
        http_req=requests.get(f"https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page={page_num}",headers=meta_critic_page_headers)
        if http_req.status_code!=200 or "text/html" not in http_req.headers["content-type"]:
            return None
    
    return http_req.text

## Web Scraping Code

In this code, we use BeautifulSoup to scrape the data from Metacritic. The entry function of this code is the "getMetaCriticValues" function. In that function, the input is the "Movie Releases by Score" html page, and the output is a generator. That generator yields a dictionary for each movie in the html page. So if the first Metacritic page is the input, the output is a generator, which creates a dictionary for the each movie in the top 100. 

The reason that we used a generator rather than a list was to prevent unnecessary intermediate objects taking up memory (Ie a dict for all 10,000 movies). 

There is also minor text normalization in this code. This is because we wanted to remove all website specific quirks from the data before performing any real data cleaning.

In [2]:
def normalizeTitle(title):
    return title.lower().strip()


def findLastPage(html:str)->int:
    return int(BeautifulSoup(html,'html.parser').find("li",{"class":"page last_page"}).find("a",{"class":"page_num"}).text)
    

def findMetaCriticReleaseDate(movie:BeautifulSoup)->Union[datetime,np.float64]:
    dateText=movie.find("div",{"class":"clamp-details"}).find("span").text
    normalizedDateText=dateText.strip()
    if normalizedDateText =="TBA":
        return np.nan
    return datetime.strptime(normalizedDateText,"%B %d, %Y")


def findMetaCriticScore(movie:BeautifulSoup,score_str:str)->float:
    span=movie.find("span",text=score_str)
    text=str(span.next_sibling.next_sibling.text).strip()
    try:
        return float(text)
    except:
        if text =="tbd":
            return np.nan
        else:
            raise ValueError("unable to find meta critic score, and was not tbd. Most likely the website layout changed")


def getMetaCriticValues(html:str)->Iterable[Dict]:
    soup=BeautifulSoup(html,'html.parser')
    for single_movie in soup.find_all("td",{"class":"clamp-summary-wrap"}):
        release_date=findMetaCriticReleaseDate(single_movie)
        year= release_date.year if type(release_date) == datetime else np.nan
        title=normalizeTitle(str(single_movie.find("h3").text))
        critic_rating=findMetaCriticScore(single_movie,"Metascore:")
        user_rating=findMetaCriticScore(single_movie,"User Score:")
        yield {
            "date_published":release_date,
            "title":title,
            "year":year,
            "metacritic_critic_rating":critic_rating/10,
            "metacritic_user_rating":user_rating
        }



## Main Webscraping Script

Now that those functions were created, we then make the code that web scrapes every page on Metacritic’s "Movie Releases by Score" page. To do that, first we check to see how many "Movie Releases by Score" pages there are on Metacritic. Then we iterate through each "Movie Releases by Score" page, and call the getMetaCriticValues function for that page. At the end, we have a dataframe which holds all of the Metacritic data we need.

Note that after each html page it gets, it sleeps for 5 seconds. This is to prevent Metacritic from blocking us.

In [3]:
def iterateThroughMetaCriticValues()->Iterable[Dict]:
    last_num=findLastPage(getMetaCriticPage(0))
    
#     only go to last num, since page count in api starts at 0, but on webpage starts at 1.
    for page_num in range(0,last_num,1):
        page=getMetaCriticPage(page_num)
        if page is not None:
            yield from getMetaCriticValues(page)
            sleep(5)

metacritic=pd.DataFrame(iterateThroughMetaCriticValues(),columns=['date_published','title','year','metacritic_critic_rating','metacritic_user_rating'])
metacritic

Unnamed: 0,date_published,title,year,metacritic_critic_rating,metacritic_user_rating
0,1941-09-04,citizen kane,1941.0,10.0,8.4
1,1972-03-11,the godfather,1972.0,10.0,9.2
2,1954-09-01,rear window,1954.0,10.0,8.8
3,1943-01-23,casablanca,1943.0,10.0,8.9
4,2014-07-11,boyhood,2014.0,10.0,7.6
...,...,...,...,...,...
13579,1987-08-22,the garbage pail kids movie,1987.0,0.1,0.8
13580,2015-06-05,united passions,2015.0,0.1,0.7
13581,1996-01-12,bio-dome,1996.0,0.1,7.2
13582,2005-08-12,chaos,2005.0,0.1,2.3


In [4]:
# this is temporary since I dont want you guys to need to re-run that long code each time.
# dumps it as a python pickle file
metacritic.to_pickle("metacritic.pickle")

# Data Cleaning

In [5]:
metacritic=pd.read_pickle("metacritic.pickle")
# usecols will specify columns we want to add from the csv to our dataframe
imdb = pd.read_csv("IMDB_movies.csv", usecols=['title','year','date_published','genre','duration','country','language','director','actors','avg_vote','votes','budget','usa_gross_income','worldwide_gross_income','metascore','reviews_from_users','reviews_from_critics'])
imdb
metacritic

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,date_published,title,year,metacritic_critic_rating,metacritic_user_rating
0,1941-09-04,citizen kane,1941.0,10.0,8.4
1,1972-03-11,the godfather,1972.0,10.0,9.2
2,1954-09-01,rear window,1954.0,10.0,8.8
3,1943-01-23,casablanca,1943.0,10.0,8.9
4,2014-07-11,boyhood,2014.0,10.0,7.6
...,...,...,...,...,...
13579,1987-08-22,the garbage pail kids movie,1987.0,0.1,0.8
13580,2015-06-05,united passions,2015.0,0.1,0.7
13581,1996-01-12,bio-dome,1996.0,0.1,7.2
13582,2005-08-12,chaos,2005.0,0.1,2.3
