In [1]:
import pandas as pd
import requests
import time
import re

from bs4 import BeautifulSoup

In [2]:
import functions as fun

Using TensorFlow backend.


# Obtaining Data

All data for this report will be gathered using web scraping from the following web sites:
    
- metacritic.com for movie rating information.
- rottontomatoes.com for additional movie rating information.
- SpringfieldSpringfield.co.uk for gathering the screenplay texts.

## Rating Data

There were a few attempts at scraping data before finding versions that worked well for my purposes. What follows are the final attempts.

### Sraping metacritic.com

I'm taking the most highly rated and most lowly rated films as listed on this site. These extremes will be used for training my classification models to pridict if random movies will be highly rated or lowly rated films.

#### Great Movies

In [3]:
goods_titles = []

for i in range(0,20):
    # There are 10 pages to flip through of 100 movies each.
    page = requests.get(
        'https://www.metacritic.com/browse/movies/score/metascore/all/filtered?page={}'.format(i),
        headers={'User-Agent': 'Chrome/80.0.3987.116'})
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Now that we've gotten the content from the page, we need to loop through each element.
    for i in range(0,100,1):
        title = soup.find_all('span', class_="title numbered")[i]\
            .next_sibling.next_sibling.contents[1].contents[0]
        goods_titles.append(title)
    
    # We're only pinging 10 times but might as well be safe since it costs like
    # nothing.
    time.sleep(1)

KeyboardInterrupt: 

Knowing that I won't be able to come up with screenplays for every single movie, I'm taking 2000 great and 2000 terrible films, in the hopes of winding up with at least 1000 of each.

In [None]:
len(goods_titles)

In [None]:
goods_titles[:5]

#### Terrible Movies

In [None]:
bads_titles = []

for i in range(110, 130):
    # There are 10 pages to flip through of 100 movies each.
    page = requests.get(
        'https://www.metacritic.com/browse/movies/score/metascore/all/filtered?page={}'.format(i),
        headers={'User-Agent': 'Chrome/80.0.3987.116'})
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Now that we've gotten the content from the page, we need to loop through each element.
    for i in range(0,100,1):
        try:
            title = soup.find_all('span', class_="title numbered")[i]\
            .next_sibling.next_sibling.contents[1].contents[0]
            bads_titles.append(title)
        except:
            pass
    # We're only pinging 10 times but might as well be safe since it costs like
    # nothing.
    time.sleep(1)

In [None]:
len(bads_titles)

In [None]:
goods_formatted = fun.format_titles(goods_titles)
bads_formatted = fun.format_titles(bads_titles)

### Scraping rottentomatoes.com

The rottentomatoes.com information will be used for linear regression. Whereas with metacritic we were using only the best and worst for classification, here I'm using samples from the entire spectrum for regression analysis.

In [None]:
all_rotten_movies = []
rotten_scores = []
for i in range(0, 101):
    page = requests.get("https://www.rottentomatoes.com/browse/"
                        "dvd-streaming-all?minTomato={}&maxTomato={}&services"
                        "=amazon;hbo_go;itunes;netflix_iw;vudu;amazon_prime;"
                        "fandango_now&genres=1;2;4;5;6;8;9;10;11;13;18;14"
                        "&sortBy=release".format(i, i+1))
    soup = BeautifulSoup(page.content, 'html.parser')
    page = soup.get_text()
    comp = re.compile('"\/m\/\w+"')
    movies = comp.findall(page)
    movies_unique = list(set([movie[4:-1] for movie in movies]))
    rotten_scores.extend([i for _ in movies_unique])
    all_rotten_movies.extend(movies_unique)
    print(i)
    print(movies_unique)
    time.sleep(1)

In [None]:
# Chopping off the year for those movies that have it.
rotten_movies_noyear = [film[:-5] if film[-4:-2] == '20' else film 
        for film in all_rotten_movies]

In [None]:
rotten_form = fun.format_titles(rotten_movies_noyear)

## Scraping in the Screenplays

As this will be an analysis centered around natural language processing, my primary data source will be the screenplay text from every movie.

Unfortunately, I was eventually locked out of SpringfieldSpringfield.co.uk, the site where I retrieved the content from, due to too many 'visits'. As I had hit the site upwards of 10k times in the course of a few days, it was probably a fair call.

In [None]:
# Getting the good screenplays along with a list of titles I couldn't find scripts for.
the_good, good_errors = fun.grab_screenplays(goods_formatted)

In [None]:
len(good_errors)

In [None]:
# Getting the good screenplays along with a list of titles I couldn't find scripts for.
the_bad, bad_errors = fun.grab_screenplays(bads_formatted)

In [None]:
len(bad_errors)

Putting them both in a DataFrame to be used as data for the rest of the project.

In [None]:
df_good = pd.DataFrame([the_good]).T
df_bad = pd.DataFrame([the_bad]).T

Now getting the screenplays for the rottentomatoes titles.

In [None]:
rotten_movies, rotten_errors = fun.grab_screenplays(rotten_form)

In [None]:
rotten_df = pd.DataFrame(columns=['titles',
                                  'titles_formatted',
                                  'rotten_scores',
                                  'scripts'])

In [None]:
# Getting the titles and scores loaded into the DataFrame.
rotten_df.titles = rotten_form
rotten_df.RottenScores = rotten_scores

In [None]:
# Getting things formatted correctly.
rotten_df.scripts = rotten_df.titles_formatted.apply(
    lambda x: rotten_scripts[0][x])

Loading all of this data into csv files to be used with the later notebooks.

In [None]:
df_good.to_csv('df_good_obtain.csv')
df_bad.to_csv('df_bad_obtain.csv')
rotten_df.to_csv('rotten_df_obtain.csv')

To be continued with scrubbing in scrubbing.ipynb.