# Web Scraping BoxOfficeMojo with BeautifulSoup (more data added)

**The goal** of this notebook is to produce a dataframe with the information about movies from [Box Office Mojo](https://www.boxofficemojo.com). The dataset obtained from webscraping will be used for exploratory data analysis, linear regression modeling and feature engineering. The resulting dataframe will be saved in the csv format for easier access in the future, as well as in movies.py python file, located in the same repository, for future projects. This notebook is a continuation on the "attempt3", where I was scraping data for 2019. It turns out I need more data points, as the budget values were missing from MANY movies. 

**The code** executed below will **result** in extracting data about domestic movies from 2019 (roughly 800 or so), and will provide us with some insights, such as movie title, total domestic gross revenue, runtime, rating, and budget.

#### Tools and Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import time, os
import dateutil.parser
import numpy as np
import pandas as pd
import re
import seaborn as sns 
import matplotlib.pyplot as plt

from urllib.parse import urljoin

#### URL details for the request

In [2]:
extension_url = '/year/2018/?ref_=bo_yl_table_1'
base_url = 'https://www.boxofficemojo.com' 
url = base_url + extension_url

In [3]:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
df.drop(columns=['Genre', 'Budget', 'Running Time'], inplace=True)

#### Helper functions: $ --> integer, hours + min --> min, date --> datestring

In [4]:
def money_to_int(moneystring):
    if type(moneystring) != float:
        moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    if runtimestring != None:
        runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

#### Helper function: findNext( ) to find sibling object values.

In [5]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from Box Office Mojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_element = obj.findNext()
    
    if next_element:
        return next_element.text 
    else:
        return None

#### Helper function: extracts movie stats, such as title, money, rating etc. and puts them into a dictionary.

Removal of *release_date* clause in the function below was necessary. It was creating the bug that I couldn't identify quickly enough, and I had to proceed. I will sacrifice the date a movie was released for the sake of this particular exercise.

In [6]:
def get_movie_dict(link):
    '''
    From BoxOfficeMojo link stub, request movie html, parse with BeautifulSoup, and
    collect 
        - title 
        - domestic gross
        - runtime 
        - MPAA rating
        - full release date
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.boxofficemojo.com'
    
    #Create full url to scrape
    url = urljoin(base_url, link)
    
    #Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page,"lxml")


    headers = ['movie_title', 'domestic_total_gross',
               'runtime_minutes', 'rating', 'budget']

    #Get title
    title_string = soup.find('title').text
    title = title_string.split('-')[0].strip()

    #Get domestic gross
    try:
        raw_domestic_total_gross = (soup.find(class_='mojo-performance-summary-table')
                                    .find_all('span', class_='money')[0]
                                    .text
                               )
    except:
        raw_domestic_total_gross = float("NaN")

    if type(raw_domestic_total_gross) == float or type(raw_domestic_total_gross) == 'NoneType':
        print('This is NaN')
        domestic_total_gross = float("NaN")
    else:
        domestic_total_gross = money_to_int(raw_domestic_total_gross)

    #Get runtime
    raw_runtime = get_movie_value(soup,'Running')
    if type(raw_runtime) != float and type(raw_runtime) != 'NoneType':
        runtime = runtime_to_minutes(raw_runtime)

    #Get rating
    rating = get_movie_value(soup,'MPAA')

#     I had to take out thios part, because it was creating the bug in the code, 
#.    so I couldn't pull enough values until I took this block out.

#     #Get release date
#     if '-' in get_movie_value(soup, 'Release Date'):
#         raw_release_date = get_movie_value(soup,'Release Date').split('-')[0]
#     elif '(' in get_movie_value(soup, 'Release Date'):
#         raw_release_date = get_movie_value(soup,'Release Date').split('(')[0]
#     else:
#         raw_release_date = get_movie_value(soup,'Release Date').split('(')[0]
#     release_date = to_date(raw_release_date)



    # Get budget alt
    raw_budget = get_movie_value(soup,'Budget')
    if raw_budget:
        budget = money_to_int(raw_budget)
    else:
        budget = 0

    #Create movie dictionary and return
    movie_dict = dict(zip(headers,[title,
                                domestic_total_gross,
                                runtime,
                                rating,
                                
                                budget]))

    return movie_dict


###### Now, we create expressions to sort through the html soup.

In [7]:
table = soup.find('table')
rows = [row for row in table.find_all('tr')]

It seems like we are looking for the values attached to the "href" attribute, as is seems to be holding the extension to our base url. We use those extension links to access each individual movie in order to extract data.

In [8]:
rows[1].find_all('td')[1].find('a')

<a class="a-link-normal" href="/release/rl2992866817/?ref_=bo_yld_table_1">Black Panther</a>

In [9]:
rows[1].find_all('td')[1].find('a')['href']

'/release/rl2992866817/?ref_=bo_yld_table_1'

We slice the url extension that we found via the path shown above, and we manually set up the beginning and the end of the string that we are interested in. For instance, to extract the value of *rl3059975681* from the text *'/release/rl3059975681/?ref_=bo_yld_table_1'* that was returned in the cell above, we create variable called *section*. This variable will hold the extension of the movie link. Then we crop it by defining the *beginning* and the *end* of the string, and use it to slice the *section* thus defining the *substring*. We save those movie stubs into the *mojo_links* list. I've been told it works faster with Selenium, but I chose to focus on [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) instead. 

In [10]:
mojo_links = []
for i in range(1,800):
    section = rows[i].find('a')['href']
    start = section.find("release/") + len("release/")
    end = section.find("/?ref_")
    substring = section[start:end]
    mojo_links.append(substring)
mojo_links

['rl2992866817',
 'rl3043198465',
 'rl2071758337',
 'rl1602061825',
 'rl2488436225',
 'rl3075966465',
 'rl3095234049',
 'rl2449573377',
 'rl2088535553',
 'rl1954383361',
 'rl2734720513',
 'rl2164360705',
 'rl3108800001',
 'rl3439363585',
 'rl2399176193',
 'rl2742978049',
 'rl1157858817',
 'rl3111945729',
 'rl1342342657',
 'rl1534625281',
 'rl2105509377',
 'rl2206107137',
 'rl2977793537',
 'rl1700234753',
 'rl1828947457',
 'rl1493206529',
 'rl3380381185',
 'rl1594197505',
 'rl879986177',
 'rl3263137281',
 'rl2708702721',
 'rl494765569',
 'rl2122286593',
 'rl3229582849',
 'rl4014048769',
 'rl2105312769',
 'rl537036289',
 'rl419661313',
 'rl3590358529',
 'rl3909125633',
 'rl2139063809',
 'rl1526760961',
 'rl1744864769',
 'rl3053815297',
 'rl1157793281',
 'rl2936047105',
 'rl394167809',
 'rl235243009',
 'rl4077356545',
 'rl1048479233',
 'rl1382647297',
 'rl1179420161',
 'rl3663955457',
 'rl654542337',
 'rl3657664001',
 'rl4261774849',
 'rl2256438785',
 'rl1082099201',
 'rl3707995649',
 'rl

###### Now, we append those movie link stubs into a dictionary, using our helper function get_movie_dict() from above.

**Warning:** The cell below takes long time to run. I saved its output in the *return_titles()* function, located in *movies.py* module. It's also available as a text file called *scraping output*. All are located in the same repository. **Run at your own risk.**

In [11]:
dicts = []

for link in mojo_links:
    dicts.append(get_movie_dict('/release/{}/'.format(link)))

dicts

[{'movie_title': 'Black Panther',
  'domestic_total_gross': 700059566,
  'runtime_minutes': 134,
  'rating': 'PG-13',
  'budget': 0},
 {'movie_title': 'Avengers: Infinity War',
  'domestic_total_gross': 678815482,
  'runtime_minutes': 149,
  'rating': 'PG-13',
  'budget': 0},
 {'movie_title': 'Incredibles 2',
  'domestic_total_gross': 608581744,
  'runtime_minutes': 118,
  'rating': 'PG',
  'budget': 0},
 {'movie_title': 'Jurassic World: Fallen Kingdom',
  'domestic_total_gross': 417719760,
  'runtime_minutes': 128,
  'rating': 'PG-13',
  'budget': 170000000},
 {'movie_title': 'Deadpool 2',
  'domestic_total_gross': 318491426,
  'runtime_minutes': 119,
  'rating': 'R',
  'budget': 110000000},
 {'movie_title': 'The Grinch',
  'domestic_total_gross': 270620950,
  'runtime_minutes': 85,
  'rating': 'PG',
  'budget': 75000000},
 {'movie_title': 'Jumanji: Welcome to the Jungle',
  'domestic_total_gross': 404515480,
  'runtime_minutes': 119,
  'rating': 'PG-13',
  'budget': 90000000},
 {'mov

###### Finally, we are saving this dictionary into a Pandas DataFrame and eventually a csv file for faster data analysis available to you in the next notebook.

In [21]:
movie_df_2018 = pd.DataFrame(dicts)

In [22]:
movie_df_2018.head()

Unnamed: 0,movie_title,domestic_total_gross,runtime_minutes,rating,budget
0,Black Panther,700059566,134.0,PG-13,0
1,Avengers: Infinity War,678815482,149.0,PG-13,0
2,Incredibles 2,608581744,118.0,PG,0
3,Jurassic World: Fallen Kingdom,417719760,128.0,PG-13,170000000
4,Deadpool 2,318491426,119.0,R,110000000


In [23]:
movie_df_2018.to_csv('box_office_mojo_data_2018.csv')

The code above saved our *movie_df_2018* dataframe into a csv in the working directory on my computer. 

In [None]:
# Find the way to scrape for multiple years.

In [26]:
movie_df_2019 = pd.read_csv("box_office_mojo_data.csv")

In [31]:
movie_dfs = pd.concat([movie_df_2019, movie_df_2018])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [32]:
movie_dfs.head()

Unnamed: 0.1,Unnamed: 0,budget,domestic_total_gross,movie_title,rating,runtime_minutes
0,0.0,356000000,858373000,Avengers: Endgame,PG-13,181.0
1,1.0,260000000,543638043,The Lion King,PG,118.0
2,2.0,200000000,434038008,Toy Story 4,G,100.0
3,3.0,150000000,477373578,Frozen II,PG,103.0
4,4.0,160000000,426829839,Captain Marvel,PG-13,123.0


In [33]:
movie_dfs.shape

(1598, 6)