# Overview

This get_franchises loads the TMDB data file with IMDB IDs, containing 10,000 movie entries, and scrapes the Box Office Mojo website for any/all of the following:
    * budget
    * domestic gross
    * worldwide gross
    * studio
    * MPAA rating

# Library Imports

In [1]:
import pandas as pd
import requests
import re
import numpy as np
from bs4 import BeautifulSoup as bs
import time
from random import randint
from time import sleep
from fake_useragent import UserAgent

In [2]:
ua = UserAgent()
header = {'User-Agent':str(ua.random)}
print(header)

{'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}


# Scraping Functions

In [3]:
#function that goes to the base page and grabs the content
def get_franchises():
    followurl = 'https://www.the-numbers.com/movies/franchises'
    page = requests.get(followurl, headers=header)
    soup = bs(page.content, 'html.parser')
    franchise_links(soup)

In [4]:
# function that collects the franchise links from the base page and puts them in our list of tags to follow

def franchise_links(soup):
    for link in soup.find_all('a', href=True):
        if '/movies/franchise/' in link.get('href'):
            franchise = link.text # saving the franchise name here in a variable. We need to figure out where to properly connect this
            franchiselinks[franchise] = link.get('href')
        else: continue

In [5]:
# follows a link from the base page
def grab_next(franchise_link):
    followurl = 'https://www.the-numbers.com' + franchise_link
    page = requests.get(followurl, headers=header)
    soup = bs(page.content, 'html.parser')
    return soup

In [6]:
def scraper(links, enddict):
    count = 1
    length = len(links)
    ts = time.time()
    
    for key, value in links.items():
        print(('Item {} / {} - {}').format(count, length, key))
        
        soup = grab_next(value)
        franchise_table = soup.find('table', id="franchise_movies_overview")
        entries = franchise_table.find_all('a', href=True)
        for item in entries:
            enddict['franchise'].append(key)
            enddict['movie'].append(item.text)
        
        sleep(randint(0,2))
        count += 1
    tnow = time.time()
    duration = round((tnow - ts), 2)
    scrape_average = round(duration/length, 2)
    print('{} minutes elapsed'.format(duration/60))
    print('{} seconds per item'.format(scrape_average))
    return enddict

get_franchises in plain english

    * make an empty list for all of the franchise page appendations
    * get the base page with the base get_franchises
    * grab all of the appendations and store them in a list
    * For each item in the list follow it to the franchise page
    * from teh franchise page, grab the franchise and the movie name and put it in the dataframe
    * clean up the movie names with our cleanup function

# Testing Set

Before we scrape for 10k returns, we will do a small test scrape

In [7]:
franchiselinks = {}

In [8]:
tempset = {'franchise' : [], 'movie' : []}

In [9]:
get_franchises()

In [10]:
franchiselinks

{'Marvel Cinematic Universe': '/movies/franchise/Marvel-Cinematic-Universe',
 'Star Wars': '/movies/franchise/Star-Wars',
 'James Bond': '/movies/franchise/James-Bond',
 'Batman': '/movies/franchise/Batman',
 'Harry Potter': '/movies/franchise/Harry-Potter',
 'Spider-Man': '/movies/franchise/Spider-Man',
 'X-Men': '/movies/franchise/X-Men',
 'Avengers': '/movies/franchise/Avengers',
 'Jurassic Park': '/movies/franchise/Jurassic-Park',
 'Star Trek': '/movies/franchise/Star-Trek',
 "Peter Jackson's Lord of the Rings": '/movies/franchise/Peter-Jacksons-Lord-of-the-Rings',
 'DC Extended Universe': '/movies/franchise/DC-Extended-Universe',
 'Indiana Jones': '/movies/franchise/Indiana-Jones',
 'Superman': '/movies/franchise/Superman',
 'Fast and the Furious': '/movies/franchise/Fast-and-the-Furious',
 'Shrek': '/movies/franchise/Shrek',
 'Rocky': '/movies/franchise/Rocky',
 'Pirates of the Caribbean': '/movies/franchise/Pirates-of-the-Caribbean',
 'Transformers': '/movies/franchise/Transform

In [11]:
#testing set

# Initialize limit  
N = 3
    
# Using items() + list slicing  
# Get first K items in dictionary  
out = dict(list(franchiselinks.items())[0: N]) 

In [12]:
tempstuff = scraper(out, tempset)

Item 1 / 3 - Marvel Cinematic Universe
Item 2 / 3 - Star Wars
Item 3 / 3 - James Bond
0.11883333333333333 minutes elapsed
2.38 seconds per item


In [13]:
tempstuff

{'franchise': ['Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Universe',
  'Marvel Cinematic Univer

In [14]:
franchises = pd.DataFrame(tempstuff)
franchises

Unnamed: 0,franchise,movie
0,Marvel Cinematic Universe,"May 6, 2022"
1,Marvel Cinematic Universe,Black Panther II
2,Marvel Cinematic Universe,"Nov 5, 2021"
3,Marvel Cinematic Universe,Eternals
4,Marvel Cinematic Universe,"Nov 5, 2021"
...,...,...
126,James Bond,Casino Royale
127,James Bond,Thunderball
128,James Bond,Goldfinger
129,James Bond,From Russia With Love


# The Big Scrape

We're ready to do the big scrape!
We'll break our frame of 10,000 entries into 10 smaller ones in case of any errors.

In [15]:
franchiselinks = {}

In [16]:
movietitles = {'franchise' : [], 'movie' : []}

In [17]:
get_franchises()

In [18]:
# Initialize limit  
N = 800
    
# Using items() + list slicing  
# Get first K items in dictionary  
listof800 = dict(list(franchiselinks.items())[0: N]) 

In [None]:
movietitles = scraper(listof800, movietitles)

Item 1 / 800 - Marvel Cinematic Universe


In [None]:
movietitles

In [None]:
franchises = pd.DataFrame(movietitles)
franchises

In [None]:
# Setting the title as the index
#franchises.set_index('id', inplace=True)

# Data Export

In [None]:
#exporting the dataframe to a csv
#franchises.to_csv('api_data/franchises_scraped.csv')