# Introduction

I really want to get the info about streaming data to do some analysis along with the box office info. I'm curious whether movies that did well at the box office are more likely to be available for rent or buy versus subscription, and whether they are more likely to be on multiple platforms. I think this is important to figure out if the same things that make a movie successful at the box office will also make it successful in post-release streaming.

The problem is, this data isn't easy to find. The Numbers has a paid data set that I don't want to pay for. So does JustWatch.

I checked out https://api.watchmode.com/, which has a free option.

In [46]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
import gzip
import pickle

In [2]:
# Credit to https://stackoverflow.com/questions/4028904/
#    how-to-get-the-home-directory-in-python for how to get the home directory
def get_keys(path):
    from os.path import expanduser
    home = expanduser("~")
    with open(path.replace('~', home)) as f:
        return json.load(f)

# Watchmode API

In [61]:
keys = get_keys("~/.secret/watchmode_api.json")
api_key = keys['api_key']
len(api_key)

40

In [62]:
# Test out the new API with key and make sure we get a good response
url = 'https://api.watchmode.com/v1/list-titles'

url_params = {
                'apiKey': api_key,
                'types': 'movie',
                'regions': 'US',
                'source_types': 'sub',
                'page': '1'
             }
# source_types: sub, rent, buy, free, tve
# Starting with 'sub' for subscription

response = requests.get(url, params=url_params)
response.status_code

200

In [117]:
head_dict = dict(response.headers)

head_dict.keys()

dict_keys(['Date', 'Server', 'Access-Control-Allow-Origin', 'Access-Control-Allow-Methods', 'Access-Control-Allow-Headers', 'X-RateLimit-Limit', 'X-RateLimit-Remaining', 'Retry-After', 'X-Account-Quota', 'X-Account-Quota-Used', 'Content-Encoding', 'Vary', 'Keep-Alive', 'Connection', 'Transfer-Encoding', 'Content-Type'])

In [64]:
# Load the resonse into a dictionary and check out the keys
resp_dict = json.loads(response.text)

resp_dict.keys()

dict_keys(['titles', 'page', 'total_results', 'total_pages'])

In [66]:
# How many results did we get?
print(f"Total results available: {resp_dict['total_results']}")
print(f"Total pages available: {resp_dict['total_pages']}")

Total results available: 17783
Total pages available: 72


In [132]:
def watchmode_movies_listtitles(api_key, source_types
                                , start_page=1, run_test=False):
    """Requests movie information from Watchmode.com
    using the list-titles endpoint.
    US region is harcoded, as this is meant for a free API user license.
    Pass API key to use, and source type. 
    Source type options are:
    sub (Subscription); rent; buy; free; tve. Pass as a string or a list for
    multiple source types.
    run_test is False by default, but change to True 
    to simply see the number of results and how many API calls are left."""
    
    url = 'https://api.watchmode.com/v1/list-titles'
    df = pd.DataFrame()
    url_params = {
                    'apiKey': api_key,
                    'types': 'movie',
                    'regions': 'US',
                    'source_types': source_types,
                    'page': start_page
                 }
    
    # The first GET request always runs to obtain totals
    response = requests.get(url, params=url_params)
    
    # If response was good and we're actually going through all the pages...
    if response.status_code == 200 and run_test == False:
        resp_dict = json.loads(response.text)
        print(f"Total results available: {resp_dict['total_results']}")
        print(f"Total pages available: {resp_dict['total_pages']}")
        print(f"Initiating loop for each page.")
        end_page = resp_dict['total_pages']
        current_page = start_page
        
        while current_page <= end_page:
            print(f"Initiating page {current_page} of {end_page}...")
            url_params['page'] = current_page
            response = requests.get(url, params=url_params)
            resp_dict = json.loads(response.text)
            if len(df) == 0:
                df = pd.DataFrame(resp_dict['titles'])
            else:
                df_temp = pd.DataFrame(resp_dict['titles'])
                df = pd.concat([df, df_temp], ignore_index=True)
            print(f"Page {current_page} complete!")
            current_page += 1
            
        print(f"""API Calls Used: {dict(response.headers)['X-Account-Quota-Used']} 
        out of {dict(response.headers)['X-Account-Quota']}""")
        return df
    
    # If response was good and we're only running a test...
    elif response.status_code == 200 and run_test == True:
        resp_dict = json.loads(response.text)
        print(f"Total results available: {resp_dict['total_results']}")
        print(f"Total pages available: {resp_dict['total_pages']}")
        print(f"""API Calls Used: {dict(response.headers)['X-Account-Quota-Used']} 
        out of {dict(response.headers)['X-Account-Quota']}""")
        return None
    else:
        print(f"""Error response attempting initial call:{response.status_code}. 
        Aborted.""")
        return None

In [100]:
# Get results for movies available by subscription
df_sub = watchmode_movies_listtitles(api_key=api_key, source_types='sub')

Total results available: 17783
Total pages available: 72
Initiating loop for each page.
Initiating page 1 of 72...
Page 1 complete!
Initiating page 2 of 72...
Page 2 complete!
Initiating page 3 of 72...
Page 3 complete!
Initiating page 4 of 72...
Page 4 complete!
Initiating page 5 of 72...
Page 5 complete!
Initiating page 6 of 72...
Page 6 complete!
Initiating page 7 of 72...
Page 7 complete!
Initiating page 8 of 72...
Page 8 complete!
Initiating page 9 of 72...
Page 9 complete!
Initiating page 10 of 72...
Page 10 complete!
Initiating page 11 of 72...
Page 11 complete!
Initiating page 12 of 72...
Page 12 complete!
Initiating page 13 of 72...
Page 13 complete!
Initiating page 14 of 72...
Page 14 complete!
Initiating page 15 of 72...
Page 15 complete!
Initiating page 16 of 72...
Page 16 complete!
Initiating page 17 of 72...
Page 17 complete!
Initiating page 18 of 72...
Page 18 complete!
Initiating page 19 of 72...
Page 19 complete!
Initiating page 20 of 72...
Page 20 complete!
Initiating

In [101]:
df_sub

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tmdb_type,type
0,11,!Women Art Revolution,2010,tt1699720,55245,movie,movie
1,135,#Female Pleasure,2018,tt8372826,540210,movie,movie
2,164,Pellichoopulu,2016,tt5824826,393841,movie,movie
3,166,#realityhigh,2017,tt6119504,455656,movie,movie
4,1155,$elfie Shootout,2016,tt4004608,327946,movie,movie
...,...,...,...,...,...,...,...
17778,4150062,The Mystery of D.B. Cooper,2020,tt8694594,743073,movie,movie
17779,518463,The Ballad of Buster Scruggs,2018,tt6412452,537996,movie,movie
17780,538946,The First Temptation of Christ,2019,tt11310004,652020,movie,movie
17781,540875,Dance Dreams: Hot Chocolate Nutcracker,2020,tt8741182,755328,movie,movie


In [104]:
ls

CONTRIBUTING.md                  awesome.gif
LICENSE.md                       [34mdata[m[m/
README.md                        student.ipynb
Streaming Data Collection.ipynb  [34mzippedData[m[m/


In [106]:
# Export to file for safekeeping
df_sub.to_json('data/watchmode_subscription_titles_20210314.json')

In [112]:
# Test to make sure file exported ok
with open('data/watchmode_subscription_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17783 entries, 0 to 17782
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         17783 non-null  int64 
 1   title      17783 non-null  object
 2   year       17783 non-null  int64 
 3   imdb_id    17783 non-null  object
 4   tmdb_id    17783 non-null  int64 
 5   tmdb_type  17783 non-null  object
 6   type       17783 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


In [120]:
# Now, let's see how many results we'll get for the other source types
# Do I have enough API calls left this month to do all of them?

df_rent = watchmode_movies_listtitles(api_key=api_key, source_types='rent', run_test=True)

Total results available: 22669
Total pages available: 91
API Calls Remaining: 120


In [121]:
df_rent = watchmode_movies_listtitles(api_key=api_key, source_types='rent', run_test=False)

Total results available: 22669
Total pages available: 91
Initiating loop for each page.
Initiating page 1 of 91...
Page 1 complete!
Initiating page 2 of 91...
Page 2 complete!
Initiating page 3 of 91...
Page 3 complete!
Initiating page 4 of 91...
Page 4 complete!
Initiating page 5 of 91...
Page 5 complete!
Initiating page 6 of 91...
Page 6 complete!
Initiating page 7 of 91...
Page 7 complete!
Initiating page 8 of 91...
Page 8 complete!
Initiating page 9 of 91...
Page 9 complete!
Initiating page 10 of 91...
Page 10 complete!
Initiating page 11 of 91...
Page 11 complete!
Initiating page 12 of 91...
Page 12 complete!
Initiating page 13 of 91...
Page 13 complete!
Initiating page 14 of 91...
Page 14 complete!
Initiating page 15 of 91...
Page 15 complete!
Initiating page 16 of 91...
Page 16 complete!
Initiating page 17 of 91...
Page 17 complete!
Initiating page 18 of 91...
Page 18 complete!
Initiating page 19 of 91...
Page 19 complete!
Initiating page 20 of 91...
Page 20 complete!
Initiating

In [123]:
df_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22669 entries, 0 to 22668
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22669 non-null  int64 
 1   title      22669 non-null  object
 2   year       22669 non-null  int64 
 3   imdb_id    22669 non-null  object
 4   tmdb_id    22669 non-null  int64 
 5   tmdb_type  22669 non-null  object
 6   type       22669 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.2+ MB


In [124]:
# Export to file for safekeeping
df_rent.to_json('data/watchmode_rental_titles_20210314.json')

In [125]:
# Test to make sure file exported ok
with open('data/watchmode_rental_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22669 entries, 0 to 22668
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22669 non-null  int64 
 1   title      22669 non-null  object
 2   year       22669 non-null  int64 
 3   imdb_id    22669 non-null  object
 4   tmdb_id    22669 non-null  int64 
 5   tmdb_type  22669 non-null  object
 6   type       22669 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.4+ MB


In [133]:
df_buy = watchmode_movies_listtitles(api_key=api_key, source_types='buy', run_test=False)

Total results available: 22776
Total pages available: 92
Initiating loop for each page.
Initiating page 1 of 92...
Page 1 complete!
Initiating page 2 of 92...
Page 2 complete!
Initiating page 3 of 92...
Page 3 complete!
Initiating page 4 of 92...
Page 4 complete!
Initiating page 5 of 92...
Page 5 complete!
Initiating page 6 of 92...
Page 6 complete!
Initiating page 7 of 92...
Page 7 complete!
Initiating page 8 of 92...
Page 8 complete!
Initiating page 9 of 92...
Page 9 complete!
Initiating page 10 of 92...
Page 10 complete!
Initiating page 11 of 92...
Page 11 complete!
Initiating page 12 of 92...
Page 12 complete!
Initiating page 13 of 92...
Page 13 complete!
Initiating page 14 of 92...
Page 14 complete!
Initiating page 15 of 92...
Page 15 complete!
Initiating page 16 of 92...
Page 16 complete!
Initiating page 17 of 92...
Page 17 complete!
Initiating page 18 of 92...
Page 18 complete!
Initiating page 19 of 92...
Page 19 complete!
Initiating page 20 of 92...
Page 20 complete!
Initiating

In [135]:
df_buy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22776 entries, 0 to 22775
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22776 non-null  int64 
 1   title      22776 non-null  object
 2   year       22776 non-null  int64 
 3   imdb_id    22776 non-null  object
 4   tmdb_id    22776 non-null  int64 
 5   tmdb_type  22776 non-null  object
 6   type       22776 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.2+ MB


In [136]:
# Export to file for safekeeping
df_buy.to_json('data/watchmode_buy_titles_20210314.json')

In [138]:
# Test to make sure file exported ok
with open('data/watchmode_buy_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22776 entries, 0 to 22775
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22776 non-null  int64 
 1   title      22776 non-null  object
 2   year       22776 non-null  int64 
 3   imdb_id    22776 non-null  object
 4   tmdb_id    22776 non-null  int64 
 5   tmdb_type  22776 non-null  object
 6   type       22776 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.4+ MB


In [140]:
df_free = watchmode_movies_listtitles(api_key=api_key, source_types='free', run_test=False)

Total results available: 7058
Total pages available: 29
Initiating loop for each page.
Initiating page 1 of 29...
Page 1 complete!
Initiating page 2 of 29...
Page 2 complete!
Initiating page 3 of 29...
Page 3 complete!
Initiating page 4 of 29...
Page 4 complete!
Initiating page 5 of 29...
Page 5 complete!
Initiating page 6 of 29...
Page 6 complete!
Initiating page 7 of 29...
Page 7 complete!
Initiating page 8 of 29...
Page 8 complete!
Initiating page 9 of 29...
Page 9 complete!
Initiating page 10 of 29...
Page 10 complete!
Initiating page 11 of 29...
Page 11 complete!
Initiating page 12 of 29...
Page 12 complete!
Initiating page 13 of 29...
Page 13 complete!
Initiating page 14 of 29...
Page 14 complete!
Initiating page 15 of 29...
Page 15 complete!
Initiating page 16 of 29...
Page 16 complete!
Initiating page 17 of 29...
Page 17 complete!
Initiating page 18 of 29...
Page 18 complete!
Initiating page 19 of 29...
Page 19 complete!
Initiating page 20 of 29...
Page 20 complete!
Initiating 

In [141]:
df_free.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7059 entries, 0 to 7058
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7059 non-null   int64 
 1   title      7059 non-null   object
 2   year       7059 non-null   int64 
 3   imdb_id    7059 non-null   object
 4   tmdb_id    7059 non-null   int64 
 5   tmdb_type  7059 non-null   object
 6   type       7059 non-null   object
dtypes: int64(3), object(4)
memory usage: 386.2+ KB


In [142]:
# Export to file for safekeeping
df_free.to_json('data/watchmode_free_titles_20210314.json')

In [143]:
# Test to make sure file exported ok
with open('data/watchmode_free_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7059 entries, 0 to 7058
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7059 non-null   int64 
 1   title      7059 non-null   object
 2   year       7059 non-null   int64 
 3   imdb_id    7059 non-null   object
 4   tmdb_id    7059 non-null   int64 
 5   tmdb_type  7059 non-null   object
 6   type       7059 non-null   object
dtypes: int64(3), object(4)
memory usage: 441.2+ KB


# The Movie Database

Looking at the IDs I got from the watchmode results, I see they reference both The Movie Database IDs and IMDb IDs. This will make it easy to join the results.

The Movie Databse does have its own API, although it appears IMDB does not.

https://developers.themoviedb.org/3/getting-started/introduction

After further review, decided to do this because I don't trust the data in the files I already have.

In [3]:
# Get API key
keys = get_keys("~/.secret/tmdb_api.json")
api_key = keys['api_keyv4']
len(api_key)

211

## Get a list of valid movie IDs, to loop through

TMDb posts a daily list of valid IDs so you can know which are good and which aren't. Since what I want is in the movie details, which requires an ID in the API call, I'm going to download this list and loop through it.

http://files.tmdb.org/p/exports/movie_ids_03_15_2021.json.gz

In [5]:
cd data

/Users/jessicamiles/Documents/Flatiron_Data_Science/git_labs_lessons/phase1/dsc-phase-1-project/data


In [102]:
# Open the gzip file, specify for reading and text
f = gzip.open('movie_ids_03_15_2021.json.gz','rt')

line = f.readline()
dict_line = json.loads(line)
dict_line

{'adult': False,
 'id': 3924,
 'original_title': 'Blondie',
 'popularity': 3.837,
 'video': False}

In [103]:
tmdb_movies = []

for line in f.readlines():
    dict_line = json.loads(line)
    movie = {}
    for key in dict_line.keys():
        movie[key] = dict_line[key]
    tmdb_movies.append(movie)

len(tmdb_movies)

589029

In [87]:
tmdb_movies[0]['id']

6124

## API Test Call

In [84]:
# Test out the API with key and make sure we get a good response
#url = f"https://api.themoviedb.org/3/movie/{tmdb_movies[0]['id']}"
url = f"https://api.themoviedb.org/3/movie/790879"

headers = {'Authorization': f'Bearer {api_key}'}

url_params = {
                
             }

response = requests.get(url, headers=headers)
response.text

'{"adult":false,"backdrop_path":null,"belongs_to_collection":null,"budget":0,"genres":[],"homepage":"","id":790879,"imdb_id":null,"original_language":"pt","original_title":"Vera - Canções do Arco-Íris","overview":"","popularity":0.6,"poster_path":"/gquRcWjcG7iPpMREIIJ1OgmG4HL.jpg","production_companies":[],"production_countries":[],"release_date":"2019-01-01","revenue":0,"runtime":0,"spoken_languages":[],"status":"Released","tagline":"","title":"Vera - Canções do Arco-Íris","video":false,"vote_average":0.0,"vote_count":0}'

In [85]:
resp_dict = json.loads(response.text)
resp_dict.keys()

dict_keys(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'])

In [86]:
resp_dict

{'adult': False,
 'backdrop_path': None,
 'belongs_to_collection': None,
 'budget': 0,
 'genres': [],
 'homepage': '',
 'id': 790879,
 'imdb_id': None,
 'original_language': 'pt',
 'original_title': 'Vera - Canções do Arco-Íris',
 'overview': '',
 'popularity': 0.6,
 'poster_path': '/gquRcWjcG7iPpMREIIJ1OgmG4HL.jpg',
 'production_companies': [],
 'production_countries': [],
 'release_date': '2019-01-01',
 'revenue': 0,
 'runtime': 0,
 'spoken_languages': [],
 'status': 'Released',
 'tagline': '',
 'title': 'Vera - Canções do Arco-Íris',
 'video': False,
 'vote_average': 0.0,
 'vote_count': 0}

## Figure out how to lump movie details with other calls and append them

https://developers.themoviedb.org/3/getting-started/append-to-response

Details gives me good info, but while here I might as well get as much info as possible. This API allows you to make multiple requests and have them bundled into a single response. Need to try it out.

I want:
- movie/id to get the details
- movie/id/release_dates for the various release dates
- movie/id/watch/providers for providers by country and by monetization

Considered reviews, but I'm going to grab those from existing Rotten Tomatoes records for now since they have multiple pages.

In [30]:
# let's find an ID that will be streaming to make sure we get good results

# Credit to https://stackoverflow.com/questions/8653516/
# python-list-of-dictionaries-search for this elegant solution to search
# a list of dictionaries

list(filter(lambda movie: movie['original_title'] == 'Avatar', tmdb_movies))

[{'adult': False,
  'id': 19995,
  'original_title': 'Avatar',
  'popularity': 173.639,
  'video': False},
 {'adult': False,
  'id': 421403,
  'original_title': 'Avatar',
  'popularity': 1.804,
  'video': False}]

In [31]:
# How do I append the response?
url = f"https://api.themoviedb.org/3/movie/{19995}\
?append_to_response=release_dates,watch/providers"

headers = {'Authorization': f'Bearer {api_key}'}

response = requests.get(url, headers=headers)
response.text

'{"adult":false,"backdrop_path":"/AmHOQ7rpHwiaUMRjKXztnauSJb7.jpg","belongs_to_collection":{"id":87096,"name":"Avatar Collection","poster_path":"/gC3tW9a45RGOzzSh6wv91pFnmFr.jpg","backdrop_path":"/syGPZuzcHBBHMLiNDN0x0Tms4Fk.jpg"},"budget":237000000,"genres":[{"id":28,"name":"Action"},{"id":12,"name":"Adventure"},{"id":14,"name":"Fantasy"},{"id":878,"name":"Science Fiction"}],"homepage":"http://www.avatarmovie.com/","id":19995,"imdb_id":"tt0499549","original_language":"en","original_title":"Avatar","overview":"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.","popularity":173.639,"poster_path":"/6EiRUJpuoeQPghrs3YNktfnqOVh.jpg","production_companies":[{"id":444,"logo_path":"/42UPdZl6B2cFXgNUASR8hSt9mpS.png","name":"Dune Entertainment","origin_country":"US"},{"id":574,"logo_path":"/iB6GjNVHs5hOqcEYt2rcjBqIjki.png","name":"Lightstorm Entertainment","origin_countr

In [32]:
resp_dict = json.loads(response.text)
resp_dict.keys()

dict_keys(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'release_dates', 'watch/providers'])

OK I'm glad I know how to do that but that's a LOT of data. Let's get just the basic details for now to get things moving, and I'll come back and get the rest of the details later if needed.

## Get details via API

In [129]:
def tmdb_getmoviedetails(movie_list_slice, api_key):
    """"""
    new_keys_to_keep = ['budget', 'genres', 'imdb_id', 
                   'original_language', 'overview', 'production_companies', 
                   'production_countries', 'release_date', 'revenue', 
                   'runtime', 'status', 'tagline', 'title', 'vote_average', 
                   'vote_count']
        
    movie_count = len(movie_list_slice)
    
    print(f"{movie_count} movies to process. Initiating.")
    
    for counter, movie in enumerate(movie_list_slice):
        
        # run GET API call to get movie details for movie id
        url = f"https://api.themoviedb.org/3/movie/{movie['id']}"
        headers = {'Authorization': f'Bearer {api_key}'}

        response = requests.get(url, headers=headers)
        resp_dict = json.loads(response.text)
        
        # add details from the response back to the original dictionary
        # Only add the keys I want (removed duplicates and unecessary info)
        for key in new_keys_to_keep:
            try:
                movie[key] = resp_dict[key]
            except:
                print(f"Error on id movie ID {movie['id']}\
                      , key {key}. Added placeholder.")
                movie[key] = "Exception encountered"
        
        # Print output to the console every 10 movies to show progress
        if not (counter + 1) % 100:
            print(f"{counter + 1} movies processed so far...")
    
    print(f"{len(movie_list_slice)} total movies processed successfully")
    return movie_list_slice
        

### Test out the procotol on a small batch

In [None]:
movies_test = tmdb_getmoviedetails(tmdb_movies[:100], api_key)

In [52]:
pickle.Pickler(open('testfile.pickle', 'wb')).dump(movies_test)

In [55]:
test_unpickle = pickle.Unpickler(open('testfile.pickle', 'rb')).load()

In [60]:
df = pd.DataFrame(test_unpickle)
df.head()

Unnamed: 0,adult,id,original_title,popularity,video,budget,genres,homepage,imdb_id,original_language,...,production_companies,production_countries,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count
0,False,6124,Der Mann ohne Namen,0.6,False,0,"[{'id': 12, 'name': 'Adventure'}]",,tt0011436,de,...,[],"[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1921-01-01,0,420,Released,,"Peter Voss, Thief of Millions",0.0,0
1,False,8773,L'amour à vingt ans,2.932,False,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,tt0055747,fr,...,"[{'id': 38936, 'logo_path': '/ypvTqUeQOxORhFEF...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",1962-06-22,0,110,Released,The Intimate Secrets of Young Lovers,Love at Twenty,6.8,32
2,False,25449,New World Disorder 9: Never Enough,2.981,False,0,[],http://www.nwdfilms.com,,en,...,[],[],2008-12-08,0,69,Released,,New World Disorder 9: Never Enough,4.5,2
3,False,31975,Sesame Street: Elmo Loves You!,1.965,True,0,"[{'id': 10751, 'name': 'Family'}]",,tt1656746,en,...,[],[],2010-01-05,0,46,Released,,Sesame Street: Elmo Loves You!,0.0,0
4,False,2,Ariel,8.455,False,0,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,tt0094675,fi,...,"[{'id': 2303, 'logo_path': None, 'name': 'Vill...","[{'iso_3166_1': 'FI', 'name': 'Finland'}]",1988-10-21,0,73,Released,,Ariel,6.8,128


### Attempt 1, Batch 1

In [58]:
#movies_batch1 = tmdb_getmoviedetails(tmdb_movies[:10001], api_key)

#pickle.Pickler(open('tmdb_moviedetails_batch01.pickle', 'wb')).dump(movies_batch1)

10001 movies to process. Initiating.
10 movies processed so far...
20 movies processed so far...
30 movies processed so far...
40 movies processed so far...
50 movies processed so far...
60 movies processed so far...
70 movies processed so far...
80 movies processed so far...
90 movies processed so far...
100 movies processed so far...
110 movies processed so far...
120 movies processed so far...
130 movies processed so far...
140 movies processed so far...
150 movies processed so far...
160 movies processed so far...
170 movies processed so far...
180 movies processed so far...
190 movies processed so far...
200 movies processed so far...
210 movies processed so far...
220 movies processed so far...
230 movies processed so far...
240 movies processed so far...
250 movies processed so far...
260 movies processed so far...
270 movies processed so far...
280 movies processed so far...
290 movies processed so far...
300 movies processed so far...
310 movies processed so far...
320 movies 

2600 movies processed so far...
2610 movies processed so far...
2620 movies processed so far...
2630 movies processed so far...
2640 movies processed so far...
2650 movies processed so far...
2660 movies processed so far...
2670 movies processed so far...
2680 movies processed so far...
2690 movies processed so far...
2700 movies processed so far...
2710 movies processed so far...
2720 movies processed so far...
2730 movies processed so far...
2740 movies processed so far...
2750 movies processed so far...
2760 movies processed so far...
2770 movies processed so far...
2780 movies processed so far...
2790 movies processed so far...
2800 movies processed so far...
2810 movies processed so far...
2820 movies processed so far...
2830 movies processed so far...
2840 movies processed so far...
2850 movies processed so far...
2860 movies processed so far...
2870 movies processed so far...
2880 movies processed so far...
2890 movies processed so far...
2900 movies processed so far...
2910 mov

5170 movies processed so far...
5180 movies processed so far...
5190 movies processed so far...
5200 movies processed so far...
5210 movies processed so far...
5220 movies processed so far...
5230 movies processed so far...
5240 movies processed so far...
5250 movies processed so far...
5260 movies processed so far...
5270 movies processed so far...
5280 movies processed so far...
5290 movies processed so far...
5300 movies processed so far...
5310 movies processed so far...
5320 movies processed so far...
5330 movies processed so far...
5340 movies processed so far...
5350 movies processed so far...
5360 movies processed so far...
5370 movies processed so far...
5380 movies processed so far...
5390 movies processed so far...
5400 movies processed so far...
5410 movies processed so far...
5420 movies processed so far...
5430 movies processed so far...
5440 movies processed so far...
5450 movies processed so far...
5460 movies processed so far...
5470 movies processed so far...
5480 mov

7740 movies processed so far...
7750 movies processed so far...
7760 movies processed so far...
7770 movies processed so far...
7780 movies processed so far...
7790 movies processed so far...
7800 movies processed so far...
7810 movies processed so far...
7820 movies processed so far...
7830 movies processed so far...
7840 movies processed so far...
7850 movies processed so far...
7860 movies processed so far...
7870 movies processed so far...
7880 movies processed so far...
7890 movies processed so far...
7900 movies processed so far...
7910 movies processed so far...
7920 movies processed so far...
7930 movies processed so far...
7940 movies processed so far...
7950 movies processed so far...
7960 movies processed so far...
7970 movies processed so far...
7980 movies processed so far...
7990 movies processed so far...
8000 movies processed so far...
8010 movies processed so far...
8020 movies processed so far...
8030 movies processed so far...
8040 movies processed so far...
8050 mov

In [73]:
batch_1 = pickle.Unpickler(open('tmdb_moviedetails_batch01.pickle', 'rb')).load()

df_batch1 = pd.DataFrame(batch_1)
df_batch1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   adult                 10001 non-null  bool   
 1   id                    10001 non-null  int64  
 2   original_title        10001 non-null  object 
 3   popularity            10001 non-null  float64
 4   video                 10001 non-null  bool   
 5   budget                10001 non-null  int64  
 6   genres                10001 non-null  object 
 7   homepage              10 non-null     object 
 8   imdb_id               10001 non-null  object 
 9   original_language     10001 non-null  object 
 10  overview              10001 non-null  object 
 11  production_companies  10001 non-null  object 
 12  production_countries  10001 non-null  object 
 13  release_date          10001 non-null  object 
 14  revenue               10001 non-null  int64  
 15  runtime            

### Need to whittle down the list

In [74]:
len(tmdb_movies)

589029

OK, so I just realized that's 500 thousand, not 50 thousand. Need to filter this down.

In [95]:
# What if we filter for only films released after 2000, based in US

url = f"https://api.themoviedb.org/3/discover/movie?\
certification_country=US&primary_release_date.gte=2019-01-01\
&sort_by=primary_release_date.asc"

headers = {'Authorization': f'Bearer {api_key}'}

url_params = {
                
             }

response = requests.get(url, headers=headers)
response.text

'{"status_code":11,"status_message":"Internal error: Something went wrong, contact TMDb.","success":false}'

In [83]:
resp_dict = json.loads(response.text)

list(filter(lambda movie: movie['original_title'] == 'Vera - Canções do Arco-Íris'
            , resp_dict['results']))

[{'adult': False,
  'backdrop_path': None,
  'genre_ids': [],
  'id': 790879,
  'original_language': 'pt',
  'original_title': 'Vera - Canções do Arco-Íris',
  'overview': '',
  'popularity': 0.6,
  'poster_path': '/gquRcWjcG7iPpMREIIJ1OgmG4HL.jpg',
  'release_date': '2019-01-01',
  'title': 'Vera - Canções do Arco-Íris',
  'video': False,
  'vote_average': 0,
  'vote_count': 0}]

Not a lot here to use to figure out why my region appears not to be working. Will try to filter the original list to include only movies with a certain popularity count.

In [104]:
tmdb_movies[10].keys()

dict_keys(['adult', 'id', 'original_title', 'popularity', 'video'])

In [93]:
# let's find an ID that will be streaming to make sure we get good results

# Credit to https://stackoverflow.com/questions/8653516/
# python-list-of-dictionaries-search for this elegant solution to search
# a list of dictionaries

popular = list(filter(lambda movie: movie['popularity'] > 3, tmdb_movies))

len(popular)

80851

In [100]:
# What's the popularity of a really old movie?
list(filter(lambda movie: movie['original_title'] == "We're No Angels", tmdb_movies))

[{'adult': False,
  'id': 5971,
  'original_title': "We're No Angels",
  'popularity': 10.64,
  'video': False,
  'budget': 20000000,
  'genres': [{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}],
  'imdb_id': 'tt0098625',
  'original_language': 'en',
  'overview': "Two escaped cons' only prayer to escape is to pass themselves off as priests and pass by the police blockade at the border into the safety of Canada.",
  'production_companies': [{'id': 4,
    'logo_path': '/fycMZt242LVjagMByZOLUGbCvv3.png',
    'name': 'Paramount',
    'origin_country': 'US'}],
  'production_countries': [{'iso_3166_1': 'US',
    'name': 'United States of America'}],
  'release_date': '1989-06-02',
  'revenue': 0,
  'runtime': 106,
  'status': 'Released',
  'tagline': "Escaped convicts disguised as priests. It'll take a miracle to get away with this one.",
  'title': "We're No Angels",
  'vote_average': 6.1,
  'vote_count': 310},
 {'adult': False,
  'id': 5996,
  'original_title': "We're No Angels

OK that's not bad. I can do 80k. Let's scrap what we had before and start over with the list of 80k IDs that have popularity over 3.

Wanted to try to run the rest of the movies overnight and see if they finish.

Will try!

In [136]:
unpopular = list(filter(lambda movie: movie['popularity'] <= 3, tmdb_movies))

### Attempt 2

In [127]:
def batch_api_data_collection(full_list, batch_count, api_key):
    interval = round(len(full_list)/batch_count)
    start_index = 0
    stop_index = interval
    
    print(f"Starting set of {batch_count} batches. {len(full_list)} total movies to pull.")
    
    for i in range(1, (batch_count + 1)):
        if i != batch_count:
            
            batch = tmdb_getmoviedetails(full_list[start_index:stop_index], api_key)
            pickle.Pickler(open(f'tmdb_moviedetails_batch{i}.pickle', 'wb')).dump(batch)
        
            print(f"Completed batch {i}. Indexes {start_index} to {stop_index - 1} retrieved.")
            start_index = stop_index
            stop_index += interval
        elif i == batch_count:
            batch = tmdb_getmoviedetails(full_list[start_index:], api_key)
            pickle.Pickler(open(f'tmdb_moviedetails_batch{i}.pickle', 'wb')).dump(batch)
        
            print(f"Completed batch {i}. Indexes {start_index} to the end retrieved.")

    print(f"All movies retrieved!")
    
    return None
# if len was 10000 and batch_count was 10
# interval would be 1000
# start index initial = 0
# stop index initial would be interval (1000), which would go to index 999,
#   which would be 1000 items


In [128]:
batch_api_data_collection(popular, 8, api_key)

Starting set of 8 batches. 80851 total movies to pull.
10106 movies to process. Initiating.
100 movies processed so far...
200 movies processed so far...
300 movies processed so far...
400 movies processed so far...
500 movies processed so far...
600 movies processed so far...
700 movies processed so far...
800 movies processed so far...
900 movies processed so far...
1000 movies processed so far...
1100 movies processed so far...
1200 movies processed so far...
1300 movies processed so far...
1400 movies processed so far...
1500 movies processed so far...
1600 movies processed so far...
1700 movies processed so far...
1800 movies processed so far...
1900 movies processed so far...
2000 movies processed so far...
2100 movies processed so far...
2200 movies processed so far...
2300 movies processed so far...
2400 movies processed so far...
2500 movies processed so far...
2600 movies processed so far...
2700 movies processed so far...
2800 movies processed so far...
2900 movies processed

4500 movies processed so far...
4600 movies processed so far...
4700 movies processed so far...
4800 movies processed so far...
4900 movies processed so far...
5000 movies processed so far...
5100 movies processed so far...
5200 movies processed so far...
5300 movies processed so far...
5400 movies processed so far...
5500 movies processed so far...
5600 movies processed so far...
5700 movies processed so far...
5800 movies processed so far...
5900 movies processed so far...
6000 movies processed so far...
6100 movies processed so far...
6200 movies processed so far...
6300 movies processed so far...
6400 movies processed so far...
6500 movies processed so far...
6600 movies processed so far...
6700 movies processed so far...
6800 movies processed so far...
6900 movies processed so far...
7000 movies processed so far...
7100 movies processed so far...
7200 movies processed so far...
7300 movies processed so far...
7400 movies processed so far...
7500 movies processed so far...
7600 mov

9200 movies processed so far...
9300 movies processed so far...
9400 movies processed so far...
9500 movies processed so far...
9600 movies processed so far...
9700 movies processed so far...
9800 movies processed so far...
9900 movies processed so far...
10000 movies processed so far...
10100 movies processed so far...
10106 total movies processed successfully
Completed batch 5. Indexes 40424 to 50529 retrieved.
10106 movies to process. Initiating.
100 movies processed so far...
200 movies processed so far...
300 movies processed so far...
400 movies processed so far...
500 movies processed so far...
600 movies processed so far...
700 movies processed so far...
800 movies processed so far...
900 movies processed so far...
1000 movies processed so far...
1100 movies processed so far...
1200 movies processed so far...
1300 movies processed so far...
1400 movies processed so far...
1500 movies processed so far...
1600 movies processed so far...
1700 movies processed so far...
1800 movies

3400 movies processed so far...
3500 movies processed so far...
3600 movies processed so far...
3700 movies processed so far...
3800 movies processed so far...
3900 movies processed so far...
4000 movies processed so far...
4100 movies processed so far...
4200 movies processed so far...
4300 movies processed so far...
4400 movies processed so far...
4500 movies processed so far...
4600 movies processed so far...
4700 movies processed so far...
4800 movies processed so far...
4900 movies processed so far...
5000 movies processed so far...
5100 movies processed so far...
5200 movies processed so far...
5300 movies processed so far...
5400 movies processed so far...


KeyError: 'budget'

In [None]:
batch8 = tmdb_getmoviedetails(popular[70721:], api_key)

In [131]:
pickle.Pickler(open(f'tmdb_moviedetails_batch8.pickle', 'wb')).dump(batch)

10130 movies to process. Initiating.
100 movies processed so far...
200 movies processed so far...
300 movies processed so far...
400 movies processed so far...
500 movies processed so far...
600 movies processed so far...
700 movies processed so far...
800 movies processed so far...
900 movies processed so far...
1000 movies processed so far...
1100 movies processed so far...
1200 movies processed so far...
1300 movies processed so far...
1400 movies processed so far...
1500 movies processed so far...
1600 movies processed so far...
1700 movies processed so far...
1800 movies processed so far...
1900 movies processed so far...
2000 movies processed so far...
2100 movies processed so far...
2200 movies processed so far...
2300 movies processed so far...
2400 movies processed so far...
2500 movies processed so far...
2600 movies processed so far...
2700 movies processed so far...
2800 movies processed so far...
2900 movies processed so far...
3000 movies processed so far...
3100 movies 

NameError: name 'batch' is not defined

In [132]:
pickle.Pickler(open(f'tmdb_moviedetails_batch8.pickle', 'wb')).dump(batch8)

In [135]:
#Error on id movie ID 644910

error_movie = list(filter(lambda movie: movie['id'] == 644910, tmdb_movies))

error_movie

[{'adult': False,
  'id': 644910,
  'original_title': 'Saturday Morning Cartoons: 1970s — Volume 1',
  'popularity': 6.65,
  'video': False}]

I'm just going to remove movie ID 644910, since it actually looks like it's a compilation of cartoons which isn't vital to my analysis.

# Attempt to scrape from JustWatch

Justwatch has infinite scroll, and the info I want isn't on the initial page, so I would need to loop through the pages. I was able to get some of the pages 

In [44]:
body = {"fields":["cinema_release_date","full_path","full_paths","id","localized_release_date","object_type","poster","scoring","title","tmdb_popularity","backdrops"],"content_types":["movie"],"providers":["nfx","amp","dnp","fuv","atp","itu","hlu","hbm","pct","pcp","amz","ply","yot","ytr","hbn","sho"],"sort_by":"release_year","enable_provider_filter":False,"monetization_types":[],"page":3,"page_size":30,"matching_offers_only":True}
body_json = json.dumps(body)
body_json

'{"fields": ["cinema_release_date", "full_path", "full_paths", "id", "localized_release_date", "object_type", "poster", "scoring", "title", "tmdb_popularity", "backdrops"], "content_types": ["movie"], "providers": ["nfx", "amp", "dnp", "fuv", "atp", "itu", "hlu", "hbm", "pct", "pcp", "amz", "ply", "yot", "ytr", "hbn", "sho"], "sort_by": "release_year", "enable_provider_filter": false, "monetization_types": [], "page": 3, "page_size": 30, "matching_offers_only": true}'

In [45]:
html_page = requests.get("""https://apis.justwatch.com/content/titles/en_US/popular"""
                        , data=body_json)
soup = BeautifulSoup(html_page.content)

In [46]:
soup.prettify()

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="initial-scale=1, minimum-scale=1, width=device-width" name="viewport"/>\n  <title>\n   Error 400 (Bad Request)!!1\n  </title>\n  <style>\n   *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100%

OK, so I managed to find the API call used to get more results, but I don't think it will allow me to use it. Which makes sense since I'm making the call externally without a request header :-p

However, the info saved in devtools actually has the links for the returned movies, so I can definitely parse those out of the HAR file from devtools and them loop through the sites to get the actual info. I have to manually scroll, but it's pretty easy to do. The site only loads 66 pages of 30 results total, which is only 1980 results at a time, so I have to figure out how to filter my results to be under that limit.

Without specifying a release date in the filter, I get results that haven't been released yet... going to come back to that. Started with filtering by release year, but got too many movies. Tried also filtering by the streaming service for the most popular ones but, still too many. Finally, selected only the services which are subscription only - no rentals, and monetization is subscription. Success! Will keep trying this and logging the files with the criteria.

### Filter criteria used and results file tracking

|URL|Release Year|Services|Monetization|Results|Saved File Name|
|---|---|---|---|---|---|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=amp,atp,dnp,hbm,hlu,nfx<br>&release_year_from=2020<br>&release_year_until=2020<br>&sort_by=release_year  |2020|amp,atp,dnp,hbm,hlu,nfx|Subscription|1602|JustWatch_released2020_AvailBySubscription.har|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=atp,dnp,hbm,hlu,nfx<br>&release_year_from=2019<br>&release_year_until=2019<br>&sort_by=release_year |2019|atp,dnp,hbm,hlu,nfx|Subscription|941|JustWatch_released2019_AvailBySubscription_NoPrime.har|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=amp<br>&release_year_from=2019<br>&release_year_until=2019<br>&sort_by=release_year|2019|amp|Subscription|1800|JustWatch_released2019_AvailBySubscription_PrimeOnly.har|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|


These are the popular services that don't have rentals or buy options; you get what you get with your subscription:

- amp = Amazon Prime Video
- atp = Apple TV Plus
- dnp = Disney Plus
- hbm = HBO Max
- hlu = Hulu
- nfx = Netflix

### Parsing the HAR files to get paths

Before getting too far with the HAR files, I want to make sure I'll be able to parse the paths I need from them and make a script to scrape the info from the individual page URLs.

Credit to https://www.accordbox.com/blog/how-crawl-infinite-scrolling-pages-using-python/ for the instructions on how to web scrape a site with infinite scrolling, although it didn't work. Accessed 3/14/2021.

Credit to https://stackoverflow.com/questions/4028904/how-to-get-the-home-directory-in-python for how to get the home directory in Python. Accessed 3/14/2021.