# Introduction

I really want to get the info about streaming data to do some analysis along with the box office info. I'm curious whether movies that did well at the box office are more likely to be available for rent or buy versus subscription, and whether they are more likely to be on multiple platforms. I think this is important to figure out if the same things that make a movie successful at the box office will also make it successful in post-release streaming.

The problem is, this data isn't easy to find. The Numbers has a paid data set that I don't want to pay for. So does JustWatch.

I checked out https://api.watchmode.com/, which has a free option.

In [53]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json

## Watchmode API

In [61]:
# Credit to https://stackoverflow.com/questions/4028904/
#    how-to-get-the-home-directory-in-python for how to get the home directory
def get_keys(path):
    from os.path import expanduser
    home = expanduser("~")
    with open(path.replace('~', home)) as f:
        return json.load(f)

keys = get_keys("~/.secret/watchmode_api.json")
api_key = keys['api_key']
len(api_key)

40

In [62]:
# Test out the new API with key and make sure we get a good response
url = 'https://api.watchmode.com/v1/list-titles'

url_params = {
                'apiKey': api_key,
                'types': 'movie',
                'regions': 'US',
                'source_types': 'sub',
                'page': '1'
             }
# source_types: sub, rent, buy, free, tve
# Starting with 'sub' for subscription

response = requests.get(url, params=url_params)
response.status_code

200

In [117]:
head_dict = dict(response.headers)

head_dict.keys()

dict_keys(['Date', 'Server', 'Access-Control-Allow-Origin', 'Access-Control-Allow-Methods', 'Access-Control-Allow-Headers', 'X-RateLimit-Limit', 'X-RateLimit-Remaining', 'Retry-After', 'X-Account-Quota', 'X-Account-Quota-Used', 'Content-Encoding', 'Vary', 'Keep-Alive', 'Connection', 'Transfer-Encoding', 'Content-Type'])

In [64]:
# Load the resonse into a dictionary and check out the keys
resp_dict = json.loads(response.text)

resp_dict.keys()

dict_keys(['titles', 'page', 'total_results', 'total_pages'])

In [66]:
# How many results did we get?
print(f"Total results available: {resp_dict['total_results']}")
print(f"Total pages available: {resp_dict['total_pages']}")

Total results available: 17783
Total pages available: 72


In [132]:
def watchmode_movies_listtitles(api_key, source_types
                                , start_page=1, run_test=False):
    """Requests movie information from Watchmode.com
    using the list-titles endpoint.
    US region is harcoded, as this is meant for a free API user license.
    Pass API key to use, and source type. 
    Source type options are:
    sub (Subscription); rent; buy; free; tve. Pass as a string or a list for
    multiple source types.
    run_test is False by default, but change to True 
    to simply see the number of results and how many API calls are left."""
    
    url = 'https://api.watchmode.com/v1/list-titles'
    df = pd.DataFrame()
    url_params = {
                    'apiKey': api_key,
                    'types': 'movie',
                    'regions': 'US',
                    'source_types': source_types,
                    'page': start_page
                 }
    
    # The first GET request always runs to obtain totals
    response = requests.get(url, params=url_params)
    
    # If response was good and we're actually going through all the pages...
    if response.status_code == 200 and run_test == False:
        resp_dict = json.loads(response.text)
        print(f"Total results available: {resp_dict['total_results']}")
        print(f"Total pages available: {resp_dict['total_pages']}")
        print(f"Initiating loop for each page.")
        end_page = resp_dict['total_pages']
        current_page = start_page
        
        while current_page <= end_page:
            print(f"Initiating page {current_page} of {end_page}...")
            url_params['page'] = current_page
            response = requests.get(url, params=url_params)
            resp_dict = json.loads(response.text)
            if len(df) == 0:
                df = pd.DataFrame(resp_dict['titles'])
            else:
                df_temp = pd.DataFrame(resp_dict['titles'])
                df = pd.concat([df, df_temp], ignore_index=True)
            print(f"Page {current_page} complete!")
            current_page += 1
            
        print(f"""API Calls Used: {dict(response.headers)['X-Account-Quota-Used']} 
        out of {dict(response.headers)['X-Account-Quota']}""")
        return df
    
    # If response was good and we're only running a test...
    elif response.status_code == 200 and run_test == True:
        resp_dict = json.loads(response.text)
        print(f"Total results available: {resp_dict['total_results']}")
        print(f"Total pages available: {resp_dict['total_pages']}")
        print(f"""API Calls Used: {dict(response.headers)['X-Account-Quota-Used']} 
        out of {dict(response.headers)['X-Account-Quota']}""")
        return None
    else:
        print(f"""Error response attempting initial call:{response.status_code}. 
        Aborted.""")
        return None

In [100]:
# Get results for movies available by subscription
df_sub = watchmode_movies_listtitles(api_key=api_key, source_types='sub')

Total results available: 17783
Total pages available: 72
Initiating loop for each page.
Initiating page 1 of 72...
Page 1 complete!
Initiating page 2 of 72...
Page 2 complete!
Initiating page 3 of 72...
Page 3 complete!
Initiating page 4 of 72...
Page 4 complete!
Initiating page 5 of 72...
Page 5 complete!
Initiating page 6 of 72...
Page 6 complete!
Initiating page 7 of 72...
Page 7 complete!
Initiating page 8 of 72...
Page 8 complete!
Initiating page 9 of 72...
Page 9 complete!
Initiating page 10 of 72...
Page 10 complete!
Initiating page 11 of 72...
Page 11 complete!
Initiating page 12 of 72...
Page 12 complete!
Initiating page 13 of 72...
Page 13 complete!
Initiating page 14 of 72...
Page 14 complete!
Initiating page 15 of 72...
Page 15 complete!
Initiating page 16 of 72...
Page 16 complete!
Initiating page 17 of 72...
Page 17 complete!
Initiating page 18 of 72...
Page 18 complete!
Initiating page 19 of 72...
Page 19 complete!
Initiating page 20 of 72...
Page 20 complete!
Initiating

In [101]:
df_sub

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tmdb_type,type
0,11,!Women Art Revolution,2010,tt1699720,55245,movie,movie
1,135,#Female Pleasure,2018,tt8372826,540210,movie,movie
2,164,Pellichoopulu,2016,tt5824826,393841,movie,movie
3,166,#realityhigh,2017,tt6119504,455656,movie,movie
4,1155,$elfie Shootout,2016,tt4004608,327946,movie,movie
...,...,...,...,...,...,...,...
17778,4150062,The Mystery of D.B. Cooper,2020,tt8694594,743073,movie,movie
17779,518463,The Ballad of Buster Scruggs,2018,tt6412452,537996,movie,movie
17780,538946,The First Temptation of Christ,2019,tt11310004,652020,movie,movie
17781,540875,Dance Dreams: Hot Chocolate Nutcracker,2020,tt8741182,755328,movie,movie


In [104]:
ls

CONTRIBUTING.md                  awesome.gif
LICENSE.md                       [34mdata[m[m/
README.md                        student.ipynb
Streaming Data Collection.ipynb  [34mzippedData[m[m/


In [106]:
# Export to file for safekeeping
df_sub.to_json('data/watchmode_subscription_titles_20210314.json')

In [112]:
# Test to make sure file exported ok
with open('data/watchmode_subscription_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17783 entries, 0 to 17782
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         17783 non-null  int64 
 1   title      17783 non-null  object
 2   year       17783 non-null  int64 
 3   imdb_id    17783 non-null  object
 4   tmdb_id    17783 non-null  int64 
 5   tmdb_type  17783 non-null  object
 6   type       17783 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


In [120]:
# Now, let's see how many results we'll get for the other source types
# Do I have enough API calls left this month to do all of them?

df_rent = watchmode_movies_listtitles(api_key=api_key, source_types='rent', run_test=True)

Total results available: 22669
Total pages available: 91
API Calls Remaining: 120


In [121]:
df_rent = watchmode_movies_listtitles(api_key=api_key, source_types='rent', run_test=False)

Total results available: 22669
Total pages available: 91
Initiating loop for each page.
Initiating page 1 of 91...
Page 1 complete!
Initiating page 2 of 91...
Page 2 complete!
Initiating page 3 of 91...
Page 3 complete!
Initiating page 4 of 91...
Page 4 complete!
Initiating page 5 of 91...
Page 5 complete!
Initiating page 6 of 91...
Page 6 complete!
Initiating page 7 of 91...
Page 7 complete!
Initiating page 8 of 91...
Page 8 complete!
Initiating page 9 of 91...
Page 9 complete!
Initiating page 10 of 91...
Page 10 complete!
Initiating page 11 of 91...
Page 11 complete!
Initiating page 12 of 91...
Page 12 complete!
Initiating page 13 of 91...
Page 13 complete!
Initiating page 14 of 91...
Page 14 complete!
Initiating page 15 of 91...
Page 15 complete!
Initiating page 16 of 91...
Page 16 complete!
Initiating page 17 of 91...
Page 17 complete!
Initiating page 18 of 91...
Page 18 complete!
Initiating page 19 of 91...
Page 19 complete!
Initiating page 20 of 91...
Page 20 complete!
Initiating

In [123]:
df_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22669 entries, 0 to 22668
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22669 non-null  int64 
 1   title      22669 non-null  object
 2   year       22669 non-null  int64 
 3   imdb_id    22669 non-null  object
 4   tmdb_id    22669 non-null  int64 
 5   tmdb_type  22669 non-null  object
 6   type       22669 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.2+ MB


In [124]:
# Export to file for safekeeping
df_rent.to_json('data/watchmode_rental_titles_20210314.json')

In [125]:
# Test to make sure file exported ok
with open('data/watchmode_rental_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22669 entries, 0 to 22668
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22669 non-null  int64 
 1   title      22669 non-null  object
 2   year       22669 non-null  int64 
 3   imdb_id    22669 non-null  object
 4   tmdb_id    22669 non-null  int64 
 5   tmdb_type  22669 non-null  object
 6   type       22669 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.4+ MB


In [133]:
df_buy = watchmode_movies_listtitles(api_key=api_key, source_types='buy', run_test=False)

Total results available: 22776
Total pages available: 92
Initiating loop for each page.
Initiating page 1 of 92...
Page 1 complete!
Initiating page 2 of 92...
Page 2 complete!
Initiating page 3 of 92...
Page 3 complete!
Initiating page 4 of 92...
Page 4 complete!
Initiating page 5 of 92...
Page 5 complete!
Initiating page 6 of 92...
Page 6 complete!
Initiating page 7 of 92...
Page 7 complete!
Initiating page 8 of 92...
Page 8 complete!
Initiating page 9 of 92...
Page 9 complete!
Initiating page 10 of 92...
Page 10 complete!
Initiating page 11 of 92...
Page 11 complete!
Initiating page 12 of 92...
Page 12 complete!
Initiating page 13 of 92...
Page 13 complete!
Initiating page 14 of 92...
Page 14 complete!
Initiating page 15 of 92...
Page 15 complete!
Initiating page 16 of 92...
Page 16 complete!
Initiating page 17 of 92...
Page 17 complete!
Initiating page 18 of 92...
Page 18 complete!
Initiating page 19 of 92...
Page 19 complete!
Initiating page 20 of 92...
Page 20 complete!
Initiating

In [135]:
df_buy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22776 entries, 0 to 22775
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22776 non-null  int64 
 1   title      22776 non-null  object
 2   year       22776 non-null  int64 
 3   imdb_id    22776 non-null  object
 4   tmdb_id    22776 non-null  int64 
 5   tmdb_type  22776 non-null  object
 6   type       22776 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.2+ MB


In [136]:
# Export to file for safekeeping
df_buy.to_json('data/watchmode_buy_titles_20210314.json')

In [138]:
# Test to make sure file exported ok
with open('data/watchmode_buy_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22776 entries, 0 to 22775
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22776 non-null  int64 
 1   title      22776 non-null  object
 2   year       22776 non-null  int64 
 3   imdb_id    22776 non-null  object
 4   tmdb_id    22776 non-null  int64 
 5   tmdb_type  22776 non-null  object
 6   type       22776 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.4+ MB


In [140]:
df_free = watchmode_movies_listtitles(api_key=api_key, source_types='free', run_test=False)

Total results available: 7058
Total pages available: 29
Initiating loop for each page.
Initiating page 1 of 29...
Page 1 complete!
Initiating page 2 of 29...
Page 2 complete!
Initiating page 3 of 29...
Page 3 complete!
Initiating page 4 of 29...
Page 4 complete!
Initiating page 5 of 29...
Page 5 complete!
Initiating page 6 of 29...
Page 6 complete!
Initiating page 7 of 29...
Page 7 complete!
Initiating page 8 of 29...
Page 8 complete!
Initiating page 9 of 29...
Page 9 complete!
Initiating page 10 of 29...
Page 10 complete!
Initiating page 11 of 29...
Page 11 complete!
Initiating page 12 of 29...
Page 12 complete!
Initiating page 13 of 29...
Page 13 complete!
Initiating page 14 of 29...
Page 14 complete!
Initiating page 15 of 29...
Page 15 complete!
Initiating page 16 of 29...
Page 16 complete!
Initiating page 17 of 29...
Page 17 complete!
Initiating page 18 of 29...
Page 18 complete!
Initiating page 19 of 29...
Page 19 complete!
Initiating page 20 of 29...
Page 20 complete!
Initiating 

In [141]:
df_free.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7059 entries, 0 to 7058
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7059 non-null   int64 
 1   title      7059 non-null   object
 2   year       7059 non-null   int64 
 3   imdb_id    7059 non-null   object
 4   tmdb_id    7059 non-null   int64 
 5   tmdb_type  7059 non-null   object
 6   type       7059 non-null   object
dtypes: int64(3), object(4)
memory usage: 386.2+ KB


In [142]:
# Export to file for safekeeping
df_free.to_json('data/watchmode_free_titles_20210314.json')

In [143]:
# Test to make sure file exported ok
with open('data/watchmode_free_titles_20210314.json') as f:
    loaded = json.load(f)
    
df_test = pd.DataFrame(loaded)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7059 entries, 0 to 7058
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7059 non-null   int64 
 1   title      7059 non-null   object
 2   year       7059 non-null   int64 
 3   imdb_id    7059 non-null   object
 4   tmdb_id    7059 non-null   int64 
 5   tmdb_type  7059 non-null   object
 6   type       7059 non-null   object
dtypes: int64(3), object(4)
memory usage: 441.2+ KB


## The Movie Database

Looking at the IDs I got from the watchmode results, I see they reference both The Movie Database IDs and IMDb IDs. This will make it easy to join the results.

The Movie Databse does have its own API, although it appears IMDB does not.

https://developers.themoviedb.org/3/getting-started/introduction

## Attempt to scrape from JustWatch

Justwatch has infinite scroll, and the info I want isn't on the initial page, so I would need to loop through the pages. I was able to get some of the pages 

In [44]:
body = {"fields":["cinema_release_date","full_path","full_paths","id","localized_release_date","object_type","poster","scoring","title","tmdb_popularity","backdrops"],"content_types":["movie"],"providers":["nfx","amp","dnp","fuv","atp","itu","hlu","hbm","pct","pcp","amz","ply","yot","ytr","hbn","sho"],"sort_by":"release_year","enable_provider_filter":False,"monetization_types":[],"page":3,"page_size":30,"matching_offers_only":True}
body_json = json.dumps(body)
body_json

'{"fields": ["cinema_release_date", "full_path", "full_paths", "id", "localized_release_date", "object_type", "poster", "scoring", "title", "tmdb_popularity", "backdrops"], "content_types": ["movie"], "providers": ["nfx", "amp", "dnp", "fuv", "atp", "itu", "hlu", "hbm", "pct", "pcp", "amz", "ply", "yot", "ytr", "hbn", "sho"], "sort_by": "release_year", "enable_provider_filter": false, "monetization_types": [], "page": 3, "page_size": 30, "matching_offers_only": true}'

In [45]:
html_page = requests.get("""https://apis.justwatch.com/content/titles/en_US/popular"""
                        , data=body_json)
soup = BeautifulSoup(html_page.content)

In [46]:
soup.prettify()

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="initial-scale=1, minimum-scale=1, width=device-width" name="viewport"/>\n  <title>\n   Error 400 (Bad Request)!!1\n  </title>\n  <style>\n   *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100%

OK, so I managed to find the API call used to get more results, but I don't think it will allow me to use it. Which makes sense since I'm making the call externally without a request header :-p

However, the info saved in devtools actually has the links for the returned movies, so I can definitely parse those out of the HAR file from devtools and them loop through the sites to get the actual info. I have to manually scroll, but it's pretty easy to do. The site only loads 66 pages of 30 results total, which is only 1980 results at a time, so I have to figure out how to filter my results to be under that limit.

Without specifying a release date in the filter, I get results that haven't been released yet... going to come back to that. Started with filtering by release year, but got too many movies. Tried also filtering by the streaming service for the most popular ones but, still too many. Finally, selected only the services which are subscription only - no rentals, and monetization is subscription. Success! Will keep trying this and logging the files with the criteria.

### Filter criteria used and results file tracking

|URL|Release Year|Services|Monetization|Results|Saved File Name|
|---|---|---|---|---|---|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=amp,atp,dnp,hbm,hlu,nfx<br>&release_year_from=2020<br>&release_year_until=2020<br>&sort_by=release_year  |2020|amp,atp,dnp,hbm,hlu,nfx|Subscription|1602|JustWatch_released2020_AvailBySubscription.har|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=atp,dnp,hbm,hlu,nfx<br>&release_year_from=2019<br>&release_year_until=2019<br>&sort_by=release_year |2019|atp,dnp,hbm,hlu,nfx|Subscription|941|JustWatch_released2019_AvailBySubscription_NoPrime.har|
|//www.justwatch.com/us/movies<br>?monetization_types=flatrate<br>&providers=amp<br>&release_year_from=2019<br>&release_year_until=2019<br>&sort_by=release_year|2019|amp|Subscription|1800|JustWatch_released2019_AvailBySubscription_PrimeOnly.har|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|
|---|---|---|---|---|---|


These are the popular services that don't have rentals or buy options; you get what you get with your subscription:

- amp = Amazon Prime Video
- atp = Apple TV Plus
- dnp = Disney Plus
- hbm = HBO Max
- hlu = Hulu
- nfx = Netflix

### Parsing the HAR files to get paths

Before getting too far with the HAR files, I want to make sure I'll be able to parse the paths I need from them and make a script to scrape the info from the individual page URLs.

Credit to https://www.accordbox.com/blog/how-crawl-infinite-scrolling-pages-using-python/ for the instructions on how to web scrape a site with infinite scrolling, although it didn't work. Accessed 3/14/2021.

Credit to https://stackoverflow.com/questions/4028904/how-to-get-the-home-directory-in-python for how to get the home directory in Python. Accessed 3/14/2021.