# Data Collection

Here we are collecting the appropriate data.
At the very least, we need a way to assess the features of the movie 
and the quantitative measure of success of the movie (e.g. sales, views, ratings, level of virality, etc.)

Also it will be helpful to see how available streaming service interacting with the features of films influences the success of movie. 

### The Movie Database (TMDB) 
It's a good open source alternative to IMDB.  
The documentation is here: https://www.themoviedb.org/documentation/api


#### Load the movie database API

In [1]:
# Loading tmdb_api
# You must get your API key from themoviedb.org 
# and save your API key("api_key") in .secret/tmdb_api.json 

import json

def get_keys(path):
    with open(path) as f:
        return json.load(f)

path = '/Users/stereopickles/.secret' # input the location of your tmdb_api.json

api_key = get_keys(f"{path}/tmdb_api.json")['api_key']


In [40]:
# Collecting basic movie data with popularity and voting

import requests

url = 'https://api.themoviedb.org/3/discover/movie'

url_params = {
    "api_key": api_key,
    "sort_by": "popularity.desc", # sort by popularity
    "page": 1 # each page contains 20 movies
}

resp = requests.get(url, params = url_params)
print(resp.status_code)
print(resp.text[:1000])
# if prints 200, we are good.

200
{"page":1,"total_results":10000,"total_pages":500,"results":[{"popularity":226.202,"vote_count":3853,"video":false,"poster_path":"\/xBHvZcjRiWyobQ9kxBhO6B2dtRI.jpg","id":419704,"adult":false,"backdrop_path":"\/t4z8OlOEzH7J1JRFUN3rcm6XHNL.jpg","original_language":"en","original_title":"Ad Astra","genre_ids":[18,878],"title":"Ad Astra","vote_average":6.1,"overview":"The near future, a time when both hope and hardships drive humanity to look to the stars and beyond. While a mysterious phenomenon menaces to destroy life on planet Earth, astronaut Roy McBride undertakes a mission across the immensity of space and its many perils to uncover the truth about a lost expedition that decades before boldly faced emptiness and silence in search of the unknown.","release_date":"2019-09-17"},{"popularity":200.36,"vote_count":77,"video":false,"poster_path":"\/h1B7tW0t399VDjAcWJh8m87469b.jpg","id":556574,"adult":false,"backdrop_path":"\/uWVkEo9PWHu9algZsiLPi6sRU64.jpg","original_language":"en","ori

In [62]:
# Now that we reviewed the data type, 
# I'll run through each pages to create a full dataset 
# of top 10000 popular movies since 2000.

tmdb_dat = []

for i in range(500):
    url_params = {
        "api_key": api_key,
        "sort_by": "popularity.desc",
        "page": i+1,
        "release_date.gte": '2000-01-01' 
        #only movies with release date greater than 2000
    }
    resp = requests.get(url, params = url_params)
    try: 
        tmdb_dat.extend(resp.json()['results'])
    except:
        print(f"Error on page {i+1}")
        
tmdb_dat[0]    

{'popularity': 226.202,
 'vote_count': 3853,
 'video': False,
 'poster_path': '/xBHvZcjRiWyobQ9kxBhO6B2dtRI.jpg',
 'id': 419704,
 'adult': False,
 'backdrop_path': '/t4z8OlOEzH7J1JRFUN3rcm6XHNL.jpg',
 'original_language': 'en',
 'original_title': 'Ad Astra',
 'genre_ids': [18, 878],
 'title': 'Ad Astra',
 'vote_average': 6.1,
 'overview': 'The near future, a time when both hope and hardships drive humanity to look to the stars and beyond. While a mysterious phenomenon menaces to destroy life on planet Earth, astronaut Roy McBride undertakes a mission across the immensity of space and its many perils to uncover the truth about a lost expedition that decades before boldly faced emptiness and silence in search of the unknown.',
 'release_date': '2019-09-17'}

In [49]:
import pandas as pd

In [115]:
movie_df = pd.DataFrame(tmdb_dat)
print(movie_df.shape)
movie_df.head()

(10000, 14)


Unnamed: 0,popularity,vote_count,video,poster_path,id,adult,backdrop_path,original_language,original_title,genre_ids,title,vote_average,overview,release_date
0,226.202,3853,False,/xBHvZcjRiWyobQ9kxBhO6B2dtRI.jpg,419704,False,/t4z8OlOEzH7J1JRFUN3rcm6XHNL.jpg,en,Ad Astra,"[18, 878]",Ad Astra,6.1,"The near future, a time when both hope and har...",2019-09-17
1,200.36,77,False,/h1B7tW0t399VDjAcWJh8m87469b.jpg,556574,False,/uWVkEo9PWHu9algZsiLPi6sRU64.jpg,en,Hamilton,"[18, 36]",Hamilton,8.8,Presenting the tale of American founding fathe...,2020-07-03
2,170.32,1831,False,/s1cVTQEZYn4nSjZLnFbzLP0j8y2.jpg,8619,False,/m11Mej9vbQqcXWgYrgPboCJ9NUh.jpg,en,Master and Commander: The Far Side of the World,"[12, 18, 10752]",Master and Commander: The Far Side of the World,7.0,After an abrupt and violent encounter with a F...,2003-11-14
3,168.356,2667,False,/hLZHJjsZ83Wfn3MRteNb6ew0WC7.jpg,72545,False,/9f33P2UvXw8HnLNB3cHdzI64Xpq.jpg,en,Journey 2: The Mysterious Island,"[28, 12, 878]",Journey 2: The Mysterious Island,6.0,Sean Anderson partners with his mom's boyfrien...,2012-01-19
4,123.127,14,False,/goEW6QqoFxNI2pfbpVqmXj2WXwd.jpg,531876,False,/n1RohH2VoK1CdVI2fXvcP19dSlm.jpg,en,The Outpost,"[28, 18, 36, 10752]",The Outpost,4.9,"A small unit of U.S. soldiers, alone at the re...",2020-06-24


In [108]:
movie_df.set_index('id', inplace = True)

#### Adding genre info
We have the top 10000 popular movies in 2000 ~ current now.  
This data set has genre as ids. Let's pull the genre data so we know what it's referring to. 

In [116]:
# getting genre data
url = 'https://api.themoviedb.org/3/genre/movie/list'

url_params = {
    "api_key": api_key
}

resp = requests.get(url, params = url_params)
print(resp.status_code)
print(resp.text[:1000])

200
{"genres":[{"id":28,"name":"Action"},{"id":12,"name":"Adventure"},{"id":16,"name":"Animation"},{"id":35,"name":"Comedy"},{"id":80,"name":"Crime"},{"id":99,"name":"Documentary"},{"id":18,"name":"Drama"},{"id":10751,"name":"Family"},{"id":14,"name":"Fantasy"},{"id":36,"name":"History"},{"id":27,"name":"Horror"},{"id":10402,"name":"Music"},{"id":9648,"name":"Mystery"},{"id":10749,"name":"Romance"},{"id":878,"name":"Science Fiction"},{"id":10770,"name":"TV Movie"},{"id":53,"name":"Thriller"},{"id":10752,"name":"War"},{"id":37,"name":"Western"}]}


In [75]:
genre_df = pd.DataFrame(resp.json()['genres'])
genre_df.head(10)
len(genre_df)

19

There are 19 genres. Each movie has one or more genres associated with it.  
I think it might be best to have an individual column for each genres.  
So we can compare the counts of them. 


In [117]:
# for each genre and for each id in the genres_ids for each movie
# say True if the genre is in the genres_ids column
# genre_df.set_index('id', inplace = True)

for i in range(len(genre_df)):
    genre_id = genre_df.index[i]
    genre_name = genre_df.name[genre_id]
    movie_df[genre_name] = movie_df.genre_ids.apply(lambda x: genre_id in x)

In [118]:
movie_df.head()

Unnamed: 0,popularity,vote_count,video,poster_path,id,adult,backdrop_path,original_language,original_title,genre_ids,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,226.202,3853,False,/xBHvZcjRiWyobQ9kxBhO6B2dtRI.jpg,419704,False,/t4z8OlOEzH7J1JRFUN3rcm6XHNL.jpg,en,Ad Astra,"[18, 878]",...,False,False,False,False,False,True,False,False,False,False
1,200.36,77,False,/h1B7tW0t399VDjAcWJh8m87469b.jpg,556574,False,/uWVkEo9PWHu9algZsiLPi6sRU64.jpg,en,Hamilton,"[18, 36]",...,True,False,False,False,False,False,False,False,False,False
2,170.32,1831,False,/s1cVTQEZYn4nSjZLnFbzLP0j8y2.jpg,8619,False,/m11Mej9vbQqcXWgYrgPboCJ9NUh.jpg,en,Master and Commander: The Far Side of the World,"[12, 18, 10752]",...,False,False,False,False,False,False,False,False,True,False
3,168.356,2667,False,/hLZHJjsZ83Wfn3MRteNb6ew0WC7.jpg,72545,False,/9f33P2UvXw8HnLNB3cHdzI64Xpq.jpg,en,Journey 2: The Mysterious Island,"[28, 12, 878]",...,False,False,False,False,False,True,False,False,False,False
4,123.127,14,False,/goEW6QqoFxNI2pfbpVqmXj2WXwd.jpg,531876,False,/n1RohH2VoK1CdVI2fXvcP19dSlm.jpg,en,The Outpost,"[28, 18, 36, 10752]",...,True,False,False,False,False,False,False,False,True,False


#### Additional cleaning
Since our dataframe is getting bigger, and it contains some information we will not be looking at, let's remove those columns. 

In [119]:
movie_df.drop(['video', 'poster_path', \
               'backdrop_path', 'original_title', 'genre_ids'], axis = 1,\
              inplace = True)

In [121]:
movie_df.head()

Unnamed: 0,popularity,vote_count,id,adult,original_language,title,vote_average,overview,release_date,Action,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,226.202,3853,419704,False,en,Ad Astra,6.1,"The near future, a time when both hope and har...",2019-09-17,False,...,False,False,False,False,False,True,False,False,False,False
1,200.36,77,556574,False,en,Hamilton,8.8,Presenting the tale of American founding fathe...,2020-07-03,False,...,True,False,False,False,False,False,False,False,False,False
2,170.32,1831,8619,False,en,Master and Commander: The Far Side of the World,7.0,After an abrupt and violent encounter with a F...,2003-11-14,False,...,False,False,False,False,False,False,False,False,True,False
3,168.356,2667,72545,False,en,Journey 2: The Mysterious Island,6.0,Sean Anderson partners with his mom's boyfrien...,2012-01-19,True,...,False,False,False,False,False,True,False,False,False,False
4,123.127,14,531876,False,en,The Outpost,4.9,"A small unit of U.S. soldiers, alone at the re...",2020-06-24,True,...,True,False,False,False,False,False,False,False,True,False


### Adding cast information

Additionally, our questions need information on casting, budget and revenues.  
These information is available on movie API but we will need to fetch it by individual movie_id. While we are at it, we should also get some additional information: 

budget, production_companies name, revenue, runtime