# Project 3 - Part 2.1 - ETL

- Name: Tyler Schelling
- Date Started: 12/20/2022

---

Your Stakeholder Wants More Data!

After investigating the preview of your data from Part 1, your stakeholder realized that there is no financial information included in the IMDB data (e.g. budget or revenue).

This will be a major roadblock when attempting to analyze which movies are successful and must be addressed before you will be able to determine which movies are successful.
Your stakeholder identified The Movie Database (TMDB) as a great source of financial data (https://www.themoviedb.org/). Thankfully, TMDB offers a free API for programmatic access to their data!

Your stakeholder wants you to extract the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification".

---

## Specifications - Financial Data

- Your stakeholder would like you to extract and save the results for movies that meet all of the criteria established in part 1 of the project (You should already have a filtered dataframe saved from part one as a csv.gz file)

- As a proof-of-concept, they requested you perform a test extraction of movies that started in 2000 or 2001

- Each year should be saved as a separate .csv.gz file

- One function will add the certification (MPGG Rating) to movie.info
- The other function will help you append/extend a JSON file with Python

### Confirm Your API Function works.

In order to ensure your function for extracting movie data from TMDB is working, test your function on these 2 movie ids: 
- tt0848228 ("The Avengers") 
- tt0332280 ("The Notebook"). 

Make sure that your function runs without error and that it returns the correct movie's data for both test ids.

---

## Import Libraries

In [1]:
import json, time, os
import tmdbsimple as tmdb
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm_notebook

## API Key and Token

In [2]:
with open('/Users/tyler/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
#Display the key names of the login dict.
login.keys()

dict_keys(['api_key', 'token'])

In [3]:
#Apply the API Key to the TMDB API
tmdb.API_KEY = login['api_key']

In [4]:
#Create the Data folder if it doesn't exist and view it's contents
FOLDER = 'Data/'
os.makedirs(FOLDER, exist_ok = True)
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'tmdb_api_results_2000.json']

## Functions

In [5]:
#Appends a new list of records to a JSON file. Adapted from: 
#https://www.geeksforgeeks.org/append-to-json-file-using-python/ 
def write_json(new_data, filename):     
    with open(filename,'r+') as file:
        #First we load existing data into a dict.
        file_data = json.load(file)
        #Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        #Sets file's current position at offset.
        file.seek(0)
        #Convert back to json.
        json.dump(file_data, file)

In [6]:
def get_movie_with_rating(movie_id):
    #Get the movie object for the current ID
    movie = tmdb.Movies(movie_id)
    #Save the .info .releases dictionaries
    info = movie.info()
    releases = movie.releases()
    #Loop through countries in releases
    for c in releases['countries']:
        #If the country abbreviation is US
        if c['iso_3166_1'] == 'US':
            #Save the certification rating in info
            info['certification'] = c['certification']
            
    return info

In [7]:
def movie_year_ratings(years_to_filter):
    #Begin looping through the years
    for YEAR in tqdm_notebook(years_to_filter, desc='YEARS', position=0):
        #Defining the JSON file to store results for year
        JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
        #Check if file exists
        file_exists = os.path.isfile(JSON_FILE)
        #If it does not exist: create it
        if file_exists == False:
        #Save an empty dict with just "imdb_id" to the new json file.
            with open(JSON_FILE,'w') as f:
                json.dump([{'imdb_id':0}],f)
        #Saving new year as the current df
        df = basics.loc[basics['startYear']==YEAR].copy()
        #Saving movie ids to list
        movie_ids = df['tconst'].copy()
        #Load existing data from json into a dataframe called "previous_df"
        previous_df = pd.read_json(JSON_FILE)
        #Filter out any ids that are already in the JSON_FILE
        movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

        #Get index and movie id from list
        # INNER Loop
        for movie_id in tqdm_notebook(movie_ids_to_get,
                                      desc=f'Movies from {YEAR}',
                                      position=1,
                                      leave=True):
            try:
                # Retrieve then data for the movie id
                temp = get_movie_with_rating(movie_id)  
                # Append/extend results to file using a pre-made function
                write_json(temp,JSON_FILE)
                # Short 20 ms sleep to prevent overwhelming server
                time.sleep(0.02)

            except Exception as e:
                ERRORS.append([movie_id, e])

        final_year_df = pd.read_json(JSON_FILE)
        final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz",\
                             compression="gzip", index=False)

    print(f"- Total errors: {len(ERRORS)}")

## Verify API Extraction

In [8]:
#Create list of test ID's
test_ids = ["tt0848228", "tt0332280"]
results = []
for movie_id in test_ids:
    #Loop through test ID list and append info into results
    try:
        movie_info = get_movie_with_rating(movie_id)
        results.append(movie_info)
    #On exception, pass the ID  
    except: 
        pass
#View the final results
pd.DataFrame(results)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",220000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",https://www.marvel.com/movies/the-avengers,24428,tt0848228,en,The Avengers,...,1518815515,143,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Some assembly required.,The Avengers,False,7.7,27909,PG-13
1,False,/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg,,29000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",http://www.newline.com/properties/notebookthe....,11036,tt0332280,en,The Notebook,...,115603229,123,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Behind every great love is a great story.,The Notebook,False,7.881,9980,PG-13


## Output Yearly Movie Data

In [9]:
#List of years to perform analysis on
YEARS = [2000, 2001]
#Empty list to contain exceptions when running the function
ERRORS = []
#Bring in the basics table from part 1
basics = pd.read_csv('Data/title_basics.csv.gz')

In [10]:
#Run the list of years in our function
movie_year_ratings(YEARS)

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/1418 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/1532 [00:00<?, ?it/s]

- Total errors: 446


## Verify File Output

In [11]:
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json']