# Project 3 - Part 2

## Business problem

- Produce a MySQL database on movies from a subset of IMDB's publicly available dataset.
- Use this database to analyze what makes a movie successful.
- Provide recommendations to the stakeholder on how to make a successful movie.

## Requirements

- Perform a test extraction of movies that started in 2000 or 2001
- Each year should be saved as a separate .csv.gz file
- Use one function to add the certification (MPGG Rating) to movie.info
- Test your function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook")
- The other function will help you append/extend a JSON file with Python
- Saved the final results to 2 separate .csv.gz files

Exploratory Data Analysis

- Load in your csv.gz's of results for each year extracted
- Concatenate the data into 1 dataframe for the remainder of the analysis
    - How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
    - Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.
    - How many movies are there in each of the certification categories (G/PG/PG-13/R)?
    - What is the average revenue per certification category?
    - What is the average budget per certification category?
- Save a final merged .csv.gz of all of the tmdb api data called "tmdb_results_combined.csv.gz"PI calls
One code file for EDA
Submit the link

## Imports

In [2]:
import pandas as pd
import numpy as np
import tmdbsimple as tmdb

import os, json, math, time

from tqdm.notebook import tqdm_notebook

## Code

In [4]:
## Extract all movies
all_movies = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)

In [3]:
## Filter movies for start year 2000 and 2001
movies_2000 = all_movies[all_movies['startYear'] == 2000]
movies_2001 = all_movies[all_movies['startYear'] == 2001]

In [4]:
## Save as a separate .csv.gz file
movies_2000.to_csv("Data/movies_2000.csv.gz", compression='gzip', index=False)
movies_2001.to_csv("Data/movies_2001.csv.gz", compression='gzip', index=False)

In [5]:
## Load API credentials
import json
with open('/Users/ShPatel/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)

In [6]:
tmdb.API_KEY =  login['api-key']

In [7]:
## Create a function to add certification to movies data
def get_movie_with_rating(movie_id):
    """Adapted from source https://github.com/celiao/tmdbsimple"""
    
    ## Get the movie object for the current id
    movie = tmdb.Movies(movie_id)
    
    ## Save the .info and .releases dictionaries
    info = movie.info()
    releases = movie.releases()

    ## Loop through countries in releases
    for c in releases['countries']:
        ## When country abbreviation == US
        if c['iso_3166_1'] == 'US':
            ## Save a certification key in info with the certification
            info['certification'] = c['certification']
            break

    return info

In [8]:
## Testing our function by looping through a list of ids
test_ids = ["tt0848228", "tt0332280"]
results = []
errors = []

for movie_id in test_ids:
    
    try:
        movie_info = get_movie_with_rating(movie_id)
        results.append(movie_info)
        
    except Exception as e: 
        errors.append([movie_id, e])
    
pd.DataFrame(results)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",220000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",https://www.marvel.com/movies/the-avengers,24428,tt0848228,en,The Avengers,...,1518815515,143,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Some assembly required.,The Avengers,False,7.711,29302,PG-13
1,False,/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg,,29000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",http://www.newline.com/properties/notebookthe....,11036,tt0332280,en,The Notebook,...,115603229,123,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Behind every great love is a great story.,The Notebook,False,7.881,10705,PG-13


In [12]:
## Create a function to extend or append a JSON file with phyton
def add_certification_to_movie(data, file_name):
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(file_name,'r+') as file:
       
        ## Load existing data
        file_data = json.load(file)
        
        ## Extend or append data
        if (type(data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        
        ## Sets current position at offset
        file.seek(0)
        
        ## Convert back to json
        json.dump(file_data, file)


In [13]:
## Run for extracted movies

folder = 'Data/'
years = [2000, 2001]

for year in tqdm_notebook(years, desc='Years', position=0):

    file_name = f'{folder}movies_{year}_final.json'

    if not os.path.exists(file_name):
        with open(file_name,'w') as file:
            json.dump({'imdb_id':0}, file)
    
    selected_movies = all_movies[all_movies['startYear'] == year]
    movie_ids = selected_movies['tconst']

    previous_df = pd.read_json(file_name)
    clean_movie_ids = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    for movie_id in tqdm_notebook(clean_movie_ids,
                                  desc=f'Movies from {year}',
                                  position=1,
                                  leave=True):
        try:
            movie_info = get_movie_with_rating(movie_id)
            add_certification_to_movie(movie_info, f'{folder}movies_{year}_final.json')
            time.sleep(0.02)
        
        except Exception as e:
            errors.append((movie_id, e))
    
    results = pd.read_json(f'{folder}movies_{year}_final.json')
    results.to_csv(f'{folder}movies_{year}_final.csv.gz', compression='gzip', index=False)

    errors = pd.read_json(f'errors_{year}.json')
    errors.to_csv(f'{folder}errors_{year}.csv.gz', compression='gzip', index=False)
    print(f"- Total errors: {len(errors)}")

Years:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: If using all scalar values, you must pass an index