# Module 1 Project

Please fill out:
* Student name: Jennifer Wadkins
* Student pace: self paced
* Scheduled project review date/time: 
* Instructor name: Jeff Herman
* Blog post URL:



Questions I have:
    * Do I need to justify not using provided data, or can I just go straight for my own?
    * Am I allowed to make some assumptions/framing for the case study, ex. set in a pre-covid world?

## Project Overview

## Importing our modules

We will be using the following libraries in this project:

pandas, numpy, matplotlib, json, re

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import json
import re
%matplotlib inline

## Other preparation work

Recommended to also install the nbextensions "Table of Contents 2" and "Collapsible Headings" for easier navigation through this notebook.

Gitbhub here: https://github.com/ipython-contrib/jupyter_contrib_nbextensions

## Project Steps

    1) Open, explore, and perform necessary cleaning on provided data sets. Determine need for additional data and acquire it via API calls and web scraping. Decide on most robust data to use as the "master" set.
    2) Merge data sets into larger data sets as needed. Clean further until working with robust data.
    3) EDA on data sets including visualizations

## Notebook Functions

In [None]:
def string_cleanup(text):
    '''takes in an object, converts to string, and removes all non-word characters'''
    text = str(text)
    result = re.sub(r"[,@\'?\.$%_:â()-]", "", text, flags=re.I)
    result = re.sub(r"\s+"," ", result, flags = re.I)
    return result

# Project Step 1 - Data Aquisition and Cleaning

Open, explore, and perform necessary cleaning on provided data sets. Determine need for additional data and acquire it via API calls and web scraping. Decide on most robust data to use as the "master" set.

## Source 1 - The Movie Database

First we will look at our provided data set from TMDB and see how they need cleaning. When performing cleaning analysis on ALL datasets in this project, we initially want to know things like:

    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

### Exploring the Data - original TMDB file

In [None]:
# importing the movie database movies data set from file
tmdb = pd.read_csv('zippedData/tmdb.movies.csv')

In [None]:
# taking a look at what we've imported
tmdb.head(10)

In [None]:
# what is the shape of our data?
tmdb.shape
# this dataset has 26,517 movie entries. At first glance we are very excited about all of this data!

In [None]:
# what kind of data is stored?
tmdb.dtypes
# Most of the data in this set seems to be stored in the correct format already (numbers as numbers, etc)
# we'll change the date to a proper date/time

In [None]:
# do we have any missing/null values?
tmdb.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further


In [None]:
tmdb.describe()
# One thing we can see in this dataset is that there are a LOT of movies with 5 or fewer votes. A full 50% of the dataset
# has 5 or fewer votes. The difference between or 75th percentile and the max goes from 28 to 22,000 votes!!
# We will look more into this and figure out the situation.

In [None]:
tmdb['vote_count'].value_counts()
# There are 6541 entries in this dataset with only 1 votex

In [None]:
tmdb.sort_values('popularity')
# while sorting on popularity, I also notice for the first time that a lot of the genre_ids on this low popularity list are absent


In [None]:
tmdb[(tmdb['vote_count'] > 30)].count()
# we only have 6347 entries in this dataset with more than 30 user votes. I question the quality of this dataset.
# Overall, this might just not be great data, and since we have access to a TMDB API, we will pull better data
# ourselves

### Pulling Data from TMDB via API

Instead of using the provided data set from TMDB, we're going to pull the specific movie data that we want to use from TMDB using an API key. 

We're accessing the API documentation for TMDB at https://developers.themoviedb.org/3/getting-started/introduction, after registering for an API key.

We can see some interesting options with the TMDB API that want to add to our data including:
    * Movie genre list to match up with the genre-ids (under Genres)
    * More up-to-date dataset in general, retrieved with some predetermined data refinement critera 
    * A list of the IMDB movie ids, which will be incredibly helpful for us to join this TMDB info with our IMDB info later in the notebook (under Movies -> Get External IDs)
    


### Movie Genres Matchup

TMDB also allows for browser-based API calls, which works well for small simple calls. We used their browser system for this simple call, copied the text results into our source code editor, and saved as a JSON

We used https://developers.themoviedb.org/3/genres/get-movie-list to get a JSON dictionary of movie genres.

In [None]:
# We saved the resulting web-based text return as a JSON using our source code editor, and now we load it
f = open('api_data/tmdb_movie_genres.json')
data = json.load(f)
data

In [None]:
genres = data['genres']

tmdb_genres = {}

for x in range(len(genres)):
    key = genres[x]['id']
    value = genres[x]['name']
    tmdb_genres[key] = value
    
tmdb_genres

# This did what we want, but WE CAN DO BETTER.

### Discover Data Set

The big workhorse API for TMDB is in "Discover" located at https://developers.themoviedb.org/3/discover/movie-discover

In this section we can get back a data set that can, in some ways, be pre-cleaned. So we are going to determine how we plan to refine/clean our data set right now, and then figure out ways that we can pull data from TMDB that already fits the parameters we want.

Here are the data cleanup steps we are planning for our data set, some of which can be achieved while we grab the data:

    * Drop entries with fewer than 30 votes. Our client is looking for a blockbuster, not a bespoke production.
    * Drop entries with no genre specified. We'll want to use the genre to make recommendations.
    * Drop entries with 1.0 or less popularity, for the same reasons as votes
    * Drop movies older than 2000. We want a relatively current dataset in order to make proper recommendations.
    
The Discover API lets us pass the following useful parameters to fulfill some of our data refinement goals:
    * primary_release_date.gte lets us include movies that have a primary release date greater or equal than the specified value
    * primary_release_date.lte lets us pass a primary release date lesser than or equal than the specified value. This will keep our scope in 2019 or older for purposes of our case study. We're looking at movie production in a pre-covid world.
    * vote_count.gte lets us filter for movies with a vote count greater than or equal to the specified value
    * with_original_language lets us pull english language films. Our client will be making films in english

This will take care of a few of the things we wanted to clean up in our dataset.
 
We're getting this and other API data in a separate notebook, because we don't want to make these API calls every time we run this notebook! We've pulled the data via the notebook called "tmdb_api_calls" and saved it as a JSON file, and will now import our JSON file here for further processing.

#### !!!!! STOP !!!!! Go to the notebook at tmdb_api_calls.ipynb and run the first section titled "Discover Data Set" now.

Alternatively, load in the provided csv below where we have already done this task.

In [None]:
# opening up our Discover dataset

f = open('api_data/tmdb_movies.json', encoding='utf-8')
discover = json.load(f)

type(discover) # we've loaded our Discover dataset and it's a dictionary

In [None]:
discover.keys() # checking the keys
# we ran our function to paginate in the API and as a result, our keys are each of the 500 calls we made to the api. We'll
# need to go a level lower to hit our data.

In [None]:
# what does the first level of our dictionary look like?
discover['1']
# This is page 1 of the results

discover['1']['results']
# these are the entries on page 1. Our plan now is to write a loop to iterate through the pages, and concatenate the 
# results onto our dataframe tmdb_discover

In [None]:
tmdb_discover = pd.DataFrame() #start by making an empty dataframe to hold our results

# loop through each page of our response JSON, make it into a dataframe, and concatenate onto our big dataframe
for x in discover:
    df = pd.DataFrame.from_dict(discover[x]['results'])
    tmdb_discover = pd.concat([tmdb_discover, df])

tmdb_discover #finished dataframe

By pre-filtering for year 2000 or later, 31+ votes, and english language films, we hit the 10,000 results limit with the TMDB API. However we can see that the default sort on this data set is via popularity, so we will conclude that we have gotten the 10,000 most popular movies since 2000, and be happy with the quality of this data.

We are definitely going to use this data instead of the provided TMDB dataset, which also had around 6,000 results after removing <30 votes, but had not yet been filtered for after year 2000 OR english language. 

We are NOT using the provided TMDB dataset from earlier in the notebook. We've found that we have higher quality data via our API pull, and will be using our tmdb_discover dataset and discarding our tmdb dataset.

### Exploring the Data - Part 2

In [None]:
tmdb_discover.shape
# we have 10,000 entries

In [None]:
tmdb_discover.dtypes
# we'll take a look at fixing the release_date format and converting that to a proper datetime. Everything else looks correct.

In [None]:
tmdb_discover.describe()
# we can see that we have meaningful data with a reasonable vote_count per entry and high popularity

In [None]:
tmdb_discover[(tmdb_discover['genre_ids'] == '[]')].count()
# All of our entries have genre ids.

In [None]:
tmdb_discover.isnull().sum()
# we have no null or missing values in our dataset

In [None]:
tmdb_discover.columns
# we don't need all of these columns, so I need a reminder right here of what I want to drop

### Data Cleanup

What do we actually need to use from this data set?

We'll be using this data set as the basis for all further connections in this project, as the TMDB API allows us to gather both the most up-to-date information as well as provides us with important details such as a specific release date and genres.

We're going to do the following work on this dataset to clean it up:
    
    a) change our release date to standard format
    b) Drop unneeded columns
        * video - we know all of these values are false, as it was part of our API parameters
        * poster_path - provides a path to an image for the movie, which we don't need
        * adult - we know all of these values are false, as it was part of our API parameters
        * backdrop_path - another set of images, which we don't need
        * original_titles - if the movie is in a foreign language, the original title is here, we only need the english titles
        * overview - summary description of the movie, which we cannot use in visualization
        * original_language - we're only using english language movies, so this is a redundant field


In [None]:
# cleaning up this dataset

#drop columns by name
tmdb_discover.drop(columns=['video', 'poster_path', 'adult', 'backdrop_path', 'original_title', 'overview', 'original_language'], inplace=True)

In [None]:
# using pandas built-in datetime converter to change our release date column to standard format
tmdb_discover['release_date'] = pd.to_datetime(tmdb_discover['release_date'])

In [None]:
tmdb_discover.dtypes

In [None]:
tmdb_discover # confirming that we have cleaned up our data and have only the information we need to use

We need this data set in order to make our API calls for the IMDB ID matchup, so we're going to export it to a csv that we can then import into our API production file.

In [None]:
# Exporting our csv so that we can make our API calls to match up IMDB ID
#tmdb_discover.to_csv('api_data/tmdb_discover.csv', index=False)

### IMDB ID Matchup

Our next goal is to match up IMDB movie ids for each of the movie ids in our data set. TMDB has an API to do exactly this - submit the TMDB id, and get an IMDB id in return. Each TMDB movie id is a parameter that must be passed to an individual API call, so we won't be using the web interface for this action.

We move to the tmdb_api_calls notebook to do this process.

We've exported our Discover Data Set up above and will process it in our API notebook, and will then re-import it here with our TMDB ids replaced with IMDB ids!

#### !!! STOP !!! Go to the API notebook tmdb_api_calls.ipynb and run the second section titled "IMDB ID Matchup" now.
Alternatively, load in the provided csv where we have already done this task.

In [None]:
tmdb_discover = pd.read_csv('api_data/tmdb_discover_converted.csv')

tmdb_discover.sort_values('id')
# we now have our original tmdb_discover dataset converted to IMDB ids instead of TMDB ids.
# We'll be able to cross reference this set later on with IMDB datasets.

In [None]:
# We didn't find four IMDB IDs, so we will drop them
tmdb_discover = tmdb_discover[tmdb_discover['id'].notna()]

In [None]:
# use the string cleanup function to remove special characters from the titles in hopes of matching up data to this data later

for ind in tmdb_discover.index:
    text = str(tmdb_discover['title'][ind])
    result = re.sub(r"[,@\'?\.$%_:â()-]", "", text, flags=re.I)
    result = re.sub(r"\s+"," ", result, flags = re.I)
    tmdb_discover['title'][ind] = result

In [None]:
#Now that we have replaced our TMDB id with IMDB id, we'll set the IMDB id as our index
tmdb_discover.set_index('id', inplace=True)

In [None]:
# confirming it worked
tmdb_discover

In [None]:
tmdb_discover.dtypes

In [None]:
# using pandas built-in datetime converter to change our release date column to standard format
tmdb_discover['release_date'] = pd.to_datetime(tmdb_discover['release_date'])

### Export Web Scraper File

We now will export our completed tmdb_discover file in order to use it to scrape Box Office Mojo.

In [None]:
#exporting the dataframe to a csv to use with our web scraper
#tmdb_discover.to_csv('api_data/tmdb_imdb_ids.csv')

## Source 2 - Box Office Mojo

Box Office Mojo is part of IMDB pro and does not offer a personal-use API. We started with our movie data of 10,000 entries from TMDB and used another TMDB API to obtain all of the IMDB IDs for those 10,000 movies. Now, we will use our web scraper in our notebook bom_scraper to use the IMDB ID at Box Office Mojo to find MPAA rating, studio, domestic gross, foreign gross and budget information for each movie, if available

#### !!! STOP !!! Go to the notebook bom_scraper.ipynb now and run the web scraper

Alternatively, load in the provided csv where we have already done this task.

### Exploring the Data

In [None]:
#Box Office Mojo movie gross
bom = pd.read_csv('api_data/tmdb_bom_scraped.csv')

In [None]:
bom

In [None]:
# what is the shape of our data?
bom.shape
# this dataset has 3251 movie entries

In [None]:
# what kind of data is stored?
bom.dtypes
# Most of this data is stored correctly, except foreign_gross. We will have to fix this column

In [None]:
# do we have any missing/null values?
bom.isnull().sum()
# This dataset is missing a few ids, which is how we will connect this data to our tmdb dataset later. We'll drop these rows.

In [None]:
bom = bom[bom['id'].notna()]

In [None]:
round(bom.describe(), 2)
# One useful bit of info we get is that the earliest movie on this list is from 2010, and the latest is from 2018

In [None]:
bom.sort_values('dom_gross', ascending=False).head(30)
# The foreign_gross column needs to be fixed and turned into a float. Right now it is an object and does not sort properly.
# values over 1bil are stored as 4 digit numbers, which skews our information.

### Data Cleanup

We performed some of our data cleanup during our web scrape, but we'll be doing these additional tasks:

    * Set our IMDB ID as the index so we can join on this field later

In [None]:
bom.isnull().sum()

In [None]:
# set the imdb id as the index (mistakenly named tmdb_id)
bom.set_index('id', inplace=True)

In [None]:
bom

## Source 3 - IMDB
   
While we do our exploration and cleanup analysis on each of these IMDB data sets, we'll explore how they will interact with each other when we merge them. We'll determine what needs to be cleaned before vs after merging the datasets.

### IMDB1 - User user_ratings per movie ID

In [None]:
# import imdb user user_ratings per movie
imdb1 = pd.read_csv('zippedData/title.ratings.csv')

#### Exploring the Data

In [None]:
# taking a look at what we've imported
imdb1
# this dataset is using the movie id and showing the average user_rating, and the number of votes

In [None]:
# what is the shape of our data?
imdb1.shape
# this dataset has 73,856 movie entries

In [None]:
# what kind of data is stored?
imdb1.dtypes
# The data in this set appears to be stored in the proper formats

In [None]:
# what are our columns?
imdb1.columns
# The 'tconst' will be found throughout our IMDB datasets. We will consider turning it into our index for all of the IMDB datasets.

In [None]:
# do we have any missing/null values?
imdb1.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further

In [None]:
round(imdb1.describe(), 2)

#### Data Cleanup

In [None]:
#We make the unique "tconst" into our index.
imdb1.set_index('tconst', inplace=True)

In [None]:
imdb1

### IMDB2 - Cast and crew per movie ID

In [None]:
# import imdb primary cast and crew per movie
imdb2 = pd.read_csv('zippedData/title.principals.csv')

#### Exploring the Data

In [None]:
# taking a look at what we've imported
imdb2
# this dataset is using the movie id and showing the principal cast and crew for each movie, by the cast/crew id

In [None]:
# what is the shape of our data?
imdb2.shape
# this dataset has 1,028,186 cast and crew entries

In [None]:
# what kind of data is stored?
imdb2.dtypes
# The data in this set appears to be stored in the proper formats

In [None]:
# what are our columns?
imdb2.columns
# The 'tconst' will be found throughout our IMDB datasets. We will turn it into our index for all of the IMDB datasets.

In [None]:
# do we have any missing/null values?
imdb2.isnull().sum()
# This dataset has large numbers of missing values. We will inspect the data itself to determine if this is important.

In [None]:
temp = imdb2.loc[(imdb2['job'].notnull())]
temp
# job seems very closely related to category. Only 177k (out of over 1mil) entries have this category filled
# and it's largely a duplicate or reword of category. We will drop this column.

In [None]:
temp = imdb2.loc[(imdb2['characters'].notnull())]
temp
# it seems unimportant to know what character the actors and actresses play. We can't really use that information.
# we will drop this column

After studying this dataset, we see that the movie id (tconst) is not unique. Because of this, we will not turn the tconst value into the index in any of the datasets.

#### Data Cleanup

We will remove three unnecessary columns that are not needed for making recommendations.

In [None]:
# After inspecting the data, we can see that the "job" column is generally an extension of the "category" column 
# We will drop this column.
imdb2.drop(columns=['job'], inplace=True)

# We can also see that the "ordering" column is just for sorting the different jobs for each movie id
# we don't really need this column and will remove it as well
imdb2.drop(columns=['ordering'], inplace=True)

# lastly, we want all of our data to contribute to a recommendation, and while the actors themselves may be important,
# the characters they play do not seem particularly important. We will also drop the "characters" column
imdb2.drop(columns=['characters'], inplace=True)

In [None]:
imdb2

### IMDB3 - Director and writer assignments per movie id

In [None]:
#IMDB directors and writers per movie
imdb3 = pd.read_csv('zippedData/title.crew.csv')


#### Exploring the Data

This appears to give the same information as the previous data set, but in a different format. Let's do a few comparisons and see if that is the case.

In [None]:
imdb3

In [None]:
temp = imdb2.loc[imdb2['tconst'] == 'tt0417610']
temp
# our director is nm1145057 and our writer is nm0083201, let's check if it's the same in dataset 6

In [None]:
temp = imdb3.loc[imdb3['tconst'] == 'tt0417610']
temp
# at first glance it's not the same! But then we see that the director is also a writer.

In [None]:
# using this information, we'll have to decide if we want to value when a person is credited in multiple roles.

# let's check one more multi-role
temp = imdb2.loc[imdb2['tconst'] == 'tt0999913']
temp
# we have 1 director and 3 writers listed

In [None]:
temp = imdb3.loc[imdb3['tconst'] == 'tt0999913']
temp
# 1 director and 4 writers, where one of the writers is the director.

In [None]:
# Let's take a look at a listing from this dataset with no writer attached, in dataset 5
temp = imdb2.loc[imdb2['tconst'] == 'tt0879859']
temp
# there is indeed no writer attached to this movie according to dataset 5

#### Data Cleanup

Based on what we are seeing here, we are NOT going to use this dataset. We'll use the other cast and crew dataset to get this same information already broken apart, rather than having to break apart this dataset.

### IMDB4 - Movie stats per movie ID

In [None]:
# import imdb stats per movie
imdb4 = pd.read_csv('zippedData/title.basics.csv')

#### Exploring the Data

In [None]:
# taking a look at what we've imported
imdb4
# this dataset is using the movie id and finally we have the title of the movie, as well as the year, the runtime, and the genres

In [None]:
# what is the shape of our data?
imdb4.shape
# this dataset has 146,144 movie entries

In [None]:
# what kind of data is stored?
imdb4.dtypes
# The data in this set appears to be stored in the proper formats

In [None]:
# what are our columns?
imdb4.columns
# The 'tconst' is found throughout our IMDB datasets and is the movie identifier
# we will want to understand the distinction between primary_title and original_title

In [None]:
# do we have any missing/null values?
imdb4.isnull().sum()
# This dataset has some missing values. We will inspect the data itself to determine if this is important.
# there are no primary titles or years missing, which seems like the most important data to have

In [None]:
imdb4.describe()
# the IMDB dataset starts at 2010, and includes unreleased future movies.
# That's unhelpful, but we plan to use tmdb_discover as our base set to left join to, so it doesn't really matter.

We don't really have much use for this data, because we've directly pulled all of this data plus more into the tmdb_discover data set. 

Why are there so many more entries in this dataset than the tmdb dataset which is from 2000 onward?

Our first hint is the number of obvious foreign language films in the dataset preview above. Our tmdb API pull focused only on movies in english. This list also is not filtered on reviews in order to reduce the number of small-scale entries. This list also includes movies with release dates in the future.

We have no real need to filter these things at this time. The unneeded movies entires will be dropped when we join to tmdb_discover on a left join.

#### Data Cleanup

We'll later be merging this dataframe into our more robust tmdb_discover data set, so we don't need all of the information in this data set. In fact, we might not need ANY of this information, except maybe runtime. All of the other information is better represented in our tmdb_discover dataset. For now, we won't be cleaning this data much further, as this set seems to be a less-specific set with redundant information.

We DO want any user user_ratings available from IMDB, to cross-reference with the user user_ratings from tmdb_discover. So we'll do a few things on this dataset before merging with imdb1:

    1) set our IMDB id as our index
    2) Dropping redundant columns "original_title", 'primary_title', 'start_year', 'genres'
    

In [None]:
# Set tconst movie id as index
imdb4.set_index('tconst', inplace=True)

# Drop original_title column
imdb4.drop(columns=["original_title", 'primary_title', 'start_year', 'genres'], inplace=True)

In [None]:
imdb4
# Our final log of 146,144 movies in the imdb dataset from 2010 onward


Why are there so many more entries in this dataset than the tmdb dataset which is from 2000 onward?

Our first hint is the number of obvious foreign language films in the dataset preview above. Our tmdb API pull focused only on movies in english. This list also is not filtered on reviews in order to reduce the number of small-scale entries. This list also includes movies with release dates in the future.

We have no real need to filter these things at this time. The unneeded movies entires will be dropped when we join to tmdb_discover on a left join.

#### Data Combining IMDB1 and IMDB4

We want the average user_rating and number of votes to be attached to our imdb4 database, from our imdb1 database.

In [None]:
imdb4 = imdb4.join(imdb1, how="left")
imdb4

The user_ratings are now in with the movie entries, and we're ready to attach this data to our tmdb_discover dataset.

### IMDB5 - Alternate titles per movie ID

In [None]:
# import imdb alternate titles
imdb5 = pd.read_csv('zippedData/title.akas.csv')

#### Exploring the Data

In [None]:
# taking a look at what we've imported
imdb5

#### Data Cleanup

It is immediately apparent that this dataset lists all of the alternate titles for each movie id.

We won't be using this dataset.

### IMDB6 - Detailed crew info per person ID

In [None]:
# import imdb detailed crew information
imdb6 = pd.read_csv('zippedData/name.basics.csv')

#### Exploring the Data

In [None]:
# taking a look at what we've imported
imdb6
# this dataset has the information about the cast and crew ids

In [None]:
# what is the shape of our data?
imdb6.shape
# this dataset has 606,648 people entries

In [None]:
# what kind of data is stored?
imdb6.dtypes
# The data in this set appears to be stored in the proper formats

In [None]:
# what are our columns?
imdb6.columns

In [None]:
# do we have any missing/null values?
imdb6.isnull().sum()
# This dataset has a lot of missing values for birth year, death year, profession, and known for.
# We don't need some of this information, including birth year, profession and known for
# We will keep death year to make sure we don't make any recommendations for cast/crew that is deceased

#### Data Cleanup

In [None]:
# the only info we need on people is if they are alive, so we will drop their year of birth
imdb6.drop(columns=['birth_year'], inplace=True)

# We don't need the specific professions of our players. We can see their role from dataset 5
imdb6.drop(columns=['primary_profession'], inplace=True)

# We're going to use other, more quantifiable metrics of popularity than the known for information
imdb6.drop(columns=['known_for_titles'], inplace=True)

# we will make the unique nconst the index
imdb6.set_index('nconst', inplace=True)

In [None]:
imdb6.head()
imdb6.sort_values('death_year').head()
# now we realize that we can have writers and composers that are long deceased. We are going to keep the death_year column.

## Source 4 -  The Numbers
Before we work on this data set, we should check if we can get better/updated data from the source. We followed the Data link at "The Numbers" to https://www.opusdata.com/ and submitted a request for access to their data set. In the meantime we will contine to work with this data set as given.

In [None]:
# import movie budgets dataset from file
thenum = pd.read_csv('zippedData/tn.movie_budgets.csv')

### Exploring the Data

We're going to perform our cleanup analysis on this dataset, including:
    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

In [None]:
# taking a look at what we've imported
thenum.head(30)

In [None]:
# what is the shape of our data?
thenum.shape
# this data has 5782 entries

In [None]:
# what format is the data stored?
thenum.dtypes
# We have a lot of data format problems here. Everything but the id is stored as an object,
# including the monetary numbers and the date. We will fix these problems during data cleanup.

In [None]:
# do we have any missing/null values?
thenum.isnull().sum()
# since we know that all of our data is objects, we MAY actually have missing values. We won't be sure until later.
# for now let's look at the tail of the set and see if anything pops out.

In [None]:
thenum.tail()
# we do, in fact, see entries with a $0 for gross. These aren't showing up as null because
# they are actual entries rather than null values. We will need to remove or impute these entries after we convert these cells.

### Data Cleanup 

On the movie budgets dataset, we find the following things to clean up and resolve:
    * We have 5782 entries. We'll want to explore how/why movies were included in this dataset, as it's not a very large dataset compared to the number of movies released over time
    * all of the data in this set is objects. A lot of the data is numbers, so we need it to be in a numerical format
    * We have an id column, which can be used as our dataset index
    * Many entries with a $0 for gross. These aren't showing up as null in our initial EDA because they are actual entries of $0 not null values. We will need to remove these entries after we convert these cells.

We're going to clean up this dataset in the following way before moving on:

    1) set the id as the index
    2) convert the release date into a standard datetime
    3) convert all cost/gross fields into integers
    4) use regex to remove as many special characters from titles as possible, in hopes of matching this up with other data later
    5) remove rows without information for budget OR gross, as we won't be able to use this data
    

In [None]:
# block of cleanup actions performing actions 1-5 listed above

# use regex to remove all non-word characters
for ind in thenum.index:
    text = str(thenum['movie'][ind])
    result = re.sub(r"[,@\'?\.$%_:â()-]", "", text, flags=re.I)
    result = re.sub(r"\s+"," ", result, flags = re.I)
    thenum['movie'][ind] = result

# sets the id as the index, removing a redundant column (former index)
thenum.set_index('id', inplace=True)

# using pandas built-in datetime converter to change our release date column to standard format
thenum['release_date'] = pd.to_datetime(thenum['release_date'])

# write a function to convert the cost/gross object entries into proper numbers that we can use in calculation
def convert_numbers(x):
    '''Takes in a string formatted number that starts with $ and may include commas, and returns that 
    number as a whole integer that can be used in calculations'''
    x = x[1:]
    x = x.replace(',', '')
    x = int(x)
    return x

# run the function on each of our three cost/gross entries
thenum['production_budget'] = thenum['production_budget'].map(lambda x: convert_numbers(x))
thenum['domestic_gross'] = thenum['domestic_gross'].map(lambda x: convert_numbers(x))
thenum['worldwide_gross'] = thenum['worldwide_gross'].map(lambda x: convert_numbers(x))

# add two new columns for domestic net and worldwide net
#thenum['domestic_net'] = thenum['domestic_gross'] - thenum['production_budget']
#thenum['worldwide_net'] = thenum['worldwide_gross'] - thenum['production_budget']


In [None]:
# check that the data now looks the way we want it
thenum.head()

Now that we have corrected our numbers, we need to address the missing data that we identified before

In [None]:
sum(thenum['production_budget'] == 0)
# all of the movies have a production budget listed. Regardless, we can't get enough info about success without any gross, so
# we'll be dropping the rows that have a gross of 0 for domestic

In [None]:
sum(thenum['domestic_gross'] == 0)
# 548 of our entries have no data for domestic_gross. We can't use these in calculations, and we're not going
# to impute them, so we are going to drop these rows from the dataset.

In [None]:
thenum = thenum[thenum['domestic_gross'] !=0]
# dropping all rows where there is no domestic gross information
thenum

In [None]:
# set movie as index
thenum.set_index('movie', inplace=True)

In [None]:
thenum.sort_values('release_date')

With this data set cleaned up, our only intended use is to join it to our tmdb_discover dataset in hopes of filling in any missing data that our Box Office Mojo scraper was unable to scrape.

## Source 5 - Rotten Tomatoes

### Exploring the Data

In [None]:
rt1 = pd.read_csv('zippedData/rt.reviews.tsv', sep='\t', encoding='Latin-1')
rt1.tail()

#It's immediately apparent that these are the posted reviews for movies on rotten tomatoes, using the id of the movie

In [None]:
rt2 = pd.read_csv('zippedData/rt.movie_info.tsv', sep='\t', encoding='Latin-1')
rt2.tail()

# this is the information on the movies, by id. But it doesn't include the movie name!!

After checking out the Rotten Tomatoes/Fandango API usage, we see that they do not grant API access to individuals. We will have to scrape for more data if we want to use this data. right now, we have no idea what the names of the movies are.

# Data Joins and Summary

After our data pulls and initial cleanup, we have the following data sets to use:


    thenum - Box office numbers with 'movie' by name as the unique key

    tmdb_discover - TMDB movie information, join 'id' on imdb data's 'tconst'

    tmdb_genres - can be joined into tmdb_discover on id

    bom - Box Office Mojo box office numbers, join 'imdb_id' on 'tconst'

    imdb1 - NOT USE further. IMDB User user_ratings and votes for each movie id. We already integrated this into imdb4.

    imdb2 - IMDB Cast and crew for each movie id. Join on movie id tconst and/or person id nconst

    imdb3 - NOT USE. Redundant information with imdb2.

    imdb4 - IMDB Movie runtime, user user_ratings and votes on movie id. Join on movie id (tconst).

    imdb5 - NOT USE. Alternate titles.

    imdb6 - IMDB Cast and crew info. Join on nconst.
    
    rt1 - NOT USE. Rotten Tomatoes movie reviews with an ID identifiter
    
    rt2 - NOT USE. Rotton Tomatoes movie stats, but no movie name



## Data Join Plan

We will combine our various data sets into a smaller number of data-rich sets that we'll use for our EDA

   ##### master_movies = tmdb_discover + imdb4 + bom + thenum
        * This dataset will reference movies by IMDB ID and have the average user_ratings, vote counts, studio, and financials where available
        * Sets imdb4 and bom will be left joined on IMDB ID with tmdb_discover as the base data set
        * thenum will be joined on movie title, discarding anything from thenum that we cannot match up
   ##### imdb_crew = imdb2 + imdb6
        This dataset will reference cast/crew members by their unique id, as well as specify IMDB IDs that they have worked on, and the job they performed


## Dataframe Join - master_movies

tmdb_discover + imdb4 + bom + thenum


In [None]:
# We are joining our imdb4 with bom and tmdb_discover on the tconst which is the IMDB id
first_join = tmdb_discover.join(imdb4, how="left")
first_join.sort_values('vote_count', ascending=False)
# We are using tmdb_discover which is our primary movie set as the basis for the join. We want all records from this dataset,
# and any of records from the other datasets which match.

In [None]:
# Joining on our Box Office Mojo data set and joining on the index which is the IMDB id
second_join = first_join.join(bom, how="left")
second_join.sort_values('vote_count', ascending=False)

In [None]:
# Now we are bringing in our thenum data set, attempting to join on both title string and release date for a correct match.
# We're dropping entries from thenum where we cannot make a match
second_join.reset_index(inplace=True) # We reset our index first so that we don't lose our IMDB id index from our data set
master_movies = second_join.merge(thenum, left_on=['title', 'release_date'], right_on=['movie', 'release_date'], how='left')
master_movies.sort_values('vote_count', ascending=False)


### Data Cleanup

We have a lot of columns now and some cleanup work to do!

We have multiple columns about movie votes and user_ratings. We will combine these into some master user_rating information.

In [None]:
# We make a new column that takes the average user_rating of our two user_rating entries. This ignores any NaN and won't use them
# in the resulting calculation
master_movies['user_rating'] = master_movies[['averagerating', 'vote_average']].mean(axis=1)

In [None]:
# We make a new column that takes the sum of our two vote counts. This ignores any NaN and won't use them
# in the resulting calculation
master_movies['total_votes'] = master_movies[['numvotes', 'vote_count']].sum(axis=1)

In [None]:
master_movies.sort_values('total_votes', ascending=False)
# We now have 9996 entries

In [None]:
# We make a new column for the worldwide net net for the movie
master_movies['world_gross'] = (master_movies['for_gross'] + master_movies['dom_gross'])

In [None]:
# We make a new column for the domestic net net for the movie
master_movies['dom_net'] = (master_movies['dom_gross'] - master_movies['budget'])

In [None]:
# We make a new column for the worldwide net net for the movie
master_movies['world_net'] = (master_movies['for_gross'] + master_movies['dom_gross'] - master_movies['budget'])

In [None]:
master_movies.sort_values('dom_gross', ascending=False).head(10)

In [None]:
master_movies.columns

In [None]:
# setting our IMDB ID as our index once again
master_movies.set_index('id', inplace=True)

### We need to map our movie_genres dictionary onto our genre_ids.

We might not actually want to do this. We DO want to make it a real list, but possible not put on words.

In [None]:
# function to convert a list in string format into a true list
def convert_to_list(string):
    '''Takes a string that looks like a list but is actually a string. Turns it into an actual list.'''
    li = string.lstrip('[')
    li = li.rstrip(']')
    li = li.replace(" ", '')
    li = list(li.split(","))
    li = [int(x) for x in li]
    return li

In [None]:
master_movies['genres'] = ''
for ind in master_movies.index:
    try:
        string = master_movies['genre_ids'][ind]
        converted = convert_to_list(string)
        converted = [tmdb_genres[x] for x in converted]
        master_movies['genres'][ind] = converted
    except: continue

In [None]:
master_movies = master_movies[master_movies['rating'].notnull()]

In [None]:
master_movies['rating'].notnull().sort_values()

In [None]:
#We're mapping our MPAA Rating to a number

replace_map = {'rating' : {'G': 1, 'PG': 2, 'PG-13': 3, 'R': 4, 'NC-17': 5, 'NR': 6}}
               
master_movies.replace(replace_map, inplace=True).astype(float)

In [None]:
master_movies.dtypes

In [None]:
# We're now ready to clean up all of the superfluous columns. We can get rid of our original sources for vote counts and user_ratings.
# We'll also get rid of our original financial information and use our new combined columns.
# We'll do this by creating a new copy of our dataframe with just the columns we want, in the order that we want.
master_movies = master_movies[['title', 'rating', 'genres', 'studio', 'popularity', 'user_rating', 'total_votes', 'release_date', 'budget', 'dom_gross', 'for_gross', 'world_gross', 'dom_net', 'world_net' ]]

In [None]:
# Our completed master_movies data set of 6305 movies
master_movies

In [None]:
#exporting the dataframe to a csv
#master_movies.to_csv('api_data/imdb_masterlist.csv')

## Dataframe Join - imdb_crew

imdb2 to imdb6 - Movie cast/crew assignments + cast/crew info

In [None]:
# we are joining our imdb2 and imdb6 to move the cast and crew names with where they have performed

imdb_crew = imdb2.join(imdb6, on='nconst', how='inner')
# we lost a few hundred entries (out of over a million) for people listed in IMDB who have never worked on a movie

imdb_crew


In [None]:
imdb_crew.set_index('tconst', inplace=True)

In [None]:
cast_crew = master_movies.join(imdb_crew, how='left')

In [None]:
cast_crew

In [None]:
cast_crew.reset_index(inplace=True)

In [None]:
temp = cast_crew.groupby(['index', 'category'])['title', 'genres', 'studio', 'rating', 'popularity', 'user_rating', 'total_votes', 'release_date', 'budget', 'dom_gross', 'for_gross', 'dom_net', 'world_net', 'primary_name']

In [None]:
temp.count()

## Data Condense - Full Financials

We are going to study all of our data, but for some of our comparisons, we'll need only data that has a FULL set of financials. For this data we'll make a specific dataset that drops all movies for which we have no budget information (and therefore cannot get a true net income number)

In [None]:
movies_withbudget = master_movies[(master_movies['budget'].notna())]
movies_withbudget

## Data Summary

master_movies has all movie titles, genres, user_ratings, and available financials.

imdb_crew is our cast and crew information for each title

movies_withbudget is our master_movies with only entries that have full financials available.

# EDA

## Studying master_movies and movies_withbudget

We need to ensure that we use the correct data sets for our sorts from here out. If we want information on GROSSES, we can use our master_movies dataset of 6,749 movies which has domestic and foreign gross for all entries. However any time we need to evaluate budget and/or net income, we must use our data set movies_withbudgets, which is much smaller but has a full set of financials.

In [None]:
master_movies.columns

In [None]:
master_movies.describe()

# average runtime is 

In [None]:
pd.plotting.scatter_matrix(master_movies[['budget', 'dom_gross', 'rating', 'user_rating', 'popularity']], figsize=(15,15));

In [None]:
# investigate correlations with user_rating

master_movies.corr()['user_rating'].sort_values()

In [None]:
master_movies.corr()['budget'].sort_values()

In [None]:
master_movies.corr()['popularity'].sort_values()

In [None]:
master_movies.corr()['rating'].sort_values()

In [None]:
master_movies.groupby('studio').mean().sort_values('dom_gross', ascending=False).head(10)
# sorting by average dom_gross, we see our top performing studios which produce the biggest blockbusters on average

# 349 studios made our 6749 movies

In [None]:
master_movies.groupby('studio').sum().sort_values('dom_gross', ascending=False).head(10)
# sorting on domestic gross as a sum, we see which studios bring in the most overall gross


In [None]:
master_movies.groupby('studio').mean().sort_values('dom_net', ascending=False).head(10)
# sorting on domestic net on average, we get some interesting results. Our big flashy studios are still there,
# but there are some smaller studios that have a very respectable net income per film

In [None]:
movies_withbudget.groupby('studio').mean().sort_values('dom_net', ascending=False).head(10)
# sorting on domestic net on average, we get some interesting results. Our big flashy studios are still there,
# but there are some smaller studios that have a very respectable domestic net income per film
# Pantelion films spent $12mil on a single film that ultimately netted $38mil domestic, which is only about $7mil less
# than the average Disney film nets domestically. Now, the WORLDWIDE net differs greatly ($79mil vs Disney's $260mil),
# but overall we can understand that we can get respectable results on a smaller budget if we do it right

In [None]:
master_movies['studio'].value_counts()

## Studying some cast and crew info

In [None]:
cast_crew.columns

In [None]:
cast_crew['category'].unique()
#What kinds of categories are tracked?

### Director

In [None]:
# make a new series based on the director of the movie
director = cast_crew[cast_crew['category'] == 'director']

In [None]:
# checking out our mean values for this group
director.groupby(['primary_name']).mean().sort_values('dom_net', ascending=False)
# We have 2608 different directors for our list of 6729 movies

In [None]:
# Checking how many movies each director has directed.
director.groupby(['primary_name']).count().value_counts('title')
#1776 of our directors have directed only one movie

In [None]:
# select directors that have directed at least 2 movies, so that we know they are proven
director = director[director.duplicated(subset='primary_name', keep=False)]

# checking out our mean values for this group
director.groupby(['primary_name']).mean().sort_values('dom_net', ascending=False)
# We have 832 repeat directors

In [None]:
# Looking at the mean domestic gross of our top 30 directors
directortop30 = director.groupby(['primary_name'])['dom_gross'].mean().sort_values(ascending=False).nlargest(30)
directortop30

In [None]:
# Bar graph of the domestic gross of our top 30 directors
graphit = directortop30.sort_values().plot(kind='barh', figsize=(20,10))

plt.title('Mean Dom Gross by Director')
plt.xlabel('Domestic Gross')
plt.ylabel('Director')
graphit.plot();

### Actor

In [None]:
# checking out the actors in our top 30 grossing movies
actor = cast_crew[cast_crew['category'] == 'actor']
actor30 = actor.groupby(['primary_name'])['dom_gross'].max().sort_values(ascending=False).nlargest(30)
actor30

In [None]:
# Your code here
graphit = actor30.sort_values().plot(kind='barh', figsize=(20,10))

plt.title('Actors in Top 30 Grossing Movies')
plt.xlabel('Domestic Gross')
plt.ylabel('Actor')
graphit.plot();

In [None]:
# Looking at the mean domestic gross of the actors
actormean = actor.groupby(['primary_name'])['dom_gross'].mean().sort_values(ascending=False).nlargest(30)
actormean

In [None]:
# Looking at the actors by movie user_rating
actoruser_rating = actor[actor.duplicated(subset='primary_name', keep=False)]
actoruser_rating = actor.groupby(['primary_name'])['user_rating'].mean().sort_values(ascending=False).nlargest(30)
actoruser_rating

### Actress

In [None]:
# checking out the actresses in our top 30 grossing movies
actress = cast_crew[cast_crew['category'] == 'actress']
actress30 = actress.groupby(['primary_name'])['dom_gross'].max().sort_values(ascending=False).nlargest(30)
actress30

In [None]:
# Your code here
graphit = actress30.sort_values().plot(kind='barh', figsize=(20,10))

plt.title('Actresses in Top 30 Grossing Movies')
plt.xlabel('Domestic Gross')
plt.ylabel('Actress')
graphit.plot();

### Writer

In [None]:
# make a new series based on the writer of the movie
writer = cast_crew[cast_crew['category'] == 'writer']

In [None]:
# checking out our mean values for this group
writer.groupby(['primary_name']).mean().sort_values('dom_net', ascending=False)
# We have 3983 different writers for our list of 6729 movies

In [None]:
# Checking how many movies each writer has written.
writer.groupby(['primary_name']).count().value_counts('title')
# over half of

In [None]:
# Looking at the mean domestic gross of our top 30 writers
writertop30 = writer.groupby(['primary_name'])['dom_gross'].mean().sort_values(ascending=False).nlargest(30)
writertop30

In [None]:
# Bar graph of the domestic gross of our top 30 directors
graphit = writertop30.sort_values().plot(kind='barh', figsize=(20,10))
#deaths = df.groupby(['State'])['Deaths'].sum().sort_values().plot(kind='barh', figsize=(20,10))

plt.title('Dom Gross by Writer')
plt.xlabel('Domestic Gross')
plt.ylabel('Director')
graphit.plot();

# TO DO

Break out franchises (by looking at duplicate words in movie titles and then manually? ?), study franchises vs single films ?
Feature engineering - add "Franchise" field? (look up library to ignore dumb words)? Look up how to perform feature engineering on non-categorical data.

Break out genres, and study numbers vs genre

Visualizations/EDA to gather:
    gross/net by team of writer-director
    gross/net of franchise vs non-franchise
    gross/net by genre
    gross/net by genre+franchise status
    gross/net against studio
    check out genres of franchises
    gross/net by MPAA rating