# Project 3 - Part 1 

-Name: Tyler Schelling
-Date Started: 12/13/2022

---

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie. 

---

## Import Libraries

In [12]:
import pandas as pd
import numpy as np

## Downloading the Files

In [98]:
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

In [99]:
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)

In [100]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [101]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1925
1,tt0000002,5.8,261
2,tt0000003,6.5,1741
3,tt0000004,5.6,176
4,tt0000005,6.2,2554


In [102]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


## Filtering/Cleaning

In [103]:
#Replace null values ("\N") with np.nan across all 3 tables
basics = basics.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan})
akas = akas.replace({'\\N':np.nan})

In [104]:
#Eliminate movies that are null for runtimeMinutes
basics = basics[basics['runtimeMinutes'].notna()]

In [105]:
#Keep only titleType of 'Movie'
basics = basics[basics['titleType'] == 'movie']

In [106]:
#Eliminate movies that are null or contain documentary as their genre.
basics = basics[(basics['genres'].notna())]
basics = basics[~basics['genres'].str.contains('documentary',case=False)]

In [111]:
#Keep startYear 2000-2022
basics = basics[basics['startYear'].notna()]
basics['startYear'] = basics['startYear'].astype('int64')

year_filter = (basics['startYear'] > 2000) & (basics['startYear'] < 2023)
basics = basics[year_filter]

In [121]:
#Keep only movies made in the US
akas = akas[akas['region'] == 'US']
us_filter_basics = basics['tconst'].isin(akas['titleId'])
us_filter_ratings = ratings['tconst'].isin(akas['titleId'])

#Filter the basics DF for US movies
basics = basics[us_filter_basics]
#Filter the ratings DF for US movies
ratings = ratings[us_filter_ratings]