# EDA Project

<div class="alert alert-block alert-info">The EDA project in this course has four main parts to it: <br>
    
1. Project Proposal
2. Phase 1
3. Phase 2
4. Report

This notebook will be used for Project Proposal, Phase 1, and Phase 2. You will have specific questions to answer within this notebook for Project Proposal and Phase 1. You will also continue using this notebook for Phase 2. However, guidance and expectations can be found on Canvas for that assignment. The report is completed outside of this notebook (delivered as a PDF). Detailed instructions for that assignment are provided in Canvas.</div>

<div class="alert alert-block alert-danger"><b><font size=4>Read this before proceeding:</font></b>
    
1. Review the list of data sets and sources of data to avoid before choosing your data. This list is provided in the instructions for the Project Proposal assignment in Canvas.<br><br>  

2. It is expected that when you are asked questions requiring typed explanations you are to use a <b><u>markdown cell</u></b> to type your answers neatly. <b><u><i>Do not provide typed answers to questions as extra comments within your code.</i></u></b> Only provide comments within your code as you normally would, i.e. as needed to explain or remind yourself what each part of the code is doing.</div>

# Project Proposal

<div class="alert alert-block alert-info">The intent of this assignment is for you to share your chosen data file(s) with your instructor and provide general information on your goals for the EDA project.</div>

<div class="alert alert-block alert-success"><b>Step 1 (2 pts)</b>: Give a brief <i><u>description</u></i> of the source(s) of your data and include a <i><u>direct link</u></i> to your data.</div>

##### Question 1

Our source of data is from IMDb (International Movie Database) which is an online database and website for information about movies, television shows, and other streaming content. It is a popular site for people to look up inforamation about ratings, cast, directors, reviews, plot, and any other related information about the content. The 4 datasets we chose to use from their database are the Titles basics, the Ratings data, the crew data, and the name basics. The Titles basics has the name of the media, when and how long it was released or aired, run time, and the genres it belongs to. The Ratings data provides the popularity rating and number of votes of the media title. The crew data provides the IDs for directors and writers. Lastly, the Name basics provides the person's name and primary professions. More information about the data sets can be found here: https://www.imdb.com/interfaces/. The specific data sets for download are found here: https://datasets.imdbws.com/.

<div class="alert alert-block alert-success"><b>Step 2 (2 pts)</b>: Briefly explain why you chose this data.</div>

##### Question 2

Understanding data is a key characteristic to starting statistical analysis. Starting off, our team discussed interests and experiences we might have in common, ultimately agreeing with true crime media. This led us to the IMDb dataset, which we could easily understand given our backgrounds and previous use of the site. Furthermore, we were able to come up with unique questions related to True Crime that we want answered through the analysis of our dataset. While there are analytical reports of IMDb data online, we did not find any analysis related to the questions we came up with. 

Out of this database, the most relevant datasets that we believe could be useful to answer our questions were title.basics, title.ratings, name.basics and title.crew. There are many other databases such as Rotten Tomatoes or Yahoo! Movies, but IMBD allows us to easily read their tsv data files through the pandas library. This is a simple file format to manipulate and extract to dive deep into coding.

<div class="alert alert-block alert-success"><b>Step 3 (1 pt)</b>: Provide a brief overview of your goals for this project.</div>

##### Question 3

We want to understand what drives the popularity of True Crime media, and media in general over time.  Specifically, we would like to be able to prove or disprove some of the following questions by the end of this project:
  * Is it true that True Crime has become more popular over the past five years? 
  * Has there been an increase in the popularity of documentaries in general?
  * Has there been an increase in the volume of documentaries in general?
  * Is there a correlation between the air date of a True Crime TV show or movie and it's popularity? 
  * Are there correlations between the popularity of a genre, specifically True Crime, and the crew?
  
Given these questions, we'll need to import data from IMDb for the title of the production itself, the crew, the names of the crew and the ratings for that production.  We'll need to download the data, clean it (there are many nulls represented by "/N") and join it in order to proceed with answering the above questions.

<div class="alert alert-block alert-success"><b>Step 4 (1 pt)</b>: Read the data into this notebook.</div>

In [2]:
##Import libraries
import sys
!conda update --yes --prefix {sys.prefix} seaborn

import pandas as pd
import urllib.request  # used to retrieve files from the internet
import numpy as np
import re
import seaborn as sns
from IPython.display import display  # used to print multiple dataframes from a single cell in Jupyter

Collecting package metadata (current_repodata.json): done
Solving environment: | 

Updating seaborn is constricted by 

anaconda -> requires seaborn==0.11.0=py_0

If you are sure you want an update of your package either try `conda update --all` or install a specific version of the package you want using `conda install <pkg>=<version>`

done

# All requested packages already installed.



In [31]:
def read_imdb_data(*args):
    '''
    Input a list of urls from imdb's datasets (https://datasets.imdbws.com/) and return a list of dataframes
    '''
    df_list = [] #instantiate a list
    if len(args):    # check to make sure the user input at least one item in the list
        for i in args:                     # for each url:
            filename = i.split('/', 3)[-1] # extract a filename from the url (everything after the 3rd "/" delimeter)
            urllib.request.urlretrieve(i, filename) #retrieve the file from the internet and copy it locally (https://docs.python.org/3/library/urllib.request.html)
            df_list.append(pd.read_csv(filename, compression='gzip', sep='\t', low_memory=False)) 
            # open the local file as a dataframe and append the dataframe to a list 
            # low_memory = False will ensure there no mixed types for the columns.  See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    else:
        print('No URLs were passed to read_imbd_data()')
    return df_list  #returns a list of dataframes


urls = ['https://datasets.imdbws.com/title.ratings.tsv.gz',
       'https://datasets.imdbws.com/title.crew.tsv.gz',
       'https://datasets.imdbws.com/title.basics.tsv.gz',
       'https://datasets.imdbws.com/name.basics.tsv.gz']  # list of urls from imdb

df_list = read_imdb_data(*urls) # call the function with the list of urls, of any length, and save the dataframes returned

imdb_ratings, imdb_crew, imdb_title_basics, imdb_name = df_list[0], df_list[1], df_list[2], df_list[3] 
# save each dataframe independently so we can explore them

<div class="alert alert-block alert-success"><b>Step 5 (1 pt)</b>: Inspect the data using the <b>info(&nbsp;)</b>, <b>head(&nbsp;)</b>, and <b>tail(&nbsp;)</b> methods.</div>

In [117]:
# Use the info() method to determine to inspect the variable (column) names, the number of non-null values,
#       and the data types for each variable.
imdb_title_basics.info()
imdb_crew.info()
imdb_ratings.info()
imdb_name.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613428 entries, 0 to 7613427
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 522.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7610476 entries, 0 to 7610475
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   tconst     object
 1   directors  object
 2   writers    object
dtypes: object(3)
memory usage: 174.2+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1123058 entries, 0 to 1123057
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1123058 non-null  object 
 1   averageRating 

In [118]:
# Use the head() method to inspect the first five (or more) rows of the data
display(imdb_title_basics.head())
display(imdb_crew.head())
display(imdb_ratings.head())
display(imdb_name.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1680
1,tt0000002,6.0,207
2,tt0000003,6.5,1418
3,tt0000004,6.1,122
4,tt0000005,6.1,2214


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0037382,tt0038355,tt0071877"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0057345,tt0054452,tt0049189,tt0059956"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0072562,tt0080455,tt0078723,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0050986,tt0050976,tt0060827"


In [119]:
# Use the tail() method to inspect the last five (or more) rows of the data
display(imdb_title_basics.tail())
display(imdb_crew.tail())
display(imdb_ratings.tail())
display(imdb_name.tail())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
7613423,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,"Action,Drama,Family"
7613424,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
7613425,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
7613426,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short
7613427,tt9916880,tvEpisode,Horrid Henry Knows It All,Horrid Henry Knows It All,0,2014,\N,10,"Animation,Comedy,Family"


Unnamed: 0,tconst,directors,writers
7610471,tt9916848,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
7610472,tt9916850,"nm5519375,nm5519454","nm6182221,nm1628284,nm2921377"
7610473,tt9916852,"nm5519375,nm5519454","nm6182221,nm1628284,nm2921377"
7610474,tt9916856,nm10538645,nm6951431
7610475,tt9916880,nm0996406,"nm1482639,nm2586970"


Unnamed: 0,tconst,averageRating,numVotes
1123053,tt9916580,7.2,5
1123054,tt9916690,6.6,5
1123055,tt9916720,6.2,72
1123056,tt9916766,6.9,16
1123057,tt9916778,7.5,27


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
10713444,nm9993714,Romeo del Rosario,\N,\N,"animation_department,art_department",tt2455546
10713445,nm9993716,Essias Loberg,\N,\N,,\N
10713446,nm9993717,Harikrishnan Rajan,\N,\N,cinematographer,tt8736744
10713447,nm9993718,Aayush Nair,\N,\N,cinematographer,\N
10713448,nm9993719,Andre Hill,\N,\N,,\N


<div class="alert alert-block alert-danger"><b>STOP HERE for your Project Proposal assignment. Submit your (1) original data file(s) along with (2) the completed notebook up to this point, and (3) the html file for grading and approval.</b></div>

<div class="alert alert-block alert-warning"><b>Instructor Feedback and Approval (3 pts)</b>: Your instructor will provide feedback in either the cell below this or via Canvas. You can expect one of the following point values for this portion.

<b>3 pts</b> - if your project goals and data set are both approved.<br>
<b>2 pts</b> - if your data set is approved but changes to your project goals (Step 3) are needed.<br>
<b>1 pt</b> - if your project goals are approved but your data set is not approved.<br>
<b>0 pts</b> - if neither your data set nor your project goals are approved.<br><br>
    
<i><u>As needed, follow your instructor's feeback and guidance to get on track for the remaining portions of the EDA project.</u></i>
</div>

# EDA Phase 1

<div class="alert alert-block alert-info">The overall goal of this assignment is to take all necessary steps to inspect the quality of your data and prepare the data according to your needs. For information and resources on the process of Exploratory Data Analysis (EDA), you should explore the <b><u>EDA Project Resources Module</u></b> in Canvas.

Once you’ve read through the information provided in that module and have a comfortable understanding of EDA using Python, complete steps 6 through 10 listed below to satisfy the requirements for your EDA Phase 1 assignment. **Remember to convert code cells provided to markdown cells for any typed responses to questions.**</div>

<div class="alert alert-block alert-success"><b>Step 6 (2 pts)</b>: Begin by elaborating in more detail from the previous assignment on why you chose this data?<br>
    
1. Explain what you hope to learn from this data. 
2. Do you have a hunch about what this data will reveal? (The answer to this question will be used in the Introduction section of your EDA report.)
</div>

We hope to learn about the perceived popularity of true crime media in recent years and whether this is driving an increase in true crime documentaries or if there are factors contributing? The sample questions we hope to learn: 

* Is it true that True Crime has become more popular over time?
* Has there been a change in volume of media and/or documentaries in general?
    * Could this just be the result of media content increasing overall?
    * Could this just be the result of documenatires as a genre increasing overall? 
* Has popularity of documentaries changed over time? How does this compare to other genres?
* Is there a correlation between release date and it's popularity?
* Are there correlations between popularity and the crew? For example, is a particular director scoring higher votes and ratings?

Our hunch is that there has been both an increase in volume of media overall and in the genre. However, we believe that the genre is growing faster in popularity over time than other genres.  

<div class="alert alert-block alert-success"><b>Step 7 (2 pts)</b>: Discuss the popluation and the sample:<br>
    
1. What is the population being represented by the data you’ve chosen? 
2. What is the total sample size?
</div>

### Kathleen

Use the ones with title, rating, and average rating only? Assuming most common movies will have this populated. 
Code cell: population and sample size, drop the nulls, if none of the fields are available

total population: unique to tconst
sample population: whatever titles are filled in with each of the required fields (title, rating, and average rating only) & no null values



<div class="alert alert-block alert-success"><b>Step 8 (2 pts)</b>: Describe how the data was collected. For example, is this a random sample? Are sampling weights used with the data?</div>

### Sahiti

This is not a random sample and no weighting was used. 

The data is collected by IMDB from several sources. As noted on their site (https://help.imdb.com/article/imdb/general-information/where-does-the-information-on-imdb-come-from/GGD7NGF5X3ECFKNN?ref_=helpart_nav_24#) They collect data from studios and filmakers, but the bulk of the information is submitted by people in the industry or people visiting the site. IMDB regularly goes through quality checks to ensure it is as accurate as possible.  


<div class="alert alert-block alert-success"><b>Step 9 (4 pts)</b>: In the Project Proposal assignment you used the info(&nbsp;) method to inspect the variables, their data types, and the number of non-null values. Using that information as a guide, provide definitions of each of your variables and their corresponding data types, i.e. a data dictionary. Also indicate which variables will be used for your purposes.</div>

### Sahiti 

The data will be combined to use the following fields:

titles dataframe:
* tconst (string object): title ID
* primaryTitle (string object): media title
* startYear (string object): Release Year
* genre (string object): Genre
* nconst (string object): Name ID
* averageRating (float): media rating
* numVotes (integer): media votes
* primaryName string object: director's name
* titleType: type of media 

director dataframe:
* directorName string object: director's name
* directorID string object: director's id
* tconst (string object): title ID
* startYear (string object): Release Year
* isTrueCrime (boolean): whether the title is true crime or not
* averageRating (float): media rating
* numVotes (integer): media votes

<div class="alert alert-block alert-success"><b>Step 10 (10 pts)</b>: For full credit in this problem you'll want to <i><u>take all necessary steps to report on the quality of the data</u></i> and <i><u>clean the data accordingly</u></i>. Some things to consider while doing this are listed below. <b>Depending on your data and goals, there may be additional steps needed than those listed here.</b>
    
1. Are there rows with missing or inconsistent values? If so, eliminate those rows from your data where appropriate.
2. Are there any outliers or duplicate rows? If so, eliminate those rows from your data where appropriate. 
At each stage of cleaning the data, state how many rows were eliminated.
3. Are you using all columns (variables) in the data? If not, are you eliminating those columns?
4. Consider some type of visual display such as a boxplot to determine any outliers. Do any outliers need removed? If so, how many were removed?

At each stage of cleaning the data, state how many rows were eliminated. <b><u><i>It is good practice to get the shape of the data before and after each step in cleaning the data and add typed explanations (in separate markdown cells) of the steps taken to clean the data.</i></u></b><br></div>
    
<div class="alert alert-block alert-info">Include the rest of your work below and insert cells where needed.</div>

_Notes:_ 
- smaller scope of items by year and title type
- focus set of columns
- merge the data set

- check how many directors in the list: make into unique column for their names?
- only titles with ratings, votes, and directors ?
- check how many genres in the list: make unique column for their names?
    - if true crime is included

#### Genre Cleanup - Step 1

* Add a column with a boolean datatype where it's True if the genre field contains both 'documentary' or 'crime', case insensitive
* Count the number of true crime documentaries in the dataset.
* Display the percent of true crime documentaries in the dataset

In [32]:
## Add a column "IsTrueCrime"
imdb_title_basics['isTrueCrime'] = (imdb_title_basics.genres.str.contains('crime', 
                flags=re.IGNORECASE, na=False)) & (imdb_title_basics.genres.str.contains('documentary', 
                flags=re.IGNORECASE, na=False))

In [33]:
TrueCrimeCount = len(imdb_title_basics[imdb_title_basics["isTrueCrime"]])  ## count the number of True items in IsTrueCrime

print(f'The total number of titles in the dataset is {len(imdb_title_basics)}.')

print(f'''The number of True Crime documentaries in the dataset is {TrueCrimeCount}, which is {round(TrueCrimeCount/len(imdb_title_basics), 3)}% of the total titles.''')

The total number of titles in the dataset is 7619417.
The number of True Crime documentaries in the dataset is 20106, which is 0.003% of the total titles.


#### Director Cleanup - Step 2

We're trying to answer: "Are there correlations between popularity and the crew? For example, is a particular director scoring higher votes and ratings?"

We considered breaking out the director column into separate fields, so there was only one director per column.  
First, we checked whether this was practical by checking for the max number of directors in the imdb_crew dataset:

In [34]:
a = 0 # instantiate a
b = 0 # instantiate b

for i in imdb_crew.index:  # loop through the crew table's index
    if len(imdb_crew.directors[i].split(",")) > a: # check to see if the current director count is greater than the last
        a, b = len(imdb_crew.directors[i].split(",")), i # save the director count and index id to a and b

print(f'The title with the most number of directors has an index id of {b} in the crew table and it has {a} directors. See below for more details.')

display(imdb_title_basics[imdb_title_basics.tconst == imdb_crew.tconst[b]])

The title with the most number of directors has an index id of 423902 in the crew table and it has 463 directors. See below for more details.


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,isTrueCrime
423902,tt0441074,tvSeries,Television Theater,Teatr Telewizji,0,1953,\N,\N,Drama,False


Given that the max number of directors in the dataset is so large (though this could be an outlier), we don't think it's practical to separate them out into separate columns.  Instead we should build a director table consisting of:

* directorName (imdb_name.primaryName)
* directorID (imdb_crew.directors)
* tconst (imdb_crew.tconst)
* startYear (imdb_title_basics.startYear)
* isTrueCrime (imdb_title_basics.IsTrueCrime)
* averageRating (imdb_ratings.averageRating)
* numVotes (imdb_ratings.numVotes)

With this dataset, we can determine whether the average rating, weighted by the number of votes, is positively correlated with the director.  We will also be able to group by whether the titles are True Crime or not.

Next, prepare the data so that we can have a "director" dataframe with a single director per row.  For the directors column, we need the string with the director IDs to be a list.

In [35]:
#Convert the directors column, which is currently a string object, to a list and assign the result to a new column
print(type(imdb_crew.directors[0])) #confirm the type of the directors column

imdb_crew['directors_list'] = imdb_crew['directors'].str.split(',')
# split the directors string into a list and create a new column from it

display(imdb_crew.tail())

<class 'str'>


Unnamed: 0,tconst,directors,writers,directors_list
7619412,tt9916848,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377","[nm5519454, nm5519375]"
7619413,tt9916850,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377","[nm5519454, nm5519375]"
7619414,tt9916852,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377","[nm5519454, nm5519375]"
7619415,tt9916856,nm10538645,nm6951431,[nm10538645]
7619416,tt9916880,nm0996406,"nm1482639,nm2586970",[nm0996406]


Next "explode" the crew data so that each row is an individual director

In [36]:
directors = imdb_crew.explode('directors_list').drop(columns=['directors', 'writers'], inplace=False)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
# this will create a new dataframe where each row is an individual director.  We drop the directors and 
# writers columns

directors.rename(columns={"directors_list": "director"}, inplace = True) 
# rename the directors_list column to "director" since it's no longer a list

display(directors.tail())

Unnamed: 0,tconst,director
7619413,tt9916850,nm5519375
7619414,tt9916852,nm5519454
7619414,tt9916852,nm5519375
7619415,tt9916856,nm10538645
7619416,tt9916880,nm0996406


Create the final directors dataframe by merging data from imbd_name, imdb_title_basics and imdb_ratings.  Print the number of shape of the dataframe before and after each merge: use a left merge (`how='left'`) so we can clean the dataframe appropriately later.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

In [37]:
print(f'The original dataframe has a shape of {directors.shape}')

# Merge in the director's name and drop the redundant nconst field
directors = pd.merge(directors, imdb_name[['nconst','primaryName']], left_on = 'director', right_on = 'nconst', how='left').drop(columns=['nconst'], inplace = False)
print(f'After merging in data from the imdb_name dataframe, directors has a shape of {directors.shape}')

# Merge in the IsTrueCrime field
directors = pd.merge(directors, imdb_title_basics[['tconst','isTrueCrime', 'startYear']], on = 'tconst', how='left')
print(f'After merging in data from the imdb_title_basics dataframe, directors has a shape of {directors.shape}')

# Merge in the averageRating and numVotes fields
directors = pd.merge(directors, imdb_ratings[['tconst','averageRating', 'numVotes']], on = 'tconst', how='left')
print(f'After merging in data from the imdb_ratings dataframe, directors has a shape of {directors.shape}')

display(directors.tail())

The original dataframe has a shape of (8997019, 2)
After merging in data from the imdb_name dataframe, directors has a shape of (8997019, 3)
After merging in data from the imdb_title_basics dataframe, directors has a shape of (8997019, 5)
After merging in data from the imdb_ratings dataframe, directors has a shape of (8997019, 7)


Unnamed: 0,tconst,director,primaryName,isTrueCrime,startYear,averageRating,numVotes
8997014,tt9916850,nm5519375,Deniz Yorulmazer,False,2010,,
8997015,tt9916852,nm5519454,Semih Bagci,False,2010,,
8997016,tt9916852,nm5519375,Deniz Yorulmazer,False,2010,,
8997017,tt9916856,nm10538645,Johan Planefeldt,False,2015,,
8997018,tt9916880,nm0996406,Hilary Audus,False,2014,,


#### Merge Data - Step 2

* We will probably need at least two dataframes to answer the questions we've posed:
 * director dataframe (Curt already created the merged dataframe above, but didn't clean it for nulls or anything)
 * title dataframe 

* figuring out what is the missing data
* figure out counts for population & sample
* question 3

Kathleen

how many null values exist for each column

#### Remove Null Values

Reviewed the primary titles in the data that were null and checked to see if we could replace it with the original title. However, both the primary title and the original title were both null. 

In [38]:
imdb_title_basics.isnull().sum() #display the number of missing values in each column


tconst             0
titleType          0
primaryTitle       8
originalTitle      8
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            10
isTrueCrime        0
dtype: int64

In [39]:
display(imdb_title_basics[imdb_title_basics.genres.isnull()])

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,isTrueCrime
1101682,tt10233364,tvEpisode,Rolling in the Deep Dish\tRolling in the Deep ...,0,2019,\N,\N,Reality-TV,,False
1523177,tt10970874,tvEpisode,Die Bauhaus-Stadt Tel Aviv - Vorbild für die M...,0,2019,\N,\N,\N,,False
1921543,tt11670006,tvEpisode,...ein angenehmer Unbequemer...\t...ein angene...,0,1981,\N,\N,Documentary,,False
2035319,tt11868642,tvEpisode,GGN Heavyweight Championship Lungs With Mike T...,0,2020,\N,\N,Talk-Show,,False
2195713,tt12149332,tvEpisode,Jeopardy! College Championship Semifinal Game ...,0,2020,\N,\N,Game-Show,,False
2346659,tt12415330,tvEpisode,Anthony Davis High Brow Tank\tAnthony Davis Hi...,0,2017,\N,\N,Reality-TV,,False
3083104,tt13704268,tvEpisode,Bay of the Triffids/Doctor of Doom\tBay of the...,0,\N,\N,\N,"Animation,Comedy,Family",,False
4895530,tt3984412,tvEpisode,"I'm Not Going to Come Last, I'm Just Going to ...",0,2014,\N,\N,Reality-TV,,False
7574648,tt9822816,tvEpisode,Zwischen Vertuschung und Aufklärung - Missbrau...,0,2019,\N,\N,\N,,False
7615757,tt9909210,tvEpisode,Politik und/oder Moral - Wie weit geht das Ver...,0,2005,\N,\N,\N,,False


In [40]:
media_titles = imdb_title_basics #Create new table for analysis
null_ptitles = media_titles[media_titles.primaryTitle.isnull()]
display(null_ptitles)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,isTrueCrime
1419394,tt10790040,tvEpisode,,,0,2019,\N,\N,\N,False
3826023,tt1971246,tvEpisode,,,0,2011,\N,\N,Biography,False
3918574,tt2067043,tvEpisode,,,0,1965,\N,\N,Music,False
5084122,tt4404732,tvEpisode,,,0,2015,\N,\N,Comedy,False
5692716,tt5773048,tvEpisode,,,0,2015,\N,\N,Talk-Show,False
6938067,tt8473688,tvEpisode,,,0,1987,\N,\N,Drama,False
6969780,tt8541336,tvEpisode,,,0,2018,\N,\N,"Reality-TV,Romance",False
7575358,tt9824302,tvEpisode,,,0,2016,\N,\N,Documentary,False


In [41]:
null_otitles = media_titles[media_titles.originalTitle.isnull()]
display(null_otitles)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,isTrueCrime
1419394,tt10790040,tvEpisode,,,0,2019,\N,\N,\N,False
3826023,tt1971246,tvEpisode,,,0,2011,\N,\N,Biography,False
3918574,tt2067043,tvEpisode,,,0,1965,\N,\N,Music,False
5084122,tt4404732,tvEpisode,,,0,2015,\N,\N,Comedy,False
5692716,tt5773048,tvEpisode,,,0,2015,\N,\N,Talk-Show,False
6938067,tt8473688,tvEpisode,,,0,1987,\N,\N,Drama,False
6969780,tt8541336,tvEpisode,,,0,2018,\N,\N,"Reality-TV,Romance",False
7575358,tt9824302,tvEpisode,,,0,2016,\N,\N,Documentary,False


In [42]:
media_titles.shape  #This is the original population size

(7619417, 10)

In [43]:
media_titles = media_titles.drop(null_ptitles.index)  
media_titles.shape

(7619409, 10)

In [44]:
null_genres = media_titles[media_titles.genres.isnull()]
media_titles = media_titles.drop(null_genres.index)
media_titles.shape

(7619399, 10)

In [45]:
media_titles.isnull().sum() #recheck and display the number of missing values in each column

tconst            0
titleType         0
primaryTitle      0
originalTitle     0
isAdult           0
startYear         0
endYear           0
runtimeMinutes    0
genres            0
isTrueCrime       0
dtype: int64

In [47]:
media_titles = media_titles.drop(columns=['isAdult','endYear','runtimeMinutes','originalTitle'], inplace=False)
media_titles.info()
                                          

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7619399 entries, 0 to 7619416
Data columns (total 6 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   tconst        object
 1   titleType     object
 2   primaryTitle  object
 3   startYear     object
 4   genres        object
 5   isTrueCrime   bool  
dtypes: bool(1), object(5)
memory usage: 356.1+ MB


In [56]:
#merge the average ratings and votes
media_titles = pd.merge(media_titles, imdb_ratings[['tconst','averageRating', 'numVotes']], on = 'tconst', how='left')


In [59]:
#merge the director id and name
media_titles = pd.merge(media_titles, directors[['tconst','director','primaryName']], on ='tconst', how='left')

In [60]:
media_titles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8997001 entries, 0 to 8997000
Data columns (total 10 columns):
 #   Column         Dtype  
---  ------         -----  
 0   tconst         object 
 1   titleType      object 
 2   primaryTitle   object 
 3   startYear      object 
 4   genres         object 
 5   isTrueCrime    bool   
 6   averageRating  float64
 7   numVotes       float64
 8   director       object 
 9   primaryName    object 
dtypes: bool(1), float64(2), object(7)
memory usage: 695.0+ MB


In [76]:
media_titles.titleType.value_counts()

AttributeError: 'NoneType' object has no attribute 'titleType'

In [75]:
#remove tvEpisodes,videogame,radioSeries,audiobook,episode becasue they aren't relevant. For example, tvEpisode is just a subset of a tvSeries.

#indextype = media_titles[media_titles.titleType == 'tvEpisode'].index

#media_titles = media_titles.drop(indextype, inplace=True)

#media_titles.titleType.value_counts()

AttributeError: 'NoneType' object has no attribute 'titleType'

#### Missing Data & Duplicate Rows 

* question 1 & 2
* state why it was dropped & why we didn't choose to impute (i.e. not take the avg and use that for the missing value)

Sahiti

#### Outliers & Graph

Curt

* do it against avg scores/popularity due to low volume of ratings (an example)
* question 4 & 2(outliers)

```
beer_reviews.boxplot(column=[
    'review_aroma', 
    'review_appearance', 
    'review_palate', 
    'review_taste', 
    'review_overall'], rot=55)
```

In [None]:
## boxplot 


<div class="alert alert-block alert-danger"><b>STOP HERE for your EDA Phase 1 assignment. Submit your <i><u>cleaned</u></i> data file along with the completed notebook up to this point for grading.</b></div>

# EDA Phase 2

<div class="alert alert-block alert-info">All of your work for the EDA Phase 2 assignment will begin below here. Refer to the detailed instructions and expectations for this assignment in Canvas.</div>