## Context:

Today we are going to be working with the IMDB movie data set.  Our goal is to eventually create a linear regression model that will enable us to predict the box office gross of a movie based on characteristics of the movie.

Before we can start to model, we need to make sure our data is clean an in a usable format.  Therefore we will go through several steps of data cleaning. The code below is not a fully exhaustive list, but includes many of the process you will go through to clean data.  

In [None]:

import numpy as np 
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [None]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()


Now let’s quickly see the names and types of the columns. Most of the time you’re going get data that is not quite what you expected, such as dates which are actually strings and other oddities. But to check upfront.

In [None]:
# Get column names
column_names = df.columns
print(column_names)
# Get column data types
df.dtypes


## Convert a column to a different data type

The most common example of this is converting a string of number to an actual float or integer.  There are two ways you can achieve this.  

1. astype(float) method

`df['DataFrame Column'] = df['DataFrame Column'].astype(float)`
2.  to_numeric method

`df['DataFrame Column'] = pd.to_numeric(df['DataFrame Column'],errors='coerce')`

What is the difference in these two methods?

(1) For a column that contains numeric values stored as strings;

(2) For a column that contains both numeric and non-numeric values. By setting errors=’coerce’, you’ll transform the non-numeric values into NaN.


https://datatofish.com/convert-string-to-float-dataframe/

In [None]:
df['title_year'][0]

In [None]:
df['title_year']= pd.to_datetime(df['title_year'], format='%Y')

In [None]:
df.

### Drop Columns

If you do not plan on using some data in your analysis, feel free to drop those columns. 

In [None]:
print(df.columns)

In [None]:
df.drop(columns=['aspect_ratio', 'plot_keywords'], inplace=True)

In [None]:
df.shape

In [None]:
smaller_df=df[['gross','budget']]

## Investigate the data

In [None]:
df.content_rating

In [None]:
#look at the unique values for ratings
ratings = list(df['content_rating'].unique())
ratings

In [None]:
df['content_rating'].value_counts()

There are many unique values that don't have a high count or don't make sense to the common user.  How should we handle these?

In [None]:
#create a list of the ratings we want to group
unrated = ['Unrated','Approved', 'Not Rated', 'TV-MA', 'M', 'GP', 'Passed', np.nan, 'X', 'NC-17','TV-14', 'TV-PG', 'TV-G', 'TV-Y', 'TV-Y7']

In [None]:
#create a list of the movie ratings we want to maintian
rated = [x for x in ratings if x not in unrated]

In [None]:
rated

In [None]:
#create a dictionary with keys of the 'unrated' values and the value being 'unrated'
unrated_dict = dict.fromkeys(unrated, 'unrated')

In [None]:
unrated_dict

In [None]:
#create a dictionary of the rated values
rated_dict  = dict(zip(rated, rated))

In [None]:
rated_dict

In [None]:
#combine those ditionaries into 1
ratings_map = {**rated_dict,**unrated_dict}
ratings_map

#### What does `**` do? 

It basically takes the dictionary passed through and unpacks it.  

https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558

https://pynash.org/2013/03/13/unpacking/

In [None]:
# use the pandas map function to change the content_rating column
df['rating'] = df['content_rating'].map(ratings_map)

In [None]:
#compare the two columns
df[['rating', 'content_rating']].tail()

## Handling Missing Data:
    


In [None]:
df.head()

In [None]:
#creates a dataframe of booleans show where data is missing
df.isna().head()

In [None]:
# Find the Percentage of rows missing data
df.isna().mean()

In [None]:
#graphically see the missing data
sns.heatmap(df.isna(), cbar=False)

In [None]:
df.groupby('title_year')['gross'].mean()

#### Dropping missing rows

One way to handle missing data is just to drop the observation from the data set. This is not always the ideal way since you will lose obseervations, but it might be unavoidable.  For example, we want to predict the gross earnings for each film, so we have to remove those that don't have value for gross.

In [None]:
df.dropna(subset=['gross'], inplace=True)

In [None]:
df.shape

In [None]:
sns.heatmap(df.isnull(), cbar=False)

In [None]:
#look at all the observations with at least one missing data point
df[df['budget'].isna()].head()

Quite a few films are still missing the values for budget. We do not want to drop this column because we believe it is an important variable, but we must have a value for each observation in order to use it.

**Talk with a partner to think of different ways you can fill in the missing budget values?**

In [None]:
#you can fill the missing values with the average value of the observations
df['budget'].fillna(df['budget'].mean(), inplace=False)

Another way to fill the missing data

In [None]:
df.groupby('rating')['gross'].mean().plot(kind='bar')

In [None]:
budget_ratings = df.groupby('rating')['budget'].mean().round(1).to_dict()
budget_ratings

In [None]:
df['budget'].fillna(df['rating'].map(budget_ratings), inplace=True)


In [None]:
sns.heatmap(df.isnull(), cbar=False)

What statistical test could we use to support our use of this method?

### Handling Categorical Data

https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

In [None]:
df['rating'].value_counts()

In [None]:
df['rating'].head(10)

In [None]:
pd.get_dummies(df['rating']).head(10)

In [None]:
df = pd.concat([df, pd.get_dummies(df['rating'])], 1)
df.head(10)

## Removing Outliers

https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

In [None]:
df.boxplot(['gross'])

In [None]:
df.sort_values('gross', ascending=False)

In [None]:
# Calculate gross amount that is 3 times above the standard deviation
above_3std = df.gross.mean()+(3*df.gross.std())

### Use a conditional selection to only return values lower than 3 standard deviations above the mean

## Creating New columns based on other columns

In [None]:
df['actor_1_facebook_likes'].describe()

In [None]:
# Create a new column called df.superstar where the value is 1
# if df.actor_1_facebook_likes is greater th000 and 0 if not
df['superstar'] = np.where(df['actor_1_facebook_likes']>=30000, 1, 0)

df[['movie_title', 'actor_1_name','actor_1_facebook_likes', 'superstar']].head(10)

**Create your own new column of data using the method above.**

In [None]:
#your code here

Another data cleaning Resource:

https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3