In [1]:
# build a simple linear regression in python
from datetime import datetime
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.stats import pearsonr
import scipy as sp
plt.style.use('seaborn')
%matplotlib inline
pd.set_option('display.max_columns', 300)

In [2]:
#read in height weight data
weight_df = pd.read_csv('https://raw.githubusercontent.com/learn-co-students/nyc-mhtn-ds-071519-lectures/master/week-1/Descriptive_Statistics/weight-height.csv')

In [3]:
# building a linear regression model using statsmodel 
lr_model = ols(formula='Weight~Height', data=weight_df).fit()

lr_model.summary()

0,1,2,3
Dep. Variable:,Weight,R-squared:,0.855
Model:,OLS,Adj. R-squared:,0.855
Method:,Least Squares,F-statistic:,59040.0
Date:,"Wed, 12 Feb 2020",Prob (F-statistic):,0.0
Time:,14:17:56,Log-Likelihood:,-39219.0
No. Observations:,10000,AIC:,78440.0
Df Residuals:,9998,BIC:,78460.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-350.7372,2.111,-166.109,0.000,-354.876,-346.598
Height,7.7173,0.032,242.975,0.000,7.655,7.780

0,1,2,3
Omnibus:,2.141,Durbin-Watson:,1.677
Prob(Omnibus):,0.343,Jarque-Bera (JB):,2.15
Skew:,0.036,Prob(JB):,0.341
Kurtosis:,2.991,Cond. No.,1150.0



##  Coefficient of Determination ($R^2$)

The _coefficient of determination_, is a measure of how well the model fits the data.

$R^2$ for a model is ultimately a _relational_ notion. It's a measure of goodness of fit _relative_ to a (bad) baseline model. This bad baseline model is simply the horizontal line $y = \mu_Y$, for dependent variable $Y$.


$$\text{TSS }= \text{ESS} + \text{RSS }$$

- TSS or SST = Total Sum of Squares 
- ESS or SSE = Explained Sum of Squares
- RSS or SSR = Residual Sum of Squares

The actual calculation of $R^2$ is: <br/> $$\Large R^2=\frac{\Sigma_i(y_i - \hat{y}_i)^2}{\Sigma_i(y_i - \bar{y})^2}$$.

$R^2$ takes values between 0 and 1.

$R^2$ is a measure of how much variation in the dependent variable your model explains.


<img src='https://pbs.twimg.com/media/D-Gu7E0WsAANhLY.png' width ="700">

### Applied 
Build a linear regression model that will estimate the gross revenue of a film from the budget of the film. 

In [4]:
df = pd.read_csv('movie_metadata.csv')
df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [5]:
# building a linear regression model using statsmodel 
movie_model = ____(formula='____~____', data=_____).fit()



NameError: name '____' is not defined

Look at the summary table of this model.  

In [None]:
____.summary()

Transform your the output of the model into an equation. Then write a sentnce that interprets interprets what the independent variable and y-intercept mean. 

Equation:



Sentence: 

What does the P-value of the budget coeffiecient mean?

Answer:

Write a sentence interpreting the $R^2$ value from your movie model.

Sentence: 

---

**Cross-industry standard process for data mining**, known as **CRISP-DM**, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.

6 high level phases of the data mining process:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
The sequence of the phases is not strict and moving back and forth between different phases as it is always required.


<img src='https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png' width ="400">

## Context:

Today we are going to be working with the IMDB movie data set.  Our goal is to eventually create a linear regression model that will enable us to predict the box office gross of a movie based on characteristics of the movie.

Before we can start to model, we need to make sure our data is clean an in a usable format.  Therefore we will go through several steps of data cleaning. The code below is not a fully exhaustive list, but includes many of the process you will go through to clean data.  

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [None]:
print(df.shape)
df.head()


Now let’s quickly see the names and types of the columns. Most of the time you’re going get data that is not quite what you expected, such as dates which are actually strings and other oddities. But to check upfront.

In [None]:
# Get column names
column_names = df.columns
print(column_names)



In [None]:
# Get column data types
df.dtypes

## Convert a column to a different data type

The most common example of this is converting a string of number to an actual float or integer.  There are two ways you can achieve this.  

1. astype(float) method

`df['DataFrame Column'] = df['DataFrame Column'].astype(float)`
2.  to_numeric method

`df['DataFrame Column'] = pd.to_numeric(df['DataFrame Column'],errors='coerce')`

What is the difference in these two methods?

(1) For a column that contains numeric values stored as strings;

(2) For a column that contains both numeric and non-numeric values. By setting errors=’coerce’, you’ll transform the non-numeric values into NaN.


https://datatofish.com/convert-string-to-float-dataframe/

In [None]:
df['title_year']

In [None]:
pd.to_datetime(df['title_year'], format='%Y')

### Drop Columns

If you do not plan on using some data in your analysis, feel free to drop those columns. 

In [None]:
print(df.columns)

In [None]:
df.drop(columns=['aspect_ratio', 'plot_keywords'], inplace=True)

In [None]:
df.shape

## Investigate the data

In [None]:
#look at the unique values for ratings
ratings = list(df['content_rating'].unique())
ratings

In [None]:
df['content_rating'].value_counts()

There are many unique values that don't have a high count or don't make sense to the common user.  How should we handle these?

In [None]:
#create a list of the ratings we want to group
unrated = ['Unrated','Approved', 'Not Rated', 'TV-MA', 'M', 'GP', 'Passed', np.nan, 'X', 'NC-17']

In [None]:
#create a list of the movie ratings we want to maintian
rated = [x for x in ratings if x not in unrated]

In [None]:
#create a dictionary with keys of the 'unrated' values and the value being 'unrated'
unrated_dict = dict.fromkeys(unrated, 'unrated')

In [None]:
#create a dictionary of the rated values
rated_dict  = dict(zip(rated, rated))

In [None]:
#combine those ditionaries into 1
ratings_map = {**rated_dict,**unrated_dict}
ratings_map

#### What does `**` do? 

It basically takes the dictionary passed through and unpacks it.  

https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558

https://pynash.org/2013/03/13/unpacking/

In [None]:
# use the pandas map function to change the content_rating column
df['rating'] = df['content_rating'].map(ratings_map)

In [None]:
#compare the two columns
df[['rating', 'content_rating']].tail()

## Handling Missing Data:
    


In [None]:
#creates a dataframe of booleans show where data is missing
df.isna().head()

In [None]:
# Find the Percentage of rows missing data
df.isna().mean()

In [None]:
#graphically see the missing data
sns.heatmap(df.isna(), cbar=False)

#### Dropping missing rows

One way to handle missing data is just to drop the observation from the data set. This is not always the ideal way since you will lose obseervations, but it might be unavoidable.  For example, we want to predict the gross earnings for each film, so we have to remove those that don't have value for gross.

In [None]:
df.dropna(subset=['gross'], inplace=True)

In [None]:
df.shape

In [None]:
sns.heatmap(df.isnull(), cbar=False)

In [None]:
#look at all the observations with at least one missing data point
df[df['budget'].isna()].head()

Quite a few films are still missing the values for budget. We do not want to drop this column because we believe it is an important variable, but we must have a value for each observation in order to use it.

**Talk with a partner to think of different ways you can fill in the missing budget values?**

In [None]:
#you can fill the missing values with the average value of the observations
df['budget'].fillna(df['budget'].mean(), inplace=False)

Another way to fill the missing data

In [None]:
df.groupby('rating')['budget'].mean().plot(kind='bar')

In [None]:
budget_ratings = df.groupby('rating')['budget'].mean().round(1).to_dict()
budget_ratings

In [None]:
df['budget'].fillna(df['rating'].map(budget_ratings), inplace=True)


In [None]:
sns.heatmap(df.isnull(), cbar=False)

What statistical test could we use to support our use of this method?

### Handling Categorical Data

https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

In [None]:
df['rating'].value_counts()

In [None]:
df['rating'].head(10)

In [None]:
pd.get_dummies(df['rating']).head(10)

In [None]:
df = pd.concat([df, pd.get_dummies(df['rating'])], 1)
df.head(10)

## Removing Outliers

https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

In [None]:
df.boxplot(['gross'])

In [None]:
df.sort_values('gross', ascending=False)

In [None]:
# Calculate gross amount that is 3 times above the standard deviation
above_3std = df.gross.mean()+(3*df.gross.std())

In [None]:
df[df['gross']>above_3std]

## Creating New columns based on other columns

In [None]:
df['actor_1_facebook_likes'].describe()

In [None]:
# Create a new column called df.superstar where the value is 1
# if df.actor_1_facebook_likes is greater than 12000 and 0 if not
df['superstar'] = np.where(df['actor_1_facebook_likes']>=12000, 1, 0)

df[['movie_title', 'actor_1_name','actor_1_facebook_likes', 'superstar']].head(10)

**Create your own new column of data using the method above.**

In [None]:
#your code here

Another data cleaning Resource:

https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3