## Day 27 Lecture 1 Assignment

In this assignment, we will learn statistical significance in linear models. We will use the google play store dataset loaded below and analyze the regression from this dataset.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import bartlett
from scipy.stats import levene
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import pylab
from scipy.stats import jarque_bera
from scipy.stats import normaltest

  import pandas.util.testing as tm


In [None]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

In [None]:
reviews.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


We will predict app ratings using other features describing the app. To use these features, we must clean the data.

To simplify, we will remove the app, category, size, installs, genres, last updated, current ver, and android ver columns. 

In [None]:
# answer below:
reviews.drop(columns=['App', 'Category', 'Size', 'Installs', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], inplace=True)

In [None]:
reviews.head()

Unnamed: 0,Rating,Reviews,Type,Price,Content Rating
0,4.1,159,Free,0,Everyone
1,3.9,967,Free,0,Everyone
2,4.7,87510,Free,0,Everyone
3,4.5,215644,Free,0,Teen
4,4.3,967,Free,0,Everyone


Check for missing values and remove all rows containing missing values

In [None]:
# answer below:
reviews.dropna(inplace=True)

In [None]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9366 entries, 0 to 10840
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rating          9366 non-null   float64
 1   Reviews         9366 non-null   object 
 2   Type            9366 non-null   object 
 3   Price           9366 non-null   object 
 4   Content Rating  9366 non-null   object 
dtypes: float64(1), object(4)
memory usage: 439.0+ KB


Remove outliers from the Type and Content Rating columns (very rare values that won't train well).

In [None]:
# answer below:
from scipy.stats.mstats import winsorize

#reviews['Type'] = winsorize(reviews['Type'], (0, 0.10))

In [None]:
#reviews['Content Rating'] = winsorize(reviews['Content Rating'], (0, 0.10))

In [None]:
#reviews.info()

Convert the Type and Content Rating columns to a numeric format, whether by one-hot encoding, ordinal encoding, or similar.

In [None]:
reviews.head()

Unnamed: 0,Rating,Reviews,Type,Price,Content Rating
0,4.1,159,Free,0,Everyone
1,3.9,967,Free,0,Everyone
2,4.7,87510,Free,0,Everyone
3,4.5,215644,Free,0,Teen
4,4.3,967,Free,0,Everyone


In [None]:
reviews = pd.get_dummies(reviews, columns=["Type"])

In [None]:
reviews.head()

Unnamed: 0,Rating,Reviews,Price,Content Rating,Type_Free,Type_Paid
0,4.1,159,0,Everyone,1,0
1,3.9,967,0,Everyone,1,0
2,4.7,87510,0,Everyone,1,0
3,4.5,215644,0,Teen,1,0
4,4.3,967,0,Everyone,1,0


In [None]:
reviews = pd.get_dummies(reviews, columns=["Content Rating"])

In [None]:
reviews['Price'] = reviews['Price'].str.replace('$', '')
reviews['Price'] = reviews['Price'].apply(lambda x: float(x))

In [None]:
reviews.head()

Unnamed: 0,Rating,Reviews,Price,Type_Free,Type_Paid,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,4.1,159,0.0,1,0,0,1,0,0,0,0
1,3.9,967,0.0,1,0,0,1,0,0,0,0
2,4.7,87510,0.0,1,0,0,1,0,0,0,0
3,4.5,215644,0.0,1,0,0,0,0,0,1,0
4,4.3,967,0.0,1,0,0,1,0,0,0,0


Finally, check that all the columns are of numeric type and change the type of columns that are not numeric. If coercing to numeric causes missing values, remove those rows containing missing values from our dataset.

In [None]:
# answer below:
reviews.describe()

Unnamed: 0,Rating,Price,Type_Free,Type_Paid,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
count,9366.0,9366.0,9366.0,9366.0,9366.0,9366.0,9366.0,9366.0,9366.0,9366.0
mean,4.191757,0.960928,0.93092,0.06908,0.00032,0.792227,0.042387,0.049221,0.115738,0.000107
std,0.515219,15.816585,0.253603,0.253603,0.017895,0.405735,0.201482,0.21634,0.319927,0.010333
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,4.3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,4.5,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
max,5.0,400.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Perform a train test split with 20% of the data in the test sample.

In [None]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9366 entries, 0 to 10840
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Rating                          9366 non-null   float64
 1   Reviews                         9366 non-null   object 
 2   Price                           9366 non-null   float64
 3   Type_Free                       9366 non-null   uint8  
 4   Type_Paid                       9366 non-null   uint8  
 5   Content Rating_Adults only 18+  9366 non-null   uint8  
 6   Content Rating_Everyone         9366 non-null   uint8  
 7   Content Rating_Everyone 10+     9366 non-null   uint8  
 8   Content Rating_Mature 17+       9366 non-null   uint8  
 9   Content Rating_Teen             9366 non-null   uint8  
 10  Content Rating_Unrated          9366 non-null   uint8  
dtypes: float64(2), object(1), uint8(8)
memory usage: 365.9+ KB


In [None]:
# answer below:
from sklearn.model_selection import train_test_split
X = reviews.drop(columns='Rating')
y = reviews['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Now generate a linear model using statsmodels and produce a p value for each coefficient in the model. Analyze the results. (Look at the results table and at a homoscedasticity plot.)

In [None]:
# answer below:
X = sm.add_constant(X)

results = sm.OLS(y, X.astype(float)).fit()

results.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,11.02
Date:,"Tue, 20 Oct 2020",Prob (F-statistic):,1.3e-15
Time:,16:16:02,Log-Likelihood:,-7034.2
No. Observations:,9366,AIC:,14090.0
Df Residuals:,9357,BIC:,14150.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5443,0.060,42.642,0.000,2.427,2.661
Reviews,1.09e-08,1.7e-09,6.420,0.000,7.57e-09,1.42e-08
Price,-0.0010,0.000,-3.051,0.002,-0.002,-0.000
Type_Free,1.2216,0.031,39.809,0.000,1.161,1.282
Type_Paid,1.3226,0.033,40.450,0.000,1.259,1.387
Content Rating_Adults only 18+,0.5338,0.263,2.027,0.043,0.018,1.050
Content Rating_Everyone,0.4100,0.089,4.597,0.000,0.235,0.585
Content Rating_Everyone 10+,0.4645,0.092,5.069,0.000,0.285,0.644
Content Rating_Mature 17+,0.3494,0.091,3.828,0.000,0.170,0.528

0,1,2,3
Omnibus:,3669.654,Durbin-Watson:,1.773
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18483.538
Skew:,-1.841,Prob(JB):,0.0
Kurtosis:,8.814,Cond. No.,4.09e+22


Scale your predictors and refit the linear model.

* How does this change the coefficients?
* How does this change the coefficients' p values?
* How does this change model performance?

In [None]:
# answer below:
from sklearn.preprocessing import StandardScaler

In [None]:
ss = StandardScaler()

In [None]:
scaled = ss.fit_transform(X.values)

In [None]:
X = sm.add_constant(scaled)

results = sm.OLS(y, X.astype(float)).fit()

results.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,11.02
Date:,"Tue, 20 Oct 2020",Prob (F-statistic):,1.3e-15
Time:,16:20:17,Log-Likelihood:,-7034.2
No. Observations:,9366,AIC:,14090.0
Df Residuals:,9357,BIC:,14150.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1918,0.005,790.738,0.000,4.181,4.202
x1,2.026e-18,1.94e-18,1.044,0.296,-1.78e-18,5.83e-18
x2,0.0343,0.005,6.420,0.000,0.024,0.045
x3,-0.0166,0.005,-3.051,0.002,-0.027,-0.006
x4,-0.0128,0.003,-4.700,0.000,-0.018,-0.007
x5,0.0128,0.003,4.700,0.000,0.007,0.018
x6,0.0020,0.005,0.382,0.702,-0.008,0.012
x7,-0.0043,0.003,-1.511,0.131,-0.010,0.001
x8,0.0088,0.005,1.853,0.064,-0.001,0.018

0,1,2,3
Omnibus:,3669.654,Durbin-Watson:,1.773
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18483.538
Skew:,-1.841,Prob(JB):,0.0
Kurtosis:,8.814,Cond. No.,2.35e+16
