## Day 27 Lecture 1 Assignment

In this assignment, we will learn statistical significance in linear models. We will use the google play store dataset loaded below and analyze the regression from this dataset.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

  import pandas.util.testing as tm


In [2]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

In [3]:
reviews.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


We will predict app ratings using other features describing the app. To use these features, we must clean the data.

To simplify, we will remove the app, category, size, installs, genres, last updated, current ver, and android ver columns. 

In [4]:
# answer below:
df = reviews.drop(columns=['App', 'Category', 'Size', 'Installs', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'])
print(df.shape)
df.head(2)

(10841, 5)


Unnamed: 0,Rating,Reviews,Type,Price,Content Rating
0,4.1,159,Free,0,Everyone
1,3.9,967,Free,0,Everyone


Check for missing values and remove all rows containing missing values

In [5]:
# answer below:
df.isnull().sum()

Rating            1474
Reviews              0
Type                 1
Price                0
Content Rating       1
dtype: int64

In [6]:
df.dropna(inplace=True)
df.isnull().sum()

Rating            0
Reviews           0
Type              0
Price             0
Content Rating    0
dtype: int64

In [7]:
df.shape

(9366, 5)

Remove outliers from the Type and Content Rating columns (very rare values that won't train well).

In [8]:
# answer below:
print(df.nunique())
print(df.Type.unique())
print(df['Content Rating'].unique())

Rating              39
Reviews           5992
Type                 2
Price               73
Content Rating       6
dtype: int64
['Free' 'Paid']
['Everyone' 'Teen' 'Everyone 10+' 'Mature 17+' 'Adults only 18+' 'Unrated']


In [9]:
df.Type.value_counts()

Free    8719
Paid     647
Name: Type, dtype: int64

In [10]:
df['Content Rating'].value_counts()

Everyone           7420
Teen               1084
Mature 17+          461
Everyone 10+        397
Adults only 18+       3
Unrated               1
Name: Content Rating, dtype: int64

In [11]:
df = df.loc[(df['Content Rating'] != 'Adults only 18+') & (df['Content Rating'] != 'Unrated') ]

In [12]:
df['Content Rating'].value_counts()

Everyone        7420
Teen            1084
Mature 17+       461
Everyone 10+     397
Name: Content Rating, dtype: int64

In [13]:
df

Unnamed: 0,Rating,Reviews,Type,Price,Content Rating
0,4.1,159,Free,0,Everyone
1,3.9,967,Free,0,Everyone
2,4.7,87510,Free,0,Everyone
3,4.5,215644,Free,0,Teen
4,4.3,967,Free,0,Everyone
...,...,...,...,...,...
10834,4.0,7,Free,0,Everyone
10836,4.5,38,Free,0,Everyone
10837,5.0,4,Free,0,Everyone
10839,4.5,114,Free,0,Mature 17+


Convert the Type and Content Rating columns to a numeric format, whether by one-hot encoding, ordinal encoding, or similar.

In [14]:
content_rating_dummies = pd.get_dummies(df['Content Rating'], drop_first=True)
type_dummies = pd.get_dummies(df.Type, drop_first=True)

In [15]:
df = pd.concat([df, content_rating_dummies], axis=1)

In [16]:
df = pd.concat([df, type_dummies], axis=1)
df

Unnamed: 0,Rating,Reviews,Type,Price,Content Rating,Everyone 10+,Mature 17+,Teen,Paid
0,4.1,159,Free,0,Everyone,0,0,0,0
1,3.9,967,Free,0,Everyone,0,0,0,0
2,4.7,87510,Free,0,Everyone,0,0,0,0
3,4.5,215644,Free,0,Teen,0,0,1,0
4,4.3,967,Free,0,Everyone,0,0,0,0
...,...,...,...,...,...,...,...,...,...
10834,4.0,7,Free,0,Everyone,0,0,0,0
10836,4.5,38,Free,0,Everyone,0,0,0,0
10837,5.0,4,Free,0,Everyone,0,0,0,0
10839,4.5,114,Free,0,Mature 17+,0,1,0,0


In [17]:
df = df.drop(columns=['Type', 'Content Rating'])

In [18]:
df

Unnamed: 0,Rating,Reviews,Price,Everyone 10+,Mature 17+,Teen,Paid
0,4.1,159,0,0,0,0,0
1,3.9,967,0,0,0,0,0
2,4.7,87510,0,0,0,0,0
3,4.5,215644,0,0,0,1,0
4,4.3,967,0,0,0,0,0
...,...,...,...,...,...,...,...
10834,4.0,7,0,0,0,0,0
10836,4.5,38,0,0,0,0,0
10837,5.0,4,0,0,0,0,0
10839,4.5,114,0,0,1,0,0


Finally, check that all the columns are of numeric type and change the type of columns that are not numeric. If coercing to numeric causes missing values, remove those rows containing missing values from our dataset.

In [19]:
# answer below
df.dtypes

Rating          float64
Reviews          object
Price            object
Everyone 10+      uint8
Mature 17+        uint8
Teen              uint8
Paid              uint8
dtype: object

In [20]:
df.Price = df.Price.replace('[\$,]', '', regex=True).astype(float)

In [21]:
df['Reviews'] = df['Reviews'].astype(int)

In [22]:
df.head()

Unnamed: 0,Rating,Reviews,Price,Everyone 10+,Mature 17+,Teen,Paid
0,4.1,159,0.0,0,0,0,0
1,3.9,967,0.0,0,0,0,0
2,4.7,87510,0.0,0,0,0,0
3,4.5,215644,0.0,0,0,1,0
4,4.3,967,0.0,0,0,0,0


In [23]:
df.dtypes

Rating          float64
Reviews           int64
Price           float64
Everyone 10+      uint8
Mature 17+        uint8
Teen              uint8
Paid              uint8
dtype: object

Perform a train test split with 20% of the data in the test sample.

In [24]:
# answer below:
x = df.drop(columns='Rating')
y = df.Rating
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

Now generate a linear model using statsmodels and produce a p value for each coefficient in the model. Analyze the results. (Look at the results table and at a homoscedasticity plot.)

In [25]:
# answer below:
x = sm.add_constant(x)

results = sm.OLS(y, x).fit()

results.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,14.67
Date:,"Tue, 20 Oct 2020",Prob (F-statistic):,9.47e-17
Time:,20:04:07,Log-Likelihood:,-7032.4
No. Observations:,9362,AIC:,14080.0
Df Residuals:,9355,BIC:,14130.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1759,0.006,673.905,0.000,4.164,4.188
Reviews,1.09e-08,1.7e-09,6.420,0.000,7.57e-09,1.42e-08
Price,-0.0010,0.000,-3.051,0.002,-0.002,-0.000
Everyone 10+,0.0545,0.027,2.056,0.040,0.003,0.107
Mature 17+,-0.0606,0.025,-2.459,0.014,-0.109,-0.012
Teen,0.0426,0.017,2.548,0.011,0.010,0.075
Paid,0.1010,0.021,4.700,0.000,0.059,0.143

0,1,2,3
Omnibus:,3667.865,Durbin-Watson:,1.773
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18470.173
Skew:,-1.841,Prob(JB):,0.0
Kurtosis:,8.813,Cond. No.,16200000.0


Scale your predictors and refit the linear model.

* How does this change the coefficients?
* How does this change the coefficients' p values?
* How does this change model performance?

In [26]:
# answer below:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [27]:
scaled = StandardScaler().fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)

In [28]:
x_scaled = scaled_df.drop(columns='Rating')
y_scaled = scaled_df.Rating
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.2)

In [30]:
x_scaled = sm.add_constant(x_scaled)

results_scaled = sm.OLS(y_scaled, x_scaled).fit()

results_scaled.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,14.67
Date:,"Tue, 20 Oct 2020",Prob (F-statistic):,9.47e-17
Time:,20:04:32,Log-Likelihood:,-13240.0
No. Observations:,9362,AIC:,26490.0
Df Residuals:,9355,BIC:,26540.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.224e-16,0.010,3.13e-14,1.000,-0.020,0.020
Reviews,0.0665,0.010,6.420,0.000,0.046,0.087
Price,-0.0322,0.011,-3.051,0.002,-0.053,-0.012
Everyone 10+,0.0213,0.010,2.056,0.040,0.001,0.042
Mature 17+,-0.0254,0.010,-2.459,0.014,-0.046,-0.005
Teen,0.0265,0.010,2.548,0.011,0.006,0.047
Paid,0.0497,0.011,4.700,0.000,0.029,0.070

0,1,2,3
Omnibus:,3667.865,Durbin-Watson:,1.773
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18470.173
Skew:,-1.841,Prob(JB):,0.0
Kurtosis:,8.813,Cond. No.,1.27


In [31]:
results.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,14.67
Date:,"Tue, 20 Oct 2020",Prob (F-statistic):,9.47e-17
Time:,20:04:44,Log-Likelihood:,-7032.4
No. Observations:,9362,AIC:,14080.0
Df Residuals:,9355,BIC:,14130.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1759,0.006,673.905,0.000,4.164,4.188
Reviews,1.09e-08,1.7e-09,6.420,0.000,7.57e-09,1.42e-08
Price,-0.0010,0.000,-3.051,0.002,-0.002,-0.000
Everyone 10+,0.0545,0.027,2.056,0.040,0.003,0.107
Mature 17+,-0.0606,0.025,-2.459,0.014,-0.109,-0.012
Teen,0.0426,0.017,2.548,0.011,0.010,0.075
Paid,0.1010,0.021,4.700,0.000,0.059,0.143

0,1,2,3
Omnibus:,3667.865,Durbin-Watson:,1.773
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18470.173
Skew:,-1.841,Prob(JB):,0.0
Kurtosis:,8.813,Cond. No.,16200000.0
