## Day 27 Lecture 1 Assignment

In this assignment, we will learn statistical significance in linear models. We will use the google play store dataset loaded below and analyze the regression from this dataset.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [0]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

In [0]:
reviews.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


We will predict app ratings using other features describing the app. To use these features, we must clean the data.

Start by creating dummy variables out of the type and content rating columns.

In [0]:
reviews.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [0]:
# Create dummy variables out of the 'Type' and 'Content Rating' columns using pandas.get_dummies

reviews = pd.get_dummies(reviews, columns=['Type', 'Content Rating'], drop_first=True)
reviews.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Price',
       'Genres', 'Last Updated', 'Current Ver', 'Android Ver', 'Type_Free',
       'Type_Paid', 'Content Rating_Everyone', 'Content Rating_Everyone 10+',
       'Content Rating_Mature 17+', 'Content Rating_Teen',
       'Content Rating_Unrated'],
      dtype='object')

Next, check for missing values and remove all rows containing missing values

In [0]:
reviews.isna().mean()*100/reviews.isna().count()

App                            0.000000
Category                       0.000000
Rating                         0.001254
Reviews                        0.000000
Size                           0.000000
Installs                       0.000000
Price                          0.000000
Genres                         0.000000
Last Updated                   0.000000
Current Ver                    0.000007
Android Ver                    0.000003
Type_Free                      0.000000
Type_Paid                      0.000000
Content Rating_Everyone        0.000000
Content Rating_Everyone 10+    0.000000
Content Rating_Mature 17+      0.000000
Content Rating_Teen            0.000000
Content Rating_Unrated         0.000000
dtype: float64

In [0]:
# We have very low missing value counts, so we drop every row with missing values
reviews = reviews.dropna()
reviews.isna().mean()*100/reviews.isna().count()

App                            0.0
Category                       0.0
Rating                         0.0
Reviews                        0.0
Size                           0.0
Installs                       0.0
Price                          0.0
Genres                         0.0
Last Updated                   0.0
Current Ver                    0.0
Android Ver                    0.0
Type_Free                      0.0
Type_Paid                      0.0
Content Rating_Everyone        0.0
Content Rating_Everyone 10+    0.0
Content Rating_Mature 17+      0.0
Content Rating_Teen            0.0
Content Rating_Unrated         0.0
dtype: float64

To simplify, we will remove the app, category, size, installs, genres, last updated, current ver, and android ver columns. 

In [0]:
reviews = reviews.drop(columns=['App', 'Category', 'Size', 'Installs', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'])
reviews.head()

Unnamed: 0,Rating,Reviews,Price,Type_Free,Type_Paid,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,4.1,159,0,1,0,1,0,0,0,0
1,3.9,967,0,1,0,1,0,0,0,0
2,4.7,87510,0,1,0,1,0,0,0,0
3,4.5,215644,0,1,0,0,0,0,1,0
4,4.3,967,0,1,0,1,0,0,0,0


Next, check that all the columns are of numeric type and change the type of columns that are not numeric. If coercing to numeric causes missing values, remove those rows containing missing values from our dataset.

In [0]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Rating                       9360 non-null   float64
 1   Reviews                      9360 non-null   object 
 2   Price                        9360 non-null   object 
 3   Type_Free                    9360 non-null   uint8  
 4   Type_Paid                    9360 non-null   uint8  
 5   Content Rating_Everyone      9360 non-null   uint8  
 6   Content Rating_Everyone 10+  9360 non-null   uint8  
 7   Content Rating_Mature 17+    9360 non-null   uint8  
 8   Content Rating_Teen          9360 non-null   uint8  
 9   Content Rating_Unrated       9360 non-null   uint8  
dtypes: float64(1), object(2), uint8(7)
memory usage: 356.5+ KB


In [0]:
reviews['Reviews'] = pd.to_numeric(reviews['Reviews'])
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Rating                       9360 non-null   float64
 1   Reviews                      9360 non-null   int64  
 2   Price                        9360 non-null   object 
 3   Type_Free                    9360 non-null   uint8  
 4   Type_Paid                    9360 non-null   uint8  
 5   Content Rating_Everyone      9360 non-null   uint8  
 6   Content Rating_Everyone 10+  9360 non-null   uint8  
 7   Content Rating_Mature 17+    9360 non-null   uint8  
 8   Content Rating_Teen          9360 non-null   uint8  
 9   Content Rating_Unrated       9360 non-null   uint8  
dtypes: float64(1), int64(1), object(1), uint8(7)
memory usage: 356.5+ KB
ERROR! Session/line number was not unique in database. History logging moved to new session 59


In [0]:
reviews['Price'].unique()

array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49',
       '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00',
       '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99',
       '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99',
       '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88',
       '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77',
       '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00',
       '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04',
       '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90',
       '$1.97', '$2.56', '$1.20'], dtype=object)

In [0]:
reviews['Price'] = reviews['Price'].str.strip('$')
reviews['Price'] = pd.to_numeric(reviews['Price'])
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Rating                       9360 non-null   float64
 1   Reviews                      9360 non-null   int64  
 2   Price                        9360 non-null   float64
 3   Type_Free                    9360 non-null   uint8  
 4   Type_Paid                    9360 non-null   uint8  
 5   Content Rating_Everyone      9360 non-null   uint8  
 6   Content Rating_Everyone 10+  9360 non-null   uint8  
 7   Content Rating_Mature 17+    9360 non-null   uint8  
 8   Content Rating_Teen          9360 non-null   uint8  
 9   Content Rating_Unrated       9360 non-null   uint8  
dtypes: float64(2), int64(1), uint8(7)
memory usage: 356.5 KB


Perform a train test split with 20% of the data in the test sample.

In [0]:
from sklearn.model_selection import train_test_split

X = reviews.drop(columns='Rating')
y = reviews['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now generate a linear model using statsmodels or sklearn and produce a p value for each coefficient in the model. Analyze the results.

In [0]:
import statsmodels.api as sm

X_const = sm.add_constant(X_train)
lm = sm.OLS(y_train, X_const).fit()

lm.summary()

  import pandas.util.testing as tm


0,1,2,3
Dep. Variable:,Rating,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,8.72
Date:,"Tue, 07 Apr 2020",Prob (F-statistic):,6.29e-12
Time:,15:44:35,Log-Likelihood:,-5649.6
No. Observations:,7488,AIC:,11320.0
Df Residuals:,7479,BIC:,11380.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.7997,0.243,11.529,0.000,2.324,3.276
Reviews,1.088e-08,1.9e-09,5.718,0.000,7.15e-09,1.46e-08
Price,-0.0010,0.000,-2.719,0.007,-0.002,-0.000
Type_Free,1.3499,0.122,11.100,0.000,1.112,1.588
Type_Paid,1.4498,0.122,11.845,0.000,1.210,1.690
Content Rating_Everyone,0.0278,0.364,0.076,0.939,-0.686,0.742
Content Rating_Everyone 10+,0.0878,0.365,0.241,0.810,-0.628,0.804
Content Rating_Mature 17+,-0.0361,0.365,-0.099,0.921,-0.752,0.680
Content Rating_Teen,0.0646,0.364,0.177,0.859,-0.650,0.779

0,1,2,3
Omnibus:,2924.472,Durbin-Watson:,2.038
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14563.566
Skew:,-1.836,Prob(JB):,0.0
Kurtosis:,8.761,Cond. No.,9.76e+20


We have generated p-values and coefficients for each feature using the summary function from statsmodels.api. The p-values produced for the Content Rating encoded features are extremely high, indicating that they are useless to our model's predictions.

All other p-values are significantly 0.05, indicating that they are all statistically significant to our model's predictions. However, there is probably multicollinearity between our Type_Free/Paid and Price features since they contain some shared data.