## Day 27 Lecture 1 Assignment

In this assignment, we will learn statistical significance in linear models. We will use the google play store dataset loaded below and analyze the regression from this dataset.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [0]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

In [0]:
reviews.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [0]:
reviews.shape

(10841, 13)

We will predict app ratings using other features describing the app. To use these features, we must clean the data.

Start by creating dummy variables out of the type and content rating columns.

In [0]:
reviews['Type'].value_counts()

Free    10039
Paid      800
0           1
Name: Type, dtype: int64

In [0]:
reviews = reviews.loc[reviews['Type'] != '0']

In [0]:
reviews['Content Rating'].value_counts()

Everyone           8714
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

In [0]:
reviews = reviews.loc[reviews['Content Rating'].isin(['Everyone', 'Teen', 'Mature 17+', 'Everyone 10+'])]

In [0]:
# answer below:

reviews_dummy = pd.get_dummies(reviews, columns=['Content Rating', 'Type'], drop_first=True) 

Next, check for missing values and remove all rows containing missing values

In [0]:
# answer below:

reviews_dummy.isna().sum()

App                               0
Category                          0
Rating                         1473
Reviews                           0
Size                              0
Installs                          0
Price                             0
Genres                            0
Last Updated                      0
Current Ver                       8
Android Ver                       2
Content Rating_Everyone 10+       0
Content Rating_Mature 17+         0
Content Rating_Teen               0
Type_Paid                         0
dtype: int64

In [0]:
reviews_dummy.dropna(inplace=True)

To simplify, we will remove the app, category, size, installs, genres, last updated, current ver, and android ver columns. 

In [0]:
# answer below:

cols = [x for x in reviews_dummy.columns.values if x not in 
        ['App', 'Category', 'Size', 'Installs', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]
reviews_subset = reviews_dummy[cols]

# or
# reviews_subset = reviews_dummy.drop(['App', 'Category', 'Size', 'Installs', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], axis=1)

Next, check that all the columns are of numeric type and change the type of columns that are not numeric. If coercing to numeric causes missing values, remove those rows containing missing values from our dataset.

In [0]:
# answer below:

reviews_subset.dtypes

Rating                         float64
Reviews                         object
Price                           object
Content Rating_Everyone 10+      uint8
Content Rating_Mature 17+        uint8
Content Rating_Teen              uint8
Type_Paid                        uint8
dtype: object

In [0]:
reviews['Reviews'].apply(pd.to_numeric, errors='coerce').describe()

count    1.083500e+04
mean     4.443502e+05
std      2.928422e+06
min      0.000000e+00
25%      3.800000e+01
50%      2.094000e+03
75%      5.480250e+04
max      7.815831e+07
Name: Reviews, dtype: float64

In [0]:
reviews.loc[reviews['Type']=='Paid', 'Price'].describe()

count       800
unique       91
top       $0.99
freq        148
Name: Price, dtype: object

In [0]:
reviews_subset['Price'] = reviews_subset['Price'].str.replace('$', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
reviews_subset = reviews_subset.apply(pd.to_numeric, errors='coerce')

In [0]:
reviews_subset.dropna(inplace=True)

In [0]:
reviews_subset.shape

(9356, 7)

Perform a train test split with 20% of the data in the test sample.

In [0]:
# answer below:
from sklearn.model_selection import train_test_split

x_cols = [x for x in reviews_subset.columns if x != 'Rating']
# or reviews_subset.drop('Rating', axis=1)
X_train, X_test, y_train, y_test = train_test_split(reviews_subset[x_cols], reviews_subset['Rating'], test_size = 0.25)

In [0]:
# This scaler changes everything to [0,1] to correspond well with the binary dummy features.

from sklearn.preprocessing import MinMaxScaler

scale = MinMaxScaler()

X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.transform(X_test)

In [0]:
X_train_scaled = pd.DataFrame(X_train_scaled, index=X_train.index, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, index=X_test.index, columns=X_test.columns)

In [0]:
X_train_scaled.head()

Unnamed: 0,Reviews,Price,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Type_Paid
1356,3.4e-05,0.0,0.0,0.0,0.0,0.0
4678,2.7e-05,0.0,1.0,0.0,0.0,0.0
1407,3.2e-05,0.0,1.0,0.0,0.0,0.0
6053,0.000121,0.0,0.0,0.0,1.0,0.0
9613,3e-06,0.006225,0.0,0.0,0.0,1.0


In [0]:
y_train.head()

1356    4.1
4678    4.2
1407    4.0
6053    4.4
9613    4.6
Name: Rating, dtype: float64

In [0]:
print(X_train_scaled.shape, y_train.shape)

(7017, 6) (7017,)


Now generate a linear model using statsmodels or sklearn and produce a p value for each coefficient in the model. Analyze the results.

In [0]:
#answer below:

import statsmodels.api as sm

X_train_scaled = sm.add_constant(X_train_scaled)
results = sm.OLS(y_train, X_train_scaled).fit()

In [0]:
results.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.01
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,11.56
Date:,"Tue, 26 May 2020",Prob (F-statistic):,6.45e-13
Time:,18:59:12,Log-Likelihood:,-5276.0
No. Observations:,7017,AIC:,10570.0
Df Residuals:,7010,BIC:,10610.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1796,0.007,583.092,0.000,4.166,4.194
Reviews,0.9516,0.162,5.869,0.000,0.634,1.269
Price,-0.4629,0.155,-2.985,0.003,-0.767,-0.159
Content Rating_Everyone 10+,0.0655,0.030,2.164,0.030,0.006,0.125
Content Rating_Mature 17+,-0.0616,0.029,-2.141,0.032,-0.118,-0.005
Content Rating_Teen,0.0377,0.019,1.954,0.051,-0.000,0.076
Type_Paid,0.0900,0.025,3.564,0.000,0.041,0.140

0,1,2,3
Omnibus:,2755.809,Durbin-Watson:,1.982
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13732.109
Skew:,-1.847,Prob(JB):,0.0
Kurtosis:,8.773,Cond. No.,26.8


What should our coefficients be if we had the original, unscaled data? We can use the scaling factor for each feature to change the current coefficient value. (Note: we have to add a 1 at the beginning to "scale" the intercept constant.)

In [0]:
scales = np.insert(scale.scale_, 0, 1.0, axis=0)
results.params / scales

const                          4.179569e+00
Reviews                        7.434838e+07
Price                         -1.851670e+02
Content Rating_Everyone 10+    6.549147e-02
Content Rating_Mature 17+     -6.164086e-02
Content Rating_Teen            3.771635e-02
Type_Paid                      9.004111e-02
dtype: float64