## Day 30 Lecture 2 Assignment

In this assignment, we will learn about random forests. We will use the google play store dataset loaded below.

In [184]:
!pip install category_encoders



In [185]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report as cr
from sklearn.metrics import confusion_matrix as cm
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [186]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

reviews.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In this assignment, you will work more independently. Perform the following steps:
    
1. Select which columns are best suited to predict whether the rating is above 4.5
2. Process the data (including transforming to the correct column type, removing missing values, creating dummy variables, and removing irrelevant variables)
3. Create a random forest model and evaluate
4. Using grid search cross validation, tweak the parameters to produce a better performing model
5. Show and discuss your results

Good luck!

In [187]:
reviews['Category'].value_counts()

FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
PARENTING                60
COMICS                   60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64

In [188]:
reviews = reviews[reviews['Rating'] != 19.0]

In [189]:
reviews['Elite Rate'] = reviews['Rating'] > 4.5

In [190]:
reviews.drop(columns='Rating', inplace=True)

In [191]:
reviews.head(10)

Unnamed: 0,App,Category,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Elite Rate
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,False
1,Coloring book moana,ART_AND_DESIGN,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,False
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,True
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,False
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,False
5,Paper flowers instructions,ART_AND_DESIGN,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up,False
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up,False
7,Infinite Painter,ART_AND_DESIGN,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up,False
8,Garden Coloring Book,ART_AND_DESIGN,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up,False
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up,True


In [192]:
#reviews.dtypes
col_drop = reviews.drop(columns=['App','Current Ver','Android Ver','Size','Genres','Last Updated'])

In [193]:
reviews_set = pd.concat([col_drop,pd.get_dummies(reviews['Elite Rate'], prefix='Elite Rate', drop_first=True)], axis=1).drop(columns='Elite Rate') 

In [194]:
reviews_set['Installs'] = reviews_set['Installs'].str.replace(',','')
reviews_set['Installs'] = reviews_set['Installs'].str.replace('+','')

reviews_set['Installs'] = reviews_set['Installs'].astype(np.int64)

In [195]:
reviews_set['Price'] = reviews_set['Price'].str.replace('$','')
reviews_set['Price'] = reviews_set['Price'].astype(np.float64)

In [196]:
reviews_set['Reviews'] = reviews_set['Reviews'].astype(np.int64)

In [197]:
reviews_set.isnull().sum()*100/reviews_set.isnull().count()

Category           0.000000
Reviews            0.000000
Installs           0.000000
Type               0.009225
Price              0.000000
Content Rating     0.000000
Elite Rate_True    0.000000
dtype: float64

In [198]:
reviews_set.dropna(axis=0, inplace=True)

In [199]:
reviews_set.isnull().sum()*100/reviews_set.isnull().count()

Category           0.0
Reviews            0.0
Installs           0.0
Type               0.0
Price              0.0
Content Rating     0.0
Elite Rate_True    0.0
dtype: float64

In [200]:
reviews_set.head()

Unnamed: 0,Category,Reviews,Installs,Type,Price,Content Rating,Elite Rate_True
0,ART_AND_DESIGN,159,10000,Free,0.0,Everyone,0
1,ART_AND_DESIGN,967,500000,Free,0.0,Everyone,0
2,ART_AND_DESIGN,87510,5000000,Free,0.0,Everyone,1
3,ART_AND_DESIGN,215644,50000000,Free,0.0,Teen,0
4,ART_AND_DESIGN,967,100000,Free,0.0,Everyone,0


In [201]:
y = reviews_set['Elite Rate_True']
X = reviews_set.drop(columns='Elite Rate_True')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

ce = LeaveOneOutEncoder()
ce.fit(X_train, y_train)

X_train_ce = ce.transform(X_train)
X_test_ce = ce.transform(X_test) 

rf = RandomForestClassifier()
rf = rf.fit(X_train_ce, y_train)

y_pred = rf.predict(X_test_ce)

In [206]:
print(cr(y_test,y_pred))
print(cm(y_test,y_pred))
print('')
print('Results: ',rf.score(X_train_ce,y_train))
print('Results: ',rf.score(X_test_ce,y_test))
#dir(rf)

              precision    recall  f1-score   support

           0       0.87      0.91      0.89      1793
           1       0.44      0.33      0.38       375

    accuracy                           0.81      2168
   macro avg       0.65      0.62      0.63      2168
weighted avg       0.79      0.81      0.80      2168

[[1634  159]
 [ 250  125]]

Results:  0.9827009572137009
Results:  0.8113468634686347


In [203]:
importance = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
importance

Reviews           0.637060
Category          0.174864
Installs          0.117300
Content Rating    0.033995
Price             0.031399
Type              0.005382
dtype: float64

In [207]:
param_grid1 = {
                 'n_estimators': [75, 100, 200],
                 'max_depth': [6, 8, 10],
                 'max_features': [2, 3, 4],
                 'min_samples_split': [50,100,200]
             }
crossgrid_tree = GridSearchCV(rf, param_grid=param_grid1, cv=5)

test_grid = crossgrid_tree.fit(X_test_ce, y_test)

In [208]:
#dir(crossgrid_tree)
#print('Best estimator: ',crossgrid_tree.best_estimator_)

print('Best Test params: ',test_grid.best_params_)
print('Test Results: ',test_grid.best_score_)

Best Test params:  {'max_depth': 8, 'max_features': 3, 'min_samples_split': 50, 'n_estimators': 100}
Test Results:  0.8284117878694351
