## Day 30 Lecture 2 Assignment

In this assignment, we will learn about random forests. We will use the google play store dataset loaded below.

In [16]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [17]:
reviews = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/googleplaystore.csv')

reviews.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In this assignment, you will work more independently. Perform the following steps:
    
1. Select which columns are best suited to predict whether the rating is above 4.5
2. Process the data (including transforming to the correct column type, removing missing values, creating dummy variables, and removing irrelevant variables)
3. Create a random forest model and evaluate
4. Using grid search cross validation, tweak the parameters to produce a better performing model
5. Show and discuss your results

Good luck!

I will drop everything except the target (Rating), Reviews, and Content Rating.

In [21]:
reviews.drop(['App', 'Category', 'Size', 'Installs', 'Type', 'Price', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], axis =1, inplace=True)

In [23]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rating          9367 non-null   float64
 1   Reviews         10841 non-null  object 
 2   Content Rating  10840 non-null  object 
dtypes: float64(1), object(2)
memory usage: 254.2+ KB


Only a few missing values so I will just drop missing values from the dataset

In [24]:
reviews.dropna(inplace=True)

Examine Content Rating

In [19]:
reviews['Content Rating'].value_counts()

Everyone           8714
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

I drop Adults Only 18+ and Unrated beacuse there are very few of them

In [26]:
reviews.drop(reviews[reviews['Content Rating']=='Unrated'].index, inplace=True)
reviews.drop(reviews[reviews['Content Rating']=='Adults only 18+'].index, inplace=True)

I also convert reviews to integers

In [28]:
reviews['Reviews'] = pd.to_numeric(reviews['Reviews'], errors='coerce', downcast='integer')

In [29]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9362 entries, 0 to 10840
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rating          9362 non-null   float64
 1   Reviews         9362 non-null   int32  
 2   Content Rating  9362 non-null   object 
dtypes: float64(1), int32(1), object(1)
memory usage: 256.0+ KB


Convert. Content Rating into dummies

In [30]:
 reviews = pd.concat([reviews, pd.get_dummies(reviews['Content Rating'], prefix='ContentRating', drop_first=True)],axis=1)

Lastly, I will split ratings into two category

In [None]:
reviews = np.where((reviews['Rating'] >= 4.5), 'Top', 'Average')

In [48]:
reviews.head()

Unnamed: 0,Rating,Reviews,ContentRating_Everyone 10+,ContentRating_Mature 17+,ContentRating_Teen
0,Average,159,0,0,0
1,Average,967,0,0,0
2,Top,87510,0,0,0
3,Top,215644,0,0,1
4,Average,967,0,0,0


Let's get to the model

In [49]:
from sklearn.model_selection import train_test_split
X = reviews.drop(['Rating'], axis =1)
Y = reviews[['Rating']]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.20)

In [50]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=10)
random_forest.fit(X_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=10, verbose=0,
                       warm_start=False)

In [51]:
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(random_forest, threshold=0.05)

sfm.fit(X_train, y_train)

  self.estimator_.fit(X, y, **fit_params)


SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                                 class_weight=None,
                                                 criterion='gini', max_depth=5,
                                                 max_features='auto',
                                                 max_leaf_nodes=None,
                                                 max_samples=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100, n_jobs=None,
                                                 oob_score=False,
                                                 ra

In [54]:
y_pred = random_forest.predict(X_test)

In [56]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.6849973304858515