We want to predict of mobile phone (range) based on the characteristics of the phone like memory, battery power, camera specification etc. The data for about 2000 phones is provided.

1) Train a decision tree to predict the price category.<br>
    a) What is the best score we get? Use 10 fold CV<br>
    b) What are the best tree parameters<br>
    c) Which variable come out to be important
    
2) Now train a Random Forest classifier. How does the score compare with decision tree?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [2]:
# Load data
df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/09_mobile_price.csv')
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,frontcamera,Has4G,memory,mobile_thickness,mobile_wt,n_cores,primarycamera_mp,px_height,px_width,ram,screen_height,screen_width,talk_time,Has3G,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0,1


In [3]:
# Check shape
df.shape

(2000, 21)

In [4]:
# Check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   battery_power     2000 non-null   int64  
 1   blue              2000 non-null   int64  
 2   clock_speed       2000 non-null   float64
 3   dual_sim          2000 non-null   int64  
 4   frontcamera       2000 non-null   int64  
 5   Has4G             2000 non-null   int64  
 6   memory            2000 non-null   int64  
 7   mobile_thickness  2000 non-null   float64
 8   mobile_wt         2000 non-null   int64  
 9   n_cores           2000 non-null   int64  
 10  primarycamera_mp  2000 non-null   int64  
 11  px_height         2000 non-null   int64  
 12  px_width          2000 non-null   int64  
 13  ram               2000 non-null   int64  
 14  screen_height     2000 non-null   int64  
 15  screen_width      2000 non-null   int64  
 16  talk_time         2000 non-null   int64  


In [5]:
# Prepare X and y
X = df.drop('price_range', axis=1)
y = df['price_range']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y ,random_state=42)

# Define parameters
params = {'criterion' : ['gini'],
          'max_depth' : range(1,40),
          'min_samples_split' : range(10,60,10)}

# GridSearch
clf_gs = GridSearchCV(DecisionTreeClassifier(), cv=10, param_grid=params)

# Fit the model
clf_gs.fit(X_train, y_train)

# Print best param and score
clf_gs.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 7, 'min_samples_split': 20}, 0.828125)

In [6]:
# Model performance on Training set
y_train_pred = clf_gs.predict(X_train)

from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

accuracy = metrics.accuracy_score(y_train, y_train_pred)
print("Accuracy: {:.2f}".format(accuracy))
cm = confusion_matrix(y_train, y_train_pred)
print("Confusion Matrix: \n", cm)
print(classification_report(y_train, y_train_pred))

Accuracy: 0.93
Confusion Matrix: 
 [[384  16   0   0]
 [ 18 371  11   0]
 [  0  26 360  14]
 [  0   0  26 374]]
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       400
           1       0.90      0.93      0.91       400
           2       0.91      0.90      0.90       400
           3       0.96      0.94      0.95       400

    accuracy                           0.93      1600
   macro avg       0.93      0.93      0.93      1600
weighted avg       0.93      0.93      0.93      1600



In [7]:
# Model performance on test set
y_pred = clf_gs.predict(X_test)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)
print(classification_report(y_test, y_pred))

Accuracy: 0.84
Confusion Matrix: 
 [[91  9  0  0]
 [11 77 12  0]
 [ 0 11 83  6]
 [ 0  0 13 87]]
              precision    recall  f1-score   support

           0       0.89      0.91      0.90       100
           1       0.79      0.77      0.78       100
           2       0.77      0.83      0.80       100
           3       0.94      0.87      0.90       100

    accuracy                           0.84       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.84      0.85       400



**The train accuracy score is 93%, whereas test accuracy score dropped to 84%**

**The best parameters:**<br>
**({'criterion': 'gini', 'max_depth': 7, 'min_samples_split': 20}**

In [8]:
# Lets just fit the model to check only feature importance
clf = DecisionTreeClassifier(criterion='gini')

clf = clf.fit(X_train, y_train)

In [9]:
# Feature importance
pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

ram                 0.615176
battery_power       0.125687
px_height           0.088204
px_width            0.082700
mobile_wt           0.019732
memory              0.011229
mobile_thickness    0.008830
primarycamera_mp    0.008224
clock_speed         0.007793
screen_width        0.007712
talk_time           0.006338
frontcamera         0.005278
n_cores             0.005198
screen_height       0.005012
Has3G               0.001389
Has4G               0.000833
dual_sim            0.000667
blue                0.000000
touch_screen        0.000000
wifi                0.000000
dtype: float64

**RAM, Battery Power, Pixel Height, Pixel Width, Weight of Mobile, memory are the important features impacting the price category of Mobile.**

In [10]:
# Lets fit random forest
from sklearn.ensemble import RandomForestClassifier

In [11]:
# Lets define same parameters
params = {'criterion' : ['gini'],
          'max_depth' : range(1,40),
          'min_samples_split' : range(10,60,10)}

# GridSearch
clf_gs = GridSearchCV(RandomForestClassifier(n_jobs=-1, n_estimators=100), cv=10, param_grid=params)

# Fit the model
clf_gs.fit(X_train, y_train)

# Print best param and score
clf_gs.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 26, 'min_samples_split': 10}, 0.875625)

In [12]:
# Model Performance on Train set
y_train_pred = clf_gs.predict(X_train)

accuracy = metrics.accuracy_score(y_train, y_train_pred)
print("Accuracy: {:.2f}".format(accuracy))
cm = confusion_matrix(y_train, y_train_pred)
print("Confusion Matrix: \n", cm)
print(classification_report(y_train, y_train_pred))

Accuracy: 1.00
Confusion Matrix: 
 [[400   0   0   0]
 [  0 398   2   0]
 [  0   1 398   1]
 [  0   0   0 400]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       400
           1       1.00      0.99      1.00       400
           2       0.99      0.99      0.99       400
           3       1.00      1.00      1.00       400

    accuracy                           1.00      1600
   macro avg       1.00      1.00      1.00      1600
weighted avg       1.00      1.00      1.00      1600



In [13]:
# Model Performance on Test set
y_pred = clf_gs.predict(X_test)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)
print(classification_report(y_test, y_pred))

Accuracy: 0.89
Confusion Matrix: 
 [[97  3  0  0]
 [ 6 82 12  0]
 [ 0 13 80  7]
 [ 0  0  4 96]]
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       100
           1       0.84      0.82      0.83       100
           2       0.83      0.80      0.82       100
           3       0.93      0.96      0.95       100

    accuracy                           0.89       400
   macro avg       0.89      0.89      0.89       400
weighted avg       0.89      0.89      0.89       400



**The accuracy score on Test set improved to 89% after using RandomForestClassifier**