We want to predict of mobile phone (range) based on the characteristics of the phone like memory, battery power, camera specification etc. The data for about 2000 phones is provided.

1) Train a decision tree to predict the price category.<br>
    a) What is the best score we get? Use 10 fold CV<br>
    b) What are the best tree parameters<br>
    c) Which variable come out to be important
    
2) Now train a Random Forest classifier. How does the score compare with decision tree?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [2]:
# Load data
df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/09_mobile_price.csv')
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,frontcamera,Has4G,memory,mobile_thickness,mobile_wt,n_cores,primarycamera_mp,px_height,px_width,ram,screen_height,screen_width,talk_time,Has3G,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0,1


In [3]:
# Check shape
df.shape

(2000, 21)

In [4]:
# Check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   battery_power     2000 non-null   int64  
 1   blue              2000 non-null   int64  
 2   clock_speed       2000 non-null   float64
 3   dual_sim          2000 non-null   int64  
 4   frontcamera       2000 non-null   int64  
 5   Has4G             2000 non-null   int64  
 6   memory            2000 non-null   int64  
 7   mobile_thickness  2000 non-null   float64
 8   mobile_wt         2000 non-null   int64  
 9   n_cores           2000 non-null   int64  
 10  primarycamera_mp  2000 non-null   int64  
 11  px_height         2000 non-null   int64  
 12  px_width          2000 non-null   int64  
 13  ram               2000 non-null   int64  
 14  screen_height     2000 non-null   int64  
 15  screen_width      2000 non-null   int64  
 16  talk_time         2000 non-null   int64  


In [5]:
# Prepare X and y
X = df.drop('price_range', axis=1)
y = df['price_range']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

# Define parameters
params = {'criterion' : ['gini'],
          'max_depth' : range(1,40),
          'min_samples_split' : range(10,60,10)}

# GridSearch
clf_gs = GridSearchCV(DecisionTreeClassifier(), cv=10, param_grid=params)

# Fit the model
clf_gs.fit(X_train, y_train)

# Print score
clf_gs.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 9, 'min_samples_split': 10}, 0.843125)

In [6]:
# Check accuracy on test set
clf_gs.score(X_test, y_test)

0.825

**The train and test accuracies are closer with 84% on train set and 82% on test set.**

**The best parameters:**<br>
**({'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 10}**

In [7]:
# Lets just fit the model to check only feature importance
clf = DecisionTreeClassifier(criterion='gini')

clf = clf.fit(X_train, y_train)

In [8]:
# Feature importance
pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

ram                 0.631534
battery_power       0.127918
px_height           0.077523
px_width            0.071451
mobile_wt           0.014626
talk_time           0.013258
n_cores             0.011215
frontcamera         0.009986
clock_speed         0.009939
primarycamera_mp    0.009333
mobile_thickness    0.007442
screen_width        0.003888
screen_height       0.003555
memory              0.003529
blue                0.001607
dual_sim            0.001250
wifi                0.001111
Has3G               0.000833
Has4G               0.000000
touch_screen        0.000000
dtype: float64

**RAM, Battery Power, Pixel Height and Width, Weight of Mobile, No. of cores and Speed of Clock are the important features impacting the price category of Mobile.**

In [9]:
# Lets fit random forest
from sklearn.ensemble import RandomForestClassifier

In [10]:
# Lets define same parameters
params = {'criterion' : ['gini'],
          'max_depth' : range(1,40),
          'min_samples_split' : range(10,60,10)}

# GridSearch
clf_gs = GridSearchCV(RandomForestClassifier(n_jobs=-1, n_estimators=100), cv=10, param_grid=params)

# Fit the model
clf_gs.fit(X_train, y_train)

# Print score
clf_gs.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 25, 'min_samples_split': 10},
 0.8768750000000001)

In [11]:
# Check accuracy on test set
clf_gs.score(X_test, y_test)

0.8825

**Both the Train Accuracy and Test Accuracy improved to 87% and 90%**