Hello üôå, welcome to my notebook. In this notebook we will try to learn Multiclass Classification using 5 Models. Also develop feature selection to increased accuracy.
Feel free if you have any question or suggestion! Thank you!

![](https://i.pcmag.com/imagery/roundups/07ml3nh3QrzTLZ9UycfQQB2-36.fit_lim.size_1050x.jpg)
(https://sea.pcmag.com)

Problem:
- Estimate price of mobiles using mobile phones sales data from various companies
- Correlation between mobile phone features and selling price
- Determine the price range (not an actual price)

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

train = pd.read_csv('../input/mobile-price-classification/train.csv')
test = pd.read_csv('../input/mobile-price-classification/test.csv')

In [None]:
train.head(3)

In [None]:
test.head(3)

Feature defenition:
1. battery_power: Total energy a battery can store in one time measured in mAh
2. blue: Has bluetooth or not
3. clock_speed: speed at which microprocessor executes instructions
4. dual_sim: Has dual sim support or not
5. fc: Front Camera mega pixels
6. four_g: Has 4G or not
7. int_memory: Internal Memory in Gigabytes
8. m_dep: Mobile Depth in cm
9. mobile_wt: Weight of mobile phone
10. n_cores: Number of cores of processor
11. px_height: Height in pixel
12. px_width: Width in pixel
13. ram: Random Acces Memory in megabytes
14. talk_time: Max time phone standby for calling (in hours)
15. three_g: 3G fiture (in Boolean, 1 for exist 0 for not-exist)
16. touch_screen: Touch screen fiture
17. wifi: Wifi fiture
18. price_range: Range of mobile price consist 0 1 2 3

And some feature/columns that have no description, but i still keep it for knowing its correlation

In [None]:
train.columns = ['Power Battery', 'Bluetooth', 'Clock Speed','Dual SIM', 
                 'Front Camera', '4G','Int. Memory','Thickness','Weight',
                 'Core Pros.','PC','Height','Width','RAM','SC H', 'SC W','Talk Time',
                 '3G','Touch Screen','Wifi','Price_Range']

test.columns = ['ID','Power Battery', 'Bluetooth', 'Clock Speed','Dual SIM', 
                 'Front Camera', '4G','Int. Memory','Thickness','Weight',
                 'Core Pros.','PC','Height','Width','RAM','SC H', 'SC W','Talk Time',
                 '3G','Touch Screen','Wifi']

In [None]:
print(f'Count of unique item for each columns in train data:\n{train.nunique().sort_values(ascending=False)}')
print('-'*20)
print(f'Count of unique item for each columns in test data:\n{test.nunique().sort_values(ascending=False)}')

In [None]:
print(f'Shape of train data:\n{train.shape}')
print('-'*20)
print(f'Shape of test data:\n{test.shape}')

In [None]:
print(f'Info of train data:\n{train.info()}')
print('-'*50)
print(f'Info of test data:\n{test.info()}')

In [None]:
train.describe().round(decimals=0)

In [None]:
test.describe().round(decimals=0)

In [None]:
print(f'Unique item in target feature:\n{train.Price_Range.unique()}')

- I'm changing name columns to better format
- For unique item in each columns we can see that RAM, Height, Width and Powe Battery have many unique item. And several columns displayed as boolean/binary, such as Wifi, 3G, 4G etc.
- The shape of our data (train and test) have same number of columns. Train data have 2000 rows, and test 1000 rows. 
- For the data type, all columns is numerical format (int and float)
- As the task given, our target feature should be Price Range, and all other features we can keep for modelling
- Target feature consist 4 unique item, such as 1, 2, 3 and 0. The smallest number indicates that lowest price.

In [None]:
all_data = pd.concat([train,test])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(nrows=4, ncols=2, figsize=(15, 12))
f.suptitle('Countplot Graph', fontsize=16)

sns.countplot(x='Core Pros.', data = all_data, palette='YlOrBr_r', ax=ax[0,0])
ax[0,0].set_ylabel(ylabel='Core Pros.', fontsize=14)

sns.countplot(x='Wifi', data = all_data, palette='YlOrBr_r', ax=ax[0,1])
ax[0,1].set_ylabel(ylabel='Wifi', fontsize=14)

sns.countplot(x='3G', data = all_data, palette='YlOrBr_r', ax=ax[1,0])
ax[1,0].set_ylabel(ylabel='3G', fontsize=14)

sns.countplot(x='4G', data = all_data, palette='YlOrBr_r', ax=ax[1,1])
ax[1,1].set_ylabel(ylabel='4G', fontsize=14)

sns.countplot(x='Dual SIM', data = all_data, palette='YlOrBr_r', ax=ax[2,0])
ax[2,0].set_ylabel(ylabel='Dual SIM', fontsize=14)

sns.countplot(x='Touch Screen', data = all_data, palette='YlOrBr_r', ax=ax[2,1])
ax[2,1].set_ylabel(ylabel='Touch Screen', fontsize=14)

sns.countplot(x='Bluetooth', data = all_data, palette='YlOrBr_r', ax=ax[3,0])
ax[3,0].set_ylabel(ylabel='Bluetooth', fontsize=14)

f.delaxes(ax[3, 1])
plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(nrows=7, ncols=2, figsize=(10, 10))
f.suptitle('Histplot Graph', fontsize=16)

sns.distplot(all_data['RAM'], ax=ax[0,0])
ax[0,0].set_ylabel(ylabel='RAM', fontsize=14)

sns.distplot(all_data['Height'], ax=ax[0,1])
ax[0,1].set_ylabel(ylabel='Height', fontsize=14)

sns.distplot(all_data['Height'], ax=ax[1,0])
ax[1,0].set_ylabel(ylabel='Height', fontsize=14)

sns.distplot(all_data['Power Battery'], ax=ax[1,1])
ax[1,1].set_ylabel(ylabel='Power Battery', fontsize=14)

sns.distplot(all_data['Weight'], ax=ax[2,0])
ax[2,0].set_ylabel(ylabel='Weight', fontsize=14)

sns.distplot(all_data['Int. Memory'], ax=ax[2,1])
ax[2,1].set_ylabel(ylabel='Int. Memory', fontsize=14)

sns.distplot(all_data['Clock Speed'], ax=ax[3,0])
ax[3,0].set_ylabel(ylabel='Clock Speed', fontsize=14)

sns.distplot(all_data['PC'], ax=ax[3,1])
ax[3,1].set_ylabel(ylabel='PC', fontsize=14)

sns.distplot(all_data['Front Camera'], ax=ax[4,0])
ax[4,0].set_ylabel(ylabel='Front Camera', fontsize=14)

sns.distplot(all_data['SC W'], ax=ax[4,1])
ax[4,1].set_ylabel(ylabel='SC W', fontsize=14)

sns.distplot(all_data['Talk Time'], ax=ax[5,0])
ax[5,0].set_ylabel(ylabel='Talk Time', fontsize=14)

sns.distplot(all_data['SC H'], ax=ax[5,1])
ax[5,1].set_ylabel(ylabel='SC H', fontsize=14)

sns.distplot(all_data['Thickness'], ax=ax[6,0])
ax[6,0].set_ylabel(ylabel='Thickness', fontsize=14)


f.delaxes(ax[6, 1])
plt.tight_layout()
plt.show()

In [None]:
print(f'Count of unique item for each columns in train data:\n{train.nunique().sort_values(ascending=False)}')
print('-'*20)
print(f'Count of unique item for each columns in test data:\n{test.nunique().sort_values(ascending=False)}')

In [None]:
feature_skew_kurt = ['RAM','Width','Power Battery','Height','Weight','Int. Memory',
                     'Clock Speed','PC','Front Camera','Talk Time','SC W', 'SC H',
                     'Thickness', 'Core Pros.']

print(f'Skewness:\n{all_data[feature_skew_kurt].skew().sort_values(ascending=False)}')
print('-'*30)
print(f'Kurtosis:\n{all_data[feature_skew_kurt].kurt().sort_values(ascending=False)}')

In [None]:
power_battery = all_data[["Power Battery", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
power_battery

In [None]:
ram = all_data[["RAM", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
ram

In [None]:
height = all_data[["Height", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
height

In [None]:
width = all_data[["Width", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
width

In [None]:
weight = all_data[["Weight", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
weight

In [None]:
int_memory= all_data[["Int. Memory", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
int_memory

In [None]:
clock_speed = all_data[["Clock Speed", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
clock_speed

In [None]:
pc = all_data[["PC", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
pc

In [None]:
front_camera = all_data[["Front Camera", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
front_camera

In [None]:
sc_w = all_data[["SC W", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
sc_w

In [None]:
talk_time = all_data[["Talk Time", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
talk_time

In [None]:
sc_h = all_data[["SC H", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
sc_h

In [None]:
thickness = all_data[["Thickness", "Price_Range"]].groupby(['Price_Range'], as_index=False).mean().sort_values(by='Price_Range', ascending=False)
thickness

In [None]:
wifi = all_data[["Wifi", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
wifi = wifi[['Price_Range','Wifi']]
wifi

In [None]:
three_g = all_data[["3G", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
three_g = three_g[['Price_Range','3G']]
three_g

In [None]:
four_g = all_data[["4G", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
four_g = four_g[['Price_Range','4G']]
four_g

In [None]:
dual_sim = all_data[["Dual SIM", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
dual_sim = dual_sim[['Price_Range','Dual SIM']]
dual_sim

In [None]:
touch_screen = all_data[["Touch Screen", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
touch_screen = touch_screen[['Price_Range','Touch Screen']]
touch_screen

In [None]:
bluetooth = all_data[["Bluetooth", "Price_Range"]].groupby(['Price_Range'], as_index=False).apply(pd.DataFrame.mode).reset_index(drop=True).sort_values(by='Price_Range', ascending=False)
bluetooth = bluetooth[['Price_Range','Bluetooth']]
bluetooth

- I cancating train and test data to make visualization more easy
- For the countplot graph, we can see that most of the feature have balanced unique item count, except 3G feature
- Little explanation about skewness adn kurtosis:
    1. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. (www.itl.nist.gov)
    2. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. (www.itl.nist.gov)
    3. Hair et al. (2010) and Bryne (2010) argued that data considered to be normal if Skewness is between ‚Äê2 to +2 and Kurtosis is between ‚Äê7 to +7.
- I only make histogram plot for feature that have many unique item, for feature that consist binary item a prefer to use countplot
- For skewness value range between -0.08 - 1, and for kurtosis between -1.34 - 0.3
- Based on explanation befero, our data must have normal distribution, and we dont neet to apply normalization
- I try to groupby each column with our target feature (Price Range) and sort them by mean and mode. This for knowing how item in certain feature vary in our target feature
- Again, for feature that have binary columns i use mode, and remains using mean
- The result is some of the feature like Power Battery, RAM, Weight etc have strong correlation to our target feature. Highest Price Range seems to have high values.
- For the mode, we can't clearly see the correlation, but still use it for model vallidation later

In [None]:
print(f'Missing values in train data:\n{train.isnull().sum().sort_values(ascending=False)}')
print('-'*30)
print(f'Missing values in test data:\n{test.isnull().sum().sort_values(ascending=False)}')

In [None]:
all_data.corr().style.background_gradient(cmap='coolwarm')

In [None]:
print(f'Correlation to target feature:\n{all_data.corr().Price_Range.sort_values(ascending=False)}')

In [None]:
train.loc[:,'Price_Range'].value_counts()

- Train and test data have no missing values
- The most hights correlation to target feature is: RAM, Power Battery, Width, and Height
- Train data have balanced item, we dont need to resample
- For this problem we will use Classification Model such as:

    1. Decision Tree Classifier (DTC)
    2. Random Forest Classifier (RFC)
    3. Extra Tree Classifier (ETC)
    4. XGB Classifier (XGB)
    5. Stacking

In [None]:
evaluation = pd.DataFrame({'Model': [],
                           'Details':[],
                           'Accuracy':[],
                           'Precision':[],
                           'Recall':[],
                           'F1':[],
                           'CVS':[]})

In [None]:
'''DECISION TREE CLASSIFIER'''
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

train_data,test_data = train_test_split(train,train_size = 0.8,random_state=3)

features = train.columns.values.tolist()[0:20]

standardScalerX = StandardScaler()
standardScalerX.fit_transform(train_data[features])
standardScalerX.fit_transform(test_data[features])

dtc1 = DecisionTreeClassifier(criterion= 'gini', min_samples_split=4, min_samples_leaf = 3, max_features = 'auto')
estimator1 = dtc1.fit(train_data[features], train_data['Price_Range'])
predict1 = dtc1.predict(test_data[features])                                                                                
acc1 = (accuracy_score(test_data['Price_Range'], predict1)*100)                                
cvs1 = (cross_val_score(dtc1, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall1 = recall_score(test_data['Price_Range'], predict1, average='weighted')*100
precision1 = precision_score(test_data['Price_Range'], predict1, average='weighted')*100
f11 = f1_score(test_data['Price_Range'], predict1, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Decision Tree','All Feature',acc1,precision1,recall1,f11,cvs1]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''RANDOM FOREST CLASSIFIER'''
from sklearn.ensemble import RandomForestClassifier

rfc1 = RandomForestClassifier(min_samples_leaf = 3, min_samples_split=4, n_estimators = 100)
estimator2 = rfc1.fit(train_data[features], train_data['Price_Range'])
predict2 = rfc1.predict(test_data[features])                                                                                
acc2 = (accuracy_score(test_data['Price_Range'], predict2)*100)
cvs2 = (cross_val_score(rfc1, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall2 = recall_score(test_data['Price_Range'], predict2, average='weighted')*100
precision2 = precision_score(test_data['Price_Range'], predict2, average='weighted')*100
f12 = f1_score(test_data['Price_Range'], predict2, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Random Forest','All Feature',acc2,precision2,recall2,f12,cvs2]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''EXTRA TREE CLASSIFIER'''
from sklearn.ensemble import ExtraTreesClassifier

etc1 = ExtraTreesClassifier(min_samples_leaf = 7, min_samples_split=2, n_estimators = 100)
estimator3 = etc1.fit(train_data[features], train_data['Price_Range'])
predict3 = etc1.predict(test_data[features])                                                                                
acc3 = (accuracy_score(test_data['Price_Range'], predict3)*100)
cvs3 = (cross_val_score(etc1, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall3 = recall_score(test_data['Price_Range'], predict3, average='weighted')*100
precision3 = precision_score(test_data['Price_Range'], predict3, average='weighted')*100
f13 = f1_score(test_data['Price_Range'], predict3, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Extra Tree','All Feature',acc3,precision3,recall3,f13,cvs3]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''XGB CLASSIFIER'''
import xgboost as xgb
from xgboost import XGBClassifier

xgb1 = XGBClassifier(criterion = 'giny', learning_rate = 0.01, max_depth = 5, n_estimators = 100, objective ='binary:logistic', subsample = 1.0)
estimator4 = xgb1.fit(train_data[features], train_data['Price_Range'])
predict4 = xgb1.predict(test_data[features])                                                                                
acc4 = (accuracy_score(test_data['Price_Range'], predict4)*100)
cvs4 = (cross_val_score(etc1, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall4 = recall_score(test_data['Price_Range'], predict4, average='weighted')*100
precision4 = precision_score(test_data['Price_Range'], predict4, average='weighted')*100
f14 = f1_score(test_data['Price_Range'], predict4, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['XGB','All Feature',acc4,precision4,recall4,f14,cvs4]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''STACKING CLASSIFIER'''
from mlxtend.classifier import StackingCVClassifier

stc1 = StackingCVClassifier(classifiers=[dtc1, rfc1, etc1, xgb1], meta_classifier=rfc1, random_state=1)
estimator5 = stc1.fit(train_data[features], train_data['Price_Range'])
predict5 = stc1.predict(test_data[features])                                                                                
acc5 = (accuracy_score(test_data['Price_Range'], predict5)*100)
cvs5 = (cross_val_score(stc1, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall5 = recall_score(test_data['Price_Range'], predict5, average='weighted')*100
precision5 = precision_score(test_data['Price_Range'], predict5, average='weighted')*100
f15 = f1_score(test_data['Price_Range'], predict5, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Stacking','All Feature',acc5,precision5,recall5,f15,cvs5]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
print(f'Shape of final train data:\n{train_data.shape}')
print(f'Shape of final test data:\n{test_data.shape}')

In [None]:
import numpy as np
def plot_feature_importance(importance,names,model_type):
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    plt.figure(figsize=(10,8))

    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])

    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(dtc1.feature_importances_,features,'DECISION TREE ')

In [None]:
plot_feature_importance(rfc1.feature_importances_,features,'RANDOM FOREST ')

In [None]:
plot_feature_importance(etc1.feature_importances_,features,'EXTRA TREE ')

In [None]:
plot_feature_importance(xgb1.feature_importances_,features,'XGB ')

- We already make 5 different model using all feature that given
- I prefer to use train data to split again into train and test data. Because test data that given doesn't contain target feature. So, we can't check the key matrix (Accuracy, Precision, Recall, F1, and Cross Validation Score) for evalualuation
- The average accuracy model is 85%
- The highest accuracy model is Gradient Boost Model (XGB)
- We create feature importance graph for each model, and show that RAM, Power Battery, Height and Width is the most important feature
- The feature which show highest importance is feature that have strong correlation that we calculated before
- We will update the feature in the model and we will see different accuracy given

In [None]:
'''DECISION TREE SELECT FEATURE'''

features = ['RAM', 'Power Battery', 'Height', 'Width']

dtc2 = DecisionTreeClassifier(criterion= 'gini', min_samples_split=4, min_samples_leaf = 3, max_features = 'auto')
estimator6 = dtc2.fit(train_data[features], train_data['Price_Range'])
predict6 = dtc2.predict(test_data[features])                                                                                
acc6 = (accuracy_score(test_data['Price_Range'], predict6)*100)                                
cvs6 = (cross_val_score(dtc2, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall6 = recall_score(test_data['Price_Range'], predict6, average='weighted')*100
precision6 = precision_score(test_data['Price_Range'], predict6, average='weighted')*100
f16 = f1_score(test_data['Price_Range'], predict6, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Decision Tree','Select Feature',acc6,precision6,recall6,f16,cvs6]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''RANDOM FOREST SELECT FEATURE'''

rfc2 = RandomForestClassifier(min_samples_leaf = 3, min_samples_split=4, n_estimators = 100)
estimator7 = rfc2.fit(train_data[features], train_data['Price_Range'])
predict7 = rfc2.predict(test_data[features])                                                                                
acc7 = (accuracy_score(test_data['Price_Range'], predict7)*100)                                
cvs7 = (cross_val_score(rfc2, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall7 = recall_score(test_data['Price_Range'], predict7, average='weighted')*100
precision7 = precision_score(test_data['Price_Range'], predict7, average='weighted')*100
f17 = f1_score(test_data['Price_Range'], predict7, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Random Forest','Select Feature',acc7,precision7,recall7,f17,cvs7]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''EXTRA TREE SELECT FEATURE'''

etc2 = ExtraTreesClassifier(min_samples_leaf = 7, min_samples_split=2, n_estimators = 100)
estimator8 = etc2.fit(train_data[features], train_data['Price_Range'])
predict8 = etc2.predict(test_data[features])                                                                                
acc8 = (accuracy_score(test_data['Price_Range'], predict8)*100)                                
cvs8 = (cross_val_score(etc2, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall8 = recall_score(test_data['Price_Range'], predict8, average='weighted')*100
precision8 = precision_score(test_data['Price_Range'], predict8, average='weighted')*100
f18 = f1_score(test_data['Price_Range'], predict8, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Extra Tree','Select Feature',acc8,precision8,recall8,f18,cvs8]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''XGB SELECT FEATURE'''

xgb2 = XGBClassifier(criterion = 'giny', learning_rate = 0.01, max_depth = 5, n_estimators = 100, objective ='binary:logistic', subsample = 1.0)
estimator9 = xgb2.fit(train_data[features], train_data['Price_Range'])
predict9 = xgb2.predict(test_data[features])                                                                                
acc9 = (accuracy_score(test_data['Price_Range'], predict9)*100)                                
cvs9 = (cross_val_score(xgb2, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall9 = recall_score(test_data['Price_Range'], predict9, average='weighted')*100
precision9 = precision_score(test_data['Price_Range'], predict9, average='weighted')*100
f19 = f1_score(test_data['Price_Range'], predict9, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['XGB','Select Feature',acc9,precision9,recall9,f19,cvs9]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

In [None]:
'''STACKING SELECT FEATURE'''

stc2 = StackingCVClassifier(classifiers=[dtc2, rfc2, etc2, xgb2], meta_classifier=rfc2, random_state=1)
estimator10 = stc2.fit(train_data[features], train_data['Price_Range'])
predict10 = stc2.predict(test_data[features])                                                                                
acc10 = (accuracy_score(test_data['Price_Range'], predict10)*100)
cvs10 = (cross_val_score(stc2, train_data[features], train_data['Price_Range'], cv=5).mean())*100
recall10 = recall_score(test_data['Price_Range'], predict10, average='weighted')*100
precision10 = precision_score(test_data['Price_Range'], predict10, average='weighted')*100
f110 = f1_score(test_data['Price_Range'], predict10, average='weighted')*100                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
r = evaluation.shape[0]
evaluation.loc[r] = ['Stacking','Select Feature',acc10,precision10,recall10,f110,cvs10]
evaluation.sort_values(by = 'Accuracy', ascending=False)
evaluation

- As we can see:
    1. Decision Tree: Accuracy, Precision, Recall, F1, and Cross Validation Score (CVS) increased about 11%
    2. Random Forest: Rougfly increased about 3.25%
    3. Extra Tree: Increased about 4.25%
    4. XGB: Increased 0.75%
    5. Stacking: Increased 4.25%
- The best model that we will use to predict test data is Extra Tree with feature selection

In [None]:
'''Learning Curve'''

def plot_learning_curve(estimator1, estimator2, estimator3, estimator4, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(20,14), sharey=True)
    if ylim is not None:
        plt.ylim(*ylim)
        
    # First Estimator
    train_sizes, train_scores, test_scores = learning_curve(estimator1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax1.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="#ff9124")
    ax1.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label="Training score")
    ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff", label="Cross-validation score")
    ax1.set_title("DTC Learning Curve", fontsize=14)
    ax1.set_xlabel('Training size (m)')
    ax1.set_ylabel('Score')
    ax1.grid(True)
    ax1.legend(loc="best")
    
    # Second Estimator 
    train_sizes, train_scores, test_scores = learning_curve(estimator2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax2.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="#ff9124")
    ax2.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label="Training score")
    ax2.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff", label="Cross-validation score")
    ax2.set_title("RFC Learning Curve", fontsize=14)
    ax2.set_xlabel('Training size (m)')
    ax2.set_ylabel('Score')
    ax2.grid(True)
    ax2.legend(loc="best")
    
    # Third Estimator
    train_sizes, train_scores, test_scores = learning_curve(estimator3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax3.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="#ff9124")
    ax3.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label="Training score")
    ax3.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff", label="Cross-validation score")
    ax3.set_title("ETC Learning Curve", fontsize=14)
    ax3.set_xlabel('Training size (m)')
    ax3.set_ylabel('Score')
    ax3.grid(True)
    ax3.legend(loc="best")
    
    # Fourth Estimator
    train_sizes, train_scores, test_scores = learning_curve(estimator4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax4.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="#ff9124")
    ax4.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label="Training score")
    ax4.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff", label="Cross-validation score")
    ax4.set_title("XGB Learning Curve", fontsize=14)
    ax4.set_xlabel('Training size (m)')
    ax4.set_ylabel('Score')
    ax4.grid(True)
    ax4.legend(loc="best")
    
    return plt

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
plot_learning_curve(dtc2, rfc2, etc2, xgb2, train_data[features], train_data['Price_Range'], (0.67, 1.01), cv=cv, n_jobs=4)

In [None]:
'''Confusion Matrix'''
from sklearn.metrics import confusion_matrix

DTC_matrix = confusion_matrix(test_data['Price_Range'], predict1)
RF_matrix = confusion_matrix(test_data['Price_Range'], predict2)
ETC_matrix = confusion_matrix(test_data['Price_Range'], predict3)
XGB_matrix = confusion_matrix(test_data['Price_Range'], predict4) 
STACKING_matrix = confusion_matrix(test_data['Price_Range'], predict5) 

f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))
sns.heatmap(DTC_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel2",  ax = ax[0,0])
ax[0,0].set_title("DTC All Feature", weight='bold')
ax[0,0].set_xlabel('Predicted Labels')
ax[0,0].set_ylabel('Actual Labels')

sns.heatmap(RF_matrix,annot=True, fmt="d" ,cbar=False, cmap="tab20", ax = ax[0,1])
ax[0,1].set_title("RFC All Feature", weight='bold')
ax[0,1].set_xlabel('Predicted Labels')
ax[0,1].set_ylabel('Actual Labels')

sns.heatmap(ETC_matrix,annot=True, fmt="d", cbar=False, cmap="Paired", ax = ax[1,0])
ax[1,0].set_title("ETC All Feature", weight='bold')
ax[1,0].set_xlabel('Predicted Labels')
ax[1,0].set_ylabel('Actual Labels')

sns.heatmap(XGB_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel1", ax = ax[1,1])
ax[1,1].set_title("XGB All Feature", weight='bold')
ax[1,1].set_xlabel('Predicted Labels')
ax[1,1].set_ylabel('Actual Labels')

sns.heatmap(STACKING_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel1", ax = ax[2,0])
ax[2,0].set_title("Stacking All Feature", weight='bold')
ax[2,0].set_xlabel('Predicted Labels')
ax[2,0].set_ylabel('Actual Labels')

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

In [None]:
'''Confusion Matrix'''

DTC_matrix = confusion_matrix(test_data['Price_Range'], predict6)
RF_matrix = confusion_matrix(test_data['Price_Range'], predict7)
ETC_matrix = confusion_matrix(test_data['Price_Range'], predict8)
XGB_matrix = confusion_matrix(test_data['Price_Range'], predict9) 
STACKING_matrix = confusion_matrix(test_data['Price_Range'], predict10) 

f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))
sns.heatmap(DTC_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel2",  ax = ax[0,0])
ax[0,0].set_title("DTC Select Feature", weight='bold')
ax[0,0].set_xlabel('Predicted Labels')
ax[0,0].set_ylabel('Actual Labels')

sns.heatmap(RF_matrix,annot=True, fmt="d" ,cbar=False, cmap="tab20", ax = ax[0,1])
ax[0,1].set_title("RFC Select Feature", weight='bold')
ax[0,1].set_xlabel('Predicted Labels')
ax[0,1].set_ylabel('Actual Labels')

sns.heatmap(ETC_matrix,annot=True, fmt="d", cbar=False, cmap="Paired", ax = ax[1,0])
ax[1,0].set_title("ETC Select Feature", weight='bold')
ax[1,0].set_xlabel('Predicted Labels')
ax[1,0].set_ylabel('Actual Labels')

sns.heatmap(XGB_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel1", ax = ax[1,1])
ax[1,1].set_title("XGB Select Feature", weight='bold')
ax[1,1].set_xlabel('Predicted Labels')
ax[1,1].set_ylabel('Actual Labels')

sns.heatmap(STACKING_matrix,annot=True, fmt="d", cbar=False, cmap="Pastel1", ax = ax[2,0])
ax[2,0].set_title("Stacking Select Feature", weight='bold')
ax[2,0].set_xlabel('Predicted Labels')
ax[2,0].set_ylabel('Actual Labels')

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

- Types of learning curves:

    1. Bad Learning Curve: High Bias
        - When training and testing errors converge and are high
        - Poor fit
        - Poor generalization
    2. Bad Learning Curve: High Variance
        - When there is a large gap between the errors
        - Require data to improve
        - Can simplify the model with fewer or less complex features
    3. Ideal Learning Curve
        - Model that generalizes to new data
        - Testing and training learning curves converge at similar values
        - Smaller the gap, the better our model generalizes 
    (www.ritchieng.com)
    
![](https://lh4.googleusercontent.com/OC2DpGenL7UAswdG5EPTgGM1XA2ULiw_P7I31F-peWgBGgnF_zzlZift-RhIqMC3zRiO13xc6xCijOTCERlbGKqLLaSswOxUAMeXnOy1ZqZGF9qxvsb_oDRDejpGlvp9diXa2VM)

![](https://lh3.googleusercontent.com/grZboodXthKKYjZvJS5b5LkDovRR8Rwsxv3GxArkVOLYEYBR0jcS6XAVMtrGluytcdsurHwc9fO72KUE4MrLbAUC0C22rfL9INOMqsmtY85Y64Kn-miC6sRmc7aaSB9RiLjiL5I)
                                                    (www.mygreatlearning.com)

- From our learning curve, we can see that, Extra Tree Classifier have good condition (Not Overfit or Underfit)
- From the Confussion Matriks we can see count of True Positive, False Positive, True Negative and False Negative. It will use to evaluate our model to knowing Accuracy, Precision, Recall and F1 score (we did it before)
- In the confusion matrix, a true positive exists where observation is positive with a positive prediction. A false positive exists where observation is negative, with a positive prediction. A true negative exists where observation is negative with negative prediction, and a false negative indicates a positive observation with a negative prediction. (www.techopedia.com)

In [None]:
test['Predict'] = stc2.predict(test[features])

In [None]:
test[['ID', 'Predict']].head(3)

In [None]:
test.loc[:,'Predict'].value_counts().sort_values(ascending=False)

In [None]:
sns.countplot(x='Predict', data = test, palette='YlOrBr_r')

Conclusion:
- We already estimate the price range in test data in column 'Predict'
    1. Class 0: 262
    2. Class 1: 225
    3. Class 2: 260
    4. Class 3: 253
- Feature with highest correlation is RAM, Power Battery, Width, Height, Int. Memory
- Feature which show no correlation is Clock Speed, Weight, and Touch Screen

- Dont' Forget to Upvote! Thank you!:)