We want to predict the price of laptop based on its component specifications. The price in this case is a numerical variable. The dataset is provided below. Some of the input features are descriptive in nature and may need some feature engineering.

Train a regression tree to predict the price of a laptop based on its components. Use feature engineering as appropriate. Report results in 3 cases.
1) Pre-pruning<br>
2) Post-pruning<br>
3) Ensemble technique (bagging)

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

In [2]:
# Load data
laptop_df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/10_laptop_price.csv')
laptop_df.head()

Unnamed: 0,laptop_ID,Company,Gaming,Convertible,Screen(inches),4k Screen,TouchScreen,Cpu,Intel,Highend CPU,Ram,Memory,Memory Type,OperatingSystem,Weight,Price_euros
0,1,Apple,0,0,13.3,0,0,Intel Core i5 2.3GHz,1,0,8GB,128GB,SSD,mac,1.37,1339.69
1,2,Apple,0,0,13.3,0,0,Intel Core i5 1.8GHz,1,0,8GB,128GB,Flash,mac,1.34,898.94
2,3,HP,0,0,15.6,0,0,Intel Core i5 7200U 2.5GHz,1,0,8GB,256GB,SSD,other,1.86,575.0
3,4,Apple,0,0,15.4,0,0,Intel Core i7 2.7GHz,1,1,16GB,512GB,SSD,mac,1.83,2537.45
4,5,Apple,0,0,13.3,0,0,Intel Core i5 3.1GHz,1,0,8GB,256GB,SSD,mac,1.37,1803.6


In [3]:
# Check shape
laptop_df.shape

(1303, 16)

In [4]:
# Check info
laptop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   laptop_ID        1303 non-null   int64  
 1   Company          1303 non-null   object 
 2   Gaming           1303 non-null   int64  
 3   Convertible      1303 non-null   int64  
 4   Screen(inches)   1303 non-null   float64
 5   4k Screen        1303 non-null   int64  
 6   TouchScreen      1303 non-null   int64  
 7   Cpu              1303 non-null   object 
 8   Intel            1303 non-null   int64  
 9   Highend CPU      1303 non-null   int64  
 10  Ram              1303 non-null   object 
 11  Memory           1303 non-null   object 
 12  Memory Type      1293 non-null   object 
 13  OperatingSystem  1303 non-null   object 
 14  Weight           1303 non-null   float64
 15  Price_euros      1303 non-null   float64
dtypes: float64(3), int64(7), object(6)
memory usage: 163.0+ KB


In [5]:
# Check percentage of missing values
(laptop_df.isna().sum()/laptop_df.shape[0])*100

laptop_ID          0.00000
Company            0.00000
Gaming             0.00000
Convertible        0.00000
Screen(inches)     0.00000
4k Screen          0.00000
TouchScreen        0.00000
Cpu                0.00000
Intel              0.00000
Highend CPU        0.00000
Ram                0.00000
Memory             0.00000
Memory Type        0.76746
OperatingSystem    0.00000
Weight             0.00000
Price_euros        0.00000
dtype: float64

Only about 0.76% values are missing for Memory Type. Lets check them out

In [6]:
# Memory type
laptop_df[laptop_df['Memory Type'].isna()]

Unnamed: 0,laptop_ID,Company,Gaming,Convertible,Screen(inches),4k Screen,TouchScreen,Cpu,Intel,Highend CPU,Ram,Memory,Memory Type,OperatingSystem,Weight,Price_euros
151,154,Dell,1,0,15.6,0,0,Intel Core i7 7700HQ 2.8GHz,1,1,8GB,1.0TB,,win,2.62,899.0
976,990,Lenovo,0,0,14.0,0,0,Intel Core i5 6200U 2.3GHz,1,0,4GB,508GB,,win,1.7,1002.0
1010,1024,Dell,1,0,15.6,0,0,Intel Core i5 7300HQ 2.5GHz,1,0,8GB,1.0TB,,win,2.65,949.0
1135,1150,Lenovo,0,0,15.6,0,0,Intel Core i7 6500U 2.5GHz,1,1,16GB,1.0TB,,win,2.5,1099.0
1158,1176,Lenovo,0,0,15.6,0,0,Intel Core i5 6200U 2.3GHz,1,0,8GB,1.0TB,,win,2.5,788.49
1176,1194,Lenovo,0,0,15.6,0,0,Intel Core i7 6500U 2.5GHz,1,1,4GB,1.0TB,,win,2.32,825.0
1258,1276,Lenovo,0,0,15.6,0,0,Intel Core i7 6500U 2.5GHz,1,1,8GB,1.0TB,,win,2.32,895.0
1266,1284,HP,0,0,15.6,0,0,AMD A9-Series 9410 2.9GHz,0,1,6GB,1.0TB,,win,2.04,549.99
1280,1298,HP,0,0,15.6,0,0,AMD A9-Series 9410 2.9GHz,0,1,6GB,1.0TB,,win,2.04,549.99
1294,1312,HP,0,0,15.6,0,0,AMD A9-Series 9410 2.9GHz,0,1,6GB,1.0TB,,win,2.04,549.99


In [7]:
# Lets check value counts
laptop_df['Memory Type'].value_counts()

SSD      644
HDD      575
Flash     74
Name: Memory Type, dtype: int64

Since the missing values for Memory Type form very small percentage, we can drop them from data frame

In [8]:
# Drop missing values
laptop_df.dropna(inplace=True)

In [9]:
# Check missing values again
laptop_df.isna().sum()

laptop_ID          0
Company            0
Gaming             0
Convertible        0
Screen(inches)     0
4k Screen          0
TouchScreen        0
Cpu                0
Intel              0
Highend CPU        0
Ram                0
Memory             0
Memory Type        0
OperatingSystem    0
Weight             0
Price_euros        0
dtype: int64

In [10]:
# Check shape
laptop_df.shape

(1293, 16)

In [11]:
# list of all categorical variables
cat_var = laptop_df.select_dtypes(include='object').columns.to_list()

# Check value counts
for var in cat_var:
    print(laptop_df[var].value_counts(normalize=True)*100)
    print()

Dell         22.815159
Lenovo       22.583140
HP           20.959010
Asus         12.219644
Acer          7.965971
MSI           4.176334
Toshiba       3.712297
Apple         1.624130
Samsung       0.696056
Razer         0.541377
Mediacom      0.541377
Microsoft     0.464037
Xiaomi        0.309358
Vero          0.309358
Chuwi         0.232019
Google        0.232019
Fujitsu       0.232019
LG            0.232019
Huawei        0.154679
Name: Company, dtype: float64

Intel Core i5 7200U 2.5GHz               14.694509
Intel Core i7 7700HQ 2.8GHz              11.214230
Intel Core i7 7500U 2.7GHz               10.363496
Intel Core i7 8550U 1.8GHz                5.645785
Intel Core i5 8250U 1.6GHz                5.568445
                                           ...    
Intel Celeron Dual Core N3060 1.60GHz     0.077340
Intel Core M M3-6Y30 0.9GHz               0.077340
Intel Core i5 2.9GHz                      0.077340
AMD A6-Series 7310 2GHz                   0.077340
AMD A9-Series 9420 2.9

In [12]:
# Replace the low percentage categories with Others to reduce the number of categories
laptop_df['Company'].replace(['Samsung', 'Razer', 'Mediacom', 'Microsoft', 
                              'Xiaomi', 'Vero', 'Chuwi', 'Google',
                              'Fujitsu', 'LG', 'Huawei'], 'Others', inplace=True)

laptop_df['Ram'].replace(['2GB','4GB','6GB','8GB'], 'Low', inplace=True)
laptop_df['Ram'].replace(['16GB','12GB'], 'Medium', inplace=True)
laptop_df['Ram'].replace(['24GB','32GB','64GB'], 'High', inplace=True)

laptop_df['Memory'].replace(['8GB', '16GB', '32GB', '64GB'], 'Low', inplace=True)
laptop_df['Memory'].replace(['128GB', '180GB', '240GB', '256GB', '500GB', '512GB'], 'Medium', inplace=True)
laptop_df['Memory'].replace(['1TB', '1.0TB', '2TB'], 'High', inplace=True)

In [13]:
# Check value counts again
var_list = ['Company', 'Ram', 'Memory']

for var in var_list:
    print(laptop_df[var].value_counts(normalize=True)*100)
    print()

Dell       22.815159
Lenovo     22.583140
HP         20.959010
Asus       12.219644
Acer        7.965971
MSI         4.176334
Others      3.944316
Toshiba     3.712297
Apple       1.624130
Name: Company, dtype: float64

Low       81.051817
Medium    17.324053
High       1.624130
Name: Ram, dtype: float64

Medium    59.087394
High      35.344161
Low        5.568445
Name: Memory, dtype: float64



In [14]:
# Lets check Cpu and Intel together
laptop_df[['Cpu','Intel']].sample(10)

Unnamed: 0,Cpu,Intel
725,Intel Core i5 7200U 2.5GHz,1
1077,Intel Core i7 7700HQ 2.8GHz,1
269,Intel Core i7 8550U 1.8GHz,1
1260,Intel Core i5 6200U 2.3GHz,1
1197,Intel Core i7 6700HQ 2.6GHz,1
444,Intel Core i7 7700HQ 2.8GHz,1
1019,Intel Core i7 6500U 2.5GHz,1
732,AMD A9-Series A9-9420 3GHz,0
398,Intel Core i7 7700HQ 2.8GHz,1
14,Intel Core M m3 1.2GHz,1


It can be observed that wherever the Cpu is Intel, the feature Intel is coded as 1, otherwise 0. So retaining both Cpu and Intel is redundant, and we can drop one of them. Since Intel is already encoded, lets retain Intel and drop Cpu.

We can also drop laptop_ID

In [15]:
# Drop laptop_ID and Cpu
laptop_df.drop(['laptop_ID', 'Cpu'], axis=1, inplace=True)

In [16]:
# Lets check data
laptop_df.sample(5)

Unnamed: 0,Company,Gaming,Convertible,Screen(inches),4k Screen,TouchScreen,Intel,Highend CPU,Ram,Memory,Memory Type,OperatingSystem,Weight,Price_euros
628,Asus,0,1,13.3,0,1,1,0,Low,High,HDD,win,1.5,639.01
8,Asus,0,0,14.0,0,0,1,1,Medium,Medium,SSD,win,1.3,1495.0
415,Dell,0,0,15.6,0,0,1,0,Low,Medium,SSD,linux,2.3,598.9
101,HP,0,0,15.6,0,0,0,0,Low,Medium,HDD,win,2.1,349.0
112,Lenovo,0,1,13.3,0,1,1,0,Low,Medium,SSD,win,1.37,1399.0


In [17]:
# Lets convert categorical to dummies
cat_var = laptop_df.select_dtypes(include='object').columns.to_list()

laptop_df_onehot = pd.get_dummies(laptop_df, columns=cat_var)

In [18]:
# Check data
laptop_df_onehot.sample(5)

Unnamed: 0,Gaming,Convertible,Screen(inches),4k Screen,TouchScreen,Intel,Highend CPU,Weight,Price_euros,Company_Acer,Company_Apple,Company_Asus,Company_Dell,Company_HP,Company_Lenovo,Company_MSI,Company_Others,Company_Toshiba,Ram_High,Ram_Low,Ram_Medium,Memory_High,Memory_Low,Memory_Medium,Memory Type_Flash,Memory Type_HDD,Memory Type_SSD,OperatingSystem_linux,OperatingSystem_mac,OperatingSystem_other,OperatingSystem_win
1276,0,0,15.6,0,0,1,0,2.3,459.0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1
498,0,0,13.3,0,0,1,0,1.4,949.0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1
773,1,0,15.6,0,0,1,1,2.62,1099.0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1
1036,0,0,15.6,0,0,1,0,2.04,742.0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1
545,0,0,15.6,0,0,1,0,2.1,705.5,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1


In [19]:
# Just to ensure there are no missing values in onehot data frame
laptop_df_onehot.isna().sum().sum()

0

In [20]:
# Lets fit full grown decision tree
X = laptop_df_onehot.drop('Price_euros', axis=1)
y = laptop_df_onehot['Price_euros']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1)

clf = DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)

print('Train Score:', clf.score(X_train, y_train))
print('Test Score:', clf.score(X_test, y_test))

Train Score: 0.9757009944822207
Test Score: 0.5552216652651857


In [21]:
# Check depth
clf.get_depth()

23

In [22]:
# Check faeture importance
pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)

Ram_Low                  0.336518
Weight                   0.218229
Memory Type_SSD          0.091946
Highend CPU              0.059540
Ram_High                 0.051218
Screen(inches)           0.042170
Gaming                   0.024543
4k Screen                0.021298
Memory_Low               0.018575
Intel                    0.017896
Memory Type_HDD          0.016773
Company_Others           0.014420
Company_Asus             0.013046
OperatingSystem_win      0.012394
Company_Lenovo           0.011545
Company_Dell             0.010055
Company_MSI              0.008250
Convertible              0.007149
TouchScreen              0.006792
Company_HP               0.006781
Memory_Medium            0.003846
Company_Toshiba          0.001957
Company_Acer             0.001488
Memory_High              0.001383
OperatingSystem_mac      0.001288
OperatingSystem_linux    0.000682
OperatingSystem_other    0.000195
Company_Apple            0.000014
Memory Type_Flash        0.000010
Ram_Medium    

### Pre-Pruning with hyperparameters

In [23]:
params = {'max_depth' : range(1,11), 'min_samples_split' : range(10,60,10)}

clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

clf_gs.fit(X_train, y_train)

print(clf_gs.best_params_)
print(clf_gs.best_score_)

{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571


In [24]:
# Check on test set
clf_gs.score(X_test, y_test)

0.7083329314640758

### Post-Pruning

In [25]:
# Cost Complexity Pruning

ccp = np.arange(0, 1, 0.1)

for v in ccp:
    clf_p = DecisionTreeRegressor(random_state=4, ccp_alpha=v)
    clf_p_gs = GridSearchCV(clf_p, cv=5, param_grid=params)
    clf_p_gs.fit(X_train, y_train)
    print('For ccp_alpha=', v)
    print(clf_p_gs.best_params_)
    print(clf_p_gs.best_score_)
    print()

For ccp_alpha= 0.0
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.1
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.2
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.30000000000000004
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.4
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.5
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.6000000000000001
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.7000000000000001
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.8
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571

For ccp_alpha= 0.9
{'max_depth': 9, 'min_samples_split': 30}
0.5873285085178571



In [26]:
# Check on test set
clf_p_gs.score(X_test, y_test)

0.7083329314640758

**The score remains same in both cases - Pre-Pruning and Post-Pruning. On test set, the score is 70.8%. Also for different values of ccp, the best train score is same**

In [27]:
# Lets fit Random Forest
from sklearn.ensemble import RandomForestRegressor

In [28]:
params = {'max_depth' : range(1,11), 'min_samples_split' : range(10,60,10)}

clf_gs_rf = GridSearchCV(RandomForestRegressor(n_jobs=-1, n_estimators=100), cv=5, param_grid=params)

clf_gs_rf.fit(X_train, y_train)

print(clf_gs_rf.best_params_)
print(clf_gs_rf.best_score_)

{'max_depth': 10, 'min_samples_split': 10}
0.6679725639017485


In [29]:
# Check on test set
clf_gs_rf.score(X_test, y_test)

0.7694027387701435

**With ensemble, the score improved to 76%**