### Assignment 01 - Classification using Decision Tree

Consider the dataset Assignment01_Lasagna_Triers.csv.
File location: https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

The file contains details of people in an area who have either tried Lasagna or not in an Italian restaurant chain. 
Train a decision tree classifier using the given data to predict whether someone has tried Lasagna or not.
Use a 80/20 split for train/test. 

1) What is the train and test accuracy score?

2) Which features come out to be important?

3) Does grouping 'age' and 'income' into 5 categories each, improve the  prediction score? 

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn import tree
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Read and display the data file
df = pd.read_excel('/Users/riteshturlapaty/ai-ml-learning/AccelerateAI/7.DecisionTree/Assignments/Assignment01.xlsx')
lasagna_triers=df
lasagna_triers.head(5)

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,51,188,32200,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,56,244,19000,Hourly,700,1620,Female,No,Home,3,West,No
4,5,28,218,81400,Salaried,26620,600,Male,No,Apt,3,West,Yes


In [3]:
# Decision tree case score
df_case_score = pd.DataFrame(columns=['Case', 'Score'])

In [4]:
lasagna_triers.shape

(856, 13)

In [5]:
lasagna_triers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856 entries, 0 to 855
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Person      856 non-null    int64 
 1   Age         856 non-null    int64 
 2   Weight      856 non-null    int64 
 3   Income      856 non-null    int64 
 4   Pay Type    856 non-null    object
 5   Car Value   856 non-null    int64 
 6   CC Debt     856 non-null    int64 
 7   Gender      856 non-null    object
 8   Live Alone  856 non-null    object
 9   Dwell Type  856 non-null    object
 10  Mall Trips  856 non-null    int64 
 11  Nbhd        856 non-null    object
 12  Have Tried  856 non-null    object
dtypes: int64(7), object(6)
memory usage: 87.1+ KB


In [6]:
# Segregate column names as per their datatype i.e this is to convert the values into numeric
cat_var_list = lasagna_triers[['Pay Type','Gender','Live Alone','Dwell Type','Nbhd']]
num_var_list = lasagna_triers[['Person','Age','Weight','Income','Car Value','CC Debt','Mall Trips']]
target_list=lasagna_triers[['Have Tried']]

In [7]:
# Create dummy values for columns where values are non-numeric
cat_var_dummies=pd.get_dummies(cat_var_list)

In [8]:
# Convert target variable values into numeric. Map the values
target_list['Have Tried'] = target_list['Have Tried'].map({'Yes':1, 'No':0})

# Check T
target_list['Have Tried'].value_counts()

1    495
0    361
Name: Have Tried, dtype: int64

In [9]:
# Concatenate the categorical (dummies) and numerical list
new_lasagna_triers = pd.concat([num_var_list,cat_var_dummies], axis=1)

In [10]:
new_lasagna_triers.head(2)

Unnamed: 0,Person,Age,Weight,Income,Car Value,CC Debt,Mall Trips,Pay Type_Hourly,Pay Type_Salaried,Gender_Female,Gender_Male,Live Alone_No,Live Alone_Yes,Dwell Type_Apt,Dwell Type_Condo,Dwell Type_Home,Nbhd_East,Nbhd_South,Nbhd_West
0,1,48,175,65500,2190,3510,7,1,0,0,1,1,0,0,0,1,1,0,0
1,2,33,202,29100,2110,740,4,1,0,1,0,1,0,0,1,0,1,0,0


In [11]:
# Prepare X and y
X=new_lasagna_triers
y=target_list

In [12]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=42)

In [13]:
clf = tree.DecisionTreeClassifier(criterion='gini')
clf = clf.fit(X_train, y_train)

In [14]:
# Predict on test data
p_pred = clf.predict_proba(X_test)
y_pred = clf.predict(X_test)

score_ = clf.score(X_test, y_test)
print("Test data score: ",score_)

Test data score:  0.7790697674418605


In [15]:
# Assign values of model score
df_case_score.loc[len(df_case_score.index)] = ["Model score without grouping of Age and Income",score_]

In [16]:
df_case_score

Unnamed: 0,Case,Score
0,Model score without grouping of Age and Income,0.77907


## 2) Which features come out to be important?

In [17]:
# Feature importance
feature_imp = pd.Series(clf.feature_importances_, index=X.columns)
feature_imp.sort_values(ascending=False,inplace=True)
feature_imp

Mall Trips           0.343350
Age                  0.114192
Nbhd_West            0.092808
Income               0.081247
CC Debt              0.063862
Person               0.058079
Car Value            0.050476
Pay Type_Hourly      0.044529
Weight               0.037745
Dwell Type_Condo     0.036685
Nbhd_East            0.032088
Live Alone_Yes       0.018445
Nbhd_South           0.012268
Gender_Female        0.011209
Gender_Male          0.003017
Pay Type_Salaried    0.000000
Live Alone_No        0.000000
Dwell Type_Apt       0.000000
Dwell Type_Home      0.000000
dtype: float64

In [18]:
# Top 5 features
top5_features = list(feature_imp.index[:5])
top5_features

['Mall Trips', 'Age', 'Nbhd_West', 'Income', 'CC Debt']

## 3) Does grouping 'age' and 'income' into 5 categories each, improve the prediction score?

In [19]:
df_group_data=df
df_group_data.head(2)

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes


In [20]:
df_group_data[['Age']].describe()

Unnamed: 0,Age
count,856.0
mean,38.78271
std,9.610763
min,22.0
25%,31.0
50%,37.5
75%,46.0
max,64.0


In [21]:
df_group_data[['Income']].describe()

Unnamed: 0,Income
count,856.0
mean,45266.939252
std,28631.290583
min,2600.0
25%,24475.0
50%,39950.0
75%,58225.0
max,190500.0


In [22]:
# Create function to map age values
def map_age(v_age):
    if v_age>=20 and v_age<=30:
        return_age=1
    elif v_age>=31 and v_age<=40:
        return_age=2
    elif v_age>=41 and v_age<=50:
        return_age=3
    elif v_age>=51 and v_age<=60:
        return_age=4
    elif v_age>=61:
        return_age=5
    else:
        return_age=6
    return return_age

In [23]:
# Create function to map Income values
def map_income(v_income):
    if v_income>=0 and v_income<=40000:
        return_income=1
    elif v_income>=40001 and v_income<=80000:
        return_income=2
    elif v_income>=80001 and v_income<=120000:
        return_income=3
    elif v_income>=120001 and v_income<=160000:
        return_income=4
    elif v_income>=160001 and v_income<=200000:
        return_income=5
    elif v_income>=200001:
        return_income=6
    else:
        return_income=7
    return return_income

In [24]:
print(map_income(100000))

3


In [25]:
# Map Age data. After mapping, Age will become a categorical data
df_group_data['Age'] = df_group_data['Age'].apply(lambda x: map_age(x))

# Check T
df_group_data['Age'].value_counts()

2    312
3    219
1    204
4    110
5     11
Name: Age, dtype: int64

In [26]:
# Map Income data. After mapping, Age will become a categorical data
df_group_data['Income'] = df_group_data['Income'].apply(lambda x: map_income(x))

# Check T
df_group_data['Income'].value_counts()

1    430
2    338
3     62
4     24
5      2
Name: Income, dtype: int64

In [27]:
df_group_data.head()

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,3,175,2,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,2,202,1,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,4,188,1,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,4,244,1,Hourly,700,1620,Female,No,Home,3,West,No
4,5,1,218,3,Salaried,26620,600,Male,No,Apt,3,West,Yes


In [28]:
# Lets convert A2 to float and A14 to int64
lasagna_triers['Age'] = lasagna_triers['Age'].astype('object')
lasagna_triers['Income'] = lasagna_triers['Income'].astype('object')

In [29]:
# Segregate column names as per their datatype i.e this is to convert the values into numeric
cat_var_list = lasagna_triers[['Age','Income','Pay Type','Gender','Live Alone','Dwell Type','Nbhd']]
num_var_list = lasagna_triers[['Person','Weight','Car Value','CC Debt','Mall Trips']]
target_list=lasagna_triers[['Have Tried']]

In [30]:
# Create dummy values for columns where values are non-numeric
cat_var_dummies=pd.get_dummies(cat_var_list)

In [31]:
cat_var_dummies.head(5)

Unnamed: 0,Age_1,Age_2,Age_3,Age_4,Age_5,Income_1,Income_2,Income_3,Income_4,Income_5,Pay Type_Hourly,Pay Type_Salaried,Gender_Female,Gender_Male,Live Alone_No,Live Alone_Yes,Dwell Type_Apt,Dwell Type_Condo,Dwell Type_Home,Nbhd_East,Nbhd_South,Nbhd_West
0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0
1,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,0
2,0,0,0,1,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,1,0,0
3,0,0,0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,0,1,0,0,1
4,1,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,1,0,0,0,0,1


In [32]:
# Convert target variable values into numeric. Map the values
target_list['Have Tried'] = target_list['Have Tried'].map({'Yes':1, 'No':0})

In [33]:
# Check T
target_list['Have Tried'].value_counts()

1    495
0    361
Name: Have Tried, dtype: int64

In [34]:
# Concatenate the categorical (dummies) and numerical list
new_lasagna_triers = pd.concat([num_var_list,cat_var_dummies], axis=1)

In [35]:
new_lasagna_triers.head(2)

Unnamed: 0,Person,Weight,Car Value,CC Debt,Mall Trips,Age_1,Age_2,Age_3,Age_4,Age_5,Income_1,Income_2,Income_3,Income_4,Income_5,Pay Type_Hourly,Pay Type_Salaried,Gender_Female,Gender_Male,Live Alone_No,Live Alone_Yes,Dwell Type_Apt,Dwell Type_Condo,Dwell Type_Home,Nbhd_East,Nbhd_South,Nbhd_West
0,1,175,2190,3510,7,0,0,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0
1,2,202,2110,740,4,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,0


In [36]:
new_lasagna_triers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856 entries, 0 to 855
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Person             856 non-null    int64
 1   Weight             856 non-null    int64
 2   Car Value          856 non-null    int64
 3   CC Debt            856 non-null    int64
 4   Mall Trips         856 non-null    int64
 5   Age_1              856 non-null    uint8
 6   Age_2              856 non-null    uint8
 7   Age_3              856 non-null    uint8
 8   Age_4              856 non-null    uint8
 9   Age_5              856 non-null    uint8
 10  Income_1           856 non-null    uint8
 11  Income_2           856 non-null    uint8
 12  Income_3           856 non-null    uint8
 13  Income_4           856 non-null    uint8
 14  Income_5           856 non-null    uint8
 15  Pay Type_Hourly    856 non-null    uint8
 16  Pay Type_Salaried  856 non-null    uint8
 17  Gender_Female   

In [37]:
target_list.head(2)

Unnamed: 0,Have Tried
0,0
1,1


In [38]:
# Prepare X and y
X=new_lasagna_triers
y=target_list

In [39]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=42)

In [40]:
clf = tree.DecisionTreeClassifier(criterion='gini')
clf = clf.fit(X_train, y_train)

In [41]:
# Predict on test data
p_pred = clf.predict_proba(X_test)
y_pred = clf.predict(X_test)

score_ = clf.score(X_test, y_test)

In [42]:
# Assign values of model score
df_case_score.loc[len(df_case_score.index)] = ["Model score with grouping of Age and Income",score_]

In [43]:
df_case_score

Unnamed: 0,Case,Score
0,Model score without grouping of Age and Income,0.77907
1,Model score with grouping of Age and Income,0.755814


As can be seen, after grouping of Age and Income columns, there is increase in model score