# Project - Happy Customer
**Name: Zimin Lee**

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
import xgboost as xgb
from xgboost import plot_tree
from sklearn.model_selection import train_test_split

**Background:**

We are one of the fastest growing startups in the logistics and delivery domain. We work with several partners and make on-demand delivery to our customers. During the COVID-19 pandemic, we are facing several different challenges and everyday we are trying to address these challenges.

We thrive on making our customers happy. As a growing startup, with a global expansion strategy we know that we need to make our customers happy and the only way to do that is to measure how happy each customer is. If we can predict what makes our customers happy or unhappy, we can then take necessary actions.

Getting feedback from customers is not easy either, but we do our best to get constant feedback from our customers. This is a crucial function to improve our operations across all levels.

We recently did a survey to a select customer cohort. You are presented with a subset of this data. We will be using the remaining data as a private test set.


In [2]:
#Read File 
df = pd.read_csv("C:\\Users\\zzzim\\Desktop\\Apziva\\ACME-HappinessSurvey2020.csv", sep=',')
df.head(10)

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
0,0,3,3,3,4,2,4
1,0,3,2,3,5,4,3
2,1,5,3,3,3,3,5
3,0,5,4,3,3,3,5
4,0,5,4,3,3,3,5
5,1,5,5,3,5,5,5
6,0,3,1,2,2,1,3
7,1,5,4,4,4,4,5
8,0,4,1,4,4,4,4
9,0,4,4,4,2,5,5


**Variable dictionary**:

Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers\
X1 = my order was delivered on time\
X2 = contents of my order was as I expected\
X3 = I ordered everything I wanted to order\
X4 = I paid a good price for my order\
X5 = I am satisfied with my courier\
X6 = the app makes ordering easy for me\
\
Attributes X1 to X6 indicate the responses for each question and have values from 1 to 5 where the smaller number indicates less and the higher number indicates more towards the answer.


In [13]:
#Renaming columns to meaningful features
df = df.rename(columns={'Y': 'Happy','X1': 'DeliveredOnTime','X2': 'ExpectedContent','X3': 'DesiredOrder','X4': 'GoodPrice','X5': 'SatisfiedOrder','X6': 'EasyOrderProcess'})
df

Unnamed: 0,Happy,DeliveredOnTime,ExpectedContent,DesiredOrder,GoodPrice,SatisfiedOrder,EasyOrderProcess
0,0,3,3,3,4,2,4
1,0,3,2,3,5,4,3
2,1,5,3,3,3,3,5
3,0,5,4,3,3,3,5
4,0,5,4,3,3,3,5
...,...,...,...,...,...,...,...
121,1,5,2,3,4,4,3
122,1,5,2,3,4,2,5
123,1,5,3,3,4,4,5
124,0,4,3,3,4,4,5


**Goal(s):**

> Predict if a customer is happy or not based on the answers they give to questions asked.

**Success Metrics:**

> Reach 73% accuracy score or above.



In [4]:
#Check for imbalance data
df[['Happy']].value_counts()

Happy
1        69
0        57
dtype: int64

There is an imbalance target data for this dataset. Thus, the dataset will be downsample to the class with lesser data. This will ensure that our classifier is unbias and can predict each class well. 

In [5]:
#Resample and downsampling step

from sklearn.utils import resample
from sklearn.utils import shuffle

#Balancing Class
#Getting the class into seperate DF
happy = df[df['Happy']==1] 
unhappy  = df[df['Happy']==0]

#The sample size of 1000 for each class with the resample function
df_majority_downsampled_happy = resample(happy, replace = True, n_samples = 57) 
df_majority_downsampled_unhappy = resample(unhappy, replace = True, n_samples = 57) 

df_ = pd.concat([df_majority_downsampled_happy, df_majority_downsampled_unhappy])

#To randomize the sequence of the concatenated paositive, negative class DF
df_ = shuffle(df_) 
df_[['Happy']].value_counts()

Happy
0        57
1        57
dtype: int64

## Predicting Happy Customer
A classification method is required to predict a customer is happy or not based on the feature. The classification method selected for this case is decision tree as there only consists of relatively small set of features (6) and little observations. 

### Decision Tree

In [6]:
from sklearn.preprocessing import OneHotEncoder

rs = 1

#Seperate features and label
X = df_[['DeliveredOnTime','ExpectedContent','DesiredOrder','GoodPrice','SatisfiedOrder', 'EasyOrderProcess']]
y = df_[['Happy']]

#OHE
ohe = OneHotEncoder()
ohe.fit(X)
X_ohe = ohe.transform(X).toarray()
ohe_df = pd.DataFrame(X_ohe, columns=ohe.get_feature_names(X.columns))

#Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X_ohe, y, test_size = 0.3, random_state = rs)

# simple decision tree training
clf = DecisionTreeClassifier(random_state=rs)

clf.fit(X_train, y_train)
print("Train accuracy:", clf.score(X_train, y_train))
print("Test accuracy:", clf.score(X_test, y_test))

Train accuracy: 0.9493670886075949
Test accuracy: 0.7714285714285715


In [23]:
 # one-hot encoding
df_ = pd.get_dummies(df)
    
# target/input split
y = df_['Happy']
X = df_.drop(['Happy'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = rs)

# simple decision tree training
clf = DecisionTreeClassifier(random_state=rs)

clf.fit(X_train, y_train)
print("Train accuracy:", clf.score(X_train, y_train))
print("Test accuracy:", clf.score(X_test, y_test))

Train accuracy: 0.9772727272727273
Test accuracy: 0.6052631578947368


In [None]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.90      0.90        21
           1       0.86      0.86      0.86        14

    accuracy                           0.89        35
   macro avg       0.88      0.88      0.88        35
weighted avg       0.89      0.89      0.89        35



From the Accuracy of the decision tree above. This model has successfully predict whether a customer is happy with a accuracy of 94% on the training set and 77% on test set. There is a 17% difference in accuracy which suggest that the model may be overfitted.\
To further improve this model, a grid search is used to find the optimal hyperparameter for this dataset. 

In [10]:
from sklearn.model_selection import GridSearchCV

# grid search CV
params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(1, 6),
          'min_samples_leaf': range(0, 25, 1)[1:]}

cv_1 = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs),return_train_score=True, cv=10)
cv_1.fit(X_train, y_train)

result_set = cv_1.cv_results_
print(cv_1.best_params_)

{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 5}


In [11]:
cv_1.fit(X_train, y_train)

print("Train accuracy:", cv_1.score(X_train, y_train))
print("Test accuracy:", cv_1.score(X_test, y_test))

Train accuracy: 0.810126582278481
Test accuracy: 0.7428571428571429


In [12]:
y_pred = cv_1.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.61      0.85      0.71        13
           1       0.88      0.68      0.77        22

    accuracy                           0.74        35
   macro avg       0.75      0.76      0.74        35
weighted avg       0.78      0.74      0.75        35



Based on the decision tree with optimal hyperparameter, there is a decrease in both training and test set accuracy. This suggest that the optimal hyperparameter may not the best fit afterall and this could be due to the small amount of training data we have. With the higher accuract score of 77% on the basic decision tree, we have selected that as the best model for this case. 

**Additional Information:**

The business is very interested in finding which questions/features are more important when predicting a customer’s happiness.

> Using a feature selection approach show us understand what is the minimal set of attributes/features that would preserve the most information about the problem while increasing predictability of the data we have. Is there any question that we can remove in our next survey?

### Feature Importance

In [41]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline


# translate rows to dicts
def row_to_dict(X, y=None):
    return X.apply(dict, axis=1)

# define prediction model
ft = FunctionTransformer(row_to_dict, validate=False)
dv = DictVectorizer()
clf = DecisionTreeClassifier(random_state=rs)

# glue steps together
model = make_pipeline(ft, dv, clf)

# train
model.fit(X_train, y_train)

# get feature importances
feature_importances = zip(dv.feature_names_, clf.feature_importances_)

sorted_feature = []
for i in list(feature_importances):
    sorted_feature.append(i)

sorted_feature.sort(key = lambda row: row[1], reverse=True)
for i in sorted_feature:
    print (i)

('SatisfiedOrder', 0.22614055021864132)
('ExpectedContent', 0.19207604950874047)
('DeliveredOnTime', 0.19078121046903485)
('DesiredOrder', 0.17235382882757933)
('EasyOrderProcess', 0.11546049146811417)
('GoodPrice', 0.10318786950788991)


In descending order, the top 3 important feature for this model are SatisfiedOrder(X5), ExpectedContent(X2) and DeliveredOnTime(X3).\
EasyOrderProcess(X6) and GoodPrice(X4) are the two least important feature which can be excluded to ease the survey experience due to the low feature importance. 