# Predicting customer satisfaction via ML

The client is a startup in the logistics and delivery domain, whose main goal is their customers' satisfaction. Getting feedback from customers is not always easy, but it is the one effective way to gauge customers' satisfaction, and improve their operations accordingly. The company provides us with a subset of a bigger survey, and asked to come up with the most effective ML method to predict customers' happiness from their answers to the survey. 

In particular, the company asked to:

1. create a ML model that is at least 73% accurancy in predicting customer's satisfaction;
2. understand which questions are the most crucial to make correct predictions, and which can be removed from the survey without impinging on model accuracy.

## Libraries

In [None]:
# basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import lazypredict
from lazypredict.Supervised import LazyClassifier

# models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier # the sklearn GradientBoosting models are reported less performative than the xgboost ones
from sklearn.svm import SVC #support vector classifier

# accuracy metrics
from sklearn.metrics import roc_curve, accuracy_score, classification_report, confusion_matrix, auc, precision_recall_curve
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV, train_test_split

## Data exploration

In [None]:
data = pd.read_csv("ACME-HappinessSurvey2020.csv")
data

Data Description:

- Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers
- X1 = my order was delivered on time
- X2 = contents of my order was as I expected
- X3 = I ordered everything I wanted to order
- X4 = I paid a good price for my order
- X5 = I am satisfied with my courier
- X6 = the app makes ordering easy for me

Attributes X1 to X6 indicate the responses for each question and have values from 1 to 5 where the smaller number indicates less and the higher number indicates more towards the answer.

In [None]:
data['Y'].value_counts().plot.bar()

From the plot above, we can see that there were more happy customers than unhappy customers (though the difference between the two is not as big as one may want to).

In [None]:
sns.pairplot(data, corner=True, kind='reg')

In [None]:
correlations = data.corr()
sns.heatmap(correlations, annot=True)

As we can see from the correlation heatmap above, it looks like customer happiness correlates more with X1 ('my order was delivered on time') and X5 ('I am satisfied with my courier') than with any other feature. This seems to suggest that customery satisfaction hinges on the experience with courier more than anything else (i.e., order content, price, and app experience).

Notably, X1 ('my order was delivered on time') highly correlates with and X3 ('I ordered everything I wanted to order'), X5 ('I am satisfied with my courier') and X6 ('the app makes ordering easy for me'). All other combinations had lesser correlation coefficients (< 0.1).

Therefore, it looks like X2 ('The content of my order was as I expected') and X4 ('I paid a good price for my order') are not so informative, and may be removed from analysis.

## Modelling

First, let's split the data so that the features *Xn* are separated from the to-be-predicted feature *y* (customer happiness). Then, we split the dataset in two subsets: the _train_ set of *Xn* and *y* features will be used to identify the right ML algorithm to predict customer happiness; the _test_ set will be used to evaluate the chosen algorithm by comparing the predicted *y* values with the true *y* values.

In [None]:
np.random.seed(1)
X = data.drop('Y', axis=1)
y = data['Y']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) # split the dataset

The `LazyClassifier` class allows us to run all possible estimators, and compare their accuracy scores in one line.

In [8]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

models

100%|███████████████████████████████████████████| 29/29 [00:02<00:00, 10.33it/s]

[LightGBM] [Info] Number of positive: 58, number of negative: 49
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000194 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 107, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.542056 -> initscore=0.168623
[LightGBM] [Info] Start training from score 0.168623





Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BernoulliNB,0.63,0.61,0.61,0.63,0.01
QuadraticDiscriminantAnalysis,0.63,0.6,0.6,0.61,0.01
LGBMClassifier,0.58,0.59,0.59,0.58,0.06
LabelSpreading,0.58,0.57,0.57,0.58,0.01
NuSVC,0.58,0.57,0.57,0.58,0.01
DummyClassifier,0.58,0.57,0.57,0.58,0.01
AdaBoostClassifier,0.58,0.55,0.55,0.57,0.08
SGDClassifier,0.53,0.54,0.54,0.53,0.01
PassiveAggressiveClassifier,0.58,0.53,0.53,0.54,0.01
KNeighborsClassifier,0.58,0.53,0.53,0.54,0.01


The most accurate predictions were made by the Decision Tree and the Ada Boost classifiers. However, none of them actually reached our target accuracy threshold (73%). This means that both models are indeed promising in making good predictions for our case, though they might need (i) some hyperparameter tuning (i.e., identification of the best combination of parameters), and/or (ii) some feature selection (i.e., elimination of some features that might be either information, or, more dangerously, noisy, and impede more accurate predictions). 

### Feature selection

We will select only the features that show high correlations coefficients (r >= 0.15) with the target label.

In [9]:
# define feature selection
features_to_remove = correlations.loc[:, correlations.loc['Y'] < 0.15].columns
data_sel = data.drop(features_to_remove, axis=1)

X_sel = data_sel.drop('Y', axis=1)
y = data_sel['Y']

X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_sel, y, test_size=0.15, random_state=42)

We will proceed to hyperparamenter tuning via GridSearch for each of the following models:
1. Logistic regression
2. KNN
3. XGBoost
4. DecisionTree
5. Random Forest
6. SVM

### Hyperparameter Turning via Grid Search

In [None]:
models = ['LogisticRegression', 'DecisionTreeClassifier', 'RandomForestClassifier', 'XGBClassifier', 'KNeighborsClassifier']
param_grid = dict().fromkeys(models)

param_grid['LogisticRegression'] = {'C': [1/0.0001, 1/0.001, 1/0.01, 1/0.1]} #inverse of regularization: 1/lambda

param_grid['DecisionTreeClassifier'] = {'criterion': ['entropy', 'log_loss', 'gini'], # measure of impurity
                                        'max_depth': np.arange(1, len(X_selected_train.columns)+1, 1), # how many levels does the tree have?
                                        'min_samples_split': [2, 3, 4, 5]} # how many samples are needed to make a split?

param_grid['RandomForestClassifier'] = {'n_estimators': [30, 50, 75, 100, 150, 200], # number of trees in the forest
                                        'max_depth': np.arange(1, len(X_selected_train.columns)+1, 1), # how many levels does the tree have?
                                        'min_samples_split': [2, 3, 4, 5], # how many samples are needed to make a split?
                                        'warm_start': ['True', 'False']
                                       } 
param_grid['XGBClassifier'] = {'eta': [0.3, 0.01, 0.001], # i.e., the learning rate
                               'max_depth': np.arange(1, len(X_selected_train.columns)+1, 1), # how many levels does the tree have?
                               'min_samples_split': [2, 3, 4, 5], # how many samples are needed to make a split?
                               'sampling_method': ['uniform', 'gradient_based'], # how are the training data sampled?
                               'gamma': [0.5, 1, 1.5, 2, 5], # minimum loss reduction required to make a split -- the larger gamma is, the more conservative the algorithm will be
                               'lambda': [0.0001, 0.001, 0.01, 0.1], # regularization term
                               'max_leaves': [0, 1, 2, 3, 4, 5]
                            }
param_grid['KNeighborsClassifier'] = {'n_neighbors' : np.arange(5, 35, 5)}

param_grid['SVC'] = {'C': [0.0001, 0.001, 0.01, 0.1], # regularization term
                     'kernel' : ['rbf', 'linear']
                    }


def load_model(model):
    if model == 'LogisticRegression':
        return LogisticRegression()
    if model == "DecisionTreeClassifier":
        return DecisionTreeClassifier(random_state=42)
    if model == 'RandomForestClassifier':
        return RandomForestClassifier(random_state=42)
    if model == 'XGBClassifier':
        return XGBClassifier()
    if model == 'KNeighborsClassifier':
        return KNeighborsClassifier()
    if model =- 'SVC':
        return SVC()

dict_models = {}
for model in models:
    estimator = load_model(model)
    gs = GridSearchCV(estimator = estimator, param_grid = param_grid[model])
    gs.fit(X_selected_train, y_train)
    dict_models[model] = gs.best_estimator_
    y_pred = dict_models[model].predict(X_selected_test)
    print("\n Report of the model: ", dict_models[model],"\n", classification_report(y_test, y_pred),"\n")


 Report of the model:  LogisticRegression(C=10.0) 
               precision    recall  f1-score   support

           0       0.60      0.50      0.55         6
           1       0.62      0.71      0.67         7

    accuracy                           0.62        13
   macro avg       0.61      0.61      0.61        13
weighted avg       0.61      0.62      0.61        13
 


 Report of the model:  DecisionTreeClassifier(max_depth=4) 
               precision    recall  f1-score   support

           0       0.50      0.33      0.40         6
           1       0.56      0.71      0.63         7

    accuracy                           0.54        13
   macro avg       0.53      0.52      0.51        13
weighted avg       0.53      0.54      0.52        13
 


 Report of the model:  RandomForestClassifier(max_depth=3, min_samples_split=4, n_estimators=30,
                       warm_start='False') 
               precision    recall  f1-score   support

           0       0.50      