# HappyCustomer - An Apziva Project

By Samuel Alter

This project centers on a customer survey dataset from a delivery company. The dataset consists of the following:
* `Y`: The target attribute, indicating whether the customer noted their happiness or unhappiness
* `X1`: Order was delivered on time
* `X2`: Contents of the order was as expected
* `X3`: I ordered everything that I wanted to order
* `X4`: I paid a good price for my order
* `X5`: I am satisfied with my courier
* `X6`: The app makes ordering easy for me

Attributes `X1` through `X6` are on a 1 to 5 scale, with 5 indicating most agreement with the statement.

The goals of this project are to train a model that predicts whether a customer is happy or not, based on their answers to the survey. Specifically, I am to reach 73% accuracy or higher with my modeling, or explain why my solution is superior.

A stretch goal would be to determine which features are more important when predicting a customer's happiness. What is the minimal set of attributes or features that would preserve the most information about the problem, while at the same time increasing predictability? The aim here is to see if any question can be eliminated in the next survey round.

The statistical analysis of the features can be found in the [Statistical Modeling](#statistical_modeling) section at the end of this document.

## Table of Contents

1. [EDA](#eda)
1. [Initial `lazypredict` model exploration](#lazy_predict)
1. [`XGBoost` in `sklearn`](#xgboost)

## EDA <a name='eda'></a>

## Run this section in the `lazypredict` environment <a name='lazy_predict'></a>

[`lazypredict`](#https://lazypredict.readthedocs.io/en/latest/) is a very helpful package that can run through generic builds of a multitude of models in order to get a high-level understanding of the performance of these models on your particular dataset. It is a great place to start and saves a lot of time that would be spent manually exploring the accuracy of different models.

In [11]:
import numpy as np
import pandas as pd

In [30]:
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

In [31]:
# setup random state for reproducibility
random_state=42

In [32]:
# read in and setup dataset

df=pd.read_csv('../data/ACME-HappinessSurvey2020.csv')

# renaming columns to preserve order
# and make them more intelligible
df.rename(columns={'Y':'y',
                   'X1':'a_time',
                   'X2':'b_contents',
                   'X3':'c_complete',
                   'X4':'d_price',
                   'X5':'e_courier',
                   'X6':'f_app'},inplace=True)

# df.dtypes

X=df[[col for col in df.columns if col != 'y']].copy()
y=df['y'].copy().astype('int8') # because it's a binary

In [33]:
X_train, \
X_test, \
y_train, \
y_test = train_test_split(X, 
                          y, 
                          test_size=0.2, 
                          stratify=y,
                          random_state=random_state)

In [34]:
print(f'''
Shapes of splits:
X_train: {X_train.shape}
X_test:  {X_test.shape}
y_train: {y_train.shape}
y_test:  {y_test.shape}
''')


Shapes of splits:
X_train: (100, 6)
X_test:  (26, 6)
y_train: (100,)
y_test:  (26,)



In [35]:
clf = LazyClassifier(verbose=0,
                     ignore_warnings=True,
                     random_state=random_state)

In [36]:
models, predictions = clf.fit(X_train=X_train,
                              X_test=X_test,
                              y_train=y_train,
                              y_test=y_test)

100%|█████████████████████████████████████████████████████████| 29/29 [00:00<00:00, 44.32it/s]

[LightGBM] [Info] Number of positive: 55, number of negative: 45
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000323 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.550000 -> initscore=0.200671
[LightGBM] [Info] Start training from score 0.200671





In [37]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SGDClassifier,0.69,0.69,0.69,0.69,0.01
NearestCentroid,0.69,0.68,0.68,0.69,0.01
AdaBoostClassifier,0.69,0.68,0.68,0.68,0.05
BernoulliNB,0.69,0.68,0.68,0.68,0.01
DecisionTreeClassifier,0.65,0.66,0.66,0.65,0.01
RandomForestClassifier,0.65,0.66,0.66,0.65,0.11
LGBMClassifier,0.65,0.65,0.65,0.65,0.08
CalibratedClassifierCV,0.65,0.63,0.63,0.62,0.01
ExtraTreesClassifier,0.62,0.61,0.61,0.62,0.06
BaggingClassifier,0.62,0.61,0.61,0.62,0.02


In [38]:
models.to_csv('../joblib/1_lazypredict.csv')

## Switch to `sklearn` environment <a name='xgboost'></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
models=pd.read_csv('../joblib/1_lazypredict.csv')

In [3]:
models

Unnamed: 0,Model,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
0,SGDClassifier,0.692308,0.690476,0.690476,0.692308,0.007916
1,NearestCentroid,0.692308,0.684524,0.684524,0.688578,0.007125
2,AdaBoostClassifier,0.692308,0.678571,0.678571,0.680769,0.054332
3,BernoulliNB,0.692308,0.678571,0.678571,0.680769,0.006422
4,DecisionTreeClassifier,0.653846,0.660714,0.660714,0.652308,0.007512
5,RandomForestClassifier,0.653846,0.660714,0.660714,0.652308,0.114229
6,LGBMClassifier,0.653846,0.64881,0.64881,0.652289,0.07994
7,CalibratedClassifierCV,0.653846,0.630952,0.630952,0.617195,0.01489
8,ExtraTreesClassifier,0.615385,0.613095,0.613095,0.615385,0.055475
9,BaggingClassifier,0.615385,0.613095,0.613095,0.615385,0.016343


After exploring alternatives, including `LGBMClassifier`, I will now use `XGBoost` instead - it had given me the second-highest accuracy previously. You'll note that XGBoost is now 25th out of 26 model options. This is a lesson learned that I must always specify a `random_state` to ensure reproducibility.

### Imports

In [4]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [5]:
# setup random state for reproducibility
random_state=42

In [6]:
# read in and setup dataset

df=pd.read_csv('../data/ACME-HappinessSurvey2020.csv')

# renaming columns to preserve order
# and make them more intelligible
df.rename(columns={'Y':'y',
                   'X1':'a_time',
                   'X2':'b_contents',
                   'X3':'c_complete',
                   'X4':'d_price',
                   'X5':'e_courier',
                   'X6':'f_app'},inplace=True)

# df.dtypes

X=df[[col for col in df.columns if col != 'y']].copy()
y=df['y'].copy().astype('int8') # because it's a binary

X_train, \
X_test, \
y_train, \
y_test = train_test_split(X, 
                          y, 
                          test_size=0.2, 
                          stratify=y,
                          random_state=random_state)

print(f'''
Shapes of splits:
X_train: {X_train.shape}
X_test:  {X_test.shape}
y_train: {y_train.shape}
y_test:  {y_test.shape}
''')


Shapes of splits:
X_train: (100, 6)
X_test:  (26, 6)
y_train: (100,)
y_test:  (26,)



### Initial model run

In [7]:
xgbc = XGBClassifier(random_state=random_state)
xgbc.fit(X_train, y_train)

y_pred = xgbc.predict(X_test)
print(f'Score on test: {xgbc.score(X_test,y_test)}')
print(classification_report(y_test, y_pred))

Score on test: 0.5384615384615384
              precision    recall  f1-score   support

           0       0.50      0.42      0.45        12
           1       0.56      0.64      0.60        14

    accuracy                           0.54        26
   macro avg       0.53      0.53      0.53        26
weighted avg       0.53      0.54      0.53        26



The base model is not great yet. Let's keep going forward.

Note that the dataset consists of survey responses, which are categorical, but they are not encoded as such. Transforming the survey results into a `category` datatype should help.

In [8]:
for col in X.columns:
    X[col]=X[col].astype('category')
    
print(f'''
Datatypes:

y: 
{y.dtypes}

X: 
{X.dtypes}
''')


Datatypes:

y: 
int8

X: 
a_time        category
b_contents    category
c_complete    category
d_price       category
e_courier     category
f_app         category
dtype: object



### Grid Search Exploration with `XGBoost`

In [17]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold,cross_val_score

In [10]:
xgbc = XGBClassifier(random_state=random_state)

# specifying the k-fold so that we can control the randomness
statified_k_fold=StratifiedKFold(n_splits=5,
                                 random_state=random_state,
                                 shuffle=True)

parameters = {
    'alpha': [0], #(list(np.linspace(0,1,3))),
    'gamma': [0], #(list(np.linspace(0,1,3))),
    'lambda': (list(np.linspace(0.275,0.325,6))),
    'learning_rate': (np.logspace(0.211,0.213,9)),
    'max_depth': [2], #(list(np.arange(1,4))),
    'min_child_weight': (list(np.linspace(3.5,4.5,9))),
    'n_estimators': (np.arange(53,58))
}

grid_search = GridSearchCV(xgbc, 
                           parameters, 
                           cv = statified_k_fold, 
                           n_jobs = -1, 
                           verbose = 0)

grid_search.fit(X_train, y_train)

In [11]:
# best score
print(f"best score: {grid_search.best_score_}")

# best parameters 
print(f"best parameters: {grid_search.best_params_}")

best score: 0.64
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.275, 'learning_rate': 1.625548755750484, 'max_depth': 2, 'min_child_weight': 4.375, 'n_estimators': 53}


```python
best score: 0.64
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.275, 'learning_rate': 1.625548755750484, 'max_depth': 2, 'min_child_weight': 4.375, 'n_estimators': 53}
```

```python
best score: 0.6900000000000001
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.3, 'learning_rate': 1.6292960326397223, 'max_depth': 2, 'min_child_weight': 4.0, 'n_estimators': 55}
```

```python
best score: 0.68
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.275, 'learning_rate': 1.6330519478943344, 'max_depth': 2, 'min_child_weight': 3.875, 'n_estimators': 56}
```

Confirm results with running the best parameters again. I will use a pipeline as I want to ensure that I can include a `random_state` and make sure that the model understands that we're dealing with categorical data.

In [67]:
from sklearn.pipeline import Pipeline

# X columns are categorical so they need to be OHE'd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [78]:
# setup classifier
model=XGBClassifier(alpha=0,
                    gamma=0,
                    reg_lambda=0.275,
                    learning_rate=1.625548755750484,
                    max_depth=2,
                    min_child_weight=4.375,
                    n_estimators=53,
                    random_state=random_state)

# preprocessor to handle categorical features, make them OHE'd
# it will ignore any categories that are not found in X_test
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), [0, 1])
    ])

# create pipeline
pipeline=Pipeline([
    ('preprocessor',preprocessor),
    ('xgb',model)
])

# allow for five cross-validation folds
statified_k_fold=StratifiedKFold(n_splits=5,
                                 random_state=random_state,
                                 shuffle=True)

# perform cross-validation and print accuracy
scores=cross_val_score(pipeline, 
                       X, 
                       y, 
                       cv=stratified_k_fold, 
                       scoring='accuracy')

print('Cross-validated accuracy: '\
f'{scores.mean()*100:.2f}% ± {scores.std()*100:.2f}%')

Cross-validated accuracy: 61.91% ± 9.64%


In [None]:
metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred),
    'recall': recall_score(y_test, y_pred),
    'f1': f1_score(y_test, y_pred)
}

### Pipeline

Example of pipeline below:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('rfe', RFE(
            estimator=BernoulliNB(),
            n_features_to_select=3,
            importance_getter='feature_log_prob_'
        )),
        ('bnb', BernoulliNB())
    ]
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred),
    'recall': recall_score(y_test, y_pred),
    'f1': f1_score(y_test, y_pred)
}
```

### Statistical Modeling <a name="statistical_modeling"></a>

From [this tutorial](#https://www.datacamp.com/tutorial/xgboost-in-python) on DataCamp, I will create a `DMatrix`. This creates an optimized dataframe for memory and speed when performing modeling.

In [15]:
import xgboost as xgb

dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)