# Happy Customers!

Data Description:

* Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers
* X1 = my order was delivered on time
* X2 = contents of my order was as I expected
* X3 = I ordered everything I wanted to order
* X4 = I paid a good price for my order
* X5 = I am satisfied with my courier
* X6 = the app makes ordering easy for me

Attributes X1 to X6 indicate the responses for each question and have values from 1 to 5 where the smaller number indicates less and the higher number indicates more towards the answer.

Goal : Predict if a customer is happy or not based on the answers they give to questions asked.

Success Metrics : Reach 73% accuracy score or above, or convince us why your solution is superior. We are definitely interested in every solution and insight you can provide us.

Try to submit your working solution as soon as possible. The sooner the better.

### Loading the data 

In [1]:
# Let's load the data and take a look at it 
import pandas as pd

df = pd.read_csv("ACME-HappinessSurvey2020.csv")

# Allows us to see n numer of random samples from the dataframe
df.sample(10)    

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
102,0,5,2,3,3,3,5
40,0,5,2,3,3,3,3
48,1,5,2,5,5,5,3
16,0,5,3,4,5,4,5
54,1,4,3,2,4,3,4
42,0,5,2,3,3,4,5
69,1,5,4,5,5,5,5
75,0,3,2,3,3,4,4
74,1,5,2,5,5,5,5
84,0,4,3,4,4,2,4


### Preprocessing

In [2]:
# What are our column data types?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
Y     126 non-null int64
X1    126 non-null int64
X2    126 non-null int64
X3    126 non-null int64
X4    126 non-null int64
X5    126 non-null int64
X6    126 non-null int64
dtypes: int64(7)
memory usage: 7.0 KB


In [3]:
# Do we have any null values?

df.isnull().sum()

Y     0
X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
dtype: int64

In [4]:
# Descriptive stats
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Y,126.0,0.547619,0.499714,0.0,0.0,1.0,1.0,1.0
X1,126.0,4.333333,0.8,1.0,4.0,5.0,5.0,5.0
X2,126.0,2.531746,1.114892,1.0,2.0,3.0,3.0,5.0
X3,126.0,3.309524,1.02344,1.0,3.0,3.0,4.0,5.0
X4,126.0,3.746032,0.875776,1.0,3.0,4.0,4.0,5.0
X5,126.0,3.650794,1.147641,1.0,3.0,4.0,4.0,5.0
X6,126.0,4.253968,0.809311,1.0,4.0,4.0,5.0,5.0


In [5]:
# Correlation, do we have linear relaiton for our questions?
# Ex: If one of them is low, are there any others following it?

df.corr()

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
Y,1.0,0.28016,-0.024274,0.150838,0.064415,0.224522,0.167669
X1,0.28016,1.0,0.059797,0.283358,0.087541,0.432772,0.411873
X2,-0.024274,0.059797,1.0,0.184129,0.114838,0.039996,-0.062205
X3,0.150838,0.283358,0.184129,1.0,0.302618,0.358397,0.20375
X4,0.064415,0.087541,0.114838,0.302618,1.0,0.293115,0.215888
X5,0.224522,0.432772,0.039996,0.358397,0.293115,1.0,0.320195
X6,0.167669,0.411873,-0.062205,0.20375,0.215888,0.320195,1.0


In [6]:
# Correlation with happiness

df.corr()["Y"].sort_values(ascending=False)

Y     1.000000
X1    0.280160
X5    0.224522
X6    0.167669
X3    0.150838
X4    0.064415
X2   -0.024274
Name: Y, dtype: float64

In [7]:
# How is happy-unhappy classes are distributed, do we need balancing classes?

df["Y"].value_counts()

1    69
0    57
Name: Y, dtype: int64

In [8]:
# Shape of the dataset
df.shape

(126, 7)

### Model Building

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [10]:
# Splitting the dataset to train and test datasets

X = df.iloc[:,[1,2,3,4,5,6]]
y = df["Y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (100, 6)
Number transactions y_train dataset:  (100,)
Number transactions X_test dataset:  (26, 6)
Number transactions y_test dataset:  (26,)


### Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
model_log = LogisticRegression().fit(X_train,y_train)
preds = model_log.predict(X_test)
preds

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1], dtype=int64)

In [13]:
print("Number of happy customers : {}".format(preds.sum()))
print("Number of unhappy customers : {}".format(len(preds) - preds.sum()))

Number of happy customers : 19
Number of unhappy customers : 7


In [14]:
# Accuracy score of the model
print("Accuracy of the model is : %{}".format(accuracy_score(y_test, preds)*100))

Accuracy of the model is : %61.53846153846154


### Decision Trees

In [15]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train,y_train)
preds_dt = model_dt.predict(X_test)

In [16]:
print("Number of happy customers : {}".format(preds_dt.sum()))
print("Number of unhappy customers : {}".format(len(preds_dt) - preds_dt.sum()))

Number of happy customers : 13
Number of unhappy customers : 13


In [17]:
# Accuracy score of the model
print("Accuracy of the model is : %{}".format(accuracy_score(y_test, preds_dt)*100))

Accuracy of the model is : %46.15384615384615


### Random Forest 

In [18]:
from sklearn.ensemble import RandomForestClassifier

model_rf= RandomForestClassifier(random_state=1)
model_rf.fit(X_train,y_train)
preds_rf=model_rf.predict(X_test)

In [19]:
print("Number of happy customers : {}".format(preds_rf.sum()))
print("Number of unhappy customers : {}".format(len(preds_rf) - preds_rf.sum()))

Number of happy customers : 17
Number of unhappy customers : 9


In [20]:
# Accuracy score of the model
print("Accuracy of the model is : %{}".format(accuracy_score(y_test, preds_rf)*100))

Accuracy of the model is : %53.84615384615385


### Gridsearch to optimize models 

In [21]:
from sklearn.model_selection import GridSearchCV

In [22]:
import numpy as np
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge

grid_log=GridSearchCV(model_log,grid)
grid_log.fit(X_train,y_train)

Traceback (most recent call last):
  File "C:\Users\Semanur\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Semanur\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\Semanur\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 443, in _check_solver
    "got %s penalty." % (solver, penalty))
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "C:\Users\Semanur\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Semanur\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
 

GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'penalty': ['l1', 'l2']})

In [23]:
print("tuned hpyerparameters :(best parameters) ",grid_log.best_params_)
print("accuracy :",grid_log.best_score_)

tuned hpyerparameters :(best parameters)  {'C': 0.01, 'penalty': 'l2'}
accuracy : 0.56


In [24]:
grid_search1_params = {
    'max_depth': [3,4,5,6,7],
    'min_samples_leaf': [3,4,5],
    'min_samples_split': [3,4,5]
             }

grid_search1 = GridSearchCV(model_dt, grid_search1_params, cv=5, n_jobs=-2)
grid_search1.fit(X_train, y_train)
pred_grid=grid_search1.predict(X_test)

print("==========================================")
print("Best parameters for Grid search is:")
print(grid_search1.best_params_)
print("==========================================")
print(accuracy_score(y_test,pred_grid))

Best parameters for Grid search is:
{'max_depth': 6, 'min_samples_leaf': 4, 'min_samples_split': 5}
0.5384615384615384


In [25]:
grid_search2_params = {
    'max_leaf_nodes': [5,6,7,8,9,10],
    'min_samples_split': [2,3,4,6,8],
     "n_estimators" : [100, 200, 300, 400, 500]
             }

grid_search2 = GridSearchCV(model_rf, grid_search2_params, cv=5, verbose=2, n_jobs=-2)
grid_search2.fit(X_train, y_train)
pred_grid2=grid_search2.predict(X_test)

print("==========================================")
print("Best parameters for Grid search is:")
print(grid_search2.best_params_)
print("==========================================")
print(accuracy_score(y_test,pred_grid2))

Fitting 5 folds for each of 150 candidates, totalling 750 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  35 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-2)]: Done 156 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-2)]: Done 359 tasks      | elapsed:   43.5s
[Parallel(n_jobs=-2)]: Done 642 tasks      | elapsed:  1.4min


Best parameters for Grid search is:
{'max_leaf_nodes': 8, 'min_samples_split': 8, 'n_estimators': 100}
0.6153846153846154


[Parallel(n_jobs=-2)]: Done 750 out of 750 | elapsed:  1.6min finished


In [26]:
# LightGBM
from lightgbm import LGBMClassifier

model_LGBM = LGBMClassifier().fit(X_train,y_train)
pred_LGBM = model_LGBM.predict(X_test)

print(accuracy_score(y_test,pred_LGBM))

0.6153846153846154


In [27]:
params_LGBM = {"learning_rate" : [0.001, 0.01, 0.1],
          "n_estimators" : [100, 200, 300, 400, 500],
          "max_depth" : [2, 3, 4, 5, 6, 7, 8, 9, 10],
              "num_leaves":[2,3,4,5]}

grid_search = GridSearchCV(model_LGBM, params_LGBM, cv=5)
grid_search.fit(X_train, y_train)
pred_grid=grid_search.predict(X_test)

print("==========================================")
print("Best parameters for Grid search is:")
print(grid_search.best_params_)
print("==========================================")
print(accuracy_score(y_test,pred_grid))

Best parameters for Grid search is:
{'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 500, 'num_leaves': 3}
0.6153846153846154


After trying different grid search models, we couldn't beat our best score, which is approximately 0.62

## RFE (Recursive Feature Elimination)

In [36]:
from sklearn.feature_selection import RFE

In [38]:
rfe = RFE(model_LGBM,step=1)
rfe.fit(X_train,y_train)

RFE(estimator=LGBMClassifier())

In [39]:
for i in range(X_train.shape[1]):
    print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))

Column: 0, Selected True, Rank: 1.000
Column: 1, Selected True, Rank: 1.000
Column: 2, Selected False, Rank: 3.000
Column: 3, Selected True, Rank: 1.000
Column: 4, Selected False, Rank: 4.000
Column: 5, Selected False, Rank: 2.000


In [40]:
sum=rfe.ranking_[rfe.ranking_==1].sum()

print("Total number of selected features: %d" % (sum))

Total number of selected features: 3


In [41]:
print(accuracy_score(y_test,rfe.predict(X_test)))

0.7307692307692307


In [42]:
# RFE-2

In [50]:
rfe2 = RFE(model_log,step=3)
rfe2.fit(X_train,y_train)
print(accuracy_score(y_test,rfe2.predict(X_test)))

0.6153846153846154


We managed to get an accuracy just above 0.73 after RFE. 

Comment: I am not sure if we can apply any feature elimination techniques to a survey dataset since every column represents questions. 

### Trying a regression model 

In relatively small datasets, sometimes rather than choosing more advanced models to improve accuracy, it is better to tackle some other algorithms and select the one that fits best. Applying regression to a classification model is one of them. Here I tried to apply regression and obtained some continuous predictions. Later on I decided on some key value, 0.65 for this case, and set the values aboue this as 1. The ones below as 0. 0.65 is an arbitrary number given by trying different values starting from 0.5 (I am trying this method for the first time)

In [46]:
from sklearn.linear_model import LinearRegression

In [47]:
model_linear = LinearRegression().fit(X_train,y_train)
pred_linear = model_linear.predict(X_test)
pred_linear

array([0.61416641, 0.65338591, 0.46174141, 0.48135541, 0.72015349,
       0.33488129, 0.41986033, 0.67729651, 0.65907771, 0.521422  ,
       0.37288508, 0.53626817, 0.50872744, 0.7988093 , 0.63335147,
       0.58636659, 0.59188968, 0.52711494, 0.66684423, 0.679111  ,
       0.50344302, 0.75528241, 0.36900778, 0.72263674, 0.68298945,
       0.64027206])

In [48]:
corrected_linear = [0 if i<0.65 else 1 for i in pred_linear ]
corr_lin = pd.Series(corrected_linear)

In [49]:
accuracy_score(y_test,corr_lin)

0.6538461538461539

# Explainable AI

In [None]:
# !pip install SHAP

In [None]:
import SHAP

In [None]:
# We will use a "Linear Explainer" since Logistic regression is a Linear Model
explainer = shap.LinearExplainer(model, X_train)
shap_values = explainer.shap_values(X_test)
pd.DataFrame(shap_values, columns=X_test.columns).head(5)

In [None]:
print('Expected Value:', explainer.expected_value)

In [None]:

a=pd.DataFrame(shap_values, columns=X_test.columns)
a.loc[0,:].sum()

In [None]:
print("Test data (actual observation): {}".format(y_test.iloc[1]))
print("Model's prediction: {}".format(preds[1]))