# XGBoost

Using `XGBoost` to push the accuracy above 73%.

In [1]:
import numpy as np
import pandas as pd

from lazypredict.Supervised import LazyClassifier

In [2]:
X_train=pd.read_csv('../data/1_X_train.csv').to_numpy()
X_test=pd.read_csv('../data/1_X_test.csv').to_numpy()
y_train=pd.read_csv('../data/1_y_train.csv').to_numpy().flatten()
y_test=pd.read_csv('../data/1_y_test.csv').to_numpy().flatten()

In [3]:
clf = LazyClassifier(verbose=0,ignore_warnings=True)
models, predictions = clf.fit(X_train=X_train,
                              X_test=X_test,
                              y_train=y_train,
                              y_test=y_test)
# predictions

100%|██████████████████████████████████████████████████████████████████████████| 29/29 [00:00<00:00, 43.21it/s]

[LightGBM] [Info] Number of positive: 55, number of negative: 45
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000474 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.550000 -> initscore=0.200671
[LightGBM] [Info] Start training from score 0.200671





In [6]:
models.to_csv('../joblib/1_lazypredict.csv')

After exploring alternatives, including `LGBMClassifier`, I will now use `XGBoost` instead.

## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
X_train=pd.read_csv('../data/1_X_train.csv').to_numpy()
X_test=pd.read_csv('../data/1_X_test.csv').to_numpy()
y_train=pd.read_csv('../data/1_y_train.csv').to_numpy().flatten()
y_test=pd.read_csv('../data/1_y_test.csv').to_numpy().flatten()

In [4]:
print(f'''
Shapes:
X_train: {X_train.shape}
X_test:  {X_test.shape}
y_train: {y_train.shape}
y_test:  {y_test.shape}
''')


Shapes:
X_train: (100, 6)
X_test:  (26, 6)
y_train: (100,)
y_test:  (26,)



## Initial model run

In [5]:
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)

y_pred = xgbc.predict(X_test)
print(f'Score on test: {xgbc.score(X_test,y_test)}')
print(classification_report(y_test, y_pred))

Score on test: 0.5769230769230769
              precision    recall  f1-score   support

           0       0.56      0.42      0.48        12
           1       0.59      0.71      0.65        14

    accuracy                           0.58        26
   macro avg       0.57      0.57      0.56        26
weighted avg       0.57      0.58      0.57        26



The base model is not great yet. Let's keep going forward.

## Restart from the beginning with a full dataset

We will start from the beginning with the full dataset so that we don't have to monkey around with keeping track of four dataframes.

In [6]:
df=pd.read_csv('../data/ACME-HappinessSurvey2020.csv')

# renaming columns to preserve order
# and make them more intelligible
df.rename(columns={'Y':'y',
                   'X1':'a_time',
                   'X2':'b_contents',
                   'X3':'c_complete',
                   'X4':'d_price',
                   'X5':'e_courier',
                   'X6':'f_app'},inplace=True)

df.sample(5)

Unnamed: 0,y,a_time,b_contents,c_complete,d_price,e_courier,f_app
16,0,5,3,4,5,4,5
107,0,4,2,4,4,4,4
41,1,4,2,4,3,2,4
125,0,5,3,2,5,5,5
50,1,5,1,3,3,4,4


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   y           126 non-null    int64
 1   a_time      126 non-null    int64
 2   b_contents  126 non-null    int64
 3   c_complete  126 non-null    int64
 4   d_price     126 non-null    int64
 5   e_courier   126 non-null    int64
 6   f_app       126 non-null    int64
dtypes: int64(7)
memory usage: 7.0 KB


Note that the dataset consists of survey responses, which are categorical, but they are not encoded as such. Transforming the survey results into a `category` datatype should help.

In [8]:
X=df[[col for col in df.columns if col != 'y']].copy()
y=df['y'].copy().astype('int8') # because it's a binary

for col in X.columns:
    X[col]=X[col].astype('category')
    
y.dtypes

dtype('int8')

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.2,
                                                    stratify=y)

print(f'''
X_train shape: {X_train.shape}
y_train shape: {y_train.shape}
X_test shape:  {X_test.shape}
y_test shape:  {y_test.shape}
''')


X_train shape: (100, 6)
y_train shape: (100,)
X_test shape:  (26, 6)
y_test shape:  (26,)



From [this tutorial](#https://www.datacamp.com/tutorial/xgboost-in-python) on DataCamp, I will create a `DMatrix`. This creates an optimized dataframe for memory and speed when performing modeling.

In [33]:
import xgboost as xgb

dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

In [None]:
xgbc = XGBClassifier()

parameters = {
    'loss': ['exponential'],
    'learning_rate': (np.logspace(-1.3,-.8239,10)),
    'n_estimators': (np.linspace(195,205,10)).astype(int)
}
grid_search_gbc = GridSearchCV(gbc, parameters, cv = 5, n_jobs = 7, verbose = 3)
grid_search_gbc.fit(X_train_scale, y_train)

# best score
print(f"best score: {grid_search_gbc.best_score_}")

# best parameters 
print(f"best parameters: {grid_search_gbc.best_params_}")