# XGBoost

Using `XGBoost` to push the accuracy above 73%.

## Run this section in the `lazypredict` environment

In [None]:
import numpy as np
import pandas as pd

from lazypredict.Supervised import LazyClassifier

In [None]:
X_train=pd.read_csv('../data/1_X_train.csv').to_numpy()
X_test=pd.read_csv('../data/1_X_test.csv').to_numpy()
y_train=pd.read_csv('../data/1_y_train.csv').to_numpy().flatten()
y_test=pd.read_csv('../data/1_y_test.csv').to_numpy().flatten()

In [None]:
clf = LazyClassifier(verbose=0,ignore_warnings=True)
models, predictions = clf.fit(X_train=X_train,
                              X_test=X_test,
                              y_train=y_train,
                              y_test=y_test)

In [None]:
models

In [None]:
models.to_csv('../joblib/1_lazypredict.csv')

## Switch to `sklearn` environment

In [3]:
import numpy as np
import pandas as pd

In [4]:
models=pd.read_csv('../joblib/1_lazypredict.csv')

In [5]:
models

Unnamed: 0,Model,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
0,LGBMClassifier,0.615385,0.613095,0.613095,0.615385,0.075215
1,XGBClassifier,0.576923,0.565476,0.565476,0.567175,0.124167
2,LabelPropagation,0.576923,0.565476,0.565476,0.567175,0.007883
3,ExtraTreesClassifier,0.538462,0.529762,0.529762,0.532867,0.033458
4,KNeighborsClassifier,0.538462,0.529762,0.529762,0.532867,0.011579
5,LabelSpreading,0.538462,0.529762,0.529762,0.532867,0.016714
6,RandomForestClassifier,0.538462,0.52381,0.52381,0.521154,0.082213
7,DummyClassifier,0.538462,0.5,0.5,0.376923,0.002986
8,BaggingClassifier,0.5,0.488095,0.488095,0.488479,0.010814
9,LogisticRegression,0.5,0.482143,0.482143,0.472089,0.00746


After exploring alternatives, including `LGBMClassifier`, I will now use `XGBoost` instead.

## Imports

In [6]:
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [7]:
# read in and setup dataset

df=pd.read_csv('../data/ACME-HappinessSurvey2020.csv')

# renaming columns to preserve order
# and make them more intelligible
df.rename(columns={'Y':'y',
                   'X1':'a_time',
                   'X2':'b_contents',
                   'X3':'c_complete',
                   'X4':'d_price',
                   'X5':'e_courier',
                   'X6':'f_app'},inplace=True)

# df.dtypes

X=df[[col for col in df.columns if col != 'y']].copy()
y=df['y'].copy().astype('int8') # because it's a binary

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:a_time: category, b_contents: category, c_complete: category, d_price: category, e_courier: category, f_app: category

In [8]:
# setup random state for reproducibility
random_state=42

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    stratify=y,
                                                    random_state=random_state)

In [10]:
print(f'''
Shapes:
X_train: {X_train.shape}
X_test:  {X_test.shape}
y_train: {y_train.shape}
y_test:  {y_test.shape}
''')


Shapes:
X_train: (100, 6)
X_test:  (26, 6)
y_train: (100,)
y_test:  (26,)



## Initial model run

In [11]:
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)

y_pred = xgbc.predict(X_test)
print(f'Score on test: {xgbc.score(X_test,y_test)}')
print(classification_report(y_test, y_pred))

Score on test: 0.5384615384615384
              precision    recall  f1-score   support

           0       0.50      0.42      0.45        12
           1       0.56      0.64      0.60        14

    accuracy                           0.54        26
   macro avg       0.53      0.53      0.53        26
weighted avg       0.53      0.54      0.53        26



The base model is not great yet. Let's keep going forward.

Note that the dataset consists of survey responses, which are categorical, but they are not encoded as such. Transforming the survey results into a `category` datatype should help.

In [14]:
for col in X.columns:
    X[col]=X[col].astype('category')
    
print(f'''
Datatypes:

y: 
{y.dtypes}

X: 
{X.dtypes}
''')


Datatypes:

y: 
int8

X: 
a_time        category
b_contents    category
c_complete    category
d_price       category
e_courier     category
f_app         category
dtype: object



From [this tutorial](#https://www.datacamp.com/tutorial/xgboost-in-python) on DataCamp, I will create a `DMatrix`. This creates an optimized dataframe for memory and speed when performing modeling.

In [15]:
import xgboost as xgb

dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

In [16]:
from sklearn.model_selection import GridSearchCV

In [68]:
xgbc = XGBClassifier()

parameters = {
    'alpha': [0], #(list(np.linspace(0,1,3))),
    'gamma': [0], #(list(np.linspace(0,1,3))),
    'lambda': (list(np.linspace(0.275,0.325,6))),
    'learning_rate': (np.logspace(0.211,0.213,9)),
    'max_depth': [2], #(list(np.arange(1,4))),
    'min_child_weight': (list(np.linspace(3.5,4.5,9))),
    'n_estimators': (np.arange(53,58))
}

grid_search = GridSearchCV(xgbc, 
                           parameters, 
                           cv = 5, 
                           n_jobs = -1, 
                           verbose = 0)

grid_search.fit(X_train, y_train)

In [69]:
# best score
print(f"best score: {grid_search.best_score_}")

# best parameters 
print(f"best parameters: {grid_search.best_params_}")

best score: 0.68
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.275, 'learning_rate': 1.6330519478943344, 'max_depth': 2, 'min_child_weight': 3.875, 'n_estimators': 56}


```python
best score: 0.6900000000000001
best parameters: {'alpha': 0, 'gamma': 0, 'lambda': 0.3, 'learning_rate': 1.6292960326397223, 'max_depth': 2, 'min_child_weight': 4.0, 'n_estimators': 55}
```

### Pipeline

Example of pipeline below:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('rfe', RFE(
            estimator=BernoulliNB(),
            n_features_to_select=3,
            importance_getter='feature_log_prob_'
        )),
        ('bnb', BernoulliNB())
    ]
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred),
    'recall': recall_score(y_test, y_pred),
    'f1': f1_score(y_test, y_pred)
}
```