# Problem Statement

In this competition, we predict whether or not an email is spam.

We are going to cover the following steps:
1. Import Libraries
2. Model: XGBoost Linear using gblinear
3. Evaluation
4. Submission
5. References
6. Notes

Let's get started.

# Import Libraries

In [None]:
import pandas as pd
from pathlib import Path

data_dir = Path('../input/tabular-playground-series-nov-2021/')

df_train = pd.read_csv(
    data_dir / "train.csv",
    index_col='id'
)

FEATURES = df_train.columns[:-1]
TARGET = df_train.columns[-1]

df_train.head()

The 'target' has binary outcomes: 0 for not spam and 1 for spam.

# Model: XGBoost Linear using gblinear

In [None]:
from xgboost import XGBClassifier

X = df_train.loc[:, FEATURES]
y = df_train.loc[:, TARGET]

model = XGBClassifier(
#     max_depth=3,
#     subsample=0.5,
#     colsample_bytree=0.5,
    n_jobs=-1,
    # Uncomment if you want to use GPU. Recommended for whole training set.
    #tree_method='gpu_hist',
    random_state=0,
    objective ='reg:squarederror', # WARNING -> reg:linear is now deprecated in favor of reg:squarederror
    booster='gblinear'
)

# Evaluation

The evaluation metric is AUC, which stands for "area under curve".

In [None]:
from sklearn.model_selection import cross_validate
import warnings 
warnings.filterwarnings('ignore')

def score(X, y, model, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model, cv=2)

display(scores)

A "neutral" AUC is 0.5, so anything better than that means our model learned something useful.

# Submission

Our predictions are binary 0 and 1, but we're allowed to submit probabilities instead. In scikit-learn, we would use the predict_proba method instead of predict.

In [None]:
# Fit on full training set
model.fit(X, y)

X_test = pd.read_csv(data_dir / "test.csv", index_col='id')

# Make predictions
y_pred = pd.Series(
    model.predict(X_test),
    index=X_test.index,
    name=TARGET,
)

# Create submission file
y_pred.to_csv("submission_xgb_linear.csv")

# References

1. Thank you to [link](https://stackoverflow.com/questions/55493454/xgboost-linear-regression-gblinear-wrong-predictions) for demonstrating how to use XGBoost Linear.
2. Thank you to Ryan Holbrook, Alexis Cook and inversion for demonstrating how to use XGBoost in their [notebook](https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground/notebook).
3. Thank you to @pinstripezebra (Lucas See) for suggesting that I try XGBoost Linear Model instead of a tree based one.

# Notes

1. What is the difference between XGBRegressor, XGBClassifier, booster='gblinear', booster='tree'?
2. Compare outputs of [XGBClassifier (tree)](https://www.kaggle.com/sugamkhetrapal/tps-nov-2021-1-06-xgboost/notebook) and [XGBClassifier (linear)](https://www.kaggle.com/sugamkhetrapal/tps-nov-2021-1-14-xgboost-linear/).