# Application of XGBoost

We will perform a XGBoost model to generate an accuracy prediction to submit in the [Tabular Playground Series - Mar 2021 Competition](https://www.kaggle.com/c/tabular-playground-series-mar-2021). XGBoost is one of the most useful models in Kaggle and we go to probe in a competition. 

In [None]:
# Import packages
import numpy as np # Handling matrices
import pandas as pd # Data processing
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Plotting 
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder # Handling categorical data and normalization
from sklearn.model_selection import train_test_split, cross_val_score # Split data in train and test and CV
from sklearn.metrics import roc_auc_score,precision_score,confusion_matrix, accuracy_score, roc_curve, f1_score # Several useful metrics
from xgboost import XGBClassifier # XGB model
from sklearn.pipeline import Pipeline # Connect processes
from sklearn.compose import ColumnTransformer # Capable apply transformer to columns

# Set matplotlib configuration
%matplotlib inline
plt.style.use('seaborn')

# 1) Review and analysis of data

In [None]:
# Import data
data = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
print("This dataset contains: {} rows and {} columns".format(data.shape[0],data.shape[1]))
data.head()

In [None]:
# Review the type of each feature
data.dtypes

In [None]:
# Count the type of features
data.dtypes.value_counts()
print('This dataset contains {} categorical features'.format(data.dtypes.value_counts()[0]))
print('This dataset contains {} numerical features'.format(data.dtypes.value_counts()[1]))

# Id and target are the unique integer features

In [None]:
# Analyse missing values
data.isna().sum()

# Do not have missing values

In [None]:
# Identify categorical features
cat = (data.dtypes == 'object')
cat_cols = list(cat[cat].index)
print(cat_cols)

# Create a handful of plots
for cols in cat_cols:
    plt.figure(figsize=(8,4));
    sns.countplot(x = data[cols]);

In [None]:
# Create a list of numerical_cols
numerical_cols = [cname for cname in data.columns if data[cname].dtype in ['float64']]

# Also, we can see how numerical features are related with the target
data[numerical_cols].hist(bins=15, figsize=(20, 14), layout=(7, 3));

We can see that our categorical and numeric features have different behaviours. We have categorical features with low and high number of classes, while our numerical feature are different distributions. 

In [None]:
# Analyse our target colum
data['target'].hist(bins=15, figsize=(12,6));

# We observe that our data is unbalanced. This is an important point.

# 2) Create a model

Using the recommendation given by the tutorial of [Intermediate ML](https://www.kaggle.com/alexisbcook/categorical-variables) of Kaggle, we apply different methods to categorical features, which have less than 12 unique values. For these features we will aply One-Hot Encoding, while features with 12 or more unique values, we will apply Ordinal Encoding. 

In [None]:
# Separate independent features of target
y = data['target']
X = data.drop(['id','target'],axis = 1)

In [None]:
# Divide data into training and validation subsets. We stratify data by output classes.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, 
                                                                test_size=0.2,random_state = 123,stratify = y)

In [None]:
# Print proportion of entire dataset
print("Proportion of classes in entire data: ")
print(100. * y.value_counts() / len(y),"\n")

# Print proportion of train and test sets 
print("Proportion of classes in train data: ")
print(100. * y_train.value_counts() / len(y_train),"\n")
print("Proportion of classes in valid data: ")
print(100. * y_valid.value_counts() / len(y_valid))

In [None]:
# Identify categorical columns with relatively low cardinality (low number of unique values)
categorical_cols_O = [cname for cname in X_train.columns if X_train[cname].nunique() < 12 and 
                    X_train[cname].dtype == "object"]

# Identify categorical columns with high cardinality
categorical_cols_L = [cname for cname in X_train.columns if X_train[cname].nunique() >= 12 and 
                    X_train[cname].dtype == "object"]

# Identify numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

In [None]:
# Preprocessing

# To categorical columns with low cardinality
categorical_O_transformer = OneHotEncoder(handle_unknown = 'ignore')

# To categorical columns with high cardinality
categorical_L_transformer = OrdinalEncoder(handle_unknown = 'use_encoded_value',
                                          unknown_value = -99)

# To numerical columns
numerical_transformer = MinMaxScaler()

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('cat_O', categorical_O_transformer, categorical_cols_O),
        ('cat_L', categorical_L_transformer, categorical_cols_L),
        ('num', numerical_transformer, numerical_cols)
    ])

In [None]:
# Creation of a model
model = XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4,use_label_encoder = False,
                     objective = "binary:logistic",eval_metric = "auc")

In [None]:
# Bundle preprocessing and modeling code in a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

In [None]:
# Preprocessing of training data, fit model 
pipeline.fit(X_train, y_train)

In [None]:
# Preprocessing of validation data, get predictions
y_pred = pipeline.predict_proba(X_valid)

# Consider our output has two columns (one per each class)

In [None]:
# Function to plot ROC curve
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

# Create plot
fpr, tpr, thresholds = roc_curve(y_valid, y_pred[:, [1]])
plot_roc_curve(fpr, tpr)

In [None]:
# Create Confusion Matrix
pred_class = y_pred[:, [1]] > 0.5
pred_class = pred_class.astype(int)
cm = confusion_matrix(y_valid, pred_class)
print("Confusion matrix: \n",cm,"\n")

# Get accuracy
accuracy = round(accuracy_score(y_valid,pred_class),4)
print("Accuracy: {}".format(accuracy),"\n")

# Get f1 score (it is required on the Task 1 of this dataset)
f1 = f1_score(y_valid,pred_class)
print("F1: {}".format(f1),"\n")


# 3) Use model in test set

In [None]:
# Load test data
test = pd.read_csv("../input/tabular-playground-series-mar-2021/test.csv")
test.head()

# Remove id
X_test = test.drop("id",axis = 1)

In [None]:
# Prediction on the valid set
test_pred=pipeline.predict_proba(X_test)

# 4) Write results

In [None]:
# Create submission file
output = test[['id']].copy()
positive_class = test_pred[:,[1]]

output['target'] = pd.Series(positive_class.flatten(), index=output.index)
output.head()

In [None]:
# write csv
output.to_csv("submissionv_XGBoost.csv",index = False)