# Spaceship Titanic Cleanlab Tutorial 

This notebook improves the XGBoost model from this [EDA + XGBoost notebook](https://www.kaggle.com/code/meetnagadia/titanic-spaceship-eda-xgboost-0-75-score) using the [cleanlab](https://github.com/cleanlab/cleanlab/) data cleaning library. 

`cleanlab` improves any model by automatically removing datapoints with label errors from the model's training set. With less than 5 extra lines of code, we can obtain a **4% reduction in error** without changing the model at all.

| Model      | Public Score |
| ----------- | ----------- |
| XGBoost      | 0.7587       |
| XGBoost + `cleanlab`   | 0.76782         |

## Loading Libraries and Data Preprocessing

The data preprocessing steps are identical to those in the [original notebook](https://www.kaggle.com/code/meetnagadia/titanic-spaceship-eda-xgboost-0-75-score).

In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
import xgboost as xgb
import warnings
warnings.simplefilter('ignore')

SEED = 123  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

train_df =pd.read_csv("../input/spaceship-titanic/train.csv") 
test_df = pd.read_csv("../input/spaceship-titanic/test.csv")

imputer_cols = ["Age", "FoodCourt", "ShoppingMall", "Spa", "VRDeck" ,"RoomService"]
imputer = SimpleImputer(strategy="median")
imputer.fit(train_df[imputer_cols])
train_df[imputer_cols] = imputer.transform(train_df[imputer_cols])
test_df[imputer_cols] = imputer.transform(test_df[imputer_cols])
train_df["HomePlanet"].fillna('Z', inplace=True)
test_df["HomePlanet"].fillna('Z', inplace=True)

label_cols = ["HomePlanet", "CryoSleep","Cabin", "Destination" ,"VIP"]
def label_encoder(train,test,columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] =  LabelEncoder().fit_transform(test[col])
    return train, test

train_df ,test_df = label_encoder(train_df,test_df ,label_cols)

p_id = test_df["PassengerId"]

y_train = train_df["Transported"]
X_train = train_df.drop(["Transported","Name", "PassengerId"], axis =1)
X_test = test_df.drop(["Name", "PassengerId"],axis=1)

## Model Training

First a regular XGBoost model is trained (as demonstrated in the [original notebook](https://www.kaggle.com/code/meetnagadia/titanic-spaceship-eda-xgboost-0-75-score)). Then we add `cleanlab` to see how much it improves the performance of the base XGBoost model.

### XGBoost [0.7587]

In [None]:
# Training regular XGBoost

xgb_base = xgb.XGBClassifier(eval_metric='error')
xgb_base.fit(X_train, y_train)

y_pred_base = xgb_base.predict(X_test)  # base model predictions for test data

In [None]:
# Estimating performance of base XGB model via cross-validation

cv_results = cross_validate(xgb_base, X_train, y_train)
print(f"Mean accuracy using 5-fold cv: {np.mean(cv_results['test_score'])}")

### XGBoost + `cleanlab` [0.76782]

In [None]:
# install + import cleanlab library
# make sure internet is toggled on (Settings > Internet)

!pip install cleanlab
from cleanlab.classification import CleanLearning

In [None]:
# Training improved XGBoost model with cleanlab

xgb_base = xgb.XGBClassifier(eval_metric='error')
cl = CleanLearning(clf=xgb_base, verbose=True)
cl.fit(X_train.values, y_train.values)

y_pred_cl = cl.predict(X_test.values)  # cleanlab-improved model predictions for test data

In [None]:
# Estimating performance of XGB model with cleanlab via cross-validation

cv_results = cross_validate(CleanLearning(clf=xgb_base), X_train.values, y_train.values);
print(f"Mean accuracy using 5-fold cv: {np.mean(cv_results['test_score'])}")

## Submission

Loading the predictions into a format ready for submission. 

**Note that you can switch between `submission_base` and `submission_cl` to compare the results of adding cleanlab.**


In [None]:
submission_base = pd.DataFrame(
    {'PassengerId': p_id,
     'Transported': y_pred_base.astype(bool)},columns=['PassengerId', 'Transported'])

submission_base.head()

In [None]:
submission_cl = pd.DataFrame(
    {'PassengerId': p_id,
     'Transported': y_pred_cl.astype(bool)},columns=['PassengerId', 'Transported'])

submission_cl.head()

In [None]:
# switch between the two submission files here

submission_cl.to_csv("submission.csv",index=False)

## Final Notes

The above XGBoost + `cleanlab` model was trained with default XGBoost hyperparameters. The `cleanlab` parameters can also be tuned to further improve overall performance.

While this notebook used an XGBoost model, `cleanlab` can be used with any classifier. Feel free to experiment this with other models and let me know if you see an improvement!