<a href="https://www.kaggle.com/code/yorkyong/spaceship-titanic-xgboost?scriptVersionId=156786032" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# Import helpful libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

This is my second attempt on this competition using another method (i.e. XGBoost).
XGBoost method applied here is referenced from excercise notebook by Alexis Cook (Itermediate Machine Learning Course in Kaggle Learn: https://www.kaggle.com/code/alexisbcook/xgboost/tutorial


For EDA done prior, please refer to the following: https://www.kaggle.com/code/yorkyong/spaceship-titanic-random-forest

* Step 1: Understanding the Data
* Step 2: Data Preparation
* Step 3: Feature Understanding
* Step 4: Feature Relationship

# **Step 5: Apply XGBoost**

Load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

In [2]:
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv', index_col='PassengerId')
X_test_full = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv', index_col='PassengerId')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['Transported'], inplace=True)
y = X.Transported             
X.drop(['Transported'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
X_test = pd.get_dummies(X_test)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

**Default Run**
* Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.
* Then, fit the model to the training data in X_train and y_train.

In [3]:
from xgboost import XGBRegressor

# Define the model
my_model_1 = XGBRegressor(random_state=0)

# Fit the model

my_model_1.fit(X_train, y_train)

In [4]:
from sklearn.metrics import mean_absolute_error

# Get predictions
predictions_1 = my_model_1.predict(X_valid)

# Set a threshold (e.g., 0.5) to convert probabilities to binary predictions
threshold = 0.5
predictions_1 = (predictions_1 >= threshold).astype(bool)

In [5]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(predictions_1, y_valid)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7780333525014376


**Improved Run**
* Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Change the default parameters (like n_estimators and learning_rate) to get better results.
* Then, fit the model to the training data in X_train and y_train.
* Set predictions_2 to the model's predictions for the validation data. Validation features are stored in X_valid.
* Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Labels for the validation data are stored in y_valid.

In [6]:
# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)

# Fit the model
my_model_2.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

# Get predictions
predictions_2 = my_model_2.predict(X_valid)

# Set a threshold (e.g., 0.5) to convert probabilities to binary predictions
threshold = 0.5
predictions_2 = (predictions_2 >= threshold).astype(bool)

# Calculate accuracy score
accuracy = accuracy_score(predictions_2, y_valid)
print(f"Accuracy: {accuracy}")




Accuracy: 0.7987349051178838


**Generate test predictions and submit results**

In [7]:
# make predictions to submit.

test_preds = my_model_2.predict(X_test)

# Set a threshold (e.g., 0.5) to convert probabilities to binary predictions
threshold = 0.5
test_preds = (test_preds >= threshold).astype(bool)

In [8]:
# Save predictions in the format used for competition scoring

output = pd.DataFrame({'PassengerId': X_test.index,
                       'Transported': test_preds})
output.to_csv('submission.csv', index=False)

In [9]:
output.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [10]:
output.shape

(4277, 2)