<div style="padding:20px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:5px;background-color:#38A6A5;overflow:hidden;font-weight:500">Titanic Spaceship</div>

# <b><span style='color:#444444'>1 |</span><span style='color:#38A6A5'> Competition Overview</span></b>

In this [competition](https://www.kaggle.com/competitions/spaceship-titanic/overview) your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

In [None]:
#Import libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import LabelEncoder

from sklearn. metrics import accuracy_score, precision_score, recall_score, confusion_matrix

print("Libraries imported")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")

# <b><span style='color:#444444'>1 |</span><span style='color:#38A6A5'> Meet and greet Data</span></b>

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
def missing_val(df):
    missing = df.isnull().sum()
    missing_percent = (df.isnull().sum()/df.shape[0] * 100).round(2)
    missing_df = pd.DataFrame({'column_name' : df.columns,
                               'missing' : missing,
                               'percent' : missing_percent})
   
    return missing_df

In [None]:
print("Train data missing values: " )
print(missing_val(train))
print('-' * 50)
print("Test data missing values: " )
print(missing_val(test))


In [None]:
train.nunique()

**Observations:**

1. There are lot of missing values(approx. 2%) in both train and test data.

**Field and data descriptions:**

train.csv - Personal records for about two-thirds (8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.

Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.


test.csv - Personal records for the remaining one-third (4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

# <b><span style='color:#444444'>2 |</span><span style='color:#38A6A5'> Exploratory Data Analysis</span></b>

In [None]:
df_num = train[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Transported']]
df_cat = train[['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Transported']]

In [None]:
fig = plt.figure(figsize = (8,5))
df1 = pd.DataFrame(train.Transported.value_counts().reset_index())

plt.pie(data = df1,x = 'Transported', labels = 'index',colors = ['#88CAC9','#EDD3B3'], 
        autopct='%.1f%%' ,shadow =True,normalize = True,startangle = 50,explode=[0.1,0.1])
plt.title("Transported Distribution")

plt.show()

In [None]:
sns.pairplot(df_num,hue = 'Transported', palette = 'pastel')
plt.show()

In [None]:
sns.histplot(data = train, x = 'Age', hue = "Transported",kde = True)

In [None]:
features = ['HomePlanet','CryoSleep', 'Destination','VIP' ]
plt.subplots(figsize = (16,16))
for i,col in enumerate(features):
    ax = plt.subplot(2,2,i+1)
    ax = sns.countplot(x = col, data = train, hue = 'Transported', palette = 'pastel')

plt.show()

In [None]:
sns.boxplot(data=df_num, orient="h", palette="pastel")

In [None]:

for i,col in enumerate(df_num.columns[:-1]):
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    ax = sns.violinplot(data=df_num, x="Transported", y=col,palette = 'pastel')
    
plt.show()

In [None]:
exp_feats=['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Plot expenditure features
fig=plt.figure(figsize=(10,20))
for i, col in enumerate(exp_feats):
    # Left plot
    ax=fig.add_subplot(5,2,2*i+1)
    sns.histplot(data=train, x=col, axes=ax, bins=30, kde=False, hue='Transported')
    ax.set_title(col)
    
    # Right plot (truncated)
    ax=fig.add_subplot(5,2,2*i+2)
    sns.histplot(data=train, x=col, axes=ax, bins=30, kde=True, hue='Transported')
    plt.ylim([0,100])
    ax.set_title(col)
    fig.tight_layout()  # Improves appearance a bit
plt.show()

In [None]:
correlation = train.corr()
sns.heatmap(correlation, annot = True)

**Observations:**

1. People who were transported tended to spend less.
2. RoomService, Spa and VRDeck have different distributions to FoodCourt and ShoppingMall - we can think of this as luxury vs essential amenities.
3. 0-18 year olds were more likely to be transported than not.
   18-25 year olds were less likely to be transported than not.
   Over 25 year olds were about equally likely to be transported than not.
4. VIP and Destination do not appear to be a useful feature; the target split is more or less equal.
5. CryoSleep and Home Planet appear to be very useful feature in contrast.
6. The target is equally distributed. Sampling of any class is not required.

# <b><span style='color:#444444'>3 |</span><span style='color:#38A6A5'> Feature Engineering</span></b>



**New Features:**
1. Can derive the number of passengers and group number from Passenger Id.
2. Can derive the deck and side(Port,Starboard) from Cabin

In [None]:
features = ['PassengerId', 'Cabin' ,'Name']
train[features].head()

In [None]:
def new_features(df):
    #Features from Passenger Id
    df['Group'] = df['PassengerId'].apply(lambda x: x.split('_')[0]).astype(int)
    df['Group_size'] = df['Group'].map(lambda x: df['Group'].value_counts()[x])
    df['IsAlone'] = np.where((df["Group_size"] == 1),1,0)
 
    
    #Fill Missing Cabin Number
    df['Cabin'].fillna('Z/9999/Z', inplace=True)
    #New features from Cabin
    df['Cabin_deck'] = df['Cabin'].apply(lambda x: x.split('/')[0])
    df['Cabin_number'] = df['Cabin'].apply(lambda x: x.split('/')[1]).astype(int)
    df['Cabin_side'] = df['Cabin'].apply(lambda x: x.split('/')[2])
    return df
    
    

In [None]:
train = new_features(train)
test = new_features(test)

train.head()

In [None]:
features = ['Group_size','IsAlone', 'Cabin_deck','Cabin_side']
plt.subplots(figsize = (16,16))
for i,col in enumerate(features):
    ax = plt.subplot(2,2,i+1)
    ax = sns.countplot(x = col, data = train, hue = 'Transported', palette = 'pastel')

plt.show()


In [None]:
sns.histplot(data=train, x='Cabin_number', hue='Transported',binwidth=20)
plt.xlim([0,2000])
fig.tight_layout()

**Insights:**
1. Age and cabin numbers can be divided into bins.
2. Premium can be made out of joining RoomService, Spa and VRDeck.
3. Essential can be made out of joining FoodCourt and ShoppingMall.


In [None]:
def more_features(df):
    df['Premium'] = df.RoomService +  df.Spa + df.VRDeck
    df['Essential'] =  df.FoodCourt + df.ShoppingMall
    df['Age_bin'] = pd.cut(df.Age, bins=[0,20,35,50,90],
                       include_lowest=True)
    df['Cabin_number_bin'] = pd.cut(df.Cabin_number, bins=[0,300,600,900,1200,1500,2000],
                       include_lowest=True)
    
    return df
    

In [None]:
train = more_features(train)
test = more_features(test)

train.head()

In [None]:
plt.subplots(figsize = (16,16))
ax1 = plt.subplot(2,1,1)
ax1 = sns.countplot(data=train, x='Age_bin', hue='Transported', palette = 'pastel') 

ax2 = plt.subplot(2,1,2)
ax2 = sns.countplot(data=train, x='Cabin_number_bin', hue='Transported', palette = 'pastel') 

plt.show()

In [None]:
features = ['Premium','Essential']
plt.subplots(figsize = (16,16))
for i,col in enumerate(features):
    ax = plt.subplot(2,1,i+1)
    ax1 = sns.histplot(data=train, x=col, hue='Transported',binwidth = 20, palette = 'pastel') 
    plt.ylim([0,100])
    plt.xlim([0,2000])

plt.show()


In [None]:
train.head()

# <b><span style='color:#444444'>4 |</span><span style='color:#38A6A5'> Missing Values</span></b>

1. The easiest was to fill missing values is fill them with median value for numeric columns and with mode for categorical columns.

2. HomePlanet - The people who are travelling in a group, there homeplanet will be same. So will read the homeplanet where group size is more than 1.


In [None]:
def fill_nan_HomePlanet(df):
    df_grp=df.groupby(['Group','HomePlanet'])['HomePlanet'].size().unstack().fillna(0)

    grp_index=df[df['HomePlanet'].isna()][(df[df['HomePlanet'].isna()]['Group']).isin(df_grp.index)].index


    df.loc[grp_index,'HomePlanet']=df.iloc[grp_index,:]['Group'].map(lambda x: df_grp.idxmax(axis=1)[x])
    
    return df   

In [None]:
def fill_nan(df):

    for col in df[df.columns[0:-1]]:
        if df[col].dtype == 'float64':
            df[col] = df[col].fillna(value = df[col].mean())
        else:
            df[col] = df[col].fillna(value = df[col].mode()[0])
    
    return df


In [None]:
train = fill_nan_HomePlanet(train)
train = fill_nan(train)

test = fill_nan_HomePlanet(test)
test = fill_nan(test)

train.isnull().sum()


# <b><span style='color:#444444'>5 |</span><span style='color:#38A6A5'> Preprocessing Data</span></b>

First we use label encoding to convert categorical columns into their numerical equivalent values.

After that we use Standard Scalar to scale the values.

In [None]:

def preprocess_data(df):
    cat_features = ['HomePlanet','CryoSleep','Cabin_deck','Cabin_side',
                    'Age_bin','Cabin_number_bin']

    label = LabelEncoder()
    for col in df[cat_features]:
        df[col] = label.fit_transform(df[col])
    
    scaler = StandardScaler()
    df = scaler.fit_transform(df)
    return df
    

In [None]:
features = ['HomePlanet','CryoSleep','Group_size','Cabin_deck','Cabin_side','Premium', 'Essential',
            'Age_bin','Cabin_number_bin']
label = 'Transported'
df_X, y = train[features], train[label].values

X = preprocess_data(df_X)

df_test_f = test[features]
df_test_final = preprocess_data(df_test_f)

In [None]:
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape[0]))

# <b><span style='color:#444444'>6 |</span><span style='color:#38A6A5'> Model</span></b>

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

We have used GridSearch to find the best parameters to train the model.

In [None]:
from sklearn.model_selection import GridSearchCV
lgbm = LGBMClassifier(random_state=0)
grid = {'n_estimators': [50, 100, 150, 200],
        'max_depth': [4, 8, 12],
        'learning_rate': [0.05, 0.1, 0.15]}

clf = GridSearchCV(estimator=lgbm, param_grid=grid, n_jobs=-1, cv=None)

clf.fit(X_train, y_train)

print(clf.best_params_)

print(clf.best_score_)

In [None]:
#Train the model
model = LGBMClassifier(n_estimators = 100, 
                      learning_rate = 0.05, 
                      random_state=0, 
                      max_depth = 12)

model.fit(X_train,y_train)

# Get predictions from test data
predictions = model.predict(X_test)

In [None]:
from sklearn. metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Get metrics
print("Overall Accuracy:",accuracy_score(y_test, predictions))
print("Overall Precision:",precision_score(y_test, predictions, average='macro'))
print("Overall Recall:",recall_score(y_test, predictions, average='macro'))

In [None]:
# Plot confusion matrix
cm = confusion_matrix(y_test, predictions)
classes = ['0','1']
plt.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
plt.title('Confusion Matrix')
plt.xlabel("Predicted Variety")
plt.ylabel("Actual Variety")
plt.show()

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))

# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

# <b><span style='color:#444444'>7 |</span><span style='color:#38A6A5'> Submission</span></b>

We will train the model again with complete training dataset before submitting the results.

In [None]:
model.fit(X,y)

# Get predictions from test data
predictions = model.predict(df_test_final)

predictions

In [None]:
submit =pd.read_csv('../input/spaceship-titanic/sample_submission.csv')

# Add predictions
submit['Transported']=predictions

# Replace 0 to False and 1 to True
submit=submit.replace({0:False, 1:True})

# Prediction distribution
plt.figure(figsize=(6,6))
submit['Transported'].value_counts().plot.pie(explode=[0.1,0.1], colors = ['#88CAC9','#EDD3B3'],autopct='%1.1f%%', shadow=True, textprops={'fontsize':16}).set_title("Prediction distribution")

plt.show()

In [None]:
# Output to csv
submit.to_csv('submission.csv', index=False)

**References:**

1. Visualizations help from https://www.kaggle.com/code/kellibelcher/tps-may-2022-eda-lgbm-neural-networks

2. Data Analysis help : https://www.kaggle.com/code/samuelcortinhas/spaceship-titanic-a-complete-guide#EDA
