# Spaceship Titanic
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!



To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

<img src="https://storage.googleapis.com/kaggle-media/competitions/Spaceship%20Titanic/joel-filipe-QwoNAhbmLLo-unsplash.jpg" width="600" height="400" />

## Importing data and packages

In [None]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline 
import matplotlib.pyplot as plt
from sklearn.metrics import jaccard_score

In [None]:
df_test = pd.read_csv('../input/spaceship-titanic/test.csv')
df_test.head(5)

In [None]:
df = pd.read_csv('../input/spaceship-titanic/train.csv')
df.head(5)

#### Data Description
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

#### File and Data Field Descriptions
 - train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
     - PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
      - HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
     - CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
     - Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
     - Destination - The planet the passenger will be debarking to.
     - Age - The age of the passenger.
     - VIP - Whether the passenger has paid for special VIP service during the voyage.
     - RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
     - Name - The first and last names of the passenger.
     - Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
 - test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
 - sample_submission.csv - A submission file in the correct format.
     - PassengerId - Id for each passenger in the test set.
     - Transported - The target. For each passenger, predict either True or False.

## Exploritary data analysis

In [None]:
print(df.describe(include = 'all'))
print('')
print('Data types of dataset:')
print(df.dtypes)
print('')
print(df.info)

By this EDA we can see that there is a lots of missing values we can replace them with mean, and also to do classification prediction we have to 

In [None]:
df.isna().any()

In [None]:
df_test.isna().any()

In [None]:
df['Cabin'].value_counts()

In [None]:
plt.boxplot = df.boxplot(column=['Age'])  

In [None]:
import seaborn as sns
sns.countplot(x ='Transported', data = df)
plt.show()

In [None]:
sns.countplot(x ='HomePlanet', data = df)
plt.show()

In [None]:
df['Destination'].value_counts()

In [None]:
x1 = df['Age']
y1 = df['Transported']
plt.scatter(x1,y1)
plt.title('Scatterplot of age and transported status')


In [None]:
corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

## Data preprocessing

### Replacing missing values

In [None]:
df['HomePlanet'] = df['HomePlanet'].fillna(df['HomePlanet'].mode()[0])
df['CryoSleep'] = df['CryoSleep'].fillna(df['CryoSleep'].mode()[0])
df['Cabin'] = df['Cabin'].fillna(df['Cabin'].mode()[0])
df['Destination'] = df['Destination'].fillna(df['Destination'].mode()[0])
df['VIP'] = df['VIP'].fillna(df['VIP'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].mode()[0])
df['RoomService'] = df['RoomService'].fillna(df['RoomService'].mode()[0])
df['FoodCourt'] = df['FoodCourt'].fillna(df['FoodCourt'].mode()[0])
df['ShoppingMall'] = df['ShoppingMall'].fillna(df['ShoppingMall'].mode()[0])
df['VRDeck'] = df['VRDeck'].fillna(df['VRDeck'].mode()[0])
df['Spa'] = df['Spa'].fillna(df['Spa'].mode()[0])



In [None]:
df.isna().any()

In [None]:
cols_mode = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

df_test[cols_mode].apply(lambda x: x.fillna(x.mode, inplace=True))

df_test['HomePlanet'] = df_test['HomePlanet'].fillna(df_test['HomePlanet'].mode()[0])
df_test['CryoSleep'] = df_test['CryoSleep'].fillna(df_test['CryoSleep'].mode()[0])
df_test['Cabin'] = df_test['Cabin'].fillna(df_test['Cabin'].mode()[0])
df_test['Destination'] = df_test['Destination'].fillna(df_test['Destination'].mode()[0])
df_test['VIP'] = df_test['VIP'].fillna(df_test['VIP'].mode()[0])
df_test['Age'] = df_test['Age'].fillna(df_test['Age'].mode()[0])
df_test['RoomService'] = df_test['RoomService'].fillna(df_test['RoomService'].mode()[0])
df_test['FoodCourt'] = df_test['FoodCourt'].fillna(df_test['FoodCourt'].mode()[0])
df_test['ShoppingMall'] = df_test['ShoppingMall'].fillna(df_test['ShoppingMall'].mode()[0])
df_test['VRDeck'] = df_test['VRDeck'].fillna(df_test['VRDeck'].mode()[0])
df_test['Spa'] = df_test['Spa'].fillna(df_test['Spa'].mode()[0])

In [None]:
df_test.isna().any()

In [None]:
df.head()

In [None]:
df_test.head()

### Choosing the predictors and target

In [None]:
x = df[['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']]
x

In [None]:
x.isna().any()

In [None]:
y = df['Transported']
y

In [None]:
x_pred = df_test[['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']]

In [None]:
x_pred.head()

### Changing data types

In [None]:
x.head()

In [None]:
x['Cabin'].value_counts()

In [None]:
x['Cabin'] = x['Cabin'].str.slice(0, 1)

x['VIP'] = x['VIP'].astype(int)
x['HomePlanet'] = x['HomePlanet'].map({'Europa' : '2', 'Earth': '3', 'Mars': '1'})
x['Destination'] = x['Destination'].map({'TRAPPIST-1e' : 3, '55 Cancri e': '2', 'PSO J318.5-22': '1'})
x['Cabin'] = x['Cabin'].map({'T' : '1', 'A': '2', 'D': '3', 'C' : '4', 'B' : '5', 'E' : '6', 'G' : '7', 'F' : '8',})
x['CryoSleep'] = x['CryoSleep'].astype(int)
y = y.astype(int)

x[['HomePlanet','Destination', 'Cabin']] = x[['HomePlanet','Destination', 'Cabin']].astype('int')
print(x.head())
print(x.dtypes)

In [None]:
x_pred['Cabin'] = x_pred['Cabin'].str.slice(0, 1)
x_pred['VIP'] = x_pred['VIP'].astype(int)
x_pred['HomePlanet'] = x_pred['HomePlanet'].map({'Europa' : '2', 'Earth': '3', 'Mars': '1'})
x_pred['Destination'] = x_pred['Destination'].map({'TRAPPIST-1e' : 3, '55 Cancri e': '2', 'PSO J318.5-22': '1'})
x_pred['Cabin'] = x_pred['Cabin'].map({'T' : '1', 'A': '2', 'D': '3', 'C' : '4', 'B' : '5', 'E' : '6', 'G' : '7', 'F' : '8',})
x_pred['CryoSleep'] = x_pred['CryoSleep'].astype(int)
x_pred[['HomePlanet', 'CryoSleep', 'Destination', 'Cabin']] = x[['HomePlanet', 'CryoSleep', 'Destination','Cabin']].astype('int')
print(x_pred.head())
print(x_pred.dtypes)

In [None]:
x.head()

In [None]:
x.describe(include = 'all')

In [None]:
x_pred.describe()

## Data normalization 

In [None]:
X = np.asarray(x)
Y = np.asarray(y)
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]


In [None]:
X_pred = np.asarray(x_pred)
X_pred = preprocessing.StandardScaler().fit(X_pred).transform(X_pred)
X_pred[0:5]

## Train/test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

## Modeling

### K-Nearest Neighbors classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
for i in range(1, 20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    print(i)
    print("Train acc", metrics.accuracy_score(y_train, knn.predict(X_train)))
    print("Test acc", metrics.accuracy_score(y_test,pred_i))
    print('Sum acc', metrics.accuracy_score(y_train, knn.predict(X_train)) + metrics.accuracy_score(y_test,pred_i))

In [None]:
# as we can see optimal value for k is 9
k = 9
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
yhat_knn = neigh.predict(X_test)
print(y_test[0:5])
print(yhat_knn[0:5])
print('KNN`s accuracy', metrics.accuracy_score(y_test,pred_i))
print('Classification report for Knn classifictaion', classification_report(y_test, yhat_knn))

### Decision tree classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
for i in ('gini', 'entropy'):
    drugTree = DecisionTreeClassifier(criterion=i, max_depth = 4)
    drugTree.fit(X_train,y_train)
    pred_i= drugTree.predict(X_test)
    print(i)
    print("Train acc", metrics.accuracy_score(y_train, knn.predict(X_train)))
    print("Test acc", metrics.accuracy_score(y_test,pred_i))
    print('Sum acc', metrics.accuracy_score(y_train, knn.predict(X_train)) + metrics.accuracy_score(y_test,pred_i))

In [None]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_train,y_train)
yhat_dt= drugTree.predict(X_test)
print (yhat_dt [0:5])
print (y_test [0:5])
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, yhat_dt))
print('Classification report for Decision Tree classification:', classification_report(y_test, yhat_dt))

### Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
for i in ('newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'):
    LR_n = LogisticRegression(C=0.01, solver=i)
    LR_n.fit(X_train,y_train)
    pred_i= drugTree.predict(X_test)
    print(i)
    print("Train acc", metrics.accuracy_score(y_train, knn.predict(X_train)))
    print("Test acc", metrics.accuracy_score(y_test,pred_i))
    print('Sum acc', metrics.accuracy_score(y_train, knn.predict(X_train)) + metrics.accuracy_score(y_test,pred_i))

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
yhat_lr = LR.predict(X_test)
print(yhat_lr[0:5])
print(y_test[0:5])
print("Logistic regression accuracy: ", metrics.accuracy_score(y_test, yhat_lr))
print('Classification report for Logistic regression:', classification_report(y_test, yhat_lr))

### SVM (Support Vector Machines) classification

In [None]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 
yhat_svm = clf.predict(X_test)
yhat_svm [0:5]
print(yhat_lr[0:5])
print(y_test[0:5])
print("SVM accuracy: ", metrics.accuracy_score(y_test, yhat_lr))
print('Classification report for SVM classification', classification_report(y_test, yhat_lr))

## Model evaluation

In [None]:
knn_acc = metrics.accuracy_score(y_test, yhat_knn)
dt_acc = metrics.accuracy_score(y_test, yhat_dt)
lr_acc =metrics.accuracy_score(y_test, yhat_lr)
svm_acc = metrics.accuracy_score(y_test, yhat_svm)

In [None]:
models = pd.DataFrame({
    'Model' : ['KNN calssification', 'Decision Tree Classification', 'Logistic regression classification',
             'SVM classification'],
    'Score' : [knn_acc, dt_acc, lr_acc, svm_acc]
})


models.sort_values(by = 'Score', ascending = False)

As we can see svm have best values of accuracy

## Predicting test dataset

I decided to choose knn classifier

In [None]:
knn_pr = KNeighborsClassifier(n_neighbors = 9).fit(X, Y)
predict = knn_pr.predict(x_pred)
df_pred =pd.DataFrame(predict)
print(df_pred)



In [None]:
df_test['transported'] = predict.tolist()
print(df_test)