# Titanic Survival Prediction

1. [Import Libraries](#heading1)
2. [Read Data](#heading2)
3. [Data Cleaning](#heading3)
4. [Machine Learning](#heading4)

# 1. Import Libraries <a id="heading1"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# 2. Read Data with Pandas Library <a id="heading2"></a>

In [None]:
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

[Titanic data link](https://www.kaggle.com/c/titanic/data)

Variable	Definition	Key

survival	Survival	0 = No, 1 = Yes

pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd

sex	Sex	

Age	Age in years	

sibsp	# of siblings / spouses aboard the Titanic

parch	# of parents / children aboard the Titanic	

ticket	Ticket number	

fare	Passenger fare	

cabin	Cabin number	
**
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
# Train data preview
train_df.head(20)

In [None]:
print(train_df.shape)

Q2: Can you check the first 10 rows of the test data ?

In [None]:
# your answer goes here
test_df.head(10)

In [None]:
print(test_df.shape)

In [None]:
train_df.info()
print('_'*40)
test_df.info()

In [None]:
train_df.describe()

Q2: Can you do the above calculation for 'test_df' ?

In [None]:
# Your answer goes here
test_df.describe()

In [None]:
# plot dataframe
train_df.hist(column='Survived')

In [None]:
g = sns.FacetGrid(train_df, col='Survived') # you can check the help doc from https://seaborn.pydata.org/
g.map(plt.hist, 'Age', bins=20)

In [None]:
g = sns.FacetGrid(train_df, col='Survived') # you can check the help doc from https://seaborn.pydata.org/
g.map(plt.hist, 'Fare', bins=20)

Q3: Can you plot the above type of diagram for 'Sex' feature?

In [None]:
# your answer goes here
g = sns.FacetGrid(train_df, col='Sex') # you can check the help doc from https://seaborn.pydata.org/
g.map(plt.hist, 'Survived', bins=20)

In [None]:
g = sns.FacetGrid(train_df, col='Embarked') # you can check the help doc from https://seaborn.pydata.org/
g.map(plt.hist, 'Survived', bins=20)

# 3. Data Clean <a id="heading3"></a>

## 3.1 remove null value

Too many missing data in 'Cabin', so we just remove the whole coloum.

In [None]:
train_df.drop('Cabin',axis=1,inplace=True)
test_df.drop('Cabin',axis=1,inplace=True)

Here we removed the NaN data by delete it. There could be other ways.

In [None]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

In [None]:
train_df.head(20)

In [None]:
train_df.info()

Now we have no null values in the train data

## 3 Converting Categorical Features

In [None]:
embarked_df = pd.get_dummies(train_df['Embarked'],drop_first=True)
embarked_df.head()

In [None]:
embarked_df_test = pd.get_dummies(test_df['Embarked'],drop_first=True)

In [None]:
sex_df = pd.get_dummies(train_df['Sex'],drop_first=True)
sex_df.head()

In [None]:
sex_df_test = pd.get_dummies(test_df['Sex'],drop_first=True)

In [None]:
train_df.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
train_df.head()

In [None]:
test_df.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
test_df.head()

In [None]:
train_df = pd.concat([train_df,sex_df,embarked_df],axis=1)
train_df.head()

In [None]:
test_df = pd.concat([test_df,sex_df_test,embarked_df_test],axis=1)
test_df.head()

# 4. Machine Learning <a id="heading4"></a>

In [None]:
from sklearn.model_selection import train_test_split # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [None]:
X_train, X_validation, y_train, y_validation = train_test_split(train_df.drop('Survived',axis=1), 
                                                    train_df['Survived'], test_size=0.20)


## 4.1 logistic regression model

In [None]:
# import LogisticRegression model in python. 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score

## call on the model object
model = LogisticRegression(solver='liblinear',
                            penalty= 'l1',random_state = 42)

## fit the model with "train_x" and "train_y"
model.fit(X_train,y_train)

## Once the model is trained we want to find out how well the model is performing, so we test the model. 
## we use "X_test" portion of the data(this data was not used to fit the model) to predict model outcome. 
y_pred = model.predict(X_validation)

## Once predicted we save that outcome in "y_pred" variable.
## Then we compare the predicted value( "y_pred") and actual value("test_y") to see how well our model is performing. 

accuracy_score(y_validation, y_pred)


Now we can predict the test set.

In [None]:
y_test = model.predict(test_df)
print(y_test[1])

In [None]:
print(y_test)

In [None]:
test_df.head()

## 4.2 KNeighbors Model model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model_knn = KNeighborsClassifier()


model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_validation)
acc_knn = accuracy_score(y_validation, y_pred_knn)

print("The Score for KNeighbors is: " + str(acc_knn))

In [None]:
y_test_knn = model_knn.predict(test_df)
print(y_test_knn[1])

In [None]:
print(y_test_knn)