# Titanic Data Peprocessing Step by Step
## Intro
 I'm talking just from *my experience on Titanic* so the following may not be true for you, so be cautious.

 - "Has_Cabin" feature does not help. I engineered a feature with 0 if a passenger has no Cabin (NaN) and 1 if he got one may make sense as cabin data of 1st class passengers was found (IRL) on the body of steward Herbert Cave, so I tried it. But that doesn't seem to help.
 - "Deck" feature does not help. Based on letters found in Cabin column we may engineer a Deck feature, indicating which deck (A - G, T or U for Unknown) the passenger was on. But it's rather noisy, it doesn't help the score.
 - "Embarked" does not help, I have no idea why people even include it in their kernels. It has no impact on survival chances.
 - *Edit*: actually certain algorithms may perform better if you turn categorical features into ordinal ones (like turning Pclass to Pclass_1, Pclass_2 and Pclass_3 features with possible values {0, 1}). Pros are higher accuracy in certain cases, cons are - you lose relation between Pclasses (meaning the algorithm will think those are independent, unordered classes, when in fact they are ordered - Pclass=1 is "better" than Pclass=3) and you add dimensions which is not always good because of the curse of dimensionality. In my specific case turning Pclass into 3 features did not help, but as I learned it's a good idea to try both approaches and see what's better in your case.
 - I don't know about feature scaling in R, maybe R methods scale them by default? If not, and if you're using R, try scaling, it may help.
 - There is not much sence in scaling features that are already 0 or 1 like Sex, but for now I scale them all. You can try to pick features for scaling. If you don't use bins (if you use Age or Fare "as is"), scaling may help to boost your score a bit, try it.


## Workflow goals

The data science solutions workflow solves for seven major goals.

**Classifying.** We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

**Correlating.** One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a [correlation](https://en.wikiversity.org/wiki/Correlation) among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

**Converting.** For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

**Completing.** Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

**Correcting.** We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

**Creating.** Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

**Charting.** How to select the right visualization plots and charts depending on nature of the data and the solution goals.

# Part 1 : Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import math
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# Part 2 : Importing & Exploring Data

In [None]:
Titanic = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")


In [None]:
Titanic.columns

**Data Dictionary**

* survived: 0 = No, 1 = Yes
* pclass:Ticket Class 1=1st, 2=2nd, 3=3rd
* SibSp: # of Sibilings/Spouses aboard the titanic (0 mentions neither have have Spuose nor Sibilings)
* parch: # of parents/children aboard the titanic
* ticket: Ticket number
* Cabin: Cabin Number
* embarked: Port of Embarkation C= Cherboug, S= Southamptom, Q = Queenstown

In [None]:
Titanic.head()

In [None]:
Titanic.tail()

In [None]:
Titanic.shape

In [None]:
Titanic.info()

We've got a sense of our variables, their class type, and the first few observations of each. We know we're working with 1309 observations of 12 variables. In which 891 observations are from train data set, and 418 observations are from test data set. When separate the variables by type, we have ordinal variable PassengerId, lable variable Name and Ticket, numeric variables such as Age, SibSp, Parch, Fare, and categorical variables like Survived ,Pclass, Sex ,Cabin, and Embarked.

In [None]:
Titanic.describe()

In [None]:
Titanic.describe(include=['O'])

# Part 3 : Data Analyze by pivoting features

In [None]:
Titanic[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
Titanic[["Sex" , "Survived"]].groupby(["Sex"] , as_index = False).mean().sort_values(by="Survived" , ascending = False)

In [None]:
Titanic[["Parch" , "Survived"]].groupby(["Parch"] , as_index = False) .mean().sort_values(by="Survived" , ascending = False)

In [None]:
Titanic[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
Titanic[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

# Part 4 : Data Analyze by visualization Method

In [None]:

Titanic.hist(bins=10,figsize=(9,7),grid=False)

In [None]:
print("No of Passengers in original data:  " , str(len(Titanic.index)))

In [None]:
sns.countplot(x="Survived" , data=Titanic , color="orange")

In [None]:
sns.countplot(x="Survived", hue="Parch",data=Titanic)

In [None]:
sns.countplot(x="Survived" ,hue="Sex" , data=Titanic )



We can see from the Countplot above that female's survival rate is greater than male's.


In [None]:
sns.countplot(x="Survived" ,hue="Pclass" , data=Titanic )

In [None]:
sns.countplot(x="Survived" ,hue="Embarked" , data=Titanic )

In [None]:
sns.countplot(x="Survived", hue="SibSp",data=Titanic)

In [None]:
Titanic["Parch"].plot.hist( figsize=(5,4))

In [None]:
Titanic["SibSp"].plot.hist()

# Part 5 : Data Cleaning Or Filling The Missing Values

 **Data Cleaning**

From the data set, we notice that there are missing values in Age, Cabin ,Fare and Embarked column. We are going to replace missing values in Age with a random sample from existing ages. For Cabin, since cabin number makes little sense to the result, we are going to create a new Cabin column to indicate how many cabins the passenger has.

## Part 5.1 : Using Three Different Methods

### Part 5.1.1 : Simple Row Del Method 

In [None]:
Titanic.isnull()

In [None]:
Titanic.isnull().sum()

In [None]:
sns.heatmap(Titanic.isnull())

In [None]:
null_var = Titanic.isnull().sum()/Titanic.shape[0] *100
null_var

In [None]:
drop_column = null_var[null_var >20].keys()
drop_column
##null_var = Titanic_datA.isnull().sum()/Titanic_datA.shape[0] *100
#null_var

In [None]:
#drop_column = null_var[null_var >20].keys()
#drop_column
##null_var = Titanic_datA.isnull().sum()/Titanic_datA.shape[0] *100
#null_var
N_Titanic_datA = Titanic.drop(columns = drop_column)

In [None]:
Titanic_copy = Titanic.copy()
Titanic_copy2 = Titanic.copy()
Titanic_Deep = Titanic_copy.copy()

In [None]:
sns.heatmap( N_Titanic_datA.isnull())

In [None]:
N_Titanic_datA.isnull().sum()/Titanic_Deep.shape[0] *100

In [None]:
N_Titanic_datAA = N_Titanic_datA.dropna()

In [None]:
sns.heatmap( N_Titanic_datAA.isnull())

In [None]:
Categorical_Values = N_Titanic_datAA.select_dtypes(include=["object"]).columns
Categorical_Values_test = test.select_dtypes(include=["object"]).columns

In [None]:
Numarical_Values = N_Titanic_datAA.select_dtypes(include=['int64','float64']).columns
Numarical_Values_test = test.select_dtypes(include=['int64','float64']).columns

In [None]:
test.shape

In [None]:
def cat_var_dist(var):
    return pd.concat([Titanic_Deep[var].value_counts()/Titanic_Deep.shape[0] * 100, 
          N_Titanic_datAA[var].value_counts()/N_Titanic_datAA.shape[0] * 100], axis=1,
         keys=[var+'_org', var+'clean'])
    

In [None]:
cat_var_dist("Ticket")

### Part 5.1..2 : SimpleImputer Method 

In [None]:
Imputer_mean = SimpleImputer(strategy='mean')


In [None]:
Imputer_mean.fit(Titanic_Deep[Numarical_Values])


In [None]:
Imputer_mean.statistics_

In [None]:
Imputer_mean.transform(Titanic_Deep[Numarical_Values])

In [None]:
Titanic_Deep[Numarical_Values] = Imputer_mean.transform(Titanic_Deep[Numarical_Values])
nnnn = Titanic_Deep[Numarical_Values]

In [None]:
Titanic_Deep[Numarical_Values].isnull().sum()

In [None]:
Imputer_mean = SimpleImputer(strategy='most_frequent')

In [None]:
Titanic_Deep[Categorical_Values] = Imputer_mean.fit_transform(Titanic_Deep[Categorical_Values])

In [None]:
Titanic_Deep[Categorical_Values].isnull().sum()

In [None]:
New_Titanic_datA = pd.concat([Titanic_Deep[Numarical_Values] , Titanic_Deep[Categorical_Values]] , axis=1)


In [None]:
New_Titanic_datA.isnull().sum()

### Part 5.1.3 : Mean Median and Mode Method

In [None]:
skip_column = null_var[null_var >20].keys()
skip_column


In [None]:
Nn_Titanic_datA = Titanic_copy.drop(columns = skip_column)


In [None]:
Titanic_mean = Nn_Titanic_datA.fillna(Nn_Titanic_datA.mean())
Titanic_mean = Titanic_mean.dropna()


In [None]:
print(Titanic_mean.isnull().sum())

In [None]:
Titanic_median = Nn_Titanic_datA.fillna(Nn_Titanic_datA.median())
test_median =  test.fillna(test.median())
Titanic_median = Titanic_median.dropna()
Titanic_median.isnull().sum()

In [None]:
print("*"*30 , "Data Cleaning Using Different Method" , "*"*30)
print("*"*30 , "Simple Row Delete Mehtod" , "*"*30)
print(N_Titanic_datAA.isnull().sum())
print("*"*30 , "SimpleImputer Method" , "*"*30)
print(New_Titanic_datA.isnull().sum())
print("*"*30 , "Median" , "*"*30)
print(Titanic_median.isnull().sum())
print("*"*30 , "Mean" , "*"*30)
print(Titanic_mean.isnull().sum())

# Part 6 : Finding categorical feature, Training Testing, and Accuracy Using Three Different Methods

## Part 6.1 : Simple Row Del Method

### Part 6.1.1 : Finding categorical feature

In [None]:
N_Titanic_datAA.tail()

In [None]:
sex = pd.get_dummies(N_Titanic_datAA["Sex"] , drop_first=True)
sexx =  pd.get_dummies(test_median["Sex"] , drop_first=True)

In [None]:
pclass = pd.get_dummies(N_Titanic_datAA["Pclass"] , drop_first=True)
pclasss = pd.get_dummies(test_median["Pclass"] , drop_first=True)

In [None]:
embarked = pd.get_dummies(N_Titanic_datAA["Embarked"] , drop_first=True)
embarkedd = pd.get_dummies(test_median["Embarked"] , drop_first=True)

In [None]:
N_Titanic_datAA_copy = N_Titanic_datAA.copy()

In [None]:
N_Titanic_datAA_copy.drop(['Embarked', 'Pclass' ,"Sex" , "Ticket" , "Name"], axis=1 , inplace=True)

In [None]:
test_median.drop(['Embarked', 'Pclass' ,"Sex" , "Ticket" , "Name"], axis=1 , inplace=True)

In [None]:
N_Titanic_datAA_copy = pd.concat([N_Titanic_datAA_copy ,sex ,pclass ,embarked] ,axis=1)
N_Titanic_datAA_copy.head()
test_median = pd.concat([test_median ,sexx ,pclasss ,embarkedd] ,axis=1)
test_median.head()

In [None]:
test_median.drop(["Cabin"], axis=1 , inplace=True)

In [None]:
test1= test_median.copy()

In [None]:
test_median.head()

### Part 6.1.2 : Training  & Testing

In [None]:
X = N_Titanic_datAA_copy.drop("Survived" , axis=1)
y = N_Titanic_datAA_copy["Survived"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y.shape

### Part 6.1.3 : Finding The Accuracy

In [None]:
#KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train,y_train) * 100, 2)
acc_knn

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, y_train) * 100, 2)
acc_gaussian

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
acc_decision_tree

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest

In [None]:
#LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
acc_logreg = round(logreg.score(X_train, y_train) * 100, 2)
acc_logreg

In [None]:
models = pd.DataFrame({
    'Model': [ 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 
              'Decision Tree'],
    'Score': [ acc_knn, acc_logreg, 
              acc_random_forest, acc_gaussian,
              acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

## Part 6.2 : SimpleImputer Method

### Part 6.2.1 : Finding categorical feature

In [None]:
New_Titanic_datA.head()


In [None]:
embarked = pd.get_dummies(New_Titanic_datA["Embarked"] , drop_first=True)
pclass = pd.get_dummies(New_Titanic_datA["Pclass"] , drop_first=True)
sex = pd.get_dummies(New_Titanic_datA["Sex"] , drop_first=True)


In [None]:
New_Titanic_datA_copy= New_Titanic_datA.copy()

In [None]:
New_Titanic_datA_copy.drop(['Embarked', 'Pclass' ,"Sex" , "Ticket" , "Name"], axis=1 , inplace=True)

In [None]:
New_Titanic_datA_copy = pd.concat([New_Titanic_datA_copy ,sex ,pclass ,embarked] ,axis=1)
New_Titanic_datA_copy.head()

### Part 6.2.2 : Training & Testing

In [None]:
XXX = New_Titanic_datA_copy.drop("Survived" , axis=1)
yyy = New_Titanic_datA_copy["Survived"]

In [None]:
XXX_train, XXX_test, yyy_train, yyy_test = train_test_split(XXX, yyy, test_size=0.2, random_state=42)

### Part 6.2.3 : Finding The Accuracy

In [None]:
#KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(XXX_train, yyy_train)
yyy_pred = knn.predict(XXX_test)
acc_knn = round(knn.score(XXX_train, yyy_train) * 100, 2)
acc_knn

In [None]:
#LogisticRegression
logreg = LogisticRegression()
logreg.fit(XXX_train, yyy_train)
yyy_pred = logreg.predict(XXX_test)
acc_logreg = round(logreg.score(XXX_train, yyy_train) * 100, 2)
acc_logreg

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(XXX_train, yyy_train)
yyy_pred = random_forest.predict(XXX_test)
random_forest.score(XXX_train, yyy_train)
acc_random_forest = round(random_forest.score(XXX_train, yyy_train) * 100, 2)
acc_random_forest

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(XXX_train, yyy_train)
yyy_pred = gaussian.predict(XXX_test)
acc_gaussian = round(gaussian.score(XXX_train, yyy_train) * 100, 2)
acc_gaussian

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(XXX_train, yyy_train)
yyy_pred = decision_tree.predict(XXX_test)
acc_decision_tree = round(decision_tree.score(XXX_train, yyy_train) * 100, 2)
acc_decision_tree

In [None]:
models = pd.DataFrame({
    'Model': [ 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 
              'Decision Tree'],
    'Score': [ acc_knn, acc_logreg, 
              acc_random_forest, acc_gaussian,
              acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

## Part 6.3 : Mean Median and Mode Method

### Part 6.3.1 : Finding categorical feature

In [None]:
Titanic_median.head()

In [None]:
embarked = pd.get_dummies(Titanic_median["Embarked"] , drop_first=True)
pclass = pd.get_dummies(Titanic_median["Pclass"] , drop_first=True)
sex = pd.get_dummies(Titanic_median["Sex"] , drop_first=True)


In [None]:
Titanic_median_copy= Titanic_median.copy()

In [None]:
Titanic_median_copy.drop(['Embarked', 'Pclass' ,"Sex" , "Ticket" , "Name"], axis=1 , inplace=True)



In [None]:
Titanic_median_copy = pd.concat([Titanic_median_copy ,sex ,pclass ,embarked] ,axis=1)
Titanic_median_copy.head()

### Part 6.3.2 : Training &Testing

In [None]:
XX = Titanic_median_copy.drop("Survived" , axis=1)
yy = Titanic_median_copy["Survived"]

In [None]:
XXX_train, XXX_test, yyy_train, yyy_test = train_test_split(XX, yy, test_size=0.2, random_state=42)

In [None]:
XXX_train.shape

### Part 6.3.3 : Finding The Accuracy

In [None]:
#KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(XXX_train, yyy_train)
yyy_pred = knn.predict(XXX_test)
acc_knn = round(knn.score(XXX_train, yyy_train) * 100, 2)
acc_knn

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(XXX_train, yyy_train)
yyy_pred = gaussian.predict(XXX_test)
acc_gaussian = round(gaussian.score(XXX_train, yyy_train) * 100, 2)
acc_gaussian

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(XXX_train, yyy_train)
yyy_pred = decision_tree.predict(XXX_test)
acc_decision_tree = round(decision_tree.score(XXX_train, yyy_train) * 100, 2)
acc_decision_tree

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(XXX_train, yyy_train)
yyy_pred = random_forest.predict(XXX_test)
random_forest.score(XXX_train, yyy_train)
acc_random_forest = round(random_forest.score(XXX_train, yyy_train) * 100, 2)
acc_random_forest

In [None]:
#LogisticRegression
logreg = LogisticRegression()
logreg.fit(XXX_train, yyy_train)
yyy_pred = logreg.predict(XXX_test)
acc_logreg = round(logreg.score(XXX_train, yyy_train) * 100, 2)
acc_logreg

In [None]:
models = pd.DataFrame({
    'Model': [ 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 
              'Decision Tree'],
    'Score': [ acc_knn, acc_logreg, 
              acc_random_forest, acc_gaussian,
              acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

In [None]:
test_median.isnull().sum()

In [None]:
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier()

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
from sklearn.linear_model import Perceptron

perceptron = Perceptron()
from sklearn.svm import LinearSVC

linear_svc = LinearSVC()
from sklearn.svm import SVC

svc = SVC()


In [None]:
# Gradient Boosting Classifier
gbkk = GradientBoostingClassifier()
gbkk.fit(X_train, y_train)
gbk_pred = gbkk.predict(X_test)
acc_gbkk = round(gbkk.score(X_train, y_train) * 100, 2)
predictions = gbkk.predict(test_median)
acc_gbkk

In [None]:
gbk_pred = gbkk.predict(X_test)
gbk_pred

In [None]:
sgd.fit(XXX_train, yyy_train)
yyy_pred = sgd.predict(XXX_test)
acc_gbk = round(sgd.score(XXX_train, yyy_train) * 100, 2)
acc_gbk

In [None]:
perceptron.fit(XXX_train, yyy_train)
yyy_pred = perceptron.predict(XXX_test)
acc_gbk = round(perceptron.score(XXX_train, yyy_train) * 100, 2)
acc_gbk

In [None]:
linear_svc.fit(XXX_train, yyy_train)
yyy_pred = linear_svc.predict(XXX_test)
acc_gbk = round(linear_svc.score(XXX_train, yyy_train) * 100, 2)
acc_gbk

In [None]:
svc.fit(XXX_train, yyy_train)
yyy_pred = svc.predict(XXX_test)
acc_gbk = round(svc.score(XXX_train, yyy_train) * 100, 2)
acc_gbk

In [None]:
#set ids as PassengerId and predict survival 
ids = test1['PassengerId']
predictions = gbkk.predict(test_median)

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)