# Introduction
The sinking of the Titanic is one of the most notorious shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

<font color = blue>
Content


1. [Load and Check Data](#1)
1. [Variable Description](#2)
    
    1. [Univariate Variable Analysis](#3)
    
        1. [Categorical Variable Analysis](#4)
    
        1. [Numerical Variable Analysis](#5)
1. [Basic Data Analysis](#6)
1. [Outlier Detection](#7)
1. [Missing Value](#8)
    1. [Find missing value](#9)
    
    1. [Fill missing value](#10)
1. [Visualization](#11)
    1. [Correlation Between SibSp - Parch - Age - Fare - Survived](#12)
    1. [SibSp with Survived features](#13)
    1. [Parch with Survived](#14)
    1. [Pclass with Survived](#15)
    1. [Age -- Survived](#16)
    1. [Pclass -- Survived -- Age](#17)
    1. [Embarked -- Sex -- Pclass -- Survived](#18)
    1. [Embarked -- Sex -- Fare -- Survived](#19)
1. [Fill Missing: Age Feature](#20)
    

<a id = '1'></a><br>
### [1. Load and Check Data](#1)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
plt.style.use('seaborn-pastel') #theme of grids
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [1]:
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
test_PassengerID = test_df['PassengerId']

<hr >
<a id = '1'></a><br>

## [2. Variable Description](#2)

<hr >

0ur Variable Features:
1. PassengerId: Unique numer of each passenger
1. Survived: 0 = No 1 = Yes
1. Pclass: Ticket class 1 = 1st 2 = 2nd 3 = 3rd
1. Name
1. Sex
1. Age
1. SibSp: of siblings / spouses aboard the Titanic
1. Parch: of parents / children aboard the Titanic
1. Ticket: Ticket number
1. Fare: Passenger fare (ticket price)
1. Cabin: Cabin number
1. Embarked: Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


* pclass: A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower
* 
* age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* 
* sibsp: The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)
* 
* parch: The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
 train_df.info()

* float64: Age and Fare
* int64: PassengerId, Survived, Pclass, Sibsp, Parch
* object: Name, Sex, Ticket, Cabin, Embarked

<hr >
<a id = '3'></a><br>

## [3. Univariate Variable Description](#3)

   
*[a. Categorical Variable:](#4) Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsb and Parch
    
*[b. Numerical Variable:](#5) Age, Fare and PassengerId



<hr >
<a id = '4'></a><br>

### [a. Categorical Variable:](#4)
    
    

In [1]:
def bar_plot(variable):
    """
    input: variable example: Sex
    output: bar plot & value count
    """
    #get feature
    var = train_df[variable]
    # count number of categroical variable
    varValue = var.value_counts()
    # visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index,varValue)
    plt.xticks(varValue.index,varValue.index.values)
    plt.ylabel('Frequency')
    plt.show() 
    print("{}: \n {}" .format(variable,varValue))
    
    
    

In [1]:
category1 = ['Survived','Sex','Pclass','Embarked','SibSp','Parch']
for i in category1:
    bar_plot(i)
    

In [1]:
category2 = ['Cabin', 'Name', 'Ticket']
for i in category2:
    print("{}: \n" .format(train_df[i].value_counts()))

<hr >
<a id = '5'></a><br>

###  [b. Numerical Variable:](#5)


In [1]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(train_df[variable], bins = 100)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [1]:
numericVar = ["Fare", "Age","PassengerId"]
for n in numericVar:
    plot_hist(n)

<hr > 

<a id = '1'></a><br>
### [3. Basic Data Analysis](#6)


* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

In [1]:
#Pclass vs Survived
train_df[['Pclass','Survived']].groupby(['Pclass'], as_index = False).mean().sort_values(by='Survived', ascending=False) 

In [1]:
#Sex vs Survived
train_df[['Sex','Survived']].groupby(['Sex'], as_index = False).mean().sort_values(by='Survived', ascending=False) 

In [1]:
#SibSp vs Survived
train_df[['SibSp','Survived']].groupby(['SibSp'], as_index = False).mean().sort_values(by='Survived', ascending=False) 

In [1]:
# Parch vs Survived
train_df[['Parch','Survived']].groupby(['Parch'], as_index = False).mean().sort_values(by='Survived', ascending=False) 

<hr > 

<a id = '1'></a><br>
### [4. Outlier Detection](#7)

In [1]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [1]:
train_df.loc[detect_outliers(train_df,['Age','SibSp','Parch','Fare'])]

In [1]:
#Drop outliers
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp","Parch","Fare"]),axis = 0).reset_index(drop = True)

<hr > 

<a id = '1'></a><br>
### [4. Missing Value](#8)


<a id = '1'></a><br>
#### [A. Find missing value](#9)
   

In [1]:
train_df_len = len(train_df )
train_df = pd.concat([train_df,test_df], axis=0).reset_index(drop = True)

In [1]:
train_df.columns[train_df.isnull().any()]
#Which columns have null value?

In [1]:
train_df.isnull().sum()

<a id = '1'></a><br>
#### [A. Fill missing value](#10)

*  Embarked has 2 missing values
*  Fare has only 1 missing values
   

In [1]:
# Find null values
train_df[train_df['Embarked'].isnull()]

In [1]:
#According to the ticket price, we can find out from which port he/she embarked.
where = train_df['Fare'].where(train_df['Fare'] == 80)
where.dropna(inplace=True)
where
#This method is aborted Fare == 80 just they are
# Another way;

In [1]:
train_df.boxplot(column='Fare',by='Embarked')
plt.show()
# Our subject's Fare are 80, we can assume they embarked from C, because of C's mean more closer to 80 other than.

In [1]:
train_df['Embarked'] = train_df['Embarked'].fillna('C')
train_df[train_df['Embarked'].isnull()]

In [1]:
train_df[train_df['Fare'].isnull()]

In [1]:
train_df[train_df['Pclass']==3]
#önce Pclass == 3'leri filtreledi
#daha sonra bu değerler içerisinde Fare'ler istendi, yani mean maaaş şudur 

In [1]:
train_df['Fare'] = train_df['Fare'].fillna(np.mean(train_df[train_df['Pclass']==3]['Fare']))
train_df[train_df["Fare"].isnull()]

<a id = "11"></a><br>
# Visualization

<a id = "12"></a><br>
## Correlation Between Sibsp -- Parch -- Age -- Fare -- Survived

In [1]:
list1 = ["SibSp", "Parch", "Age", "Fare", "Survived"]
sns.heatmap(train_df[list1].corr(), annot = True, fmt = ".2f") 
#annot --> write values
#.2f meant is .00 if you write .3f it will be .000 
plt.show()

# Comments: Fare feature seems to have correlation with the survived feature (+0,26)

<a id = "13"></a><br>
## Sibsp with Survived features

In [1]:
g = sns.factorplot(x='SibSp',y='Survived',data=train_df,kind='bar',size=6)
g.set_ylabels('Survived probability') 
plt.show()


# Comment: Having a lot of SibSp have less chance to survive
# if SibSp <= 2 passenger has more chance to survive
# We can consider a new feature describing these categories

<a id = "14"></a><br>
## Parch -- Survived

In [1]:
g = sns.factorplot(x='Parch',y='Survived',data=train_df,kind='bar',size=6)
g.set_ylabels('Survived probability') 
plt.show()

* Sibsp and parch can be used for new feature extraction with th = 3
* small familes have more chance to survive.
* there is a std in survival of passenger with parch = 3

<a id = "15"></a><br>
## Pclass -- Survived

In [1]:
g = sns.factorplot(x='Pclass',y='Survived',data=train_df,kind='bar',size=6)
g.set_ylabels('Survived probability') 
plt.show()

<a id = "16"></a><br>
## Age -- Survived

In [1]:
g = sns.FacetGrid(train_df, col = "Survived")
g.map(sns.distplot, "Age", bins = 25)
plt.show()

* Age <= 10 has a high survival rate
* Elder passengers (est. 80 year)
* Large number of 20 years old did not survive.
* Most passengers are in 15-35 age range
* Use Age distrubition for missing value of age features

<a id = "17"></a><br>
## Pclass -- Survived -- Age

In [1]:
g = sns.FacetGrid(train_df,col="Survived",row="Pclass",size=2)
g.map(plt.hist, "Age",bins=20)
g.add_legend()
plt.show()

#### Description
* Pclass is important feature for model training  

<a id = "18"></a><br>
## Embarked -- Sex -- Pclass -- Survived

In [1]:
g = sns.FacetGrid(train_df, row = "Embarked", size = 2)
g.map(sns.pointplot, "Pclass","Survived","Sex")
g.add_legend()
plt.show()

* Females have much better survival rate than males
* Males have better survival rate in pclass in C
* Embarked and sex will be use in training

<a id = "19"></a><br>
## Embarked -- Sex -- Fare -- Survived

In [1]:
g = sns.FacetGrid(train_df, row = "Embarked", col = "Survived", size = 2.3)
g.map(sns.barplot, "Sex", "Fare")
g.add_legend()
plt.show()

* Passengers who pay higher fare have better survival rate
* Fare can be used as categorical for training

<a id = "20"></a><br>
## Fill Missing: Age Feature

In [1]:
train_df[train_df["Age"].isnull()] 
# 12 of 256 rows doesnt have NaN

In [1]:
sns.factorplot(x="Sex",y="Age",data = train_df, kind="box")
plt.show()

* Sex is not informative for age prediction, age distrubition seems to be same

In [1]:
sns.factorplot(x="Sex",y="Age",hue="Pclass",data = train_df, kind="box")
plt.show()

For Age median values;
1st class passengers are older than 2nd, and 2nd older than 3rd.
Elder are rich, younger are poor :D

In [1]:
sns.factorplot(x="Parch",y="Age", data = train_df, kind="box")
sns.factorplot(x="SibSp",y="Age", data = train_df, kind="box")
plt.show()

* For Parch
    - If feature of parch little than 3 somebody must younger than 30
* For SibSp
    - If feature of 

In [1]:
# Add Sex feature

train_df['sexnumber'] = [1 if i == "male" else 0 for i in train_df['Sex']]

In [1]:
sns.heatmap(train_df[["Age","Sex","SibSp","Parch","Pclass","sexnumber"]].corr(), annot = True)
plt.show()

* Age is not correlated with sex but it correlated with SibSp Parch Pclass 

In [1]:
index_nan_age = list(train_df["Age"][train_df["Age"].isnull()].index)
for i in index_nan_age:
    age_pred = train_df["Age"][((train_df["SibSp"] == train_df.iloc[i]["SibSp"])&(train_df["Parch"] == train_df.iloc[i]["Parch"])& (train_df["Pclass"] == train_df.iloc[i]["Pclass"]))].median()
    age_med = train_df["Age"].median()
    if not np.isnan(age_pred):
        train_df["Age"].iloc[i] = age_pred
    else:
        train_df["Age"].iloc[i] = age_med

In [1]:
train_df[train_df["Age"].isnull()]
# already cleaned to missing values

sex feature

* Age is not correlated with sex but it is correlated with parch, sibsp and pclass.