# Introduction

This project is about RMS Titanic which sank after colliding with an iceberg, and the survivors in 1912. 1502 out of 2224 passengers and crew died in this disaster. 

In this project, we will try to build a predictive model to find out what sorts of people were more likely to survive. Data that we will use are the passenger data like; name, age, gender, socio-economic class, etc.

<font color = 'blue' >

1. [Load and Check Data](#1)
1. [Variable Description](#2)</a><br>
    * [Univariate Variable Analysis](#3)
       * [Categorical Variable](#4)
       * [Numerical Variable](#5)
1. [Basic Data Analysis](#6)
1. [Outlier Detection](#7)
1. [Missing Values](#8)
    * [Finding Missing Values](#9)
    * [Filling Missing Values](#10)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a ID = "1"></a><br>
## 1. Load and Check Data

In [1]:
train_df = pd.read_csv ("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv ("/kaggle/input/titanic/test.csv")
test_PassengerId = test_df["PassengerId"]

In [1]:
train_df.columns

In [1]:
train_df.head()

In [1]:
train_df.describe()

<a ID = "2"></a><br>
# 2.Variable Description
1. PassengerId: unique ID number
2. Survived: passenger survive (1) or died (0)
3. Pclass: passenger class
4. Name: passenger name
5. Sex: gender of passenger
6. Age: age of passenger
7. SibSp: number of siblings or spouse
8. Parch: number of parent or children
9. Ticket: ticket number
10. Fare: amount of money for ticket
11. Cabin: cabin category 
12. Embarked: port where passenger embarked (C= Cherborg, Q= Queenstown, S= Southampton)

In [1]:
train_df.info()

* float64(2): Fare, Age
* int64(5): PassengerId, Survived, Pclass, SibSp, Parch, 
* object(5): Name, Sex, Ticket, Cabin, Embarked

<a ID = "3"></a><br>
# Univariate Variable Analysis
   * Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, SibSp, Parch
   * Numerical Variable: Age, PassengerId, Fare

<a ID = "4"></a><br>
## Categorical Variable

In [1]:
def bar_plot(variable):
    """
    input: variable ex: "Sex"
    
    output: bar plot + value count
    
    """
    # get feature
    var = train_df[variable]
    # count categorical variable
    varValue = var.value_counts()
    
    #visualise
    plt.figure(figsize = (5,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequncy")
    plt.title(variable)
    plt.show()
    print("{} /n {}".format(variable, varValue))

In [1]:
category1 = ["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
for i in category1:
    bar_plot(i)

In [1]:
category2 = ["Cabin", "Name", "Ticket"]
for i in category2:
    print("{} \n".format(train_df[i].value_counts()))

<a ID = "5"></a><br>
## Numerical Variable

In [1]:
def plot_hist(variable):
    plt.figure(figsize= (9,3))
    plt.hist(train_df[variable], bins=80)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show

In [1]:
numericVar = ["Fare", "Age", "PassengerId"]
for c in numericVar:
    plot_hist(c)

 <a ID = "6"></a><br>
# Basic Data Analysis
* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

In [1]:
# Pclass - Survived
train_df[["Pclass","Survived"]].groupby(["Pclass"], as_index= False).mean().sort_values(by="Survived", ascending=False)

In [1]:
# Sex - Survived
train_df[["Sex","Survived"]].groupby(["Sex"], as_index= False).mean().sort_values(by="Survived", ascending=False)

In [1]:
# SibSp - Survived
train_df[["SibSp","Survived"]].groupby(["SibSp"], as_index= False).mean().sort_values(by="Survived", ascending=False)

In [1]:
# Parch - Survived
train_df[["Parch","Survived"]].groupby(["Parch"], as_index= False).mean().sort_values(by="Survived", ascending=False)

<a ID = "7"></a><br>
# Outlier Detection

In [1]:
def detect_outliers(df, features):
    outlier_indices = []
    
    for i in features:
        
        # 1st Quartile
        Q1 = np.percentile(df[i],25)
        
        # 3rd Quartile
        Q3 = np.percentile(df[i],75)
        
        # IQR
        IQR = Q3 - Q1
        
        # Outlier step
        outlier_step = IQR * 1.5
        
        # detect outlier and  theri indeces
        outlier_list_col = df[(df[i] < Q1 - outlier_step) | (df[i] > Q3 + outlier_step )].index 
    
        # store indices
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter (outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
       
    return multiple_outliers

In [1]:
train_df.loc[detect_outliers(train_df,["Age","SibSp","Parch","Fare"])]

In [1]:
# drop outliers
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp","Parch","Fare"]), axis = 0).reset_index(drop=True)

<a ID = "8"></a><br>
# Missing Values

In [1]:
train_df_len = len(train_df)
train_df = pd.concat([train_df, test_df], axis=0).reset_index(drop=True)

In [1]:
train_df.head()

<a ID = "9"></a><br>
## Finding Missing Values
    

In [1]:
train_df.columns[train_df.isnull().any()]

In [1]:
train_df.isnull().sum()

<a ID = "10"></a><br>
## Filling Missing Values
* "Embarked" has 2 missing values
* "Fare" has " missing value

In [1]:
train_df[train_df["Embarked"].isnull()]

In [1]:
train_df.boxplot(column="Fare", by="Embarked")
plt.show()

In [1]:
train_df["Embarked"] = train_df["Embarked"].fillna("C")

In [1]:
train_df[train_df["Embarked"].isnull()]

In [1]:
train_df[train_df["Fare"].isnull()]

In [1]:
train_df[train_df["Pclass"] == 3]

In [1]:
np.mean(train_df[train_df["Pclass"] == 3] ["Fare"])

In [1]:
train_df["Fare"] = train_df["Fare"].fillna(np.mean(train_df[train_df["Pclass"] == 3] ["Fare"]))

In [1]:
train_df[train_df["Fare"].isnull()]