______________________
# Summary
#### The aim of this notebook is to do Exploratory Data Analysis and as a result create 2 new promising features which are `Target_Class` and `Cabin_type`. This notebook contains 4 insights into the data which are informative and helpful. 
#### Styling of the graphs drawn below were inspired from this [notebook](https://www.kaggle.com/dwin183287/tps-jan-2021-eda-models). Those graphs are on another level. Thanks [Sharlto Cope](https://www.kaggle.com/dwin183287)
#### My previous notebook in this series : [Making first submission for baseline score (0.788)](https://www.kaggle.com/abhinavnayak/making-first-submission-for-baseline-score-0-788)

<a id='content-table'></a>
## Table of Contents
1. [Looking at the null values](#tag1)
2. [Insight-1 : Survival rate among different classes in a column](#tag2)   
    - [`Pclass`](#tag2a)   
    - [`Sex`](#tag2b)   
    - [`SibSp`](#tag2c)   
    - [`Parch`](#tag2d)   
3. [Insight-2 : The `Ticket` column & creating new feature `Ticket_class`](#tag3)   
    - [Insight-1](#tag3a)    
    - [Insight-2](#tag3b)    
    - [Creating `Ticket_class` column](#tag3c)    
4. [Insight-3 : The `Cabin` column & creating new feature `Cabin_type`](#tag4)
5. [Insight-4 : Corelation between `Fare` & `Pclass`](#tag5)

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

print(train.shape, test.shape, submission.shape)
print(train.columns)                             #printing the column names
print(set(train.columns)-set(test.columns))      #printing the target column

In [None]:
train.head()

<a id='tag1'></a>
## 1) [Looking at the null values](#content-table)

It is always helpful to remember which columns have very high percentage of null values. This will help you in [Insight-3](#tag4)

In [None]:
df = train.isnull().sum()/len(train)*100
df

In [None]:
plt.figure(figsize = (9,5), facecolor='#f6f6f6')
sns.barplot(x = df.values, y = list(df.index), color='#ffd514')
plt.title('% Missing values')

ax = plt.gca()
ax.set_facecolor('#f6f6f6')
for s in ["top", "right"]:
    ax.spines[s].set_visible(False)

<a id='tag2'></a>
## 2) [Insight-1 : Survival rate among different classes in a column](#content-table)

I compared the survival rate for each type in a given column. This is done for categorical classes i.e `Pclass`,`Sex`, `SibSp`, `Parch`

<a id='tag2a'></a>
### a) [`Pclass`](#content-table)

In [None]:
train.groupby('Pclass')['Survived'].apply(lambda x: f'{x.sum()/len(x)*100: 0.2f} % ({x.sum()}/{len(x)}) Survived')

In [None]:
df = train.groupby('Pclass')['Survived'].apply(lambda x: x.sum()/len(x)*100)

plt.figure(figsize = (4,5), facecolor='#f6f6f6')
sns.barplot(x = list(df.index), y = df.values, color='#ffd514')
plt.title("% Survival Rate - [PClass]")

ax = plt.gca()
ax.set_facecolor('#f6f6f6')
for s in ["top", "right"]:
    ax.spines[s].set_visible(False)

> ### What does 57.98% mean here?
> *Class 1 in `Pclass` has `30315` people of the total `100000` people in train data. Of these `30315`, `17576` people have survived.   
> Hence survival rate in class 1 = `17576/30315 = 57.98%`*

3rd class ticket has the least survival rate of all. Therefore `Pclass` has high effect on `Survival`

<a id='tag2b'></a>
### b) [`Sex`](#content-table)

In [None]:
train.groupby('Sex')['Survived'].apply(lambda x: f'{x.sum()/len(x)*100: 0.2f} % ({x.sum()}/{len(x)}) Survived')

In [None]:
df = train.groupby('Sex')['Survived'].apply(lambda x: x.sum()/len(x)*100)

plt.figure(figsize = (4,5), facecolor='#f6f6f6')
sns.barplot(x = list(df.index), y = df.values, color='#ffd514')
plt.title('% Survival Rate - [Sex]')

ax = plt.gca()
ax.set_facecolor('#f6f6f6')
for s in ["top", "right"]:
    ax.spines[s].set_visible(False)

Female has survival rate of more than 3 times that of male. Therefore `Sex` has high effect on `Survival`

<a id='tag2c'></a>
### c) [`SibSp`](#content-table)

In [None]:
train.groupby('SibSp')['Survived'].apply(lambda x: f'{x.sum()/len(x)*100: 0.2f} % ({x.sum()}/{len(x)}) Survived')

In [None]:
df = train.groupby('SibSp')['Survived'].apply(lambda x: x.sum()/len(x)*100)

plt.figure(figsize = (9,5), facecolor='#f6f6f6')
sns.barplot(x = list(df.index), y = df.values, color='#ffd514')
plt.title('% Survival Rate - [SibSp]')

ax = plt.gca()
ax.set_facecolor('#f6f6f6')
for s in ["top", "right"]:
    ax.spines[s].set_visible(False)

<a id='tag2d'></a>
### d) [`Parch`](#content-table)

In [None]:
train.groupby('Parch')['Survived'].apply(lambda x: f'{x.sum()/len(x)*100: 0.2f} % ({x.sum()}/{len(x)}) Survived')

In [None]:
df = train.groupby('Parch')['Survived'].apply(lambda x: x.sum()/len(x)*100)

plt.figure(figsize = (9,5), facecolor='#f6f6f6')
sns.barplot(x = list(df.index), y = df.values, color='#ffd514')
plt.title('% Survival Rate - [Parch]')

ax = plt.gca()
ax.set_facecolor('#f6f6f6')
for s in ["top", "right"]:
    ax.spines[s].set_visible(False)

<a id='tag3'></a>
## 3. [Insight-2 : The `Ticket` column & creating new feature `Ticket_class`](#content-table)

<a id='tag3a'></a>
### a) [Insight-1](#content-table)
I found that not all values in this column are string. There are some float values

In [None]:
_ = train['Ticket'].apply(lambda x: type(x))
_.value_counts()

<a id='tag3b'></a>
### b) [Insight-2](#content-table)
Although >66% values are unique in this column, there is a pattern to naming the ticket.
I found out 12 types of tickets after doing some EDA.   
- Some tickets are just numbers (some of these in float)
- Some tickets start with 'A.', 'A/5', 'A/4' etc  
- Some tickets start with 'C.A', 'CA' etc  
- Some tickets start with 'SC/PARIS', 'SC/Paris', 'SC/AH' etc  
and so on...

In [None]:
pd.set_option('display.max_rows', 101)   # To enable printing 100 rows
train['Ticket'].value_counts().head(50)

Here I convert `Ticket` column values into one of 12 types

In [None]:
import re

def fn1(x):
    if isinstance(x, str):
        if len(re.findall("^\d+$", x))>0:
            return 'type1'
        if len(re.findall("^(A\.|A/S|A/5|A/4|AQ/4|AQ/3|A4)", x))>0:
            return 'type2'
        if len(re.findall("^(C|CA|CA\.|C\.A\.)", x))>0:
            return 'type3'
        if len(re.findall("^(SC|S\.C\.|SC/PARIS|S\.C\./PARIS|SC/Paris|SC/AH|S\.C\./A\.4)", x))>0:
            return 'type4'
        if len(re.findall("^(PC|PP|P\.P|P/PP)", x))>0:
            return 'type5'
        if len(re.findall("^(W\.C\.|W./C\.|W/C)", x))>0:
            return 'type6'
        if len(re.findall("^(SOTON/O\.Q|SOTON/OQ|STON/O|STON/O2|SOTON/O2)", x))>0:
            return 'type7'
        if len(re.findall("^(WE/P|W\.E\.P)", x))>0:
            return 'type8'
        if len(re.findall("^(F\.C|F\.C\.C|Fa)", x))>0:
            return 'type9'
        if len(re.findall("^(LP)", x))>0:
            return 'type10'
        if len(re.findall("^(S\.O\.C|S\.P|S\.O|P\.P|SO/C)", x))>0:
            return 'type11'
        if len(re.findall("^(S\.W\./PP|SW/PP)", x))>0:
            return 'type12'
        
        else:
            return x
    else:
        return 'type1'
df = train['Ticket'].apply(lambda x: fn1(x))
df.value_counts()

Here we can create a new feature `Ticket_type` based on the above insight.

<a id='tag3c'></a>
### c) [Creating `Ticket_class` column](#content-table)

In [None]:
train['Ticket_class'] = df.values
train.head()

<a id='tag4'></a>
## 4) [Insight-3 : The `Cabin` column & creating new feature `Cabin_type`](#content-table)

**1 -** `Cabin` column has a pattern. 
- Some start with letter 'A'
- Some start with letter 'B' 
- and so on 

In [None]:
_ = train['Cabin']
_[_.notnull()].head(10)

**2 -** Also, `Cabin` column is more than 67% empty. But we can check if this has got anything to do with `Survival`.
We will compare survival rates of records whose `Cabin` column is filled vs not filled ones. 

In [None]:
train['Cabin_filledornot'] = train['Cabin'].notnull().astype(int)
train.head()

In [None]:
train.groupby('Cabin_filledornot')['Survived'].apply(lambda x: f'{x.sum()/len(x)*100: 0.2f} % ({x.sum()}/{len(x)}) Survived')

We see that those with null values in `Cabin` have almost half the survival rate vs those who have it filled.      
We will create a new feature `Cabin_type` where all the null values are filled with 'X' and others are replaced with their first letter

In [None]:
train['Cabin_type'] = train['Cabin'].fillna('X').map(lambda x: x[0].split()[0])    
train.drop(['Cabin', 'Cabin_filledornot'], axis = 1, inplace = True)
train.head()

<a id='tag5'></a>
## 5) [Insight-4 : Corelation between `Fare` & `Pclass`](#content-table)

`Fare` has 0.133% and 0.134% missing values in train and test data respectively. A way to fill this is by finding a corelation between `Fare` and another column and fill the missing values with different class means of the column. The most promising column that can be corelated to `Fare` is `Pclass` which tells us about the ticket class. 

In [None]:
train.groupby('Pclass')['Fare'].apply(lambda x: x.mean())

Average fare for each class is quite different and thus we can fill missing `Fare` values with respective class mean fare from `Pclass`

______________