## 1 - SetUp Environment

In [1]:
import numpy as np
import pandas as pd
import pickle

<hr>

## 2 - Load Dataframe

In [2]:
with open('../Assets/Version 1-3.pickle', 'rb') as file:
    df = pickle.load(file)

df.head()

Unnamed: 0,PassengerId,Name,Ticket,Age,Parch,Fare,Pclass,Sex,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",A/5 21171,22.0,0,7.25,3,male,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599,38.0,0,71.2833,1,female,C85,C,1
2,3,"Heikkinen, Miss. Laina",STON/O2. 3101282,26.0,0,7.925,3,female,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803,35.0,0,53.1,1,female,C123,S,1
4,5,"Allen, Mr. William Henry",373450,35.0,0,8.05,3,male,,S,0


<hr>

## 3 - Declare Variable

In this section, I work on just text categorical variables</br>
<li> Sex
<li> Cabin
<li> Embarked

In [3]:
CatTxt_list = ['Sex' , 'Cabin', 'Embarked']

<hr>

## 4 - Handle Null Values

First we must make sure whether we have null values in these variables or not.

In [4]:
def null_checker():
    for i in range(len(CatTxt_list)):
        print(CatTxt_list[i] , ':' , df.loc[df[CatTxt_list[i]].isnull()].index)

Then we run the function:

In [5]:
null_checker()

Sex : Int64Index([], dtype='int64')
Cabin : Int64Index([   0,    2,    4,    5,    7,    8,    9,   12,   13,   14,
            ...
            1294, 1297, 1299, 1300, 1301, 1303, 1304, 1306, 1307, 1308],
           dtype='int64', length=1013)
Embarked : Int64Index([61, 829], dtype='int64')


the result shows we have null values in <b>"Cabin" , "Embarked"</b>.first we should make sure whether the number of missing values of each variable, are less then 5% of whole data or more.

In [6]:
for i in range(len(CatTxt_list)):
    count_nan = df[CatTxt_list[i]].isna().sum()
    print(CatTxt_list[i] , ':' , round((count_nan / df.shape[0]) * 100, 3) , '%')

Sex : 0.0 %
Cabin : 77.446 %
Embarked : 0.153 %


>### 4.1 - Drop Null

Since the percentage of <b>"Embarked"</b> is less than 5%, we can drop null value

In [7]:
print("the dimension before removing null: " , df.shape)
df = df.dropna(subset=['Embarked'])
print("the dimension after removing null: " , df.shape)

the dimension before removing null:  (1308, 11)
the dimension after removing null:  (1306, 11)


now again we can check the statuse of null values:

In [8]:
null_checker()

Sex : Int64Index([], dtype='int64')
Cabin : Int64Index([   0,    2,    4,    5,    7,    8,    9,   12,   13,   14,
            ...
            1294, 1297, 1299, 1300, 1301, 1303, 1304, 1306, 1307, 1308],
           dtype='int64', length=1013)
Embarked : Int64Index([], dtype='int64')


>### 4.2 - Impute Null

Since the percentage of <b>"Cabin"</b> is more than 5%, we should impute it. for doing this, because the percentage of null value is too high, we can convert all null to a specific category

In [9]:
df['Cabin'].fillna('Unknown', inplace=True)

now we can check null function again:

In [11]:
null_checker()

Sex : Int64Index([], dtype='int64')
Cabin : Int64Index([], dtype='int64')
Embarked : Int64Index([], dtype='int64')


<hr>

## 6 - Possible Range

according to data document, those variables just can contain these value:</br>
<li> <b>Sex:</b> <i> Male , Female</i>
<li> <b>Cabin:</b> <i>A , B , C , D , E , F , G , T</i>
<li> <b>Embarked:</b> <i> C:Cherbourg, Q:Queenstown, S:Southampton</i>

>### 6.1 - Sex

In [12]:
 print(df['Sex'].value_counts())

male      842
female    464
Name: Sex, dtype: int64


>### 6.2 - Cabin

In [13]:
 print(df['Cabin'].value_counts())

Unknown            1013
C23 C25 C27           6
B57 B59 B63 B66       5
G6                    5
F33                   4
                   ... 
A14                   1
E63                   1
E12                   1
E38                   1
C105                  1
Name: Cabin, Length: 186, dtype: int64


For <b>"Cabin"</b> we should just extract the first letter that show the class of cabin:

In [14]:
def extract_first_letter_except_unknown(x):
    if x == 'Unknown':
        return x
    else:
        return x[0]
    
df['Cabin'] = df['Cabin'].apply(extract_first_letter_except_unknown)

now, again we check the range:

In [15]:
 print(df['Cabin'].value_counts())

Unknown    1013
C            94
B            63
D            46
E            41
A            22
F            21
G             5
T             1
Name: Cabin, dtype: int64


<hr>

## Check Point

In [16]:
with open('../Assets/Version 1-4.pickle', 'wb') as file:
    pickle.dump(df, file)