# Data Modelling
---
Possíveis formas de tratamento de dados para cada tipo de atributo foram descritas no notebook anterior. Este notebook tem como objetivo exemplificar cada etapa dos métodos contidos em data_teatment.py

O tratamento adequado para cada feature depende do modelo utilizado. Em data_treatment é possível encontrar alguns destes tratamentos. Todos os tratamentos aqui utilizados foram discutidos em Titanic 1.

Os tratamentos serão exemplificados utilizando o conjunto de treinamento, mas também se aplicam para o conjunto de teste, com algumas restrições.

Nos notebook de treinamento de modelo, todos os tratamento serão encapsulados em um método de data_treatment.py. Estes tratamentos correspondem ao que foi exemplificado aqui, a não ser que dito o contrário.

In [1]:
#Imports 
import pandas as pd

from sklearn.preprocessing import LabelEncoder

In [2]:
#Load data
data_path = "./data/train.csv"

df = pd.read_csv(data_path)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## PassengerId

In [3]:
#Remove column as it is a unique identifier
df = df.drop(["PassengerId"], axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Pclass
Pclass já é um campo numérico que não possui missing values, logo não necessita de tratamento.

## Name

In [4]:
#Transform Name feature into one contaning only the title
df["Title"] = df["Name"].apply(lambda x: x.split(",")[-1].split()[0])
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr.
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs.
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr.


In [5]:
#Verify resulting values for Name as it is now categorical
print(df["Title"].value_counts(dropna=False))

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Col.           2
Major.         2
Jonkheer.      1
Lady.          1
Don.           1
Ms.            1
Sir.           1
Mme.           1
Capt.          1
the            1
Name: Title, dtype: int64


In [6]:
# Verify inconsistency cases
print(df[df["Title"] == "Jonkheer."])

     Survived  Pclass                             Name   Sex   Age  SibSp  \
822         0       1  Reuchlin, Jonkheer. John George  male  38.0      0   

     Parch Ticket  Fare Cabin Embarked      Title  
822      0  19972   0.0   NaN        S  Jonkheer.  


In [7]:
#As Name is a categorical value, encode it with a label encoder.
name_le = LabelEncoder()
df["Title"] = name_le.fit_transform(df["Title"])

In [8]:
#Verify resulting values for Name as it is now encoded
print(df["Title"].value_counts(dropna=False))

11    517
8     182
12    125
7      40
3       7
14      6
1       2
9       2
6       2
4       1
2       1
16      1
5       1
15      1
10      1
13      1
0       1
Name: Title, dtype: int64


Como algumas classes possuem pouca representatividade, uma possível abordagem seria considerar todas as instâncias com poucos exemplos como uma única classe. Esta abordagem não será investigada aqui.

In [9]:
#Drop original Name column
df = df.drop(["Name"], axis=1)

## Sex

In [10]:
#Verify feature before transformation
print(df["Sex"].value_counts(dropna=False))

male      577
female    314
Name: Sex, dtype: int64


In [11]:
#Categorical feature without missing values, use label encoder.
sex_le = LabelEncoder()
df["Sex"] = sex_le.fit_transform(df["Sex"])

In [12]:
#Verify feature after transformation
print(df["Sex"].value_counts(dropna=False))

1    577
0    314
Name: Sex, dtype: int64


## Age

In [13]:
#Fill mssing values with age mean according to gender
df["Age"] = df.groupby("Sex")["Age"].transform(lambda x: x.fillna(x.mean()))

In [14]:
#Verify if no missing values are left
print(df["Age"].isna().sum())

0


## SibSp, Parch e Fare
Já são campos numéricos que não possuem missing values, logo não serão feitos tratamentos.

## Cabin

In [15]:
#Remove feature as it has too many missing values
df = df.drop(["Cabin"], axis = 1)

## Ticket

In [16]:
#Remove feature due to value inconsistency
df = df.drop(["Ticket"], axis=1)

## Embarked

In [17]:
print(df.head())
print(df["Embarked"].value_counts())
#Fill missing values with mode depending on passenger class
df["Embarked"] = df.groupby("Pclass")["Embarked"].transform(lambda x: x.fillna(x.value_counts().idxmax()))

   Survived  Pclass  Sex   Age  SibSp  Parch     Fare Embarked  Title
0         0       3    1  22.0      1      0   7.2500        S     11
1         1       1    0  38.0      1      0  71.2833        C     12
2         1       3    0  26.0      0      0   7.9250        S      8
3         1       1    0  35.0      1      0  53.1000        S     12
4         0       3    1  35.0      0      0   8.0500        S     11
S    644
C    168
Q     77
Name: Embarked, dtype: int64


In [18]:
#Label encode port
print(df["Embarked"].value_counts(dropna=False))
port_le = LabelEncoder()
df["Embarked"] = port_le.fit_transform(df["Embarked"])

S    646
C    168
Q     77
Name: Embarked, dtype: int64


In [19]:
#Final Dataset
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,1,22.0,1,0,7.25,2,11
1,1,1,0,38.0,1,0,71.2833,0,12
2,1,3,0,26.0,0,0,7.925,2,8
3,1,1,0,35.0,1,0,53.1,2,12
4,0,3,1,35.0,0,0,8.05,2,11
