# Etapas realizadas:
* [1. Unir dataset de treino e test](#1.-Unir-dataset-de-treino-e-test)
* [2. Binarizar atributo 'Sex'](#2.-Binarizar-atributo-'Sex')
* [3. Imputar a média para os valores nulos e remover casas decimais do atributo 'Age'](#3.-Imputar-a-media-para-os-valores-nulos-e-remover-casas-decimais-do-atributo-'Age')
* [4. Descartar atributos 'Name', 'Ticket' e 'Cabin'](#4.-Descartar-atributos-'Name',-'Ticket'-e-'Cabin')
* [5. Binarizar atributo 'Embarked' e remover as tuplas com valores nulos](#5.-Binarizar-atributo-'Embarked'-e-remover-as-tuplas-com-valores-nulos)
* [6. Imputar a média para os valores nulos e remover casas decimais do atributo 'Fare'](#6.-Imputar-a-media-para-os-valores-nulos-e-remover-casas-decimais-do-atributo-'Fare')
* [7. PCA nos atributos 'SibSp' e 'Parch'](#7.-PCA-nos-atributos-'SibSp'-e-'Parch')

# Carregar bibliotecas

In [1]:
import pandas as pd
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")

# Ler dataset

**survival**: Survival (0 = No, 1 = Yes) <br>
**pclass:** Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) <br>
**sex:** Sex	
**Age:** Age in years	
**sibsp**: # of siblings / spouses aboard the Titanic	
**parch:** # of parents / children aboard the Titanic	
**ticket:** Ticket number	
**fare:** Passenger fare	
**cabin:** Cabin number	
**embarked:** Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

***

#### Additional Info:
**pclass:** A proxy for socio-economic status (SES)<br>
1st = Upper<br>
2nd = Middle<br>
3rd = Lower

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:** The dataset defines family relations in this way...<br>
Sibling = brother, sister, stepbrother, stepsister<br>
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** The dataset defines family relations in this way...<br>
Parent = mother, father<br>
Child = daughter, son, stepdaughter, stepson<br>
Some children travelled only with a nanny, therefore parch=0 for them.

In [2]:
dataset_train = pd.read_csv('dados/train.csv')

dataset_test = pd.read_csv('dados/test.csv')

In [3]:
dataset_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Pré Processamento dos dados

### 1. Unir dataset de treino e test

In [4]:
dataset = pd.concat([dataset_train, dataset_test])
dataset.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450


### 2. Binarizar atributo 'Sex'

In [5]:
dataset.Sex.unique()

array(['male', 'female'], dtype=object)

In [6]:
dataset.loc[dataset['Sex'] == 'male', 'Sex'] = 0
dataset.loc[dataset['Sex'] == 'female', 'Sex'] = 1

dataset.Sex.unique()

array([0, 1])

### 3. Imputar a media para os valores nulos e remover casas decimais do atributo 'Age'

In [7]:
dataset.Age.unique()

array([ 22.  ,  38.  ,  26.  ,  35.  ,    nan,  54.  ,   2.  ,  27.  ,
        14.  ,   4.  ,  58.  ,  20.  ,  39.  ,  55.  ,  31.  ,  34.  ,
        15.  ,  28.  ,   8.  ,  19.  ,  40.  ,  66.  ,  42.  ,  21.  ,
        18.  ,   3.  ,   7.  ,  49.  ,  29.  ,  65.  ,  28.5 ,   5.  ,
        11.  ,  45.  ,  17.  ,  32.  ,  16.  ,  25.  ,   0.83,  30.  ,
        33.  ,  23.  ,  24.  ,  46.  ,  59.  ,  71.  ,  37.  ,  47.  ,
        14.5 ,  70.5 ,  32.5 ,  12.  ,   9.  ,  36.5 ,  51.  ,  55.5 ,
        40.5 ,  44.  ,   1.  ,  61.  ,  56.  ,  50.  ,  36.  ,  45.5 ,
        20.5 ,  62.  ,  41.  ,  52.  ,  63.  ,  23.5 ,   0.92,  43.  ,
        60.  ,  10.  ,  64.  ,  13.  ,  48.  ,   0.75,  53.  ,  57.  ,
        80.  ,  70.  ,  24.5 ,   6.  ,   0.67,  30.5 ,   0.42,  34.5 ,
        74.  ,  22.5 ,  18.5 ,  67.  ,  76.  ,  26.5 ,  60.5 ,  11.5 ,
         0.33,   0.17,  38.5 ])

In [8]:
# Calculando a média de idades
media_age = round(dataset.loc[~dataset['Age'].isnull(), 'Age'].mean())
media_age

30

In [9]:
# Imputando a média nos valores nulos
dataset.loc[dataset['Age'].isnull(), 'Age'] = media_age

In [10]:
# Retirando casas decimais
dataset.Age = dataset.Age.astype(int)

dataset.Age.unique()

array([22, 38, 26, 35, 30, 54,  2, 27, 14,  4, 58, 20, 39, 55, 31, 34, 15,
       28,  8, 19, 40, 66, 42, 21, 18,  3,  7, 49, 29, 65,  5, 11, 45, 17,
       32, 16, 25,  0, 33, 23, 24, 46, 59, 71, 37, 47, 70, 12,  9, 36, 51,
       44,  1, 61, 56, 50, 62, 41, 52, 63, 43, 60, 10, 64, 13, 48, 53, 57,
       80,  6, 74, 67, 76])

### 4. Descartar atributos 'Name', 'Ticket' e 'Cabin'

In [11]:
dataset.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

dataset.head()

Unnamed: 0,Age,Embarked,Fare,Parch,PassengerId,Pclass,Sex,SibSp,Survived
0,22,S,7.25,0,1,3,0,1,0.0
1,38,C,71.2833,0,2,1,1,1,1.0
2,26,S,7.925,0,3,3,1,0,1.0
3,35,S,53.1,0,4,1,1,1,1.0
4,35,S,8.05,0,5,3,0,0,0.0


### 5. Binarizar atributo 'Embarked' e remover as tuplas com valores nulos

In [12]:
dataset.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [13]:
len(dataset)

1309

In [14]:
# Removendo as 2 tuplas com o valor nulo para o atributo 'Embarked'
dataset = dataset.loc[~dataset['Embarked'].isnull()]
len(dataset)

1307

In [15]:
dummies_embarked = pd.get_dummies(dataset['Embarked'], prefix='Embarked')
dummies_embarked.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [16]:
# Unindo atributo binarizado ao dataset
dataset = pd.concat([dataset, dummies_embarked], axis=1)

dataset.head()

Unnamed: 0,Age,Embarked,Fare,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Embarked_C,Embarked_Q,Embarked_S
0,22,S,7.25,0,1,3,0,1,0.0,0,0,1
1,38,C,71.2833,0,2,1,1,1,1.0,1,0,0
2,26,S,7.925,0,3,3,1,0,1.0,0,0,1
3,35,S,53.1,0,4,1,1,1,1.0,0,0,1
4,35,S,8.05,0,5,3,0,0,0.0,0,0,1


In [17]:
# Removendo o atributo 'Embarked', pois ele já consta no dataset de forma binarizada
dataset.drop(['Embarked'], axis=1, inplace=True)

dataset.head()

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Embarked_C,Embarked_Q,Embarked_S
0,22,7.25,0,1,3,0,1,0.0,0,0,1
1,38,71.2833,0,2,1,1,1,1.0,1,0,0
2,26,7.925,0,3,3,1,0,1.0,0,0,1
3,35,53.1,0,4,1,1,1,1.0,0,0,1
4,35,8.05,0,5,3,0,0,0.0,0,0,1


### 6. Imputar a media para os valores nulos e remover casas decimais do atributo 'Fare'

In [18]:
dataset.Fare.unique()

array([   7.25  ,   71.2833,    7.925 ,   53.1   ,    8.05  ,    8.4583,
         51.8625,   21.075 ,   11.1333,   30.0708,   16.7   ,   26.55  ,
         31.275 ,    7.8542,   16.    ,   29.125 ,   13.    ,   18.    ,
          7.225 ,   26.    ,    8.0292,   35.5   ,   31.3875,  263.    ,
          7.8792,    7.8958,   27.7208,  146.5208,    7.75  ,   10.5   ,
         82.1708,   52.    ,    7.2292,   11.2417,    9.475 ,   21.    ,
         41.5792,   15.5   ,   21.6792,   17.8   ,   39.6875,    7.8   ,
         76.7292,   61.9792,   27.75  ,   46.9   ,   83.475 ,   27.9   ,
         15.2458,    8.1583,    8.6625,   73.5   ,   14.4542,   56.4958,
          7.65  ,   29.    ,   12.475 ,    9.    ,    9.5   ,    7.7875,
         47.1   ,   15.85  ,   34.375 ,   61.175 ,   20.575 ,   34.6542,
         63.3583,   23.    ,   77.2875,    8.6542,    7.775 ,   24.15  ,
          9.825 ,   14.4583,  247.5208,    7.1417,   22.3583,    6.975 ,
          7.05  ,   14.5   ,   15.0458,   26.2833, 

In [19]:
# Calculando a média de 'Fare'
media_fare = round(dataset.loc[~dataset['Fare'].isnull(), 'Fare'].mean())
media_fare

33

In [20]:
# Imputando a média nos valores nulos
dataset.loc[dataset['Fare'].isnull(), 'Fare'] = media_fare

In [21]:
# Retirando casas decimais
dataset.Fare = dataset.Fare.astype(int)

dataset.Fare.unique()

array([  7,  71,  53,   8,  51,  21,  11,  30,  16,  26,  31,  29,  13,
        18,  35, 263,  27, 146,  10,  82,  52,   9,  41,  15,  17,  39,
        76,  61,  46,  83,  73,  14,  56,  12,  47,  34,  20,  63,  23,
        77,  24, 247,  22,   6,  79,  36,  66,  69,  55,  25,  33,  28,
         0,  50, 113,  90,  86, 512, 153, 135,  19,  78,  91, 151, 110,
       108, 262, 164, 134,  57, 133,  75, 211,   4, 227, 120,  32,  81,
        89,  38,  49,  59,  93, 221, 106,  40,  42,  65,  37,   5,   3,
        60, 136,  45])

### 7. PCA nos atributos 'SibSp' e 'Parch'

In [22]:
pca = PCA(n_components=1)
pca_sibsp_parch = pca.fit_transform(dataset[['SibSp', 'Parch']])
pca_sibsp_parch

array([[ 0.22262695],
       [ 0.22262695],
       [-0.62776265],
       ..., 
       [-0.62776265],
       [-0.62776265],
       [ 0.74878047]])

In [23]:
pd_pca_sibsp_parch = pd.DataFrame(pca_sibsp_parch, columns=['pca_sibsp_parch'])
pd_pca_sibsp_parch.head()

Unnamed: 0,pca_sibsp_parch
0,0.222627
1,0.222627
2,-0.627763
3,0.222627
4,-0.627763


In [24]:
# Unindo o PCA no dataset
dataset.reset_index(drop=True, inplace=True)
dataset = pd.concat([dataset, pd_pca_sibsp_parch], axis=1)

# Removendo os atributos 'SibSp' e 'Parch'
dataset.drop(['SibSp', 'Parch'], axis=1, inplace=True)

dataset.head()

Unnamed: 0,Age,Fare,PassengerId,Pclass,Sex,Survived,Embarked_C,Embarked_Q,Embarked_S,pca_sibsp_parch
0,22,7,1,3,0,0.0,0,0,1,0.222627
1,38,71,2,1,1,1.0,1,0,0,0.222627
2,26,7,3,3,1,1.0,0,0,1,-0.627763
3,35,53,4,1,1,1.0,0,0,1,0.222627
4,35,8,5,3,0,0.0,0,0,1,-0.627763


### Separar e salvar os dados de train e test tratados

In [25]:
len(dataset_test)

418

In [26]:
len(dataset)

1307

In [27]:
len(dataset) - len(dataset_test)

889

In [28]:
train_clean = dataset[0:len(dataset) - len(dataset_test)]
test_clean = dataset[889:]

In [29]:
len(test_clean)

418

In [30]:
# Remover casas decimais do atributo classe 'Survived'
train_clean.Survived = train_clean.Survived.astype(int)

In [31]:
train_clean.to_csv('dados/train_clean.csv', sep=';', index=False)
test_clean.to_csv('dados/test_clean.csv', sep=';', index=False)