# Titanic: Machine Learning from Disaster
## CRISP-DM: Modeling
**Autor:** Wanderson Marques - wdsmarques@gmail.com

Esse Jupyter Notebook contém a escolha do **modelo e parâmetros** para conjunto de dados Titanic. A modelagem refere-se à quarta fase da metodologia CRISP-DM. 

<img src="imgs/modeling.jpg" />

### Carregar bibliotecas

In [1]:
import pandas as pd
import joblib
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

### Carregar dataset

Nesse momento, o dataset já está pré-processado.

In [2]:
dataset = pd.read_csv('datasets/train-preprocessado.csv')
dataset.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Embarked_C,Embarked_Q,Survived
0,0.802787,-0.9864448,-0.471492,-0.47646,-0.376645,1.367833,2.136001,-0.301775,0
1,-0.405272,-0.1570346,0.491588,-0.47646,-0.151489,1.367833,2.136001,-0.301775,1
2,0.802787,2.678779e-16,-0.471492,-0.47646,-0.531501,-0.731083,-0.468165,-0.301775,0
3,0.802787,-0.760242,-0.471492,-0.47646,-0.485487,-0.731083,-0.468165,-0.301775,0
4,-0.405272,2.678779e-16,-0.471492,-0.47646,-0.717819,-0.731083,-0.468165,-0.301775,0


In [3]:
X = dataset.drop(['Survived'], axis=1)
y = dataset['Survived']

### Testar modelos e parâmetros

Realização de experimentos para encontar o modelo preditivo e parâmetros que melhor explicam a relação entre as variáveis preditoras e a predita.

In [4]:
# Dicionário com classificadores candidatos
modelos = {
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Neural Network': MLPClassifier(),
    'Naives Bayes': GaussianNB()
}

# Vetor de dicionários dos parâmetros possíveis para cada classificador. Estes serão combinados pelo GridSearchCV
parametros = [
    {'n_neighbors': [3, 5, 7], 'metric': ['euclidean', 'minkowski', 'manhattan']},
    {'n_estimators': [50, 100, 200], 'min_samples_split': [2, 5, 10]},
    {'hidden_layer_sizes': [25, 50, 100, (25,25), (50,50), (100, 100)], 'activation': ['logistic', 'relu']},
    {}
]

In [5]:
results = []
i = 0
for nome, modelo in modelos.items():
    print("Treinando ", nome)
    
    # Combinar os parâmetros de cada modelo com cross-validation (10 folds)
    gs = GridSearchCV(modelo, parametros[i], scoring='accuracy', n_jobs=-1, verbose=10, cv=10)
    gs.fit(X, y)
    results.append([nome, gs.best_params_, gs.best_score_])
    
    i += 1
    
resultados = pd.DataFrame(results, columns=['Modelo', 'Melhor Parâmetros', 'Score'])

Treinando  KNN
Fitting 10 folds for each of 9 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   13.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1918s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0480s.) Setting batch_size=16.
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   14.0s finished


Treinando  Random Forest
Fitting 10 folds for each of 9 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   16.2s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   19.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   21.2s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   22.7s finished


Treinando  Neural Network
Fitting 10 folds for each of 12 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   14.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   28.8s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   31.9s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   37.2s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   45.2s finished


Treinando  Naives Bayes
Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done   5 out of  10 | elapsed:   14.8s remaining:   14.8s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:   14.8s remaining:    6.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   14.8s finished


### Verificar resultados (Utilizando Acurácia)

In [6]:
resultados.sort_values('Score', ascending=False)

Unnamed: 0,Modelo,Melhor Parâmetros,Score
1,Random Forest,"{'min_samples_split': 10, 'n_estimators': 200}",0.873737
2,Neural Network,"{'activation': 'relu', 'hidden_layer_sizes': (...",0.84596
0,KNN,"{'metric': 'manhattan', 'n_neighbors': 3}",0.839646
3,Naives Bayes,{},0.808081


### Treinar melhor modelo com melhores parâmetros

In [7]:
model = RandomForestClassifier(min_samples_split=10, n_estimators=200)
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Verificar desempenho para o conjunto de treino

In [8]:
y_pred = model.predict(X)

In [9]:
print(classification_report(y, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.96      0.93       396
          1       0.96      0.89      0.93       396

avg / total       0.93      0.93      0.93       792



### Salvar modelo preditivo treinado

In [10]:
joblib.dump(model, filename='models/model.pkl')

['models/model.pkl']