<a href="https://colab.research.google.com/github/titocampos/estudo-crm/blob/master/aula_20200207.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Desafio Titanic**


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 12)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

Para usar a API do Kaggle, inscreva-se em uma conta do Kaggle em https://www.kaggle.com. Em seguida, vá para a guia 'Conta' do seu perfil de usuário (https://www.kaggle.com/<nome do usuário>/account) e selecione 'Criar token da API'. Isso acionará o download do kaggle.json, um arquivo que contém suas credenciais da API. Salve este aquivo no disco local. Após rode a celula abaixo para fazer o upload do arquivo para o Colab.

In [3]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"titocampos","key":"d5658011dda10356de46c101b9286867"}'}

Após carregar o arquivo da Api execute os comandos abaixo para o download dos arquivos do desafio. É necessário que voce tenha aceito os termos do desafio, anteriormente em <https://www.kaggle.com/c/titanic/rules>.


In [4]:
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle competitions download -c titanic
!rm -rf ~/.kaggle/

Downloading train.csv to /content
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 23.1MB/s]
Downloading gender_submission.csv to /content
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 2.77MB/s]
Downloading test.csv to /content
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 26.3MB/s]


Montando o arquivo para envio da baseline

In [0]:
!pwd
!ls -l
df_test = pd.read_csv("test.csv")

df_test.head()

In [0]:
df_test["Survived"] = np.logical_or(df_test["Sex"]=="female", df_test["Age"]<=12) * 1
df_test.head(10)

In [0]:
df_base = df_test[['PassengerId', 'Survived']]
df_base.head(10)

In [0]:
df_base.to_csv("gender_submission.csv", index=False)

Subindo este arquivo para o Kaggle devemos atingir uma acurácia de 77%.


##**Primeiro modelo**

Random forest <https://medium.com/machina-sapiens/o-algoritmo-da-floresta-aleat%C3%B3ria-3545f6babdf8>


In [0]:
from sklearn.ensemble import RandomForestClassifier

modelo = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)
variaveis = ['Sex_binario', 'Age', 'Pclass']
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

df_train.head(10)

Precisamos adequar nossa variavel categorica sexo, para isso vamos criar uma funcao para essa transformação 

In [0]:
def transformar_sexo(valor):
  if valor == 'female':
    return 1
  else: 
    return 0

In [0]:
df_train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [0]:
df_train['Sex_binario'] = df_train['Sex'].map(transformar_sexo)
df_test['Sex_binario'] = df_test['Sex'].map(transformar_sexo)
df_train.head(10)

In [0]:
X = df_train[variaveis]
X = X.fillna(-1)
y = df_train['Survived']

modelo.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [0]:

df_prev = df_test[variaveis]
df_prev = df_prev.fillna(-1)

p = modelo.predict(df_prev)
p

Montando o arquivo para submissão

In [0]:
df_sub = pd.Series(p, index=df_test["PassengerId"], name="Survived")
df_sub.to_csv("modelo1.csv", header=True)
!head -n10 modelo1.csv

Subindo este arquivo para o Kaggle vemos que não conseguimos melhorar a acurácia, na realidade piorou 72.7%.

## **Treino e validação**


In [0]:
from sklearn.model_selection import train_test_split

X_falso = np.arange(10)
X_falso

In [0]:
train_test_split(X_falso, test_size=.40, random_state=0)

In [0]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.50, random_state=0) #.50 por ter poucos dados

In [0]:
modelo.fit(X_train, y_train)
p = modelo.predict(X_valid)


Validando a acurária

In [0]:
np.mean(y_valid == p)

In [0]:
#Avaliando a baseline
p = (X_valid["Sex_binario"]==1).astype(np.int64)
np.mean(y_valid == p)

0.7825112107623319

## **Validação cruzada**


In [0]:
from sklearn.model_selection import RepeatedKFold, KFold
kf = KFold(3, shuffle=True, random_state = 0)
for linhas_treino, linhas_valid in kf.split(X_falso):
  print("Treino", linhas_treino)
  print("Validacao", linhas_valid)  

In [0]:
kf = RepeatedKFold(n_splits=2, n_repeats=10, random_state = 0)
resultados=[]
for linhas_treino, linhas_valid in kf.split(X):
  X_train, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
  y_train, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]
  
  modelo = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)
  modelo.fit(X_train, y_train)
  p = modelo.predict(X_valid)
  
  acc = np.mean(y_valid == p)
  resultados.append(acc)
  print("Acc:", acc)

np.mean(resultados)

In [0]:
plt.hist(resultados)

## **Segundo modelo**

In [0]:
#acuracia anterio 0.77
variaveis = ['Sex_binario', 'Age', 'Pclass', 'SibSp', 'Parch', 'Fare']

X = df_train[variaveis].fillna(-1)
y = df_train['Survived']

In [0]:
kf = RepeatedKFold(n_splits=2, n_repeats=10, random_state = 0)
resultados=[]
for linhas_treino, linhas_valid in kf.split(X):
  X_train, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
  y_train, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]
  
  modelo = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)
  modelo.fit(X_train, y_train)
  p = modelo.predict(X_valid)
  
  acc = np.mean(y_valid == p)
  resultados.append(acc)
  print("Acc:", acc)

np.mean(resultados)
plt.hist(resultados)

## **Retreinar o modelo**


In [0]:
 modelo = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)
 modelo.fit(X, y)

 p = modelo.predict(df_test[variaveis].fillna(-1))
 p

In [0]:
df_sub = pd.Series(p, index=df_test["PassengerId"], name="Survived")
df_sub.to_csv("modelo2.csv", header=True)
!head -n10 modelo2.csv

## **Possiveis variaveis**

In [0]:
df_train['Embarked_S'] = (df_train['Embarked'] == 'S').astype(int)
df_train['Embarked_C'] = (df_train['Embarked'] == 'C').astype(int)
#df_train['Embarked_Q'] = (df_train['Embarked'] == 'Q').astype(int)

df_train['Cabine_nula'] = df_train['Cabin'].isnull().astype(int)

df_train['Miss'] = df_train['Name'].str.contains('Miss').astype(int)
df_train['Mrs'] = df_train['Name'].str.contains('Mrs').astype(int)
df_train['Master'] = df_train['Name'].str.contains('Master').astype(int)
df_train['Col'] = df_train['Name'].str.contains('Col').astype(int)
df_train['Major'] = df_train['Name'].str.contains('Major').astype(int)
df_train['Mr'] = df_train['Name'].str.contains('Mr').astype(int)
df_train

In [0]:
variaveis = ['Sex_binario', 'Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 
             'Embarked_S', 'Embarked_C', 'Cabine_nula', 'Miss', 'Mrs',
             'Master', 'Col', 'Major', 'Mr']

X = df_train[variaveis].fillna(-1)
y = df_train['Survived']

In [0]:
resultados=[]
kf = RepeatedKFold(n_splits=2, n_repeats=10, random_state = 0)
for linhas_treino, linhas_valid in kf.split(X):
  X_train, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
  y_train, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]
  
  modelo = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)
  modelo.fit(X_train, y_train)
  p = modelo.predict(X_valid)
  
  acc = np.mean(y_valid == p)
  resultados.append(acc)
  print("Acc:", acc)

np.mean(resultados)
plt.hist(resultados)

In [0]:
df_test['Embarked_S'] = (df_test['Embarked'] == 'S').astype(int)
df_test['Embarked_C'] = (df_test['Embarked'] == 'C').astype(int)
#df_test['Embarked_Q'] = (df_test['Embarked'] == 'Q').astype(int)

df_test['Cabine_nula'] = df_test['Cabin'].isnull().astype(int)

df_test['Miss'] = df_test['Name'].str.contains('Miss').astype(int)
df_test['Mrs'] = df_test['Name'].str.contains('Mrs').astype(int)
df_test['Master'] = df_test['Name'].str.contains('Master').astype(int)
df_test['Col'] = df_test['Name'].str.contains('Col').astype(int)
df_test['Major'] = df_test['Name'].str.contains('Major').astype(int)
df_test['Mr'] = df_test['Name'].str.contains('Mr').astype(int)

modelo = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)
modelo.fit(X, y)

p = modelo.predict(df_test[variaveis].fillna(-1))
p


In [0]:
df_sub = pd.Series(p, index=df_test["PassengerId"], name="Survived")
df_sub.to_csv("modelo3.csv", header=True)
!head -n10 modelo3.csv

In [0]:
from sklearn.tree import DecisionTreeClassifier

modelo = DecisionTreeClassifier(max_depth= 10, min_samples_leaf=5)
modelo.fit(X, y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [0]:
p = modelo.predict(df_test[variaveis].fillna(-1))

df_sub = pd.Series(p, index=df_test["PassengerId"], name="Survived")
df_sub.to_csv("modelo5.csv", header=True)
!head -n10 modelo5.csv

In [0]:
df_train['Embarked_S'] = (df_train['Embarked'] == 'S').astype(int)
df_train['Embarked_C'] = (df_train['Embarked'] == 'C').astype(int)
#df_train['Embarked_Q'] = (df_train['Embarked'] == 'Q').astype(int)

df_train['Cabine_nula'] = df_train['Cabin'].isnull().astype(int)

df_train['Miss'] = df_train['Name'].str.contains('Miss').astype(int)
df_train['Mrs'] = df_train['Name'].str.contains('Mrs').astype(int)
df_train['Master'] = df_train['Name'].str.contains('Master').astype(int)
df_train['Col'] = df_train['Name'].str.contains('Col').astype(int)
df_train['Major'] = df_train['Name'].str.contains('Major').astype(int)
df_train['Mr'] = df_train['Name'].str.contains('Mr').astype(int)

df_test['Embarked_S'] = (df_test['Embarked'] == 'S').astype(int)
df_test['Embarked_C'] = (df_test['Embarked'] == 'C').astype(int)
#df_test['Embarked_Q'] = (df_test['Embarked'] == 'Q').astype(int)

df_test['Cabine_nula'] = df_test['Cabin'].isnull().astype(int)

df_test['Miss'] = df_test['Name'].str.contains('Miss').astype(int)
df_test['Mrs'] = df_test['Name'].str.contains('Mrs').astype(int)
df_test['Master'] = df_test['Name'].str.contains('Master').astype(int)
df_test['Col'] = df_test['Name'].str.contains('Col').astype(int)
df_test['Major'] = df_test['Name'].str.contains('Major').astype(int)
df_test['Mr'] = df_test['Name'].str.contains('Mr').astype(int)

variaveis = ['Sex_binario', 'Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 
             'Embarked_S', 'Embarked_C', 'Cabine_nula', 'Miss', 'Mrs',
             'Master', 'Col', 'Major', 'Mr']

X = df_train[variaveis].fillna(-1)
y = df_train['Survived']

from sklearn.linear_model import LinearRegression
modelo = LinearRegression()
modelo.fit(X, y)


p = modelo.predict(df_test[variaveis].fillna(-1))
p = (p > 0.7).astype(int)
df_sub = pd.Series(p, index=df_test["PassengerId"], name="Survived")
df_sub.to_csv("modelo6.csv", header=True)
!head -n10 modelo6.csv


#ver https://www.kaggle.com/paulorzp/titanic-gp-model-training