# **Proyecto Statistical Learning**
# Primera parte
**José Barrios - 20007192** 

## Descripción general
El proyecto consiste en hacer una  clasificación binaria para determinar si una persona sobrevive (y=1), o no (y=0), el hundimiento del Titanic.

Se busca crear un modelo con una exactitud de al menos el 80%. 

El proyecto está dividido en dos partes. En esta primera parte nos enfocaremos en crear los modelos, evaluar su desempeño y guardarlos para ser posteriormente usados en otro notebook.

En general, se hará _feature engineering_ y luego el entrenamiento de 4 modelos:
* Árbol de decisión
* Support Vector Machine
* Naive Bayes
* Regresión Logística
Los resultados de estos modelos se combinarán y ayudarán a predecir la supervivencia de una persona en concenso.

## Datos
Obtenemos los datos para concer su estructura.

In [98]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.preprocessing as skp 
from sklearn import tree
from sklearn import svm
import datetime

In [2]:
if tf.__version__.startswith("2."):
  import tensorflow.compat.v1 as tf
  tf.compat.v1.disable_v2_behavior()
  tf.compat.v1.disable_eager_execution()
  print("Enabled compatitility to tf1.x")

Instructions for updating:
non-resource variables are not supported in the long term
Enabled compatitility to tf1.x


In [None]:
%load_ext tensorboard

In [63]:
data = pd.read_csv('data_titanic_proyecto.csv')
data.head()

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId     891 non-null int64
Age             891 non-null float64
SibSp           891 non-null int64
Parch           891 non-null int64
Fare            891 non-null float64
Embarked_C      891 non-null uint8
Embarked_Q      891 non-null uint8
Embarked_S      891 non-null uint8
Class_Lower     891 non-null uint8
Class_Middle    891 non-null uint8
Class_Upper     891 non-null uint8
Sex_F           891 non-null uint8
Sex_M           891 non-null uint8
Survived        891 non-null int64
dtypes: float64(2), int64(4), uint8(8)
memory usage: 48.9 KB


Podemos notar que algunas columnas las podemos omitir para realizar nuestros análisis ya que se considera que no aportarán a los modelos:
* Name
* Ticket
* Cabin

Cabin tambien se eliminó porque tiene demasiados valores nulos cuando podemos obtener información de otras features.

In [64]:
data = data.drop(columns = ["Name", "Ticket", "Cabin"])
data.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,22.0,1,0,7.25,S,Lower,M,N
1,2,38.0,1,0,71.2833,C,Upper,F,Y
2,3,26.0,0,0,7.925,S,Lower,F,Y
3,4,35.0,1,0,53.1,S,Upper,F,Y
4,5,35.0,0,0,8.05,S,Lower,M,N


In [73]:
data["Age"] = data["Age"].fillna(data["Age"].mean())

Realizamos one-hot-encoding para las variables Embarked, Passenger class y Passenger sex

In [65]:
data = pd.concat([data, pd.get_dummies(data["Embarked"], prefix = "Embarked")], axis = 1)
data = pd.concat([data, pd.get_dummies(data["passenger_class"], prefix = "Class")], axis = 1)
data = pd.concat([data, pd.get_dummies(data["passenger_sex"], prefix = "Sex")], axis = 1)

data = data.drop(columns = ["Embarked", "passenger_class", "passenger_sex"])
data.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,passenger_survived,Embarked_C,Embarked_Q,Embarked_S,Class_Lower,Class_Middle,Class_Upper,Sex_F,Sex_M
0,1,22.0,1,0,7.25,N,0,0,1,1,0,0,0,1
1,2,38.0,1,0,71.2833,Y,1,0,0,0,0,1,1,0
2,3,26.0,0,0,7.925,Y,0,0,1,1,0,0,1,0
3,4,35.0,1,0,53.1,Y,0,0,1,0,0,1,1,0
4,5,35.0,0,0,8.05,N,0,0,1,1,0,0,0,1


Convertimos passenger_survived a enteros donde Y = 1 y N = 0

In [66]:
data["Survived"] = 1
data.loc[data["passenger_survived"] == "N", "Survived"] = 0
data = data.drop(columns = ["passenger_survived"])
data.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Class_Lower,Class_Middle,Class_Upper,Sex_F,Sex_M,Survived
0,1,22.0,1,0,7.25,0,0,1,1,0,0,0,1,0
1,2,38.0,1,0,71.2833,1,0,0,0,0,1,1,0,1
2,3,26.0,0,0,7.925,0,0,1,1,0,0,1,0,1
3,4,35.0,1,0,53.1,0,0,1,0,0,1,1,0,1
4,5,35.0,0,0,8.05,0,0,1,1,0,0,0,1,0


Aprovechamos Pandas para crear los datasets de entrenamiento, pruebas y validación.

In [76]:
data_train = data.sample(frac = 0.8, random_state = 123) #random state es un random seed
data_test = data.drop(data_train.index) #Data de testing
data_cv = data_train.sample(frac = 0.12, random_state = 123) #Data de validación (12% del entrenamiento, aprox. 10% del det original)
data_train = data_train.drop(data_cv.index) #Data entrenamiento final, ya sin la de testing ni cross-validation

print(data_train.shape)
print(data_test.shape)
print(data_cv.shape)

(627, 14)
(178, 14)
(86, 14)


## Creación de modelos
### Árbol de decisión
Implementación de árbol de decisión con ScikitLearn. Se saca provecho de que los árboles de decisión necesitan poca feature engineering para realizar un entrenamiento, así que únicamente haremos uso de las librerías disponibles.

In [79]:
model_tree = tree.DecisionTreeClassifier()
model_tree = model_tree.fit(data_train[["Age", "SibSp", "Parch", "Fare", "Embarked_C", "Embarked_Q", "Embarked_S", "Class_Lower", "Class_Middle", "Class_Upper", "Sex_F", "Sex_M"]], 
                            data_train["Survived"])

In [82]:
model_tree.predict(data_cv[["Age", "SibSp", "Parch", "Fare", "Embarked_C", "Embarked_Q", "Embarked_S", "Class_Lower", "Class_Middle", "Class_Upper", "Sex_F", "Sex_M"]])

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0],
      dtype=int64)

Se procede a almacenar el árbol para poder utilizarlo en otro proyecto.

In [95]:
tree.export_graphviz(model_tree, 
                     out_file = "models/model_tree.dot", 
                     feature_names = ["Age", "SibSp", "Parch", "Fare", "Embarked_C", "Embarked_Q", "Embarked_S", "Class_Lower", "Class_Middle", "Class_Upper", "Sex_F", "Sex_M"],
                     filled=True, 
                     rounded=True,
                     special_characters=True)

Definimos una función que almacena los experimentos en una hoja de Excel

In [None]:
#from openpyxl import load_workbook
with pd.ExcelWriter('bitacora.xlsx', engine='openpyxl', mode='a') as writer:
    data_cv.to_excel(writer,sheet_name='Sheet1')
    writer.save()

writer.close()

### Support Vector Machine (SVM)
Es un método de aprendizaje supervizado que es especialmente últil para:
* Effective in high dimensional spaces.

        Still effective in cases where number of dimensions is greater than the number of samples.

        Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

        Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

