# TITANIC

## Descripción

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Data

Overview

The data has been split into two groups:

    training set (train.csv)
    test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Data Dictionary
Variable	Definition	Key
survival 	Survival 	0 = No, 1 = Yes
pclass 	Ticket class 	1 = 1st, 2 = 2nd, 3 = 3rd
sex 	Sex 	
Age 	Age in years 	
sibsp 	# of siblings / spouses aboard the Titanic 	
parch 	# of parents / children aboard the Titanic 	
ticket 	Ticket number 	
fare 	Passenger fare 	
cabin 	Cabin number 	
embarked 	Port of Embarkation 	C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [269]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# special matplotlib argument for improved plots
from matplotlib import rcParams
from sklearn.linear_model import LogisticRegression

In [270]:
#CARGAMOS LOS DATOS EN TRAIN Y TEST
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")

In [271]:
#EXPLORAMOS LOS DATOS
print(train.shape)
print(test.shape)

(891, 12)
(418, 11)


In [272]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [273]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [274]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [275]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [276]:
print(train.isnull().sum())
print(test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [277]:
#PARA NUESTRA PREDICCIÓN SERÁN INSERVIBLES LAS COLUMNAS DE ID, NOMBRE, TICKET
#CABINA CONTIENE DEMASIADOS VALORES NULLS, TAMBIÉN SE CONSIDERA INSERVIBLE PARA EL MODELO

train=train.drop(["PassengerId","Name","Ticket","Cabin"], axis=1)
test=test.drop(["Ticket","Cabin"], axis=1)

In [278]:
#LA COLUMNA DE EDAD Y FARE TAMBIÉN TIENE VALORES NULOS PERO OPTAMOS POR REEMPLAZARLOS POR EL PROMEDIO DE ÉSTA
train["Age"]=train["Age"].fillna(round(train["Age"].mean(),0))
test["Age"]=test["Age"].fillna(round(train["Age"].mean(),0))

test["Fare"]=test["Fare"].fillna(round(test["Fare"].mean(),0))

In [279]:
#PASAMOS A CONVERTIR NUESTRAS VARIABLES CATEGÓRICAS A NUMÉRICAS

#CAPTURAMOS TODAS LAS VARIABLES CATEGÓRICAS EXISTENTES EN NUESTROS DATOS
cat_vars = [var for var in train.columns if train[var].dtype == 'O']
cat_vars

['Sex', 'Embarked']

In [280]:
#CONVERTIMOS DICHAS VARIABLES A NUMÉRICAS
train = pd.get_dummies(train, columns=cat_vars)
test = pd.get_dummies(test, columns=cat_vars)

In [281]:
#ELIMINAMOS COLUMNAS REDUNDANTES
train=train.drop(["Sex_female","Embarked_S"], axis=1)
test=test.drop(["Sex_female","Embarked_S"], axis=1)

In [282]:
#COMPROBAMOS DE NUEVO NUESTROS VALORES NULOS
print(train.isnull().sum())
print(test.isnull().sum())

Survived      0
Pclass        0
Age           0
SibSp         0
Parch         0
Fare          0
Sex_male      0
Embarked_C    0
Embarked_Q    0
dtype: int64
PassengerId    0
Pclass         0
Name           0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_male       0
Embarked_C     0
Embarked_Q     0
dtype: int64


In [283]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Age,SibSp,Parch,Fare,Sex_male,Embarked_C,Embarked_Q
0,892,3,"Kelly, Mr. James",34.5,0,0,7.8292,1,0,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,7.0,0,0,0
2,894,2,"Myles, Mr. Thomas Francis",62.0,0,0,9.6875,1,0,1
3,895,3,"Wirz, Mr. Albert",27.0,0,0,8.6625,1,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,12.2875,0,0,0


In [284]:
#YA TENEMOS NUESTROS DATOS LIMPIOS PARA APLICAR UNA PREDICCIÓN

In [285]:
#CAPTURAMOS NUESTRAS VARIABLES DEPENDIENTES E INDEPENDIENTES
#RECORDAMOS QUE DEBE HACERSE CON NUESTROS DATOS train Y POSTERIORMENTE APLICAR LAS PREDICCIONES CON NUESTROS DATOS test


x_train = train.drop("Survived", axis=1)
y_train = train["Survived"]

x_test = test.drop(["PassengerId","Name"],axis=1)

In [286]:
# fit logistic regression model
# statsmodels works nicely with pandas dataframes
logit = sm.Logit(y_train, x_train).fit()
print(logit.params)
logit.summary()

Optimization terminated successfully.
         Current function value: 0.491208
         Iterations 6
Pclass        0.061404
Age           0.005657
SibSp        -0.258011
Parch        -0.093061
Fare          0.016180
Sex_male     -2.253984
Embarked_C    0.670764
Embarked_Q    0.162848
dtype: float64


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,883.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 03 Oct 2019",Pseudo R-squ.:,0.2624
Time:,00:04:33,Log-Likelihood:,-437.67
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,2.306e-63

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Pclass,0.0614,0.074,0.829,0.407,-0.084,0.207
Age,0.0057,0.006,1.017,0.309,-0.005,0.017
SibSp,-0.2580,0.095,-2.719,0.007,-0.444,-0.072
Parch,-0.0931,0.113,-0.827,0.408,-0.314,0.127
Fare,0.0162,0.003,5.316,0.000,0.010,0.022
Sex_male,-2.2540,0.181,-12.474,0.000,-2.608,-1.900
Embarked_C,0.6708,0.217,3.092,0.002,0.246,1.096
Embarked_Q,0.1628,0.301,0.542,0.588,-0.426,0.752


In [287]:
#FITEAR MODELO CON SICKIT LEARN
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)

#CHECAMOS LA PRECISION DE NUESTRO MODELO CON EL R CUADRADO
print(logit2.score(x_train, y_train))

0.8024691358024691




In [288]:
#HACEMOS PREDICCION CON NUESTROS MISMOS DATOS ENTRENADOS
pred_train=logit2.predict(x_train)

#AGREGAMOS UNA COLUMNA PARA COMPARAR RESULTADOS
train["Prediction"]=pred_train
train[["Survived","Prediction"]].head()

Unnamed: 0,Survived,Prediction
0,0,0
1,1,1
2,1,1
3,1,1
4,0,0


In [289]:
#HACEMOS UNA PREDICCION CON LOS DATOS TEST
pred_test=logit2.predict(x_test)

#AGREGAMOS LA COLUMNA DE LA PREDICCION A NUESTROS DATOS TEST 
test["Survived"] = pred_test
datafinal=test[["PassengerId","Survived"]]
datafinal



Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [290]:
#GENERAMOS EL ARCHIVO FINAL
datafinal.to_csv("Lab5_A01730893.csv",index=False)