# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

# Import the required libraries and Read the Data

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
import matplotlib.pyplot as plt

titanic = pd.read_csv(url)

## Let us now peep into the data

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Filling up missing data

In [5]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else: return Age
    


Let us now apply the same

In [6]:
titanic['Age'] = titanic[['Age', 'Pclass']].apply(impute_age, axis = 1)

In [7]:
titanic.drop('Cabin',axis = 1, inplace = True)
#%% Converting Categorical Features
sex = pd.get_dummies(titanic['Sex'], drop_first = True)
embark = pd.get_dummies(titanic['Embarked'],drop_first = True)
titanic.drop(['Sex','Embarked','Name','Ticket'], axis =1, inplace = True)
titanic = pd.concat([titanic,sex, embark],axis = 1)

In [8]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,1,0,3,22.0,1,0,7.25,1,0,1
1,2,1,1,38.0,1,0,71.2833,0,0,0
2,3,1,3,26.0,0,0,7.925,0,0,1
3,4,1,1,35.0,1,0,53.1,0,0,1
4,5,0,3,35.0,0,0,8.05,1,0,1


## Let us define the feature colums. The survived columns serves for prediction

In [9]:
#feature_cols = ['Pclass', 'Parch']
feature_cols = ['Pclass', 'Parch','male','Age']
X = titanic[feature_cols]
y = titanic.Survived

## Split the data  and we will build the model using the train set and apply the model on the test 

In [10]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)




# Let us fit the Model

In [12]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)


LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [14]:
y_pred_class = logreg.predict(X_test)

In [16]:
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)


0.7937219730941704


# Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

In [17]:
print metrics.confusion_matrix(y_test, y_pred_class)

[[110  18]
 [ 28  67]]


true positive (TP)  eqv. with hit  
true negative (TN)  eqv. with correct rejection  
false positive (FP)  eqv. with false alarm, Type II error  
false negative (FN)  eqv. with miss, Type I error

In [18]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1][1]
TN = confusion[0][0]
FP = confusion[0][1]
FN = confusion[1][0]

In [19]:
print 'True Positives:', TP
print 'True Negatives:', TN
print 'False Positives:', FP
print 'False Negatives:', FN

True Positives: 67
True Negatives: 110
False Positives: 18
False Negatives: 28


In [20]:
#calculate the sensitivity
print TP / float(TP + FN)


0.7052631578947368


In [21]:
# calculate the specificity
print TN / float(TN + FP)


0.859375
