# Logistic Regression & Confusion Matrix

In [None]:
# Import required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report

Data Source : https://www.kaggle.com/azeembootwala/titanic?select=train_data.csv

The purpose of this project is to develop a logistic regression model for survival based on the train data. A confusion matrix is then used to provide the accuracy of the model.

# Data Loading

In [None]:
# Import train data
train_df = pd.read_csv('../input/titanic/train_data.csv')
# View the first five rows of train_df
train_df.head()

In [None]:
# Import test data
test_df = pd.read_csv('../input/titanic/test_data.csv')
# View the first five rows of test_df
test_df.head()

The table above shows the first 5 rows of the train data and test data. There's a total of 17 columns. The dependent variable is Survived and all the other columns are independent variables.

The first five rows of Age, Fare and Family_Size are in decimals and are smaller than 1 as they were normalized. In addition, the first two columns are irrelevant for the Logistic Regression below. These columns will be dropped. Let's take a deeper look at this below.

# Data Cleaning

Both train data and test data simultaneously, but only the train data will be viewed after cleaning.

In [None]:
# Drop the first two columns which are not required for the analysis below
train_df = train_df.drop(['Unnamed: 0', 'PassengerId'], axis=1)
test_df = test_df.drop(['Unnamed: 0', 'PassengerId'], axis=1)
# View some details for each variable
train_df.describe()

From the table above, there are 792 rows with data on each variable. Based on the description from Kaggle, all missing values have been filled with a median of the column values. All real valued data columns have been normalized. Thus, the columns for Age, Fare and Family_Size consist values between 0 and 1.

Let's look at the correlation coefficient below to check if we need all the columns for the logistic regression model.

In [None]:
# Set the figure's size
plt.figure(figsize=(20,15))
# Plot heatmap
sns.heatmap(train_df.corr(), annot = True, cmap = 'Blues_r')

From the heatmap above, it seems like all variables are important and should be included in the logistic regression model. We should look at the assumptions of logistic regression before applying the model.

# Assumptions of Logistic Regression

The assumptions of Logistic Regression via the link (https://www.statisticssolutions.com/assumptions-of-logistic-regression/) is used to check if the data is a good fit.

- The dependent variable - survival results are binary (0 = No, 1 = Yes). We will be using the binary logistic regression.


- Logistic regression requires the observations to be independent of each other, which means that the observations should not be coming from a repeated measurements or matched data.

   Based on the description on Kaggle, 4 columns (Title_1 to Title_4) have been added, re-engineered from the Name column to Title1 to Title4 signifying males & females depending on whether they were married or not .(Mr , Mrs ,Master,Miss). This indicates that the title columns are created based on the name, sex and age columns from the original dataset. Therefore, Title_1 to Title_4 are excluded as we will be keeping the Sex and Age columns.


- Logistic Regression requires little or no multicollinearity among the independent variables.

   From the heatmap above, fare is positively correlated with Pclass_1 (1st Class Ticket) and negatively correlated with Pclass_2 (2nd Class Ticket) and Pclass_3 (3rd Class Ticket). This means that as the fare price increase, it is highly possible that it is a 1st Class Ticket. Thus, Pclass_1 to Pclass_3 are excluded since we are keeping the fare column.


- Logistic regression requires the independent variables are linearly related to the log odds. Our data do meet this requirement since the columns were normalized.


- Logistic regression requires a large sample size. Since we will be using 7 independent variables, having 792 rows of data should be fine.

In [None]:
# Drop the title columns
train_df = train_df.drop(['Title_1','Title_2','Title_3','Title_4' ], axis=1)
test_df = test_df.drop(['Title_1','Title_2','Title_3','Title_4' ], axis=1)
# Drop the Pclass columns
train_df = train_df.drop(['Pclass_1','Pclass_2','Pclass_3'], axis=1)
test_df = test_df.drop(['Pclass_1','Pclass_2','Pclass_3'], axis=1)
# View the first five rows of df
train_df.head()

# Setting the X and y variables
Spilt the cleaned data into independent variable (X) and dependent variable (y).

In [None]:
# Independent Variables for train data
train_x = train_df.iloc[:,1:]
# View the first three rows of train_x
train_x.head(3)

In [None]:
# Independent Variables for test data
test_x = test_df.iloc[:,1:]
# View the first three rows of test_x
test_x.head(3)

In [None]:
# Dependent Variable for train data
train_y = train_df['Survived']
# View the first three rows of train_y
train_y.head(3)

In [None]:
# Dependent Variable for test data
test_y = test_df['Survived']
# View the first three rows of test_y
test_y.head(3)

# Logistic Regression

In [None]:
# Fit the logistic regression model according to the given training data
lr = linear_model.LogisticRegression(random_state=0).fit(train_x, train_y)
# Check accuracy of model
print ('Train : ', lr.score(train_x,train_y))
print ('Test  : ', lr.score(test_x, test_y))

The above shows the ratio of the number of correct predictions to the number of observations in train data and test data. As the training set accuracy is higher than test set by 0.04%, there's a small indication of overfitting. From the test set, the model is predicting with 79% accuracy. This is good but not great.

Note that I had checked the logistic regression score() for the data if we include all the Pclass and Title columns. As expected, the accuracy score is higher than 79%. However, as including these columns do not meet the assumptions of logistic regression, the accuracy of the model could not be trusted.

# Confusion Matrix

In [None]:
# Predict for test_x
pred_y = lr.predict(test_x)
# Confusion Matrix
confusion_matrix(test_y, pred_y)

In [None]:
# Plot the confusion matrix graph
plot_confusion_matrix(lr, test_x, test_y, cmap = 'Blues_r')

0 indicated the person did not survive whereas 1 indicates the person survived.

There are 24 data points in (1,1), it means that the outcome of the model correctly predits the person survived. There's 55 data points in (0,0), indicating that the model correctly predict the person did not survive. A total of 79 data points were correctly classified as survived or dead.

From the above plot, Type 1 and type 2 errors has a total of 21 (9 + 12) data points which were incorrectly classified. These represents the predicted outcome differs from the actual outcome.

# Classification Report

In [None]:
# View the classfication report
print(classification_report(test_y, pred_y))

The values in the report above is calculated via the values provided in the confusion matrix. The precision and recall are the probabilities calculated from the confusion matrix above.

The F-1 score measures the preciseness and robustness of the model.

In conclusion, there are no perfect models. The logistic model tells us the prediction is pretty good but may not be the best model for this dataset.