
### Logistic Regression with Python

[Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). 

## Import Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from IPython.display import Image
Image("titanic.png")

## The Data

Reading in the titanic_train.csv file into a pandas dataframe.

In [None]:
train = pd.read_csv('titanic_train.csv')

In [None]:
train.head()

# Exploratory Data Analysis

checking out missing data

## Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

In [None]:
train.isnull().sum()

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Roughly 20 percent of the Age data is missing. 

The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. 

Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train)

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

In [None]:
sns.displot(train['Age'].dropna(),kde=False,color='darkred',bins=40)

In [None]:
train['Age'].hist(bins=30,color='darkred',alpha=0.3)

In [None]:
sns.countplot(x='SibSp',data=train)

In [None]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))

___
## Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
However we can be smarter about this and check the average age by passenger class. For example:


In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. 

We'll use these average age values to impute based on Pclass for Age.

In [None]:
# def impute_age(cols):
#     Age = cols[0]
#     Pclass = cols[1]
    
#     if pd.isnull(Age):

#         if Pclass == 1:
#             return 37

#         elif Pclass == 2:
#             return 29

#         else:
#             return 24

#     else:
#         return Age
    
    
age_means = train.groupby('Pclass')['Age'].mean()

def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        return age_means[Pclass]
    else:
        return Age


Now apply that function!

In [None]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

check that heat map again!

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

drop the Cabin column and the row in Embarked that is NaN.

In [None]:
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train.head()

In [None]:
train.dropna(inplace=True)

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
train.info()

In [None]:
pd.get_dummies(train['Embarked'],drop_first=True).head()

In [None]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

In [None]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
train.head()

In [None]:
train = pd.concat([train,sex,embark],axis=1)

In [None]:
train.head()


# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

In [None]:
train.drop('Survived',axis=1).head()

In [None]:
train['Survived'].head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
accuracy=confusion_matrix(y_test,predictions)

In [None]:
accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy=accuracy_score(y_test,predictions)
accuracy

In [None]:
predictions