
# Logistic Regression with Python

We'll be trying to predict a classification- survival or deceased.Let's begin our understanding of implementing Logistic Regression in Python for classification.

## Import Libraries
Let's import some libraries to get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import seaborn as sns
%matplotlib inline

## The Data


In [None]:
train = pd.read_csv('../input/train.csv')

In [None]:
train.tail()

# Exploratory Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

## Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
train.isnull().sum().sort_values(ascending=False)

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

#### Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

In [None]:
train['Age'].hist(bins=30,color='darkred',alpha=0.7)

In [None]:
sns.countplot(x='SibSp',data=train)

In [None]:
sns.countplot(x='Parch',data=train)

In [None]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))

___
## Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
However we can be smarter about this and check the average age by passenger class. For example:


In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 37

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

Now apply that function!

In [None]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

In [None]:
train['Embarked'] = train['Embarked'].fillna('S')


Now let's check that heat map again!

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

We will sum of family member 

In [None]:
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train.head()

In [None]:
train.dropna(inplace=True)

## Converting Categorical Features 
  We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
train.info()

In [None]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

In [None]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
train = pd.concat([train,sex,embark],axis=1)

In [None]:
train.head()

### Great! Our data is ready for our model!

## Building a Logistic Regression model
 Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop(['Survived'],axis=1), 
                                                    train['Survived'], test_size=0.10, 
                                                    random_state=101)

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)
X_test.head()

In [None]:
predictions

## Evaluation

We can check precision,recall,f1-score using classification report!

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
print(classification_report(y_test,predictions))

# Decision Tree Classifiction

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt_model=DecisionTreeClassifier()
dt_model.fit(X_train,y_train)

In [None]:
dt_pred = dt_model.predict(X_test)

In [None]:
print(confusion_matrix(y_test,dt_pred))

In [None]:
print(classification_report(y_test,dt_pred))

# Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf= RandomForestClassifier(n_estimators=500)
rf.fit(X_train,y_train)

In [None]:
rf_pre=rf.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rf_pre))

In [None]:
print(classification_report(y_test,rf_pre))

# XGBoosts Classifier

In [None]:
from xgboost import XGBClassifier
xgboost = XGBClassifier(n_estimators=1000)
xgboost.fit(X_train,y_train)

In [None]:
xg_pred = xgboost.predict(X_test)

In [None]:
print(confusion_matrix(y_test,xg_pred))

In [None]:
print(classification_report(y_test,xg_pred))

# ANN

In [None]:
import keras 
from keras.layers import Dense
from keras.models import Sequential

In [None]:
ann  = Sequential()
ann.add(Dense(units= 32,init= 'uniform', activation = 'relu', input_dim=9))
ann.add(Dense(units= 32,init= 'uniform', activation = 'relu'))
ann.add(Dense(units= 1,init= 'uniform', activation = 'sigmoid'))
ann.compile(optimizer='adam',
              loss='mean_squared_error',
              metrics=['accuracy'])

In [None]:
ann.fit(X_train,y_train, batch_size=32, nb_epoch=300,verbose= 0)

In [None]:
ann_pred = ann.predict(X_test)
ann_pred = [ 1 if y>=0.5 else 0 for y in ann_pred]
print(ann_pred)

In [None]:
print(confusion_matrix(y_test,ann_pred))

In [None]:
print(classification_report(y_test,ann_pred))

Now we will use test dataset

In [None]:
test = pd.read_csv('../input/test.csv')

In [None]:
sns.heatmap(test.isnull())

In [None]:
test.drop('Cabin',axis=1,inplace=True)

In [None]:
test['Fare'].fillna(test['Fare'].median(), inplace=True)

In [None]:
test.info()

In [None]:
test.head()

In [None]:
test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)

In [None]:
sex_test = pd.get_dummies(test['Sex'],drop_first=True)
embark_test= pd.get_dummies(test['Embarked'],drop_first=True)

In [None]:
test.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
test = pd.concat([test,sex_test,embark_test],axis=1)

In [None]:
test.head()

In [None]:
train.head()

In [None]:
ann.fit(train.drop(['Survived'],axis=1),train['Survived'] , nb_epoch=300,verbose= 0)

In [None]:
test_prediction = ann.predict(test)
test_prediction = [ 1 if y>=0.5 else 0 for y in test_prediction]

In [None]:
test_pred = pd.DataFrame(test_prediction, columns= ['Survived'])

In [None]:
new_test = pd.concat([test, test_pred], axis=1, join='inner')

In [None]:
new_test.head()

In [None]:
df= new_test[['PassengerId' ,'Survived']]

In [None]:
df.head()

In [None]:
df.to_csv('predictions.csv' , index=False)

## If you like it, please vote.
# Thank you :)