# Titanic Survial Predicions via Logistic Regression Model
We will be using <a href="https://www.kaggle.com/tedllh/titanic-train">Titanic Data Set from kaggle</a>. We will classify two classes here, dead and survived people.

### Importing Libraries

In [None]:
import pandas as pd # for dataframes
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns # for data visualization

%matplotlib inline

### Fetching the data from Data set

In [None]:
train = pd.read_csv("../input/titanic-train/titanic_train.csv")

In [None]:
train.head()

Now we have the data, what's our problem? We will be using features like Name, Sex, Age, Pclass etc. To predict whether the person **Survived** or not.

Obviously we will be doing data preprocessing to see which part of data is necessary and which is not. So, we will be dropping some features that doesn't effect our prediction such as Name, PassengerID and others.

And this will be done after data visualization :)

## Exploratory Data Analysis (EDA)
Let's search for missing values!

In [None]:
train.isnull().sum()

#### Note:
Whenever you are going to show this null or missing values. *Always go for **Data Visualization** here*

## Heatmap

In [None]:
# using seaborn
sns.heatmap(train.isnull(), yticklabels = False, cmap = 'viridis', cbar = False)

## Countplot

Now, here the missing values can be visualized very clearly. Hence, this method is more proficient for presentations stuff at industry level.

In [None]:
# countplot for death and survival rate
sns.set_style("whitegrid")
sns.countplot(x = "Survived", data = train, palette = 'rainbow')

In [None]:
# countplot for male and female
sns.set_style('darkgrid')
sns.countplot(x = "Sex", data = train, palette = 'rocket')

In [None]:
# countplot for people died against Pclass
sns.set_style('whitegrid')
sns.countplot(x = "Survived", hue = 'Pclass', data = train)

## Histogram

In [None]:
# getting the count of Age person
train['Age'].hist(bins=40, color='darkred', alpha=0.5)

In [None]:
train['Fare'].hist(bins=20, color='purple', alpha = 0.7, figsize = (10, 5))

## Data cleaning
That is we are going to fill the missing values, as we have seen using heatmap that we cannot afford to drop those missing values as we may lost a huge amount of data.

Replacing values is known as imputation. So, we are going to impute average values.

---
Getting the average Age

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'Pclass', y = 'Age', data = train, palette = 'winter')

Let's say we have average age as:
- 1st class ==> 37
- 2nd class ==> 28
- 3rd class ==> 25

---
Now we will make a function that will replace all the null values for specific PClass with average age in that class

## Imputing Average Age

In [None]:
# making a method for imputing age
def impute_age(cols):
    # we will pass 2 cols as arguments, col at 0 index will be for age and col at 1 index will be for Pclass
    Age = cols[0]
    Pclass = cols[1]
    
    # getting null values
    if pd.isnull(Age):
        # returning avg. age (37) for 1st Class
        if Pclass == 1:
            return 37
        # for 2nd class age (28)
        elif Pclass == 2:
            return 28
        # for 3rd class age (25)
        else:
            return 25
    else:
        return Age
            

In [None]:
# applying above function
# col[0] = 'Age' & col[1] = 'Plcass'
train['Age'] = train[['Age', 'Pclass']].apply(impute_age, axis = 1)

In [None]:
sns.heatmap(train.isnull(), yticklabels = False, cmap = 'viridis', cbar = False)

### Dropping unnecessary columns
Such that: Name, Cabin, PassengerID etc

In [None]:
# cabin
train.drop('Cabin', inplace = True, axis = 1)

In [None]:
sns.heatmap(train.isnull())

In [None]:
train.shape # 1 column is dropped now

In [None]:
train.head()

## Converting Categorcial Features
As we know that ML model only works for numerical not categorical values. 

In [None]:
train.info() # 'object' are strings in python

---

As we have **Object(4)** so it means there are 4 string type of categorical values that needs to be converted into numercial values. But we don't need **Names** and **PassengerID** as it doesn't effect whether a person lives or die.

---


In [None]:
sex = pd.get_dummies(train['Sex'], drop_first = True)
embark = pd.get_dummies(train['Embarked'], drop_first = True)

In [None]:
sex # into numerical values

In [None]:
embark # into numerical

In [None]:
# dropping extra cols
train.drop(['Sex', 'Embarked', 'Name', 'Ticket'], axis = 1, inplace = True)

In [None]:
train # categorical values dropped here!

## Concat Numerical Values
Now the next step is to combine or concatenate the features that we just converted from categorical to numerical.

In [None]:
train = pd.concat([train, sex, embark], axis = 1)

In [None]:
train # all values in numercial form! yay!!

 # Building Logistic Regression Model
 ## Machine Learning Model
 As the concept is same, we will split the data into two parts i.e. Train and Test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# we required 'Survived' values on Y-Axis
Y = train['Survived'] # Y == Survived column

In [None]:
# all other features will be on X-axis. So, we are dropping 'Survived' column and storing all others
X = train.drop(['Survived'], axis=1) # X == all cols, excluding Survived column

In [None]:
X

In [None]:
# splitting Testing and Training data with 20-80 margine
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 101)

## Training and Prediction via Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logReg = LogisticRegression()

In [None]:
# training our Logistic Regression Model here
logReg.fit(X_train, Y_train)

In [None]:
# making predictions on Testing data
predictions = logReg.predict(X_test)

## Evalutaion

In [None]:
from sklearn.metrics import classification_report

In [None]:
# using actual testing data and the predictions our Model just made
print(classification_report(Y_test, predictions)) # getting accuracy

## Making a .csv File of the Predictions
This file can be submitted on <a href="http://kaggle.com/">kaggle</a> or you for any other purpose where you want to show your predicitons

In [None]:
# for making a DataFrame shape must be same
pred = logReg.predict(X)

In [None]:
pred.shape

In [None]:
X.shape # same number of Rows

In [None]:
# making our own DataFrame with 'Submission' as Name
submission = pd.DataFrame({
    'PassengerId' : X['PassengerId'],
    'Survived' : pred
})

In [None]:
submission.head