In this notebook, we will analyze the data of Titanic Dataset.  
Process followed for the Data Analysis is as below:  
1. Reading the Data  
    a. Loading the data  
    b. Using info(), describe() to view the data  
        
2. Data Cleaning  
    a. Checking Null values in df  
    b. Imputing the null values  
        
3. Exploratory Data Analysis  
    a. Plotting graphs against target variable  
    b. Plotting correlation heatmap  
        
4. Machine Learning - Preprocessing  
    a. Train Test Split  
    b. Standard Scaling for numerical variables  
    c. One Hot Encoding for categorical variables  
        
5. Machine Learning - Modelling  
    a. Logistic Regression  
    b. Decision Tree  
    c. Random Forrest  

Importing the Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Reading the Data

In [None]:
df=pd.read_csv("../input/taitanictrain/datasets_11657_16098_train.csv")
df.head()

In [None]:
df.shape

We now know the dimensions of the dataset and its constitution.

In [None]:
df.nunique()

From this, we see that PassengerId and Name are unique for each observation.  
This means they would bear no impact on the data analysis.  
Hence, we can remove these columns from our dataframe.

In [None]:
df = df.drop(['PassengerId','Name'], axis=1)

In [None]:
df.info()

From the above we see the data distribution.  
The insights we can get from it are:
 - Total Number of variables - 11
 - Variables and their types  
 - Number of categorical variables (object) - 5
 - Number of numerical variables (int64 or float64) - 7
 - If any variables have null values - Age, Cabin and Embarked

Knowing the above, we can delve deeper into handling null values.  
Had the Non-Null Count been 891 for all, we would have skipped this step.

In [None]:
df.isnull().sum()

Let's visualize the above for better view.

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(df.isnull(), yticklabels=False, cmap='ocean');

In [None]:
687/891

We see that the null values from Cabin column account for 77% of total number of rows.  
Also, inspecting the column, we see it is a categorical variable with 147 categories.  
Hence it would not be logical to keep that column for our analysis.  
We will drop the Cabin column from our dataframe.

In [None]:
df = df.drop('Cabin', axis=1)

Lets check if it has been executed successfully.

In [None]:
df.isnull().sum()

Now, lets handle the next column with most null values - Age.  
As Age is an int64 type variable, we could impute using mean or median.  
Let us analyze it futher before making any decision.

In [None]:
df.Age.describe().round(2)

Lets visualize the Age data using a boxplot.

In [None]:
sns.boxplot(data = df, x = df.Age)

In [None]:
df.Age.median()

We could simply impute the missing values with the median.  
But before that, let us classify the Age column further with other categorical variables.

In [None]:
sns.boxplot(data = df, x = df.Sex, y = df.Age)

We realise there is no significant difference between the medians of both classifications.  
Let us do this with another categorical vairable.

In [None]:
sns.boxplot(data=df, x=df.Pclass, y=df.Age)

Here, we can clearly see the difference between Age values for each Pclass.  
Hence, it would be logical to impute the Age according to the Pclass.

To confirm our visualization is correct, let us look at the numbers themselves.

In [None]:
df[df['Pclass']==1].Age.median()

In [None]:
df[df['Pclass']==2].Age.median()

In [None]:
df[df['Pclass']==3].Age.median()

We realise, that imputing Age value by mean of age w.r.t Pclass is more meaningful.  
Hence, we will impute the data accordingly.

In [None]:
df['Age'] = df['Age'].fillna(df.groupby('Pclass').Age.transform('median'))

We have imputed the null values in Age column according to their Pclass median.  
Lets check the null values again.

In [None]:
df.isnull().sum()

We still have to handle the null values from Embarked column.  
Lets understand that column better to make a decision.

In [None]:
df.Embarked.value_counts()

Since category S constitutes a significant share of Embarked column, it is that much likely that the missing values would be in category S.  
Hence, we will replace the missing values of Embarked column with value S.

In [None]:
df.Embarked = df.Embarked.fillna('S')

Let's check if were able to successfully execute it.

In [None]:
df.Embarked.isnull().sum()

Now let's have an overall view of the dataframe after all the cleaning.

In [None]:
df.info()

## Exploratory Data Analysis

Now that we have cleaned the data, let us proceed with EDA to better understand the data.

We want to see the relationship of the input variables with respect to the output variable.  
Here our output variable is 'Survived.'  
Hence we will plot our charts to show the distribution of each variable with Survived data.

Before that, lets look at the distribution of Survived.

In [None]:
sns.countplot(data = df, x = df.Survived)

#### Pclass

In [None]:
sns.countplot(data = df, x = df.Pclass, hue = df.Survived)

Here we can immediately see a pattern. It seems that higher your Pclass, the greater your chance of Surviving.  
A passenger with Pclass 3 has a high probability of not Surviving.

#### Age

In [None]:
sns.barplot(data=df, x=df.Survived, y=df.Age)

In [None]:
sns.boxplot(data=df, x=df.Survived, y=df.Age)

Not much inference can be derived from the Age column.  
However, the boxplot does match with our understanding of Titanic survivors.  
The people who Survived were mostly either old people, children or woman.  
Young and middle-aged men were less prioritized due to dearth of lifeboats.
  
Hence, it makes sense that the Age distribution of Surivived is more widspread than that of those who didn't Survive.

#### SibSp

In [None]:
df.SibSp.value_counts()

Sibsp refers to Number of Siblings/Spouses Aboard

In [None]:
sns.countplot(data=df, x=df.SibSp, hue=df.Survived)
plt.legend(bbox_to_anchor=(1.05, 1))

To an untrained eye, there doesn't seem to be much to interpret from the above plot.

#### Parch

In [None]:
df.Parch.value_counts()

Parch refers to Number of Parents/Children Aboard

In [None]:
sns.countplot(data=df, x=df.Parch, hue=df.Survived)
plt.legend(bbox_to_anchor=(1.05, 1))

Again, this plot seems very similar to the earlier plot.  
Not much inference.

#### Ticket

In [None]:
df.Ticket.value_counts()

In [None]:
df.Ticket.nunique()

For a dataset with lenght of 891, we have unique values of 681.  
There may not be much inference from this.

In [None]:
sns.countplot(data=df, x=df.Ticket, hue=df.Survived)

From the above plot, we can clearly say they is no significant pattern between passengers who have Survived and their Ticket Number.

We will drop the column Ticket from our dataset. We were able to come to conclusion thanks to EDA.

In [None]:
df = df.drop('Ticket', axis=1)

In [None]:
df.info()

#### Fare

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(data=df, x=df.Fare, y=df.Survived, orient='h', palette='viridis');


In [None]:
sns.barplot(data=df, x=df.Survived, y=df.Fare, palette='viridis')

From the above plots we see that those who had paid higher Fare had more chances of Survival.

#### Embarked

This variable refers to port of Embarkment.  
(C = Cherbourg; Q = Queenstown; S = Southampton)

In [None]:
df.Embarked.value_counts()

In [None]:
sns.countplot(data=df, x=df.Embarked, hue=df.Survived, palette='viridis')

Logically, port of Embarkment shouldn't have any correlation with Survival.

## Machine Learning Modelling

### One Hot encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
df.head()

In [None]:
df = pd.get_dummies(df, drop_first = True)
df.head()

### Feature Selection

In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(df.corr(), annot = True, cmap = 'Blues');

There doesn't seem to be any massively correlated variables which may cause multicollienarity problem in the analysis.  
Hence, we can proceed further.


### Splitting the dataset 
i.e. Removing the Dependent variable from the dataset into a new dataframe

In [None]:
# Independent Variable
X = df.drop('Survived', axis = 1)
X.head()

In [None]:
# Dependent Variable
y = df['Survived']
y.head()

### Feature Importance

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)

In [None]:
model.feature_importances_

In [None]:
feature_plot = pd.Series(model.feature_importances_, index = X.columns)
feature_plot

In [None]:
feature_plot.nlargest(10).plot(kind = "barh")

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_sc = sc.fit_transform(X[['Age','Fare']])

In [None]:
X.columns

In [None]:
X_df = pd.DataFrame(X_sc, columns = ['Age','Fare'])
X_df.head()

In [None]:
X_rem = X.drop(['Age','Fare'], axis = 1)

In [None]:
X_fin =pd.concat([X_df, X_rem], axis = 1) 
X_fin.head()

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_fin, y, test_size = 0.3, random_state = 0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Model Building

In [None]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)

In [None]:
log_preds = log_model.predict(X_test)

### Checking Accuracy of Model

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix
confusion_matrix(y_test, log_preds)

In [None]:
accuracy_score(y_test, log_preds).round(2)

In [None]:
plot_confusion_matrix(log_model, X_test, y_test, cmap = "GnBu")

In [None]:
print(classification_report(y_test, log_preds))

### Automating the model

In [None]:
def algo(algorithm):
    model = algorithm()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print('The accuracy of {} model is {}'.format(model, accuracy_score(y_test, preds).round(2)))
    print("\n")
    print("CLASSIFICATION REPORT",'\n',classification_report(y_test, preds),"\n")
    print("CONFUSION MATRIX",'\n')
    plot_confusion_matrix(model, X_test, y_test, cmap = "GnBu")

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
algo(DecisionTreeClassifier)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
algo(RandomForestClassifier)

### K Nearest Neighbor

In [None]:
from sklearn.neighbors import KNeighborsClassifier
algo(KNeighborsClassifier)

### Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC
algo(SVC)

### Linear SVC

In [None]:
from sklearn.svm import LinearSVC
algo(LinearSVC)

In [None]:
from sklearn.naive_bayes import GaussianNB
algo(GaussianNB)

In [None]:
from sklearn.linear_model import SGDClassifier
algo(SGDClassifier)

In [None]:
from sklearn.linear_model import Perceptron
algo(Perceptron)

#### From the above results, it is clear that RandomForest Classifier is the best performing model.

In [None]:
### Hyper Parameter Tuning on Random Forest

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = [{'n_estimators': range(10,20), 
               'max_depth': range(2,6)}]

In [None]:
model = RandomForestClassifier(random_state = 0)

rfc_grid = GridSearchCV(estimator = model, param_grid = parameters, cv = 10)

rfc_grid.fit(X_train, y_train)

In [None]:
rfc_grid.best_score_

In [None]:
rfc_grid.best_estimator_