# Introduction

Everyone knows about the Titanic ship as many of the people have seen the Titanic movie and how it reached it’s tragic end on the night of 15th April 1912.A ship which was termed as the ‘unsinkable’ one struck disaster by colliding with an iceberg and after few hours it was at the bottom of the ocean.

Only 1517 passengers were able to survive the shipwreck with the help of life boats but it could not accommodate all the passengers therefore it lead to a huge loss of life.

# Objective

In this kernel, we will go through a beginner friendly ML use case where we predict the survival probablity of a person based on some variables/features in that incident. We will go step by step covering all the points that we we need to take care of in a basic Machine Learning Project. So let's get started!!

First let us understand the basic steps in a ML project. They are as follows:-



# Steps in a ML project

1. **Data Collection** : You can collect the data from the company side, kaggle, APIs etc. I have picked it from kaggle.
2. **Exploratory Data Analysis**: Analyzing the data with the target variable helps us in understanding about data.
3. **Feature Engineering**: Dealing with NAN values, outliers, categorical variables, Feature Scaling comes under this point.
4. **Model Building**: After cleaning the data, it is time to feed the data to the model.
5. **Model Deployment**: Hosting your model in a 3rd party server is the last step.

**A] Importing the Libraries and Dataset**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/titanic/train.csv')          #read_csv is a pandas function to get csv file form of dataframe.

In [None]:
df.head()

**B] Exploratory Data Analysis(EDA)**

We can observe from the dataset that 'Survived' column is our target variable and other features are independent features.

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.isnull().mean()

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cmap='viridis')     #Age and Cabin have NULL values

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(),annot=True,linewidth=.5)

In [None]:
df['Survived'].value_counts()

In [None]:
sns.set(style='darkgrid')
sns.countplot('Survived',data=df,color='yellow',dodge=True,palette='twilight')
plt.xlabel('Survived/Not survived')
plt.ylabel('Frequency')
plt.title('Graph representing number of survivors and non survivors')
plt.show()

In [None]:
sns.countplot('Survived',hue='Sex',data=df,dodge=True,palette='BrBG')
plt.xlabel('Survived/Not survived')
plt.ylabel('Frequency')
plt.title('Survivor and non survivor based on gender')
plt.show()

In [None]:
df['Pclass'].value_counts()

In [None]:
sns.catplot('Pclass',hue='Sex',col='Survived',kind='count',data=df,palette='cubehelix_r')

In [None]:
sns.distplot(df['Age'].dropna(),kde=False)       #normally distributed

In [None]:
sns.boxplot('Pclass','Age',data=df)

In [None]:
sns.pairplot(df)

In [None]:
sns.catplot('SibSp',hue='Sex',col='Survived',kind='count',data=df,palette='cubehelix_r')

**C]Feature Engineering**

In [None]:
df.head()

In [None]:
def impute_nan(df,Age,median):                          #Handling missing values by replacing NAN with median
    df['Age_median']=df['Age'].fillna(median)
median=df.Age.median()
impute_nan(df,'Age',median)
    

In [None]:
df.head()

In [None]:
df.drop('Age',axis=1,inplace=True)

In [None]:
df.drop('Cabin',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df['Embarked'].mode()

In [None]:
df['Embarked'].fillna('S',inplace=True)

In [None]:
df['Embarked'].isnull().sum()

In [None]:
df.isnull().sum()    #so now we have dealt with all NAN values.

**Encoding categorical data into dummy variable**

In [None]:
df1=pd.get_dummies(df['Sex'],drop_first=True)
df1.head()

In [None]:
df=pd.concat([df1,df],axis=1)

In [None]:
df.head()

In [None]:
df.drop('Sex',axis=1,inplace=True)

In [None]:
df2=pd.get_dummies(df['Embarked'],drop_first=True)

In [None]:
df=pd.concat([df2,df],axis=1)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.drop('Name',axis=1,inplace=True)

In [None]:
df.drop('Ticket',axis=1,inplace=True)

In [None]:
df.drop('Embarked',axis=1,inplace=True)

In [None]:
df.head()

This was all about the EDA part of this use case. We now have the dataset which is ready to be given to the model. I will come up with the last part very soon. Till then, you can refer the medium article I have written on this.

[https://medium.com/swlh/machine-learning-project-titanic-problem-statement-c45997a75d5b](http://)