> ##### *Please upvoke and share the kernel if you like the work!! Also, please feel free to copy the kernel and practise the code for urself.*

In [None]:
import numpy as np 
import pandas as pd
import cufflinks as cf
from sklearn.linear_model import LogisticRegression
cf.go_offline()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **Loading Titanic Data:**

In [None]:
train = pd.read_csv("../input/titanic/train.csv",index_col = 'PassengerId')
test = pd.read_csv("../input/titanic/test.csv",index_col = 'PassengerId')

# **Training data Overview:** 
Gives a general sence of the missing values and null values and overall structure of the data in the dataset. 

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.describe()

# **Data Dictionary:** 
Now let us create a data dictionary based on the Titanic dataset for our better understanding of the Variables and what those variables stand for

In [None]:
titanic_dictionary = {'survived':'survived',
               'Pclass':'Passenger Ticket class',
               'sex':'Sex',
               'Age':'Age in years',
               'Sibsp':'# of siblings / spouses aboard the Titanic',
               'parch':'# of parents / children aboard the Titanic',
               'ticket':'Ticket number',
               'Fare':'Passenger fare',
               'cabin':'Cabin number',
               'Embarked':'Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)'}

# **Test data overview:**

In [None]:
test.head()

In [None]:
test.describe()

In [None]:
test.info()

# **Missing data in the training dataset:**

In [None]:
train.isnull().sum()

As shown above, we can see that Age has 177 missing values as well as Cabin has 687 missing values. Lets create a heatmap for representing the above information.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
import seaborn as sns
sns.heatmap(train.isnull(),yticklabels = False)

plt.title("Train Feature Missing Values",Fontsize = 20)
plt.show()

# **Exploratory data Analysis on the Training set** :
Now let us explore the data for our better understanding.

In [None]:
plt.figure(figsize=(10,6))
sns.set_style('whitegrid')
sns.countplot(x = 'Survived', data = train)
plt.title('Survival Statistics', fontsize = 18)
plt.show()

In the above representation, '1' indicates survival rate and '0' indicates Non-survival rate. As we can see more that 350 people survived the disaster and around 500 were not able to make it. 

Now let us check if there is dependence of Sex/Gender on the people who survived.

In [None]:
plt.figure(figsize= (10,6))
sns.set_style('whitegrid')
sns.countplot(x = 'Survived', hue = 'Sex',data = train)
plt.title("Male/Female survivals",fontsize = 20)

The above CountPlot suggests '1' is survival rate and '0' is non survival rate at the same time, blue is for male and magenta is for female.

After division we can see that more than twice the number of men died in the disaster as compared to women.

Now let's check how it look's when we divide survival of passengers by Ticket class

In [None]:
plt.figure(figsize = (10,6))
sns.countplot(x = "Survived", data = train, hue = "Pclass")
plt.show()

One of the Observations from above Countplot is that the number of non survivors are more from Class 3 passengers

Now let's check the distribution of passengers age.

In [None]:
plt.figure(figsize = (10,6))
sns.distplot(train['Age'].dropna(),bins = 30)

We can see that there are more number of passengers ranging from the age of 20-30

Lets check the distribution of passengers fare..

In [None]:
plt.figure(figsize= (10,6))
plt.hist(train['Fare'],bins = 30)

Most of passengers as we checked out earlier were travelling in the third class. As we can see from the plot tickets for the third class were not too expensive. There are some outliers, propably some of passengers in first class bought some extra services. For better understanding of fare distribution let's use a iplot.

In [None]:
train['Fare'].iplot(kind='hist',bins=30,color='blue')

**Cleaning Training Data**

We know that "Age' column has 177 missing values, and the total number of passengers were: 891. So, the percentage of rows with missing values of Age are: (177/891)*100

In [None]:
(train['Age'].isnull().sum()/(train['Age'].count()+ train['Age'].isnull().sum()))*100

Lets Create a BoxPlot to check if Age can depend on the Passengers travelling Class. 

In [None]:
plt.figure(figsize = (10,6))
sns.boxplot(x= 'Pclass' , y = "Age" , data = train)
plt.title("Distribution of Age based on Passenger Class", fontsize =20)
plt.show()

We can see that passengers who bought tickets in first class are older than passengers in other classes. We may assume that richer and older passengers are sitting in the first class. So then, let's impute missing values for age depending on ticket class.

Assigning Age to null rows(Age Imputation)

In [None]:
train.groupby('Pclass').mean()['Age']

In [None]:
def age_imputation(column):
    Age = column[0]
    Pclass = column[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return train[train["Pclass"]==1].mean()["Age"].round()
        elif Pclass == 2:
            return train[train["Pclass"]==2].mean()["Age"].round()
        elif Pclass == 3:
            return train[train["Pclass"]==3].mean()["Age"].round()
    else:
        return Age

In [None]:
train["Age"] = train[["Age","Pclass"]].apply(age_imputation,axis = 1)

In [None]:
train.isnull().sum()

In [None]:
(687/891)* 100

We can see that Cabin has 77% of missing data, so we can ignore and drop the column.

In [None]:
train.drop("Cabin", axis = 1, inplace = True)

In [None]:

train.isnull().sum()

Now we have only one column with missing values - Embarked. As there are only two observations with missing values we can delete this observations.

In [None]:
train.dropna(inplace = True)

In [None]:
train.isnull().sum()

Lets go through the dataset again:

In [None]:
train.head()

Now lets try to convert the Non categorical data into categorical form

we can see that name column has Annotations(Mr,Mrs etc) in the middle, lets try and make them categorical

In [None]:
train['Title'] = train.Name.str.extract(r',\s([a-zA-Z ]+)', expand = False)

In [None]:
train['Title'].value_counts()

Title's such as Ms,Mme etc occur once or twice in the dataset. we can replace these titles with more meaningful and understamdable titles such as Mr,Miss

In [None]:
train['Title'] = train['Title'].replace(to_replace = "Master", value = 'Mr')

In [None]:
train['Title'] = train['Title'].replace(to_replace = ['Mlle','Ms','Mme'], value = 'Miss')

In [None]:
train['Title'] = train['Title'].replace(to_replace = ['Dr','Rev','Major','Col','Don','Jonkheer','Lady','the Countess','Capt','Sir'],value = 'Other')

Lets see how it turned out to be.

In [None]:
train["Title"].value_counts()


Lets create a boxplot for new variables verses age.

In [None]:
plt.figure(figsize = (10,6))
sns.boxplot( x = 'Title', data = train, y = 'Age' )

plt.title("Age vs Title", Fontsize = 20 )

Lets now check the death rate as per the above reframed data. 

In [None]:
plt.figure(figsize = (10,6))

sns.countplot( x= 'Survived', data = train, hue = "Title" )

In [None]:
train.head()

Let us use get_dummies pandas function to convert the categorical functions into dummy ones. 

In [None]:
d_title = pd.get_dummies(train['Title'],drop_first = True, prefix = "Title")

In [None]:
d_sex = pd.get_dummies(train['Sex'], drop_first = True, prefix = "Sex")

In [None]:
d_embarked = pd.get_dummies(train['Embarked'], drop_first = True, prefix = "Emparked")

In [None]:
train = pd.concat([train,d_title,d_sex,d_embarked], axis = 1)

In [None]:
train.drop(["Name","Sex","Ticket",'Embarked','Title'], axis = 1 , inplace = True)

In [None]:
train.head()

# **Now lets go ahead and start cleaning the Test Data**

In [None]:
test.head()

In [None]:
test.isnull().sum()

In [None]:
test.groupby('Pclass').mean()['Age']

In [None]:
def age_imputation_test(column):
    Age = column[0]
    Pclass = column[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return test[test["Pclass"]==1].mean()["Age"].round()
        elif Pclass == 2:
            return test[test["Pclass"]==2].mean()["Age"].round()
        elif Pclass == 3:
            return test[test["Pclass"]==3].mean()["Age"].round()
    else:
        return Age

In [None]:
test["Age"] = test[["Age","Pclass"]].apply(age_imputation_test,axis = 1)

In [None]:
test.isnull().sum()

In [None]:
plt.figure(figsize = (8,6))

sns.boxplot(x = 'Pclass', y = 'Fare',hue = 'Pclass', data = test)

In [None]:
test.groupby('Pclass').mean()['Fare']

In [None]:
test.head()

In [None]:
test[test['Fare'].isnull()]

In [None]:
test.fillna(test[test['Pclass']==3].mean()['Fare'],inplace = True)

In [None]:
test.isnull().sum()

In [None]:
test.head()

In [None]:
test["Name_"] = test.Name.str.extract(r',\s([a-zA-Z]+)', expand = True)

In [None]:
test['Name_'].value_counts()

In [None]:
test['Name_'] = test['Name_'].replace(to_replace = "Master", value = "Mr")

In [None]:
test["Name_"] = test["Name_"].replace(to_replace = "Ms", value = "Miss")

In [None]:
test["Name_"] = test["Name_"].replace(to_replace = ["Rev", 'Col','Dona', 'Dr'], value = 'Others')

In [None]:
test.head()

In [None]:
Dummy_Name = pd.get_dummies(test["Name_"],drop_first = True, prefix = "Title")

In [None]:
Dummy_Sex = pd.get_dummies(test['Sex'], drop_first = True)

In [None]:
Dummy_Embarked = pd.get_dummies(test['Embarked'], drop_first = True,prefix = 'Embarked')

In [None]:
test = pd.concat([test,Dummy_Name,Dummy_Embarked, Dummy_Sex],axis = 1)

In [None]:
test.drop(['Name','Sex','Embarked','Name_','Ticket','Cabin'],axis = 1,inplace = True)

In [None]:
test.head()

# Spliting the train data

In [None]:
train.head()

In [None]:
X = train.drop('Survived', axis = 1)
y = train['Survived']

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y, random_state = 1)

# 1: Logistic Regression

In [None]:
l_model = LogisticRegression()
l_model.fit(X_train,y_train)

In [None]:
l_model_data = l_model.predict(X_test)

In [None]:
l_train_model = l_model.predict(X_train)

In [None]:
test_prediction = l_model.predict(test)

In [None]:
results = pd.DataFrame({'PassengerId':test.index,
                       'Survived':test_prediction})

In [None]:
results.to_csv("Submission.csv", index = False)