****Prediction on who survived in Titanic Disaster-1912****

Questions below:

The data has been split into two groups:

* training set (train.csv)
* test set (test.csv)



* Training set:-  we have a total of 891 entries for training (12 columns)
* Test set :- 417 entries for testing (11 columns)
* used :  Pandas, a data manipulation library in python

with the given data, we have to predict the survived people in Titanic Disaster.

**Variable Notes**

* pclass: A proxy for socio-economic status (SES)
       1st = Upper
       2nd = Middle
       3rd = Lower
* survival - Survival (0 = No; 1 = Yes)
* class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name - Name
* sex - Sex
* age - Age
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* ticket - Ticket Number
* fare - Passenger Fare
* cabin - Cabin
* embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**Step 1: Importing Library**

Libraries i used:
* Pandas :-  for data manipulation and analysis
* Numpy  :-  adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
* Matplotlib :-  It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.
* Seaborn   :-  statistical data visualization
* DecisionTreeClassifier :- to create a model that predicts the value of a target variable based on several input variables.

In [None]:
#libraries
import pandas as pd
import numpy as np
import re
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

**Step 2 :- Loading the csv files**

We are loading the given two csv files.

In [None]:
#Training (891 Entries) & Testing (418 Entries) data
train_data = pd.read_csv('../input/test-dataset-for-titanic-competition/titanic_train.csv')
test_data = pd.read_csv('../input/test-dataset-for-titanic-competition/titanic_test.csv')
all_data = [train_data, test_data]

In [None]:
#to know the rows and column of a train_data
train_data.shape

In [None]:
#Training (891 Entries)
train_data.info()

In [None]:
#To get top 5 enteries of train_data
train_data.head()

In [None]:
#to see how many null value in Train_data set
train_data.isnull().sum()

In [None]:
#to know the rows and column of a test_data
test_data.shape

In [None]:
#Testing (418 Entries)
test_data.info()

In [None]:
test_data.head()

In [None]:
#To know the number of null values in each column
test_data.isnull().sum()

**Bar Chart Function**


In [None]:
#Bar chat function
def bar_chart(feature):
    survived = train_data[train_data['Survived']==1][feature].value_counts()
    dead= train_data[train_data['Survived']==0][feature].value_counts()
    df=pd.DataFrame([survived,dead])
    df.index = ['Survived','Dead']
    df.plot(kind='bar',stacked=True,figsize=(10,5))

**Step 3 : Going through each column**

we now will go through each column and ananlyse it, if we could use it for creating new ones that can make a significant improvement in our output .

Feature 1: Pclass
Pclass contains three classes , class1 , class2 and class3 in which
class1 is more expensive than class2 follow by class3.
Hence, the important and more valuable people life are saved first .
class1 people survived more than class2 followed by class3.

In [None]:
#Feature 1: Pclass
print( train_data[["Pclass","Survived"]].groupby(["Pclass"], as_index = False).mean() )
bar_chart('Pclass')

Feature 2 : Sex

Undoubtfully Female has survived more than male since female and children are saved first.

In [None]:
#Feature 2: Sex
print( train_data[["Sex","Survived"]].groupby(["Sex"], as_index = False).mean() )
bar_chart('Sex')

Feature 3: Family

Family = No. of siblings + No. of ParentsChildren + 1(himself)

we hav taken two column SibSp and Parch and added them to get Family size plus the person himself

In [None]:
#Feature 3: Family
for data in all_data:
    data['family_size']=data['SibSp']+data['Parch']+1

#print(train_data[["family_size","Survived"]].groupby(["family_size"],as_index=False).mean())
bar_chart('family_size')

These output are not much helping us to analyse anything , lets take the output if a person is alone , he survived or not.

Feature 3.1  : is_alone?
if family size is 1 that is himself then he is alone.

In [None]:
#Feature 3.1: is alone?

for data in all_data:
    data['is_alone']=0
    data.loc[data['family_size']==1,'is_alone']=1

#print(train_data[['is_alone','Survived']].groupby(['is_alone'],as_index=False).mean())
bar_chart('is_alone')

#Feature 4 : Embarked part 1 

As Embarked port are divided into S, C and Q (full form mentioned in the beginning)

We can count how many number of people from S , C and Q went to Class1 of Pclass respectively,
same for class2 and class3

In [None]:
#Feature 4: Embarked part 1


Pclass1 = train_data[train_data['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train_data[train_data['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train_data[train_data['Pclass']==3]['Embarked'].value_counts()
df = pd.DataFrame([Pclass1,Pclass2,Pclass3])
df.index = ['lst class','2nd class','3rd class']

#print(train_data[["Embarked","Survived"]].groupby(["Embarked"],as_index=False).mean())
df.plot(kind='bar',stacked=True,figsize=(10,5))

as we can see , maximum no. of people from S has gone to class3 and almost half of that in class2.
We could analyse from this that more no. of people died in class3 were from S, since class3 people contain maximum no. of dead people.
and similar analysis for class2 and class3 as well who were from S.

#Feature 4 : Embarked part 2
fill the null values with S , as there are maximum no. of people who were from S.

In [None]:
#Feature 4: Embarked part 2
for data in all_data:
    data['Embarked']=data['Embarked'].fillna('S')
    

Feature 5 : Fare
The people who paid higher has maximum chances of getting Survived,
but these same results we can get from Pclass as well as Pclass also talks about Fare
higher paid - class1
average paid - class2
low paid - class3

So, this column may not help to analyse.

In [None]:
#Feature 5: Fare
for data in all_data:
    data['Fare'] = data['Fare'].fillna(data['Fare'].median())
    
train_data['category_fare']=pd.qcut(train_data['Fare'],4)

print(train_data[["category_fare","Survived"]].groupby(["category_fare"],as_index=False).mean())
bar_chart('category_fare')

#Feature 6: Name part 1
Now, here comes the most interesting part, where we takes tha Saluatation of the Name.

Most Common Saluatation : Mrs. , Mr. , Miss , Master ,Other
we have name it as Title.
So, we divide the Name list according to its Saluataion and count the number of each Saluatation present .


In [None]:
#Feature 6: Name part 1
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\. ', name)
    if title_search:
        return title_search.group(1)
    return ""

for data in all_data:
    data['title'] = data['Name'].apply(get_title)

data['title'].value_counts()

#Feature 6: Name part 2

In [None]:
#Feature 6: Name part 2

#replacing every title with the common title 
for data in all_data:
    data['title'] = data['title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
    data['title'] = data['title'].replace('Mlle','Miss')
    data['title'] = data['title'].replace('Ms','Miss')
    data['title'] = data['title'].replace('Mme','Mrs')
    
#We compute the name title with Sex.
print(pd.crosstab(train_data['title'], train_data['Sex']))
print("----------------------")

print(train_data[['title','Survived']].groupby(['title'], as_index = False).mean())
bar_chart('title')

Feature 7: Age 
Since there are many null in this Column, So we fill null with any random number which comes between difference of average age and standard deviation and sums of average age and standard deviation.
And Categories age in 5 parts.

In [None]:
#Feature 7: Age
#train_data['Age'].fillna(train_data.groupby("title")["Age"].transform("median"), inplace=True)
for data in all_data:
    age_avg  = data['Age'].mean()
    age_std  = data['Age'].std()
    age_null = data['Age'].isnull().sum()

    random_list = np.random.randint(age_avg - age_std, age_avg + age_std , size = age_null)
    data['Age'][np.isnan(data['Age'])] = random_list
    data['Age'] = data['Age'].astype(int)

train_data['category_age'] = pd.cut(train_data['Age'], 5)
print( train_data[["category_age","Survived"]].groupby(["category_age"], as_index = False).mean() )
bar_chart('category_age')

#Mapping Data

Machine Learning only takes numerical values and not strings , so, every entry must be converted to integer.
So, we map every string entry to integer.

And also drop unwanted columns.

In [None]:
#Map Data
for data in all_data:

    #Mapping Sex
    sex_map = { 'female':0 , 'male':1 }
    data['Sex'] = data['Sex'].map(sex_map).astype(int)

    #Mapping Title
    title_map = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
    data['title'] = data['title'].map(title_map)
    data['title'] = data['title'].fillna(0)

    #Mapping Embarked
    embark_map = {'S':0, 'C':1, 'Q':2}
    data['Embarked'] = data['Embarked'].map(embark_map).astype(int)

    #Mapping Fare
    data.loc[ data['Fare'] <= 7.91, 'Fare']                            = 0
    data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
    data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']   = 2
    data.loc[ data['Fare'] > 31, 'Fare']                               = 3
    data['Fare'] = data['Fare'].astype(int)

    #Mapping Age
    data.loc[ data['Age'] <= 16, 'Age']                       = 0
    data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
    data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
    data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
    data.loc[ data['Age'] > 64, 'Age']                        = 4

#Feature Selection
#Create list of columns to drop
drop_elements = ["Name", "Ticket", "Cabin", "SibSp", "Parch"]

#Drop columns from both data sets
train_data = train_data.drop(drop_elements, axis = 1)
train_data = train_data.drop(['PassengerId','category_fare', 'category_age'], axis = 1)
test_data = test_data.drop(drop_elements, axis = 1)

#Print ready to use data
print(train_data.head(10))

#Prediction

we need to train our model. To do that, we need to provide data in two parts — X and Y.

* X : X_train : Contains all the features
* Y : Y_train : Contains the actual output (Survived)

we need to tell our model that we are looking for this output. Just like we shop online and if the dress gets out of stock , we search for similar dress, "hey, i want similar kind dress"

In [None]:
#Prediction
#Train and Test data
X_train = train_data.drop("Survived", axis=1)
Y_train = train_data["Survived"]
X_test  = test_data.drop("PassengerId", axis=1).copy()

#Running our classifier

We have data separated, now we call our classifier, fit data (training) with help of .fit method of the scikit-learn library, and predict the output on testing data, with .predict method.


In [None]:
#Running our classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
accuracy = round(decision_tree.score(X_train, Y_train) * 100, 2)
print("Model Accuracy: ",accuracy)

#Creating a CSV with results

The output submission.csv file contain only two columns — Passenger Id and Survived — as mentioned on the competition page.

In [None]:
#Create a CSV with results
submission = pd.DataFrame({
    "PassengerId": test_data["PassengerId"],
    "Survived": Y_pred
})
submission.to_csv('submission.csv', index = False)

**References:**

* 1. [Predict Who Survived the Titanic Disaster](https://towardsdatascience.com/your-first-kaggle-competition-submission-64da366e48cb)
* 2. [User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
* 3. [Kaggle - Titanic Solution [1/3] - data analysis](https://www.youtube.com/watch?v=3eTSVGY_fIE&t=31s)