# IMPORT THE LIBRARY

In [1]:
import pandas as pd
import numpy as np

# READING DATA USING PANDAS
    
We use pandas read_csv function to read the csv file in python and pandas DataFrame method to convert file into the data frame.

In [2]:
df = pd.DataFrame(pd.read_csv('./titanic/train.csv'))
df.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


To know the column and row of the DataFrame using shape attribute

In [3]:
df.shape

(891, 12)

### Description of the attributes of the dataset

Pclass: Passenger Class ( 1 = 1st, 2 = 2nd, 3 = 3rd)

survival: Survival(0 = No, 1 = Yes)

name : Name

sex : Sex

age : Age

sibsp : Number of Sibling/Spouses Aboard

parch : Number of Parents/Children Aboard

ticket : Ticket Number

fare : Passenger Fare( British pound )

cabin : Cabin

embarked : Port of Embarkation ( C = cherbourg, Q = Queenstown , S = Southampton)

# HANDELING NULL VALUES

Dataset may contain many rows and columns for which some values are missing. We can't leave those missing values as it is.

    1. Either drop the entire row or column.
    2. Fill the missing values with some appropriate value ex. mean of all the values for that column may do the job.

In [4]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Seperating out the column which have more than 35% of the values missing in the dataset.

In [5]:
drop_col = df.isnull().sum()[df.isnull().sum() > (35/100 * df.shape[0])]
drop_col

Cabin    687
dtype: int64

In [6]:
drop_col.index

Index(['Cabin'], dtype='object')

In [7]:
df.drop(drop_col.index,axis=1,inplace = True)

If age is not there so we can't put it 0 so we do is put it avrage value of age.

In [8]:
df.fillna(df.mean(),inplace = True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

for Embarked we do not put avg because it's not just value it is string.

so we do is highet one from that.

In [9]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [10]:
df['Embarked'].fillna('S',inplace = True)

df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

FINDING CORRELATION

In [11]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


    sibsp : Number of Sibling/Spouses Aboard
    parch : Number of Parent/Children Aboard
    
    So we can make a new column family_size by combining these two column.

In [12]:
df['FamilySize'] = df['SibSp']+df['Parch']
df.drop(['SibSp', 'Parch'], axis=1, inplace = True)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


### FamilySize in the ship does not have much correlance with survival rate

    Let's check if we weather the person was alone or not can affect the survival rate.

In [13]:
df['Alone'] = [0 if df['FamilySize'][i]>0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [14]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

### If the person is alone he/she has less chance of surviving.

    The reason might be the person who is traveling with his family might be belonging to rich class and might be prioritized over other  or else they can help each other family member.

In [15]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


So we can see if the person was not alone, the chance the ticket price is high.

In [16]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,A/5 21171,7.2500,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,STON/O2. 3101282,7.9250,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,113803,53.1000,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,373450,8.0500,S,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,211536,13.0000,S,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,112053,30.0000,S,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,W./C. 6607,23.4500,S,3,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,111369,30.0000,C,0,1


In [17]:
df.index

RangeIndex(start=0, stop=891, step=1)

In [18]:
df['Sex'] = [0 if df['Sex'][i] == 'male' else 1 for i in df.index]
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

It shows, female passenger have more chance of surviving than male ones.

It shows women were prioritized over male.

In [19]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

 
 # CONCLUSION
 
 * Female passenger were prioritized over men.
 * People with high class or rich people have higher survival rate than othr. The hierarichy might have been followed while saving the passangers.
 * Passengers travelling with their family have higher survival rate.
 * Passenger who borded the ship at Cherbourge, survived more in proportion then the others.