Step 1: __+ Add data__ using this link
Data URL:
https://www.kaggle.com/benjaminashley/titanic/download

Step 2: Copy the location of the csv file:
'../input/titanic/Titanic_Dataset.csv'


|Variable |	Definition	| Key|
| ------------- |:-------------:| -----:|
|survival	|Survival	|0 = No, 1 = Yes|
|pclass	|Ticket class	|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex	|Sex	||
|Age	|Age in years	||
|sibsp	|# of siblings / spouses aboard the Titanic	||
|parch	|# of parents / children aboard the Titanic	||
|ticket	|Ticket number	||
|fare	|Passenger fare	||
|cabin|	Cabin number	||
|embarked|	Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting data
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

The first thing we will do is load the data and take a quick look at the dataset as a dataframe table.

In [None]:
# Path of the file to read
Titanic = '../input/titanic/Titanic_Dataset.csv'

# Fill in the line below to read the file into a variable titanic
titanic = pd.read_csv(Titanic)
print('The shape of the dataframe is:',titanic.shape)
titanic.head()

So this is a relatively small dataset. There are 891 observations (rows) with 12 variables (columns). 

Even though it is small, this dataset has a lot of information we will be able to use to get some interesting insights.

Let's quickly change the Survival status of the passengers to text values to help with interpretation.

In [None]:
#df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],['1st new value','2nd new value',...])
titanic['Survived'] = titanic['Survived'].replace([0,1],['perished' , 'survived'])
titanic.head()

We have 2 variables not previously mentioned, *Passenger_Id* and *Full_Name*. These 2 variables are not really of any interest to us as they don't provide uselful insights into the data because they are all unique so won't show any trends. The *Passenger_Id* is a unique identifier for each observation and *Full_Name* is the full name of each passenger.

Here we can see a summary description of the data which proves that no  passenger name is repeated amongts other things...

In [None]:
# Get a break down of the summary stats for the dataset.
titanic.describe(include='all')

Take a moment to study the values, it's not as scary as it looks!

* Notice the difference between the numerical values and the categorical values. Take *Age* for example and compare it with *Embarked*
* We have some missing values.
* We also have some numerical values which should really be understood as categorical variables

Remember:
- Survived is either 0 or 1 (perished/survived)
....


OK, so it's time to expain, through the data, based on the data, the story of who survived and who didn't on the Titanic.

## 1. What was the survival rate?
To do this, you only need three things:
1. The data
2. Matplotlib library (https://matplotlib.org/)
3. A method of visualization (eg. a bar chart)

(a bar chart ->  is a good way of visualising categorical data. Since we can not do simple functions on them such as, + - / * , but we can count them).

In [None]:
titanic['Survived'].value_counts()

In [None]:
titanic['Survived'].value_counts().plot(kind="bar")

With the graph we get a more visual representation right? So we know that over 500 people perished while only around 300 survived. 

As good practice and for better interpretation of the graphs, we can add a few more lines of code. __Let's give the viz a title and label the axis as well as change the colour from the default blue.__ 

All we have do do is copy and paste our previous code and add a few more lines...

In [None]:
titanic['Survived'].value_counts().plot(kind="bar",color=['salmon', 'turquoise'])
plt.suptitle("Survival Rate on the Titanic", fontsize=16)
plt.xlabel('Survived')
plt.ylabel('Nº_Passengers')

And we can also check the proportions as __percentages__ to get more of an idea of their contribution to the total number of passengers

In [None]:
# Calculate the survival rate and assign it to a variable called, 'SurvivalRate'
SurvivalRate = titanic['Survived'].value_counts(normalize=True) * 100
SurvivalRate

So we now know that around 60% perished and 40% survived.

-------

Now we have an understanding of the relative proportion of survival, we've also studied up on the subject of the Titanic, (or just watched the film) and we remember when deploying the lifeboats they often called out, "Women and children first!". So as a reasonable hypothesis to explore, we can ask the question if more women survided than men?

## 2. What was the survival rate by gender?


In [None]:
Female = titanic[titanic['Sex']=='female']['Survived'].value_counts() # Count the number of survivors and those that perished when passenger was female.
Male = titanic[titanic['Sex']=='male']['Survived'].value_counts()

df = pd.DataFrame([Female,Male]) # Create a dataframe of this new data
df.index=['Female','Male'] # Label the rows
df.plot(kind='bar',stacked=True,fig=(18,6),title= "Titanic Survival Rates by Sex", color=['turquoise','salmon'])

So this tells us that females survived disproportionately more than males did. That's interesting, but it would be good to know if other factors also played an important role in the survival of the passengers. 

As you can imagine, the higher the class of ticket, the higher up on the ship you were (there would be better views). Also, the higher up you are the closer to the deck you would be and they had better access to the life-boats.

## 3. What was the survival rate by class of ticket?


In [None]:
First = titanic[titanic['Pclass']==1]['Survived'].value_counts()
Second = titanic[titanic['Pclass']==2]['Survived'].value_counts()
Third = titanic[titanic['Pclass']==3]['Survived'].value_counts()

df2 = pd.DataFrame([First,Second,Third])
df2.index=['First','Second','Third']
df2.plot(kind='bar',stacked=True,fig=(18,6),title= "Titanic Survival Rates by Class",color=['turquoise','salmon'])
plt.ylabel('Nº_Passengers')

Now something pops straight away with this viz.

* Unfortunately the folks in 3rd class had a pretty poor survival rate, whereas in 1st class we see that more than half survived.

What is also interesting here is that, we would expect there to be few 1st class passengers and more 3rd class passengers, which is shown here. However we would typically expect there to be a number of 2nd class passengers, somewhere between the 2 extremes and yet there are fewer 2nd class than 1st.

Despite this, given the proportions:
* Over half the 1st class passengers survived
* Almost half the 2nd class passengers survived
* Only about just under a quarter of 3rd class passengers survived.

Showing a clear downward rate of survival as you go down through the ship.

So definitely gender seems to be a factor in whether or not you survived, and also the class of your ticket. But what about the both of them together. What if we drill down on both...

## 4. What was the survival rate by class of ticket and Gender?


In [None]:
# filtering by pclass 1 to create a copy in a new dataframe for only passengers in First Class.
Class1 = titanic[titanic['Pclass']==1].copy()

Female1 = Class1[Class1['Sex']=='female']['Survived'].value_counts()
Male1 = Class1[Class1['Sex']=='male']['Survived'].value_counts()

df = pd.DataFrame([Female1,Male1]) # Create a dataframe of this new data
df.index=['Female','Male'] # Label the rows
#df.plot(kind='bar',stacked=True,fig=(18,6),title= "Titanic Survival Rates by Sex in First Class", color=['turquoise','salmon'])

# filtering by pclass 2
Class2 = titanic[titanic['Pclass']==2].copy()

Female2 = Class2[Class2['Sex']=='female']['Survived'].value_counts()
Male2 = Class2[Class2['Sex']=='male']['Survived'].value_counts()

df2 = pd.DataFrame([Female2,Male2]) # Create a dataframe of this new data
df2.index=['Female','Male'] # Label the rows
#df2.plot(kind='bar',stacked=True,fig=(18,6),title= "Titanic Survival Rates by Sex in Second Class", color=['turquoise','salmon'])

# filtering by pclass 3
Class3 = titanic[titanic['Pclass']==3].copy()

Female3 = Class3[Class3['Sex']=='female']['Survived'].value_counts()
Male3 = Class3[Class3['Sex']=='male']['Survived'].value_counts()

df3 = pd.DataFrame([Female3,Male3]) # Create a dataframe of this new data
df3.index=['Female','Male'] # Label the rows
#df3.plot(kind='bar',stacked=True,fig=(18,6),title= "Titanic Survival Rates by Sex in Third Class", color=['turquoise','salmon'])

fig, axes = plt.subplots(nrows=1, ncols=3, sharey=True)
plt.gcf().suptitle('Titanic Survival Rates by Class and Gender')

df.plot(ax=axes[0],kind='bar',stacked=True,fig=(18,6),title= "First Class", color=['turquoise','salmon'])
df2.plot(ax=axes[1],kind='bar',stacked=True,fig=(18,6),title= "Second Class", color=['turquoise','salmon'],legend=False)
df3.plot(ax=axes[2],kind='bar',stacked=True,fig=(18,6),title= "Third Class", color=['salmon','turquoise'],legend=False)
plt.show()

What we can see from this....


## 5. What was the distribution of passenger ages?
A histogram is a great way to show this.

A histogram "bins" the ages into groups and the area represents the amount of data.

In [None]:
# Make separate dataframes for whether passengers survived or not.
Survived = titanic[titanic['Survived']=='survived'].copy()
Perished = titanic[titanic['Survived']=='perished'].copy()

Perished['Age'].plot(kind = 'hist',bins=17, alpha=0.7,label='Perished',color=['salmon'])
Survived['Age'].plot(kind = 'hist',bins=17, alpha=0.5,label='Survived',color=['turquoise'])

plt.suptitle("Titanic Age Distribution", fontsize=16)
plt.xlabel('Age')
plt.ylabel('Nº_Passengers')
plt.legend(loc='upper right')
plt.show()

In [None]:
Perished['Age'].plot(kind = 'density', alpha=0.7,label='Perished',color=['salmon'])
Survived['Age'].plot(kind = 'density', alpha=0.5,label='Survived',color=['turquoise'])

plt.suptitle("Titanic Age Distribution", fontsize=16)
plt.xlabel('Age')
plt.ylabel('Nº_Passengers')
plt.legend(loc='upper right')
plt.show()

## 6. What was the distribution of passenger survival by age, class and gender?

In [None]:
# We already created dataframes of the different classes
#Class1 = titanic[titanic['Pclass']==1].copy()

# Set the Viz area
fig, axes = plt.subplots(nrows=2, ncols=3, sharey=True, sharex=True, figsize=(10,6))
plt.gcf().suptitle('Titanic Survival Rates by Age, Class and Gender',fontsize=16)

# We can further filter this by Gender for each class...
# First
Class1M = Class1[Class1['Sex']=='male'].copy()
Class1M['Age'].groupby(Class1M['Survived']).plot(ax=axes[0,0],kind = 'hist', alpha=0.7,legend=True, title='First')
Class1F = Class1[Class1['Sex']=='female'].copy()
Class1F['Age'].groupby(Class1F['Survived']).plot(ax=axes[1,0],kind = 'hist', alpha=0.7,legend=False)
                                                 
# Second
Class2M = Class2[Class2['Sex']=='male'].copy()
Class2M['Age'].groupby(Class2M['Survived']).plot(ax=axes[0,1],kind = 'hist', alpha=0.7,legend=False, title='Second')
Class2F = Class2[Class2['Sex']=='female'].copy()
Class2F['Age'].groupby(Class2F['Survived']).plot(ax=axes[1,1],kind = 'hist', alpha=0.7,legend=False)
               
# Third
Class3M = Class3[Class3['Sex']=='male'].copy()
Class3M['Age'].groupby(Class3M['Survived']).plot(ax=axes[0,2],kind = 'hist', alpha=0.7,legend=False, title='Third')
Class3F = Class3[Class3['Sex']=='female'].copy()
Class3F['Age'].groupby(Class3F['Survived']).plot(ax=axes[1,2],kind = 'hist', alpha=0.7,legend=False)
               
# Add MALE and FEMALE labels    
plt.gcf().text(1, 0.7, 'MALE', style='italic', fontsize=20)    
plt.gcf().text(1, 0.3, 'FEMALE', style='italic', fontsize=20)    
    
plt.xlabel('Age')
plt.show()

In [None]:
# We already created dataframes of the different classes
#Class1 = titanic[titanic['Pclass']==1].copy()

# Set the Viz area
fig, axes = plt.subplots(nrows=2, ncols=3, sharey=True, sharex=True)
plt.gcf().suptitle('Titanic Survival Rates by Age, Class and Gender',fontsize=16)

# We can further filter this by Gender for each class...
# First
Class1M = Class1[Class1['Sex']=='male'].copy()
Class1M['Age'].groupby(Class1M['Survived']).plot(ax=axes[0,0],kind = 'density', alpha=0.7,legend=True, title='First')
Class1F = Class1[Class1['Sex']=='female'].copy()
Class1F['Age'].groupby(Class1F['Survived']).plot(ax=axes[1,0],kind = 'density', alpha=0.7,legend=False)
                                                 
# Second
Class2M = Class2[Class2['Sex']=='male'].copy()
Class2M['Age'].groupby(Class2M['Survived']).plot(ax=axes[0,1],kind = 'density', alpha=0.7,legend=False, title='Second')
Class2F = Class2[Class2['Sex']=='female'].copy()
Class2F['Age'].groupby(Class2F['Survived']).plot(ax=axes[1,1],kind = 'density', alpha=0.7,legend=False)
               
# Third
Class3M = Class3[Class3['Sex']=='male'].copy()
Class3M['Age'].groupby(Class3M['Survived']).plot(ax=axes[0,2],kind = 'density', alpha=0.7,legend=False, title='Third')
Class3F = Class3[Class3['Sex']=='female'].copy()
Class3F['Age'].groupby(Class3F['Survived']).plot(ax=axes[1,2],kind = 'density', alpha=0.7,legend=False)
               
# Add MALE and FEMALE labels    
plt.gcf().text(1, 0.7, 'MALE', style='italic', fontsize=20)    
plt.gcf().text(1, 0.3, 'FEMALE', style='italic', fontsize=20)    
    
plt.xlabel('Age')
plt.show()