<a href="https://colab.research.google.com/github/suzannelittle/dcu-dpv/blob/main/titanic_missingvalues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

RMS Titanic was a passenger ship built in the United Kingdom which sunk in the early hours of 15th April, 1912 following a collision with an iceberg in the North Atlantic Ocean. It was the ship's maiden voyage, travelling from Southampton, UK to New York City, USA with stops along the way in Cherbourg, France and Queenstown (now Cobh), Ireland.

The purpose of this notebook is to explore the dataset of all passengers on board the doomed liner, making observations about different aspects of it and hopefully obtaining interesting insights along the way. This dataset is available online and I have downloaded it as a CSV file.

Let's firstly import <i>matplotlib.pyplot</i> and <i>seaborn</i> since we will be creating graphs from the dataset which will require these tools:

In [None]:
import matplotlib.pyplot as plt

# Must also include the following so that matplotlib graphs will display properly in Jupyter notebook cells
%matplotlib inline

import seaborn as sns

# Set the context to 'talk' now so that all graphs created will appear reasonably big (see seaborn notebook for explanation of contexts)
sns.set_context("talk")

import pandas as pd
import numpy as np

Before we begin, here is a brief overview of the columns that exist in the Titanic dataset:

<table>
<tr><th><i>Column Name</i></th><th><i>Description</i></th></tr>
<tr><th><b>Class</b></th><td>Which class ticket the passenger held (1 = First, 2 = Second, 3 = Third).</td></tr>
<tr><th><b>Survived</b></th><td>Did the passenger survive the Titanic's sinking? (1 = Yes, 0 = No)</td></tr>
<tr><th><b>Name</b></th><td>The passenger's full name and title.</td></tr>
<tr><th><b>Age</b></th><td>How old the passenger was at the time of the Titanic's sinking.</td></tr>
<tr><th><b>Sibling or Spouse</b></th><td>Number of passenger's siblings or spouses aboard, if any.</td></tr>
<tr><th><b>Parents or Children</b></th><td>Number of passenger's parents or children aboard, if any.</td></tr>
<tr><th><b>Ticket No.</b></th><td>Number on this passenger's ticket.</td></tr>
<tr><th><b>Fare</b></th><td>The amount (in pounds) that this passenger's ticket cost.</td></tr>
<tr><th><b>Cabin</b></th><td>The passenger's cabin on the Titanic.</td></tr>
<tr><th><b>Embarked</b></th><td>Place of embarkment (S = Southampton, Q = Queenstown [now Cobh], C = Cherbourg)</td></tr>
<tr><th><b>Boat</b></th><td>The number/letter of the lifeboat the passenger escaped the ship on, if any.</td></tr>
<tr><th><b>Body</b></th><td>Body identification number if passenger died and body was recovered successfully.</td></tr>
<tr><th><b>Home/Dest</b></th><td>The passenger's home/destination.</td></tr>
</table>

Now we will import <i>pandas</i> and create a DataFrame object from the CSV file which holds the data.

In [None]:
# Read in CSV file to a DataFrame object
online_csv = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(online_csv)
# Display first 10 rows
df.head(10)

In [None]:
df.info()

We can already see in the first 10 rows that there are "missing" values. In the output from info() you can see that some columns don't have 891 entries. How many missing values are there?

In [None]:
df.isnull().sum()

So we can see that there are many missing values for Age, Cabin and two for Embarked.    

This summary is generated using two functions:
  * Pandas [dataframe.isnull()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html)    
  * Pandas [dataframe.sum()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)    

NaN is a special value, represented by the numpy object np.nan. Try the following ...    

In [None]:
type(np.nan)

In [None]:
np.nan == np.nan

In [None]:
None == None

In [None]:
np.nan == None

Calculate the average age of all Titanic passengers.

In [None]:
df['Age'].mean()

What happens to the np.NaN values? What if you use [fillna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) to "correct" the missing values? Is the following code a "good idea"?

In [None]:
print(df['Age'].fillna(0).mean())
print(df['Age'].mean())

Let's make some simple graphs to explore the data

**Histograms**

Let's begin by exploring the survival rates based on age on a histogram. We will use the `plt.hist()` function.

Remember that `plt` was the short name we assigned matplotlib.pyplot when we imported it.

`plt.hist()` is used to create histograms. You need to select one column to create a histogram.

In [None]:
plt.hist(df[df["Survived"] == 1]["Age"], bins = 20)

**A line graph**

The distribution of age across the 3 passenger classes (Pclass)

In [None]:
for x in [1,2,3]:    ## for 3 classes
    df.Age[df.Pclass == x].plot(kind="kde")
plt.title("Age wrt Pclass")
plt.legend(("1st","2nd","3rd"))