# Exploring the Raw Titanic Dataset

Part of the journey to making a good regression model is to understand the data that we are prediciting modelling. To do this, we will perform some exploratory data analysis on the raw data from the [Titanic Kaggle Challenge](https://www.kaggle.com/c/titanic). The purpose of this challenge is to predict the probability of survival for a passsenger, given their boarding details.

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/fd/RMS_Titanic_3.jpg"
     alt="Titanic"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
The RMS Titanic
</div>

## Imports

In [1]:
import pandas as pd
import numpy as np

## Data

In [2]:
df_train = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_raw.csv')
df_test = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_test_raw.csv')

In [3]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Questions

### Question 1

After briefly looking through the data, you may notice that some entries are missing.

Write a function that determines the percentage of missing entries for each column in the dataset. The function should return a `dict` that contains each column name as a key entry, and the percent missing entries as its value.

_**Function Specifications:**_
* Should take a pandas `DataFrame` as input and return a `dict` as output.
* The `dict` should contain a key / value pair, where the key is the column name, and the value is the percentage of missing entries in the column.
* Should be generalised to be able to work on _**ANY**_ dataframe.
* Numeric values should be rounded to two decimal places.

In [5]:
def percent_missing(df):
    # your code here
    return_dict = {}
    for column in df.columns:
        return_dict[column] = round(df[column].isnull().sum() / df.shape[0] * 100,2)
    
    return return_dict

In [6]:
percent_missing(df_train)

_**Expected Outputs:**_
```python
percent_missing(df_train) == {
    'PassengerId': 0.0,
    'Survived': 0.0,
    'Pclass': 0.0,
    'Name': 0.0,
    'Sex': 0.0,
    'Age': 19.87,
    'SibSp': 0.0,
    'Parch': 0.0,
    'Ticket': 0.0,
    'Fare': 0.0,
    'Cabin': 77.1,
    'Embarked': 0.22
}
```

### Question 2

It would be a good idea to replace some of our missing data with numerical data. Missing values can be replaced with the either the _mean_ or the _median_ of the column. Write a function that takes in as input a dataframe, column name, and a string that is either `'mean'` or `'median'`, and returns as output either the mean or median for that column.

_**Function Specifications:**_
* The function should take three inputs: `(df, column_name, choice)`, where `df` is a pandas `DataFrame`, `column_name` is a `str`, and `choice` is a `str` that defaults to `mean`.
* If the `column_name` does not exist in `df`, raise a `ValueError`.
* Should return as output the relevant value based on the `choice` input, that is, either the mean or median of that column.
* The value should be rounded to 2 decimal places.

In [3]:
def calc_mean_median(df, column_name, choice='mean'):
    # your code here
    if column_name not in df.columns:
        raise ValueError ('Column Name does not exist')
    else:
        if choice.lower() == 'mean':
            result = round(df[column_name].mean(),2)
        elif choice.lower() == 'median':
            result = round(df[column_name].median(),2)
    return result

In [8]:
calc_mean_median(df_train, 'Age')

_**Expected Outputs:**_
```python
calc_mean_median(df_train, 'Age') == 29.7
calc_mean_median(df_train, 'Pclass', choice='median') == 3.0
```

### Question 3

We ultimately want to predict the survival chance of the passengers in the testing set. We can start by building a simple model using the data we already have by using _conditional probability_! Write a function that returns the survival probability of a passenger, given a variable from the dataset. 

_**Function specifications:**_
* The function should make use of the `df_train` `DataFrame` loaded earlier in this notebook.
* It should take a `column_name` as input. Assume that the column name exists in `df_train`.
* It should return a survival likelihood for each element within the given `column_name` as a number between 0 and 1. This should return a `DataFrame` that contains the chosen `column_name` as an index, and should contain a column containing the survival probabilities. You can use `groupby` and an aggragate function to help you calculate this value.

In [9]:
def survival_likelihood(column_name):
    # your code here
   
    df_prob = df_train.groupby(column_name).mean()[['Survived']]
    
    return df_prob

In [10]:
survival_likelihood('SibSp')

_**Expected Outputs:**_
```python
survival_likelihood('Pclass')
```
> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Survived</th>
    </tr>
    <tr>
      <th>Pclass</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>0.629630</td>
    </tr>
    <tr>
      <th>2</th>
      <td>0.472826</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0.242363</td>
    </tr>
  </tbody>
</table>

```python
survival_likelihood('SibSp')
```
> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Survived</th>
    </tr>
    <tr>
      <th>SibSp</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.345395</td>
    </tr>
    <tr>
      <th>1</th>
      <td>0.535885</td>
    </tr>
    <tr>
      <th>2</th>
      <td>0.464286</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0.250000</td>
    </tr>
    <tr>
      <th>4</th>
      <td>0.166667</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>8</th>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>