# Exploring the Raw Titanic Dataset
Â© Explore Data Science Academy

Part of the journey to making a good regression model is to understand the data that we are modelling. To do this, we will perform some exploratory data analysis on the raw data from the [Titanic Kaggle Challenge](https://www.kaggle.com/c/titanic). The purpose of this challenge is to predict the probability of survival for a given passenger, given their boarding details.

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/fd/RMS_Titanic_3.jpg"
     alt="Titanic"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
The RMS Titanic
</div>

### Honour Code

I **RIZQAH, MENIERS**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).  

Non-compliance with the honour code constitutes a material breach of contract.

## Imports

In [1]:
import pandas as pd
import numpy as np

## Data

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_raw.csv')
df_clean = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_clean_raw.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_clean.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S


## Questions

### Question 1

After briefly looking through the data, you may notice that some entries are missing.

Write a function that determines the number of missing entries for a specified column in the dataset. The function should return an `int` that corresponds to the number of missing entries in the specified column.

_**Function Specifications:**_
* Should take a pandas `DataFrame` and a `column_name` as input and return a `int` as output.
* The `int` should be the number of missing entries in the column.
* Should be generalised to be able to work on _**ANY**_ dataframe.

In [5]:
### START FUNCTION
def total_missing(df,column_name):
    # your code here
    total_missing = df[column_name].isna().sum()
    
    return total_missing
### END FUNCTION

In [6]:
total_missing(df,'Survived')

0

_**Expected Outputs:**_
```python
total_missing(df,'Age') == 177
total_missing(df,'Survived') == 0
```

### Question 2

It would be a good idea to replace some of our missing data. Missing values can be replaced with the either the _mean_ , the _median_ or the _mode_ (in the case of categorical columns). Write a function that takes in as input a dataframe and a column name, and returns the `mean` for numerical columns and the `mode` for non-numerical columns.

_**Function Specifications:**_
* The function should take two inputs: `(df, column_name)`, where `df` is a pandas `DataFrame`, `column_name` is a `str`.
* If the `column_name` does not exist in `df`, raise a `ValueError`.
* Should return as output the `mean` if the specified column is numerical and return a list of the `mode(s)` otherwise.
* The mean should be rounded to 2 decimal places.
* **If there is more than one `mode` for a given non-numerical column, the fuction should return a list of all modes**. 


In [18]:
### START FUNCTION
def calc_mean_mode(df, column_name):
    # your code here
    num = df.select_dtypes(include=['int64', 'float64']).columns
    nonum = df.select_dtypes(include='float64').columns

    if column_name in num:
        return round(np.mean(df[column_name]), 2)

    elif column_name in nonum:
        return sorted(list(df[column_name].value_counts().index[df[column_name].value_counts() == df[column_name].value_counts().max()]))

    else:
        return ValueError
### END FUNCTION

In [19]:
calc_mean_mode(df,'Embarked')

ValueError

_**Expected Outputs:**_
```python
calc_mean_mode(df, 'Age') == 29.7
calc_mean_mode(df, 'Embarked') == ['S']
```

### Question 3

We ultimately want to predict the survival chances of the passengers in the testing set. We can start by building a simple model using the data we already have by using _conditional probability_ ! Write a function that returns the survival probability of a passenger, given a condition on a **numerical variable** from the dataset. The condition will consist of a `column_name`, a `value` and a `boolean_operator`. Possible boolean operators include `"<"`,`">"`, or `"=="`. For example, `column_name = "Age"`, `boolean_operator = ">"`, and `value = 40` together form the condition `Age > 40`.

_**Function specifications:**_
* The function should make use of the `df_clean` `DataFrame` loaded earlier in this notebook.
* It should take a numerical `column_name` string, a `boolean_operator` string, and a `value` of type string as input. 
* It should return a survival likelihood as a number between 0 and 1, rounded to 2 decimal places. 
* Assume that `column_name` exists in `df_clean`.

_**Hint:** You can use `eval()` to evaluate string boolean expressions._

In [28]:
### START FUNCTION
def survival_likelihood(df_clean,column_name, boolean_operator, value):
    # your code here
    return

### END FUNCTION

In [29]:
survival_likelihood(df_clean,"Age","<","15")

0.58

_**Expected Outputs:**_
```python
survival_likelihood(df,"Pclass","==","3") == 0.24
survival_likelihood(df,"Age","<","15") == 0.58
```