# Titanic

### Dataset

The Titanic dataset is a classic in the field of machine learning. It consists of a list of passengers aboard the ship during its unique journey to the United States. Several features describe the passengers on board, as well as their final status: did they survive the shipwreck?

Out of the 2,200 people on board the ship at the time of the sinking, approximately 1,500 perished. While our dataset is not exhaustive, it contains enough entries to apply machine learning methods to it.

### Objective

The goal of this exercise is to create a model to predict a passenger's survival based on their various known characteristics.

### Importing Libraries

In the following cell, import all the libraries you will use. You will need to return to this cell from time to time to add libraries during this exercise. Don't forget to re-run the cell each time you add one.

If the library you want is not installed on your system, please refer to the "installation" section of its online documentation. In the vast majority of cases, you can open a terminal window (Anaconda Prompt for Windows or a regular terminal for Mac and Linux), make sure you are in the right virtual environment if you are using one, and then type:

``pip install YOUR_LIBRARY_NAME``

To get started:

**>>>** Import the "pandas" library using the alias "pd"

**>>>** Import the "numpy" library using the alias "np"

In [None]:
# code here!


### Reading and Displaying the DataFrame

**>>>** Using Pandas and the `pd.read_csv()` function, import and store the dataset in a dataframe named "df."

**>>>** Use the `.head()` function to display the first 5 rows.

In [None]:
# code here!


### Columns

**>>>** Display the columns of the dataframe using the `.columns` property.

In [None]:
# code here!


### Explanation of the Different Columns

When studying a dataset, it's important to always refer to the documentation or ask the person who generated the dataset for explanations. Here are the details that are relevant to us:

- **'PassengerId'**: The ID of the passengers. It is equal to our index.
- **'Survived'**: 1 if the person survived, 0 if they did not survive.
- **'Pclass'**: Passenger class.
    - 1: First class (upper class).
    - 2: Second class (middle class).
    - 3: Third class (lower class).
- **'Name'**: The name of the passenger.
- **'Sex'**: The gender of the passenger.
- **'Age'**: The age of the passenger. If the age is estimated, it may be in the form of decimals.
- **'SibSp'**: *Siblings and Spouses*. The number of brothers, sisters, half-brothers, half-sisters, husbands, or wives accompanying them on the ship.
- **'Parch'**: *Parents and Children*. The number of fathers, mothers, sons, daughters, stepsons, or stepdaughters accompanying them on the ship.
- **'Ticket'**: The ticket number of the person.
- **'Fare'**: The cost of the ticket.
- **'Cabin'**: The cabin number.
- **'Embarked'**: The port of embarkation:
    - C: Cherbourg
    - Q: Queenstown
    - S: Southampton

# Exploration 

**>>>** Apply a `.describe()` on the dataframe. By default, this method applies only to numerical variables. If you prefer the result to be transposed, use the `.T` property on the dataframe returned by `.describe()`.

In [None]:
# Code here!


**>>>** Display basic statistics for non-numeric variables. To do this, use the "exclude" parameter of the `.describe()` method and provide `np.number` as the argument.

In [None]:
# Code here!


### Removing Unnecessary Columns

Use the `.drop()` method to remove the 'PassengerId' column since it's just a duplicate of the index.

In [None]:
# Code here!


### Statistics and Visualization

To better understand the data, let's explore each column using statistical tools and visualizations.

**>>>** Go back to the first cell and import the 'seaborn' library as 'sns' and the 'matplotlib.pyplot' library as 'plt'.

### 'Survived'

**>>>** Examine the distribution of the 'Survived' column using the `.value_counts` method. You can optionally pass the `normalize=True` parameter for a better estimate of the distribution.

In [None]:
# Code here!


**>>>** Display the results in the form of a graph by using the `.plot.bar()` method on this result.

In [None]:
# Code here!


**>>>** Display the same chart, but use the `sns.countplot()` function, providing the series `df['survived']` as the argument for the "x" parameter.

In [None]:
# Code here!


**>>>** Let's now examine how the passenger class influences the survival rate. Use the same function as before, but this time pass the column `df['Pclass']` as "x," and add a second parameter named "hue," giving it `df['Survived']` as the argument.

In [None]:
# Code here!


### Passenger Names

The title of address (Mrs., Miss, Mr., etc.) can provide us with information about the passengers. For now, this information is contained within their names.

**>>>** Create a new column named "NameTitle" containing only the titles of address. There are various ways to do this:
- Use the `str.split()` function twice with the `expand=True` parameter, retrieving the correct newly created series each time.
- Use the `map()` function twice, along with lambda functions, to apply successive `split()` operations to the series and retrieve the desired value using the index of the created list.

PS: The term 'master' was historically used for young children.

In [None]:
# Code here!


**>>>** Perform a `value_counts()` on the "NameTitle" Series.

In [None]:
# Code here!


**>>>** Some titles are very rare, and others need to be grouped into larger categories. Create a dictionary with the old titles as keys and the new titles to which they will be assigned as values. Make sure that:

- "Mlle." and "Ms." are assigned to "Miss."
- "Lady," "Mme.," and "the" are assigned to "Mrs."
- "Major," "Col.," "Sir," "Don.," and "Jonkheer." are assigned to "Mr."
- There's no need to add titles that will remain the same, but you can do so if you prefer.

*Tips*:

- The `.replace()` method will be the best choice here!
- Then, verify with a `.value_counts()` that everything is in order.

In [None]:
# Code here!


**>>>** Calculate the survival rate for each name title. To do this, perform a *groupby* on the "Survived" column, asking it to group by the 'NameTitle' column. Choose `mean()` as the aggregation method. You can also add a `.sort_values(ascending=False)` to sort the result in descending order.

*Tips*:

- If you want to display other indicators as well, you can use the `.agg()` method on the grouped data, giving it a list of functions to use as strings. For example, "mean", "median", "count". The result will then be a dataframe.

In [None]:
# Code here!


**>>>** Next, generate a "bar" type graph for this newly created series to better visualize the data. You can use matplotlib or seaborn as you prefer.

In [None]:
# Code here!


### Sex

**>>>** Display the proportion of men and women with a `.value_counts()`. You can normalize the data to make the result clearer.

In [None]:
# Code here!


**>>>** Display the survival rate for each gender.

In [None]:
# Code here!


### Age

**>>>** Determine how many values are missing in this column.

In [None]:
# Code here!


**>>>** We are going to artificially fill in these missing values. While we could simply assign them the mean or median age of the passengers, we have information about the titles of address for all passengers, including those 177 people whose age is missing.

Therefore, we will complete these missing data with the mean age relative to their title of address.

The `.fillna()` method allows you to fill empty values in a dataframe or series with a value of your choice. It can take multiple arguments, including a Series. Since we know the titles of all passengers, we can apply the following method:

- Create a dictionary containing the 'NameTitles' as keys and the average age as values. You can do this in a single line of code with a dictionary comprehension.
- Use `.map()` on the `df['NameTitle']` column with this dictionary as an argument to create a Series that contains, for each person, the average age of the 'NameTitle' category to which they belong. Be careful not to modify the `df['NameTitle']` column!
- Once you have achieved this result, apply the `.fillna()` method to `df['Age']` and pass the newly generated Series as an argument.

*Tips*:
- Start by displaying the average age for each 'NameTitle'.
- To generate the dictionary, remember that the `.unique()` method will generate a list of all unique values in a series.
- Calculate the average of a group by selecting the appropriate rows with `.loc`.

In [None]:
# Code here!


**>>>** Use `pd.qcut` to create a new variable "AgeClass" that divides the distribution into 6 equal parts. Also, check the type of this newly created object.

In [None]:
# Code here!


**>>>** Calculate the survival rate for each age class.

In [None]:
# Code here!

### Embarkment

In [None]:
df['Embarked'].value_counts()

In [None]:
df['Embarked'].isna().sum()

**>>>** We have 2 empty values in this column. Fill them with the letter 'S' as it is the mode of the distribution.

In [None]:
# Code here!


**>>>** Create an `sns.countplot()` on the 'Embarked' column using the 'hue' parameter to visualize the origin of social classes for each port.

In [None]:
# Code here!


### Cabin

The first letter of each cabin can be useful to us. Passengers who were in cabins near the lifeboats may have had a better chance of survival. The fact that most passengers did not have a cabin may have also influenced their survival rate.

In [None]:
df['Cabin'].isna().sum()

**>>>** Create a new column named "CabinLetter" that contains the first letter of each cabin. Note that `None` values are considered as float, so you need to convert them. They will become lowercase "n," which is acceptable.

In [None]:
# Code here!


### Ticket

At first glance, ticket numbers may not seem useful. However, tickets with longer numbers appear to be correlated with a higher survival rate. Let's test this hypothesis. Create a "TicketLen" column containing the length of the tickets.

In [None]:
# Code here!


**>>>** Use `sns.countplot()` to display a chart with the ticket length on the x-axis, the count on the y-axis, and differentiate it all (*hue*) by the "survived" column.

In [None]:
# Code here!


### Model building

Here, we will use a random forest. We need to prepare the data and select our features (X). Let's start by converting our categorical columns into "dummy" variables.

In [None]:
df.columns

**>>>** Create two lists:

- The first list, named "cat_cols," will contain the names of all categorical columns.
- The second list, named "num_cols," will contain all numerical columns.

In [None]:
# Code here!


**>>>** Now, transform all the categorical columns into *dummy* columns using the `pd.get_dummies()` function. When it's done, create a new dataframe that will concatenate the dataframe containing the numerical columns and the dataframe containing the "dummified" categorical columns using the `pd.concat` function. Store this new dataframe in a variable X. Then, check the *dtypes* of X.

In [None]:
# code here!


**>>>** Create a variable named "y" that will contain the 'Survived' column (the target value to predict).

In [None]:
# code here!


**>>>** Use the `train_test_split()` function to create your variables X_train, X_test, y_train, and y_test. Set the random_state parameter to 42.

In [None]:
from sklearn.model_selection import train_test_split
# Code here!


## RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay

**>>>** In the next cell, write four lines of code:

- Declare a variable "rf" (for random forest) that will contain the model `RandomForestClassifier()`. Use 42 as the value for the "random_state" parameter.
- Fit the model to your training data.
- Store the predictions made on your X_test in a variable named "y_pred."
- Calculate the accuracy score with the accuracy_score() function, using y_test and y_pred as arguments.

In [None]:
# Code here!


**>>>** Display the `feature_importances_` property of the rf model.

In [None]:
# Code here!


**>>>** We want to visualize each variable along with its importance in the final outcome of our model.

To do this, create a new DataFrame (no need to store it in a variable) using the `pd.DataFrame` function. Provide it with an object that is a `zip` of X.columns and rf.feature_importances_. Name the columns 'variable' and 'importance' using the "columns" parameter. Then, reorganize your DataFrame using `.sort_values()` so that it is sorted in descending order by the 'importance' column.

In [None]:
# Code here!


**>>>** To better visualize the performance of our model, use the `confusion_matrix()` function. Provide it with y_test and y_pred as arguments and use `rf.classes_` as the label. Store the result in a variable "cm" and display it.

In [None]:
# Code here!


**>>>** Now, use the `ConfusionMatrixDisplay()` function to visualize the confusion matrix more effectively. Provide our variable "cm" as the argument to the "confusion_matrix" parameter, and use `rf.classes_` as the argument for the "display_labels" parameter.

In [None]:
# Code here!


### True / False - Positive / Negative

- The intersection between the `true label` (1) and the `predicted label` (1) represents TP (*True Positive*).
- The intersection between the `true label` (0) and the `predicted label` (0) represents TN (*True Negative*).
- The intersection between the `true label` (1) and the `predicted label` (0) represents FN (*False Negative*).
- The intersection between the `true label` (0) and the `predicted label` (1) represents FP (*False Positive*).

**>>>** Finally, use the `classification_report` function to display the precision, recall, and f1-score values.

In [None]:
# Code here!


<div>
<img src="files/precision_recall_wiki.png" alt="precision_recall_accuracy" width="40%" align='center' source="wikipedia" /> </div>

<div>
<img src="files/precision_recall_accuracy.png" alt="precision_recall_accuracy" width="70%" align='center' source="https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488" /> </div>

## Understanding the classification report

**Precision**: Precision is a measure of how many of the predicted positive instances were actually positive. In this context, it tells us the percentage of passengers predicted to have survived (1) who actually survived. For the "0" class (did not survive), the precision is 0.85, meaning that 85% of the predicted non-survivors were correctly predicted. For the "1" class (survived), the precision is 0.79, indicating that 79% of the predicted survivors were correctly predicted.

**Recall**: Recall, also known as sensitivity or true positive rate, is a measure of how many of the actual positive instances were correctly predicted. In this context, it tells us the percentage of passengers who actually survived (1) that were correctly identified by the model. The recall for the "0" class is 0.87, indicating that 87% of the actual non-survivors were correctly identified. For the "1" class, the recall is 0.78, meaning that 78% of the actual survivors were correctly identified.

**F1-score**: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is particularly useful when you want to consider both false positives and false negatives. The F1-score for the "0" class is 0.86, and for the "1" class, it's 0.78.

**Support**: Support represents the number of occurrences of each class in the test set. In this case, there are 134 instances of the "0" class (did not survive) and 89 instances of the "1" class (survived).

**Accuracy**: Accuracy is a measure of overall correctness. It tells us what percentage of all instances (both "0" and "1" classes) were correctly classified. In this case, the overall accuracy is 0.83, which means that 83% of the test set instances were correctly classified by the model.

**Macro Avg**: Macro average takes the unweighted average of precision, recall, and F1-score for each class. In this case, the macro average F1-score is 0.82.

**Weighted Avg**: Weighted average takes the weighted average of precision, recall, and F1-score for each class, with weights based on the number of instances in each class. In this case, the weighted average F1-score is 0.83.

## Bonus

- Use a logistic regression model.

### Logistic Regression

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [None]:
# Code here!
