---
# **Lab 4: Titanic Project**
---

Through the following questions, you will hone and apply your skills to a famous dataset containing information about passengers on the Titanic and whether they survived or not: [Titanic dataset from Kaggle](https://www.kaggle.com/competitions/titanic/overview).


<br>

There are twelve columns in the dataset. The target column is `Survived` which indicates if a passenger survived (1) or not (0). The features initially available are:

* `PassengerId`: Numeric, a unique number for each passenger.
* `Pclass`: Numeric, the ticket class.	1 = 1st, 2 = 2nd, 3 = 3rd.
* `Name`: Categorical, the name of the passenger.
* `Sex`: Categorical, the sex of the passenger.
* `Age`: Numeric, the passenger's age in years.
* `Sibsp`: Numeric, the number of siblings / spouses aboard the Titanic.
* `Parch`: Numeric, the number of parents / children aboard the Titanic.
* `Ticket`: Categorical, ticket number.
* `Fare`: Numeric, passenger fare.
* `Cabin`: Categorical, cabin number.
* `Embarked`: Categorical, port of embarkation.	C = Cherbourg, Q = Queenstown, S = Southampton.
* `Hometown`: Categorical, passenger home town.
* `Destination`: Categorical, ultimate return point.
* `HasCabin`: Categorical, whether the passenger(s) had a cabin or not.

<br>

**Run the cell below to import all necessary libraries and functions.**

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

import numpy as np

url = "https://github.com/the-codingschool/TRAIN-datasets/raw/main/titanic/titanic_cleaned.csv"
data = pd.read_csv(url)

#### **Problem #1**

Print the first 10 rows of the data.

In [None]:
# COMPLETE THIS CODE

#### **Problem #2**

Determine how many data points and how many variables are in this dataset using the `.shape` attribute.

In [None]:
# COMPLETE THIS CODE

#### **Problem #3**

Now, get a general picture of the data's statistics using the `.describe()` method. Then, specifically determine:

* The average age of the passengers.
* The standard deviation (std) of the fare.

In [None]:
# COMPLETE THIS CODE

#### **Problem #4**

Determine how many passengers did or did not survive using the `.value_counts()` method.

In [None]:
# COMPLETE THIS CODE

#### **Problem #5**

Visualize the number of passengers who did versus did not survive using a bar graph.

In [None]:
categories = ['Not Survived', 'Survived']
bars = # COMPLETE THIS CODE

plt.# COMPLETE THIS CODE

plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.show()

#### **Problem #6**

Visualize the number of male versus female passengers.

In [None]:
categories = ['Male', 'Female']
bars = # COMPLETE THIS CODE

plt.# COMPLETE THIS CODE

plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.show()

#### **Problem #7**

Create the following two bar plots:
1. Number of males who did or did not survive.
2. Number of females who did or did not survive.

In [None]:
categories = ['Not Survived', 'Survived']

male_data = data[# COMPLETE THIS CODE
male_bars = male_data[# COMPLETE THIS CODE

plt.bar(# COMPLETE THIS CODE

plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.show()

In [None]:
# COMPLETE THIS CODE

#### **Problem #8**

Comparing these values can be difficult since there is a different number of males and females in this dataset as seen in Problem #6. To more properly compare, do the following:
1. Use the `.groupby(...)` method to group the data by sex.
2. Determine the mean value of `Survived` for males and females. **NOTE**: This represents the *fraction* of those who survived since `Survived` is either 0 or 1.
3. Visualize the results with a bar graph.

In [None]:
grouped_data = data.groupby(# COMPLETE THIS CODE
surv_frac = grouped_data['Survived'].# COMPLETE THIS CODE

x_values = surv_frac.index
plt.bar(x_values, surv_frac)

plt.xlabel(# COMPLETE THIS CODE
plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.show()

#### **Problem #9**

Now, perform the same analysis as in Problem #8 using the variable `HasCabin` instead of `Sex`.

In [None]:
# COMPLETE THIS CODE

#### **Problem #10**

Performing the same type of analysis, determine which port of embarkation had the least survivors.

<br>

**NOTE**: The description of the dataset at the top of the notebook explains what each possible category (C, Q, S) means.

In [None]:
# COMPLETE THIS CODE

#### **Problem #11**

Performing the same type of analysis, determine if the passenger class had a noticeable effect on survival.

In [None]:
# COMPLETE THIS CODE

#### **Problem #12**

Performing the same type of analysis on the new variable created below, determine if the family size had a noticeable effect on survival.

In [None]:
data['FamilySize'] = data['Sibsp'] + data['Parch'] + 1

# COMPLETE THIS CODE

#### **Problem #13**

For sake of comparison, visualize the fraction of survivors versus family size found in Problem #12 using a *line plot*.

In [None]:
grouped_data = # COMPLETE THIS CODE
surv_frac = grouped_data[# COMPLETE THIS CODE

x_values = surv_frac.index
plt.# COMPLETE THIS CODE

plt.xlabel(# COMPLETE THIS CODE
plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.# COMPLETE THIS CODE

#### **Problem #14**

Visualize the fraction of survivors versus age using a scatter plot.

In [None]:
# COMPLETE THIS CODE

#### **Problem #15**

The results from Problem #14 are not very insightful because we are treating age like a categorical variable when it's realistically a numerical variable that can take on *many* values.

<br>

So, let's *make* a categorical age variable as follows:
1. Run the first cell to create this categorical variable.
2. Create a bar plot showing the survival rate versus age category.

In [None]:
data['Age Category'] = 'unknown'

data['Age Category'][data['Age'] < 13] = 1
data['Age Category'][(data['Age'] >= 13) & (data['Age'] < 18)] = 2
data['Age Category'][(data['Age'] >= 18) & (data['Age'] < 35)] = 3
data['Age Category'][(data['Age'] >= 35) & (data['Age'] < 65)] = 4
data['Age Category'][data['Age'] >= 65] = 5

In [None]:
# COMPLETE THIS CODE

#### **Problem #16**

Apply what you've learned about this data to *try* making predictions about who survived or not. Specifically,

1. Run the provided code to predict that only first class passengers survived (based on the fact that they had the highest survival rate). **Make sure you understand this code before moving on.**

2. Using the same approach, make predictions based on whether the passengers were male or female and considering the results you found above.

3. Using the same approach, make predictions based on which port the passengers embarked from and considering the results you found above.

In [None]:
predicted_values = (data['Pclass'] == 1)

correct_predictions = (data['Survived'] == predicted_values)

correct_predictions.value_counts()

True     605
False    286
dtype: int64

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

---

## **[OPTIONAL] Homework**

---

As we transition from data science and general machine learning concepts to learning our first machine learning model tomorrow, we recommend that you review the optional notebook problems so far, reflect on what topics you feel most and least comfortable with, and consider the following open-ended questions:

1. Would you consider Exploratory Data Analysis a “linear” process? Why or why not?

1. Can you recall a time you’ve read about or observed the results of unethical data practices in terms of data collection, use, etc.?

1. After exploring the Titanic data, were there specific variables that seemed more related to survival rates than others, or did they all seem equally related?

1. Are there other ways you could imagine making predictions about the survival of passengers in the Titanic dataset? How you could use more information about one variable (such as its average or standard deviation)? How could you combine the value of multiple variables (ex: `Sex` and `PClass`) to produce one prediction?

1. As we will see tomorrow, linear regression is a form of finding the “line of best fit”. Are there places in your own classes where you use or teach this?

1. Given the slope and intercept of a line, how can you determine y values given an x value? How could this be used to make predictions for unseen data? How does this work if you have multiple input variables (not just x)?

---

# End of Notebook

© 2024 The Coding School, All rights reserved