In [None]:
# Run this cell
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext jupyter_ai_magics
from datascience import *
import numpy as np
import math

# CMPSC 5A Midterm — Titanic

In this project, you will learn to apply all the concepts you have learned in the class so far. This includes table manipulation (all table functions), iteration (for loops), conditional (if statements), data cleaning, etc.

## Names
Please list all students who worked on this project.
1. Student Name 1
2. Student Name 2
3. Student Name 3

## Member Responsibilities
To make sure every member of your team contributes fairly, you should split up responsibilities before starting in writing. This should be a couple sentences. Here’s two examples of what that could look like for a hypothetical team of 2 (Alice and Bob):
- “Alice will work on Q1-5 and Bob will do Q6-10. Each of us will make the slides that correspond to the problems we did.” 
- “We will meet in person to do the whole assignment as a team, switching up who is typing. Whoever is not currently typing will still help by checking documentation pages for the typist and advising the typist as they write code. Alice will make the first half of the slides, Bob will make the second half.” 

As these examples show, everyone should contribute to the notebook and presentation roughly equally.

After presentations are complete, we will send out a survey where you say what you ended up doing and whether your teammate(s) held up their end of the bargain. Your self-evaluation and your teammates evaluation of you is not directly part of the rubric. 

This does not mean you can't help each other!!! It’s a group project. Please please please reach out to your teammate(s) if you’re struggling with questions that are assigned to you!! The survey evaluation is not a competition. We do not care if someone thinks they did 55% of the work and their partner did 45%. We care a lot if someone is not doing their work, not communicating with their team, and generally not contributing in good faith.


*Write your team member responsibility distribution here*

## Logistics

**Deadline:** The midterm project notebook is due Thursday, February 13th, 2025 at 11:59pm PT. The midterm project presentation slides are due by 10:00am PT on Friday, February 14th. Unlike labs, **no late submissions are allowed**.

**Submission:** For full credit, you must complete all the questions and submit to Gradescope. You may still change your answers before the project deadline - only your final submission will be graded for correctness. Only one partner needs to submit the notebook to Gradescope, and they will need to add the other two group members as members on Gradescope. See [How to Add Group Member in Gradescope](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members). **After they submit, all group members should open their Gradescope accounts and see that a submission has been processed. Be sure that the final notebook has all the ChatGPT prompts used by all the team members**.

**Presentation:** Your group will need to create a presentation slide deck and give a 6-8 minute oral presentation during your assigned time slot. Presentations must use slides. Your group is not allowed to scroll through your notebook during the presentation. All slides must be uploaded to https://drive.google.com/drive/folders/19zNx7JWtwSde3cDNeCT3gLOdBJ7Kuset?usp=drive_link.

**IMPORTANT NOTES:** 
- You are not limited to just one solution code cell, one prompt cell and one workflow cell for each question. Use as many of each as you like to ensure your notebook is presented well, easy-to-read, and has all the required plots and intermediate tables visible to show how you deduced that answer.
- None of the questions are created in a way that will allow you to just give a one line answer. Remember, if your answer is just a one line answer, you are probably missing something.
- Every group's answers may be different based on the approach you take for data analysis. Others may have visually presented it with a graph that may be different from yours, or filtered the table in a different way. That does not mean yours is wrong. We are looking for diversity in how information is displayed and there are more than one correct answers for each question.

**Partners:** You will work with two other partners (total three in a group); your partners can be from any lab section. 

**Rules:** Don't share your code with anybody but your partners. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem. Since the problems are open-ended, they can have various different answers. What is important is the approach you take to solve your task.

**Support:** You are not alone! Come to office hours, post on Ed, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Ed post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or ULA.

**Advice:** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, **DO NOT** reuse the variable names.

You **never** have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!

## Titanic Dataset
The *R.M.S. Titanic* was a British luxury passenger and mail carrying ocean liner that was operated by the White Star Line. It was the largest ship afloat at the time of its creation, and the ship was thought to be unsinkable. The *Titanic* began its maiden voyage from Southampton, United Kingdom on April 10, 1912 en route to New York City. However, the *Titanic* struck an iceberg at 11:40 p.m. on April 14, 1912, and the ship sank in the Atlantic Ocean during the morning of April 15, 1912.

The *Titanic* dataset was created by Thomas Cason of the University of Virginia and reflects information known as of August 2, 1999. The dataset contains information of 1309 passengers, and it does not contain information about the *Titanic* crew. A total of 2240 people (passengers and crew) sailed on the *Titanic*. The dataset originated from the [Encyclopedia Titanica](https://www.encyclopedia-titanica.org/), which has a goal of telling the story of every single person that traveled on the *Titanic*. The dataset is consistently being evolved, and therefore some data is missing, which is common in many real-world datasets. The missing data are marked by `nan`.

Below are the descriptions of the variables in the dataset.

- `Pclass`: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

- `survived`: Survival (0 = No; 1 = Yes)

- `name`: Name

- `sex`: Sex

- `age`: Age (years)

- `sibsp`: Number of Siblings/Spouses Aboard

- `parch`: Number of Parents/Children Aboard

- `ticket`: Ticket Number

- `fare`: Passenger Fare (British pound)

- `embarked`: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

- `boat`: Lifeboat

- `body`: Body Identification Number

- `home.dest`: Home/Destination

In [None]:
# Run This Cell

# Read the dataset
titanic_unclean = Table.read_table("data/titanic.csv")

# Display the first few rows of the table
titanic_unclean.show(10)

**Question 1:** Real-world datasets do not always have complete information! If you observe some of the rows above, you will find a `nan` (Not a Number). **`nan` values are not strings in the case of this dataset**. They are of type `None` (another kind of data type). These are missing values that can cause issues if we try to use their corresponding rows to do any kind of arithmetic. There are many ways to work with missing or skewed data. For now, we will just choose to delete any rows that have missing information. Name the final table something meaningful - like `titanic`. This will be the table we use to answer all the other questions.

*Hint:*
- One way you can do this is by writing a function and then using the `.apply()` function.
- **Your final dataset should have roughly 1000-1100 rows.**
- In Python, `nan` is equal to *nothing*, not even itself. So, to check if a variable x is `nan`, you can test x != x.

**Note: Do not remove rows if they are only missing values in the `boat`, `body` and `home.dest` columns. Missing values in `boat` and `body` are deliberately missing. They correspond to those who did not get on the lifeboat and did not die or were not found respectively. We are okay to work with `home.dest` as it is.**

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

We now have the final dataset ready to use!

**Question 2:** Exploratory data analysis is one of the first steps data scientists conduct to identify and analyze patterns in the data. Investigate the passengers aboard the *Titanic* so you can get to know your data better. What was the average age of the passengers? What was the sum of all the passenger fares? How many males and females were on board? Which was the most common port of embarkation?


In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 3:** Explore the relationship between `Age` and `Fare`. Is there a correlation between the two variables? Think about what kind of plot you could make to show a correlation between two variables, and produce that below. What can you infer from this plot?

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 4:** Examine the relationship between passenger class (`Pclass`), which serves as a proxy for socio-economic status, and survival (`survived`). Calculate the survival rate for each passenger class. What percentage of first-class passengers survived? What percentage of third-class passengers survived? What could be possible reasons for the discrepancy? What might this tell us about the socio-economic disparities in survival rates? Support your answer with statistics.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 5:** Create an age profile of the passengers on board. You can categorize the ages into different groups (e.g., 0-12, 13-19, 20-35, etc.) and then calculate the proportion of each age group within the total number of passengers. Using this, you should also find the most common class amongst each age range. Discuss any interesting findings and how this age profile might reflect the demographics of early 20th-century travelers.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 6:** Explore how the number of siblings/spouses (`sibsp`) and parents/children (`parch`) aboard might relate to a passenger's chance of survival. Start by creating new variables to represent family size, and then group the passengers by their family size to calculate survival rates. What trends do you notice?

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 7:** Investigate the passengers on the *Titanic* who got on a lifeboat. How many were there? How many were males, and how many were females? What were their top three home destinations? How many survived?

Think about how you can best display the distribution of ages in equal size bins among those who escaped and those who didn't, and make appropriate plots. What can you infer from the plots? 

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 8:** Investigate and plot the frequency of the titles (ex: Mr, Mrs, etc.) of the passengers. What are the names of the passengers aboard the ship whose title was unique? Then, analyze and explain the relationship between passenger title and at least 2 variables in the dataset of your choice that you believe would give the best insights. Are there any patterns? What could be the reasons for such patterns? Support your answer with data and use comparative charts.

Hint: Python has functions/methods like `append()` and `split()` that you can potentially use to extract a passenger's title.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

## Novel analysis
For the last section of your midterm project, you are expected to think like a data scientist. Formulate **TWO** unique questions/problems/insights that you can solve or obtain from the dataset given to you. **You need to come up with the problem and present the solution in the notebook**. The problems you come up with will be graded on novelty and the work necessary to obtain them. **DO NOT** reuse problems that are already covered in the midterm or those that could be solved or visualized with a single line of code.

**Question 9. *Type your question here***

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 10. *Type your question here***

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

### Full Submission (due Thursday, February 13th at 11:59pm PT)

To submit:
- **save the notebook** first (press the icon on the top left)
- go up to the `Kernel` menu and select `Restart & Clear Output` (make sure the notebook is saved first, because otherwise, you will lose all your work!). 
- go to `Cell -> Run All`. Carefully look through your notebook and verify that all computations execute correctly. You should see **no errors**; if there are any errors, make sure to correct them before you submit the notebook.
- go to `File -> Download as -> Notebook` and download the notebook to your own computer. ([Please verify](https://ucsb-ds.github.io/ds1-f20/troubleshooting/#i-downloaded-the-notebook-file-but-it-saves-as-the-ipynbjson-extension-so-whenever-i-upload-it-to-gradescope-it-fails) that it got saved as an .ipynb file.)
- Upload the notebook to [Gradescope](https://www.gradescope.com/). You can drag and drop the file too.
- One submission per group. After you submit your notebook, you can Add Group Member. [How to Add Group Member in Gradescope](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members)