In [None]:
# Run this cell once when starting on notebook.
# On Google Colab, wait for the Google Drive permission prompts before proceeding
import matplotlib.pyplot as plt
%matplotlib inline
import os
import sys
try:
    %load_ext jupyter_ai_magics
except:
    print("%%ai cells will not work in this notebook")
    print("Please use Gemini for AI queries instead")
from datascience import *
import numpy as np
import math

DATA_FILENAME="data/titanic.csv"
try:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  !mkdir -p /content/data
  !gdown --fuzzy https://drive.google.com/file/d/1ZOLRWMYK6XVLY8E8xdvzysyRbCkcoMve/view?usp=sharing -O /content/data/titanic.csv
  DATA_FILENAME = DATA_FILENAME.replace("data/", "/content/data/")
  !ls -l /content/data
except:
  print("Google Drive not mounted; this is normal on Jupyter Hub")


## CS5A S25 Midterm Project: Exoplanets

* Please refer to the general instructions in [this document](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?usp=sharing) before starting.
* You may work on either JupyterHub or Google Colab
* The Google Colab version is [this folder](https://drive.google.com/drive/folders/1r6v3k0fO5rGgLqPiDkfA71TB_hWkjKpK?usp=drive_link); you should make a copy of the notebook in your[ group folder for the midterm team project](https://docs.google.com/spreadsheets/d/1oLPWP0j1jvx7eLq7kcQc15PiSDUwjlnZAolUVSHWBOQ/edit?usp=sharing) before starting to make edits.





## Names
Please list all students who worked on this project.
1. Student Name 1
2. Student Name 2
3. Student Name 3
4. Student Name 4

*Write your team member responsibility distribution here*

## Titanic Dataset
The *R.M.S. Titanic* was a British luxury passenger and mail carrying ocean liner that was operated by the White Star Line. It was the largest ship afloat at the time of its creation, and the ship was thought to be unsinkable. The *Titanic* began its maiden voyage from Southampton, United Kingdom on April 10, 1912 en route to New York City. However, the *Titanic* struck an iceberg at 11:40 p.m. on April 14, 1912, and the ship sank in the Atlantic Ocean during the morning of April 15, 1912.

The *Titanic* dataset was created by Thomas Cason of the University of Virginia and reflects information known as of August 2, 1999. The dataset contains information of 1309 passengers, and it does not contain information about the *Titanic* crew. A total of 2240 people (passengers and crew) sailed on the *Titanic*. The dataset originated from the [Encyclopedia Titanica](https://www.encyclopedia-titanica.org/), which has a goal of telling the story of every single person that traveled on the *Titanic*. The dataset is consistently being evolved, and therefore some data is missing, which is common in many real-world datasets. The missing data are marked by `nan`.

Below are the descriptions of the variables in the dataset.

- `Pclass`: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

- `survived`: Survival (0 = No; 1 = Yes)

- `name`: Name

- `sex`: Sex

- `age`: Age (years)

- `sibsp`: Number of Siblings/Spouses Aboard

- `parch`: Number of Parents/Children Aboard

- `ticket`: Ticket Number

- `fare`: Passenger Fare (British pound)

- `embarked`: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

- `boat`: Lifeboat

- `body`: Body Identification Number

- `home.dest`: Home/Destination

In [None]:
# Run This Cell

# Read the dataset
titanic_unclean = Table.read_table(DATA_FILENAME)

# Display the first few rows of the table
titanic_unclean.show(10)

**Question 1:** Real-world datasets do not always have complete information! If you observe some of the rows above, you will find a `nan` (Not a Number). **`nan` values are not strings in the case of this dataset**. They are of type `None` (another kind of data type). These are missing values that can cause issues if we try to use their corresponding rows to do any kind of arithmetic. There are many ways to work with missing or skewed data. For now, we will just choose to delete any rows that have missing information. Name the final table something meaningful - like `titanic`. This will be the table we use to answer all the other questions.

*Hint:*
- One way you can do this is by writing a function and then using the `.apply()` function.
- **Your final dataset should have roughly 1000-1100 rows.**
- In Python, `nan` is equal to *nothing*, not even itself. So, to check if a variable x is `nan`, you can test x != x.

**Note: Do not remove rows if they are only missing values in the `boat`, `body` and `home.dest` columns. Missing values in `boat` and `body` are deliberately missing. They correspond to those who did not get on the lifeboat and did not die or were not found respectively. We are okay to work with `home.dest` as it is.**

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

We now have the final dataset ready to use!

**Question 2:** Exploratory data analysis is one of the first steps data scientists conduct to identify and analyze patterns in the data. Investigate the passengers aboard the *Titanic* so you can get to know your data better. What was the average age of the passengers? What was the sum of all the passenger fares? How many males and females were on board? Which was the most common port of embarkation?


In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 3:** Explore the relationship between `Age` and `Fare`. Is there a correlation between the two variables? Think about what kind of plot you could make to show a correlation between two variables, and produce that below. What can you infer from this plot?

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 4:** Examine the relationship between passenger class (`Pclass`), which serves as a proxy for socio-economic status, and survival (`survived`). Calculate the survival rate for each passenger class. What percentage of first-class passengers survived? What percentage of third-class passengers survived? What could be possible reasons for the discrepancy? What might this tell us about the socio-economic disparities in survival rates? Support your answer with statistics.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 5:** Create an age profile of the passengers on board. You can categorize the ages into different groups (e.g., 0-12, 13-19, 20-35, etc.) and then calculate the proportion of each age group within the total number of passengers. Using this, you should also find the most common class amongst each age range. Discuss any interesting findings and how this age profile might reflect the demographics of early 20th-century travelers.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 6:** Explore how the number of siblings/spouses (`sibsp`) and parents/children (`parch`) aboard might relate to a passenger's chance of survival. Start by creating new variables to represent family size, and then group the passengers by their family size to calculate survival rates. What trends do you notice?

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 7:** Investigate the passengers on the *Titanic* who got on a lifeboat. How many were there? How many were males, and how many were females? What were their top three home destinations? How many survived?

Think about how you can best display the distribution of ages in equal size bins among those who escaped and those who didn't, and make appropriate plots. What can you infer from the plots?

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 8:** Investigate and plot the frequency of the titles (ex: Mr, Mrs, etc.) of the passengers. What are the names of the passengers aboard the ship whose title was unique? Then, analyze and explain the relationship between passenger title and at least 2 variables in the dataset of your choice that you believe would give the best insights. Are there any patterns? What could be the reasons for such patterns? Support your answer with data and use comparative charts.

Hint: Python has functions/methods like `append()` and `split()` that you can potentially use to extract a passenger's title.

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

## Novel analysis
For the last section of your midterm project, you are expected to think like a data scientist. Formulate **TWO** unique questions/problems/insights that you can solve or obtain from the dataset given to you. **You need to come up with the problem and present the solution in the notebook**. The problems you come up with will be graded on novelty and the work necessary to obtain them. **DO NOT** reuse problems that are already covered in the midterm or those that could be solved or visualized with a single line of code.

**Question 9. *Type your question here***

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

**Question 10. *Type your question here***

In [None]:
# SOLUTION

#### Prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

*Type your answer here*

# Deadine and Submission

References these links

* [Deadlines](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?tab=t.0#bookmark=id.fqop8bpqcnvf)
* [Submission Instructions](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?tab=t.0#bookmark=id.vsf44v7st0t)
