<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [36]:
import pandas as pd
import plotly.express as px
data = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

In [38]:
max_age = data["Age"].max()
bins = [0,12, 19, 59, max_age]
labels = ["Child", "Teen", "Adult", "Senior"]
data["age_cat"] = pd.cut(data["Age"], bins = bins, labels = labels, right = True, include_lowest = True)

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_cat
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Adult
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Adult


In [39]:
#Grouping the data 
data_grouped = data.groupby(["Pclass", "Sex", "age_cat"]) \
    [["PassengerId", "Survived"]].agg({"PassengerId": "count", "Survived": "sum"})\
          .rename(columns ={"PassengerId": "n_passengers", "Survived": "n_survived"}).reset_index()





In [40]:
#Calculate the survival rate 
data_grouped["survival_rate"] = data_grouped["n_survived"]/data_grouped["n_passengers"]

In [41]:
#Arranging by survival rate 
data_grouped.sort_values(by = "survival_rate")

Unnamed: 0,Pclass,Sex,age_cat,n_passengers,n_survived,survival_rate
0,1,female,Child,1,0,0.0
23,3,male,Senior,4,0,0.0
14,2,male,Adult,76,4,0.052632
21,3,male,Teen,38,3,0.078947
13,2,male,Teen,10,1,0.1
22,3,male,Adult,186,26,0.139785
7,1,male,Senior,14,2,0.142857
5,1,male,Teen,4,1,0.25
15,2,male,Senior,4,1,0.25
20,3,male,Child,25,9,0.36


In [43]:
st.write("The cliche with the Titanic is the saying 'Women and children' first. " \
"This would make you think that women and children would have a significantly higher survival" \
"rate than men. But this was also a time with stark socioeconomic divides, so the question is:" \
"Do women and children have higher survival rates across all passenger classes?")

NameError: name 'st' is not defined

In [None]:
#Start by cleaning the data a little bit. I want all female passengers and child passengers 
#To be in one category and all male, non child passengers to be another category 

In [None]:
#Create the new variable 
data_grouped["woman_child"] = (data_grouped["Sex"] == "female") | (data_grouped["age_cat"] == "Child")

data_grouped


Unnamed: 0,Pclass,Sex,age_cat,n_passengers,n_survived,survival_rate,woman_child
0,1,female,Child,1,0,0.0,True
1,1,female,Teen,13,13,1.0,True
2,1,female,Adult,68,66,0.970588,True
3,1,female,Senior,3,3,1.0,True
4,1,male,Child,3,3,1.0,True
5,1,male,Teen,4,1,0.25,False
6,1,male,Adult,80,34,0.425,False
7,1,male,Senior,14,2,0.142857,False
8,2,female,Child,8,8,1.0,True
9,2,female,Teen,8,8,1.0,True


In [48]:
#Re-group the data 
data_regrouped = data_grouped.groupby(["Pclass", "woman_child"])\
    [["n_passengers", "n_survived"]].agg({"n_passengers": "sum", "n_survived": "sum"})\
          .reset_index()
#Then recalculate the survival rate 
data_regrouped["survival_rate"] = data_regrouped["n_survived"]/data_regrouped["n_passengers"]
data_regrouped

Unnamed: 0,Pclass,woman_child,n_passengers,n_survived,survival_rate
0,1,False,98,37,0.377551
1,1,True,88,85,0.965909
2,2,False,90,6,0.066667
3,2,True,83,77,0.927711
4,3,False,228,29,0.127193
5,3,True,127,56,0.440945


In [None]:
fig = px.bar(data_regrouped, 
             x = "Pclass", 
             y = "survival_rate",
             hover_data=["n_passengers"],
             color = "woman_child",
             color_discrete_sequence = ["#88AED0", "#E4A0B7"],
             template = "plotly_white",
             barmode = "group"
            )
fig.update_layout(
    xaxis_title="Class",
    yaxis_title="Survival Rate",
     yaxis = dict(
        tickformat = ".0%"  
    )
)
fig.show()

In [70]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_cat
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Adult
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Adult


## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [79]:
#Creating a new column for family size 
data["family_size"] = data["SibSp"]  + data["Parch"] + 1

In [97]:
#Grouping 
grouped_data = data.groupby(["family_size", "Pclass"])\
        [["PassengerId", "Fare"]].agg({"PassengerId": "count", "Fare": ["mean", "min", "max"] })\
        .reset_index()
grouped_data.columns = ['_'.join(col).strip() for col in grouped_data.columns.values]

grouped_data = grouped_data.rename(columns ={"family_size_": "family_size",\
                               "Pclass_": "class", \
                                "PassengerId_count": "n_passengers",\
                                "Fare_mean": "avg_fare", \
                                "Fare_min": "min_fare", \
                                "Fare_max": "max_fare" }).\
                                        sort_values(by = ["class", "family_size"])
grouped_data

Unnamed: 0,family_size,class,n_passengers,avg_fare,min_fare,max_fare
0,1,1,109,63.672514,0.0,512.3292
3,2,1,70,91.848039,29.7,512.3292
6,3,1,24,95.681075,26.2833,211.5
9,4,1,7,133.521429,120.0,151.55
12,5,1,2,262.375,262.375,262.375
15,6,1,4,263.0,263.0,263.0
1,1,2,104,14.066106,0.0,73.5
4,2,2,34,24.682962,11.5,33.0
7,3,2,31,31.693819,13.0,73.5
10,4,2,13,36.575969,11.5,65.0


In [80]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'age_cat',
       'family_size'],
      dtype='object')

## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.