# INST414 — Lab 3: GroupBy → Conditional Probabilities (with Pandas)

**What you’ll do today:** learn a small set of `groupby` patterns and use them to compute **conditional probabilities** from data.

## Learning goals
By the end, you should be able to:
- Use `groupby` to compute averages **within groups**.
- Explain why `.mean()` of a 0/1 (or True/False) column is a probability.
- Compute conditional probabilities like `P(survived=1 | pclass)` using `groupby`.
- Condition on more than one variable (e.g., `sex` and `pclass`).
- Use `apply(lambda grp: ...)` when you need logic inside each group.

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run the cell after the last successful one.
- When you see **Checkpoint** prompts, pause and try before scrolling.

## Common issues (quick fixes)
- Combining conditions: use parentheses and `&` / `|`:
  - ✅ `(df['a'] > 1) & (df['b'] == 'x')`
  - ❌ `df['a'] > 1 & df['b'] == 'x'`
- Multi-group output can look “stacked” (a MultiIndex). Use `.unstack()` to make it easier to read.


# Load modules and settings

**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

**Turn off AI assistance:** Go to **Settings → AI Assistance** and uncheck everything. AI-generated code is not allowed on assignments in this course.


In [None]:
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20


# Part 1 — Practice `groupby` on a tiny table

We’ll start with a small toy dataset so you can see `groupby` clearly.


In [None]:
df = pd.DataFrame({
    'group': ['a','a','a','a','b','b','b','b'],
    'x':     [0,  1,  1,  0,  0,  0,  0,  1],
    'y':     [0,  0,  1,  0,  0,  1,  1,  0],
})

df


## `groupby`: do the same operation within each subgroup

Example: “What is the mean of `x` in group `a` vs group `b`?”


In [None]:
df.groupby('group')['x'].mean()


**Checkpoint:** Compute the *sum* of `y` within each group.


In [None]:
# Try it here


## `apply` + `lambda`: custom calculations inside each group

Sometimes you want logic inside each group.

Example: “Within each group, what fraction of rows have `x == 1`?”

This is a probability: `P(x==1 | group)`.


In [None]:
df.groupby('group').apply(lambda grp: (grp['x'] == 1).mean())


# Part 2 — Titanic: conditional probabilities with `groupby`

Each row is a passenger.


In [None]:
titanic = pd.read_csv('https://zjelveh.github.io/files/titanic.csv')

columns_to_keep = ['pclass', 'sex', 'survived', 'age', 'fare']
titanic = titanic[columns_to_keep]

titanic.head()


## Sanity checks

**Checkpoint:** How many rows and columns are in this dataset? What are three columns you recognize?


In [None]:
titanic.shape

titanic.columns


## Conditional probability as a grouped mean

Question: “Among passengers in each class, what fraction survived?”

That is `P(survived=1 | pclass)`.


In [None]:
titanic.groupby('pclass')['survived'].mean()


### A common mistake (and the fix)

You might try to write something like:

```py
# DON'T RUN THIS (it will error)
# titanic.groupby('pclass').survived==1.mean()
```

The right way is to put the logic inside parentheses **before** taking a mean.


In [None]:
# Two correct patterns:

# (A) create the indicator first, then groupby + mean
(titanic['survived'] == 1).groupby(titanic['pclass']).mean()

# (B) use apply + lambda
# titanic.groupby('pclass').apply(lambda grp: (grp['survived'] == 1).mean())


## Conditioning on more than one variable

Compute `P(survived=1 | sex, pclass)`.

**Tip:** `.unstack()` makes the table easier to read.


In [None]:
rates = titanic.groupby(['sex', 'pclass'])['survived'].mean()

rates.unstack()


## Conditional joint distributions with `value_counts`

Compute `P(sex, pclass | survived)`.

**Checkpoint:** Verify that within each `survived` group, the probabilities sum to 1.


In [None]:
joint_cond = titanic.groupby('survived')[['sex', 'pclass']].value_counts(normalize=True)

joint_cond


# Lab Task

Try these without looking back too much. The goal is to recognize patterns:
- create indicator columns
- compute conditional probabilities with `groupby` + `.mean()`
- use `apply(lambda grp: ...)` when needed

1) Create a new column called `age_under_18` that is True if `age < 18` and False otherwise.


In [None]:
# 1) Try it here


2) Compute `P(survived==1 | sex, pclass, age_under_18)`.

- Which group was most likely to survive?
- Least likely?

**Hint:** missing ages exist. Decide whether you want to include missing ages (they behave like False in comparisons) or restrict to known ages.


In [None]:
# 2) Try it here


3) Compute `P(age > 50 | survived)`.

**Hint:** consider restricting to rows with non-missing age.


In [None]:
# 3) Try it here


4) What is the average `fare` by `sex`?


In [None]:
# 4) Try it here


5) Compute `P(survived, pclass | sex)`.

- Which rows add up to 1?
- Among men, which group was least likely to survive?


In [None]:
# 5) Try it here


6) Compute `P(age_under_18 == 0, sex == 'male' | survived)`.


In [None]:
# 6) Try it here


7) Compute `P(age_under_18, survived==1 | sex)`.

**Checkpoint:** What adds up to 1 here, and why?


In [None]:
# 7) Try it here
