# INST414 — Lab 2: Tables → Probabilities (with Pandas)

**What you’ll do today:** use a small set of Pandas patterns to compute **marginal**, **joint**, and **conditional** probabilities from data.

## Learning goals
By the end, you should be able to:
- Use `iloc` to grab rows/columns by **position**.
- Create new columns using **direct assignment** and `.assign(...)`.
- Turn counts into probabilities with `.value_counts(normalize=True)`.
- Compute:
  - a **marginal** probability (one variable),
  - a **joint** probability (two variables),
  - and a **conditional** probability (one variable *given* another).

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run the cell after the last successful one.
- When you see **Checkpoint** prompts, pause and try before scrolling.
- If your output doesn’t match the expected *type* (Series vs DataFrame), check your brackets.

## Common issues (quick fixes)
- **`NameError: name 'pd' is not defined`** → you didn’t run the import cell.
- **`KeyError: 'colname'`** → the column name is misspelled; check `df.columns`.
- **`TypeError` when combining conditions** → use parentheses and `&` / `|`:
  - ✅ `(df['a'] > 1) & (df['b'] == 'x')`
  - ❌ `df['a'] > 1 & df['b'] == 'x'`


# Load modules and settings

**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

**Turn off AI assistance:** Go to **Settings → AI Assistance** and uncheck everything. AI-generated code is not allowed on assignments in this course.

In [None]:
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 30

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


# Part 1 — Practice on a tiny table

Before we jump into a real dataset, we’ll practice on a tiny table where you can see *every* row.

- Each row is one **case**.
- Each column is a **variable**.
- Each cell is a **value**.


In [None]:
df = pd.DataFrame({
    "caseID": [101, 102, 103, 104],
    "year":   [2019, 2019, 2020, 2020],
    "sex":    ["F", "F", "M", "M"],
    "age":    [30, 20, 40, 22],
    "fta":    [0, 0, 0, 1],   # 1 means "failed to appear"
})

df


## Quick inspection: shape, columns, head

- `df.shape` gives `(rows, columns)`
- `df.columns` lists variable names
- `df.head()` shows the first few rows


In [None]:
df.shape
df.columns
df.head()


## Selecting columns (Series vs DataFrame)

- `df['age']` is a **Series** (one column)
- `df[['age']]` is a **DataFrame** (a table with one column)


In [None]:
df['age']
type(df['age'])

df[['age']]
type(df[['age']])


## Summaries for numeric columns

If a column is numeric, you can summarize it quickly:
- `.sum()` adds values
- `.mean()` averages values

**Checkpoint:** What does `df.fta.mean()` represent in plain English?


In [None]:
df.age.sum()
df.age.mean()

df.fta.sum()
df.fta.mean()


## Counts → probabilities with value_counts

- `value_counts()` gives counts for each category
- `value_counts(normalize=True)` divides by the total count and gives **probabilities**

These probabilities should add up to 1.


In [None]:
df.sex.value_counts()
df.sex.value_counts(normalize=True)
df.sex.value_counts(normalize=True).sum()


## Joint distributions (two variables)

If you run `value_counts` on multiple columns, you get a **joint distribution**.

Interpretation example:
- the pair `(sex='F', fta=0)` is one joint outcome
- its value is the probability of that joint outcome in the table


In [None]:
df[['sex', 'fta']].value_counts(normalize=True)


## Filtering rows (logical conditions)

Core pattern:
1) build a True/False condition
2) use it inside `df[ ... ]` to keep only matching rows

**Checkpoint:** Filter to only `year == 2019`, then compute the mean of `fta` in that subset.


In [None]:
df[df.year == 2019]


In [None]:
# Try it here


### Multiple conditions

When you combine conditions, use parentheses and `&` (AND) / `|` (OR).


In [None]:
df[(df.year == 2020) & (df.sex == "M")]


## Creating new columns

Two common patterns:

### 1) Direct assignment
`df['new_col'] = ...`

### 2) .assign(...)
`df = df.assign(new_col = ...)`

We’ll use this a lot to make **indicator columns** (True/False) that represent events.


In [None]:
# Direct assignment
df['adult'] = df.age >= 18
df


In [None]:
# assign(...) method (creates a new DataFrame)
df = df.assign(is_male = df.sex == "M")
df


## iloc: selecting by position

`iloc` selects by integer position (0-based):
- `df.iloc[0]` → first row
- `df.iloc[:, 2]` → third column (all rows)
- `df.iloc[1:3, 1:4]` → a slice of rows and columns

**Checkpoint:** Use `iloc` to grab the value in row 0, column 3 (the first row’s age).


In [None]:
df.iloc[0]
df.iloc[:, 2]
df.iloc[1:3, 1:4]


In [None]:
# Try it here


# Part 2 — Apply the same patterns to the Titanic dataset

Now we’ll use a real dataset and compute probabilities from it.

**Dataset:** passengers on the Titanic (subset of the Kaggle dataset).


In [None]:
titanic = pd.read_csv("https://zjelveh.github.io/files/titanic.csv")
titanic.head()


## Sanity checks

**Checkpoint:** How many rows and columns are in this dataset? What are three columns you recognize?


In [None]:
titanic.shape
titanic.columns


## Marginal distributions (one variable)

A **marginal distribution** is just the distribution of one variable by itself.

Examples:
- distribution of `survived`
- distribution of `sex`
- distribution of `pclass`


In [None]:
titanic.survived.value_counts(normalize=True)
titanic.sex.value_counts(normalize=True)
titanic.pclass.value_counts(normalize=True)


## From a distribution to a single probability

Sometimes you want a single probability like: “probability that age is over 18”.

A clean way to do this:
1) create an indicator column like `age_over_18`
2) take its mean (because True is treated like 1, False like 0)

**Important:** `age` has missing values. If you compare `NaN > 18`, Pandas treats it as False.
So we’ll compute it two ways:
- (A) naive: includes missing ages as False
- (B) restricted: only among rows with known ages


In [None]:
titanic = titanic.assign(age_over_18 = titanic.age > 18)

# (A) naive: missing ages are treated as False
titanic.age_over_18.mean()


In [None]:
# (B) restricted: only among passengers with known ages
known_age = titanic[titanic.age.notna()]
known_age.age_over_18.mean()


## Joint distributions (two variables)

A **joint distribution** answers questions like:
- “What is the probability a passenger is female AND survived?”
- “What is the probability a passenger is male AND did not survive?”

Compute it with multi-column `value_counts(normalize=True)`.


In [None]:
titanic[['sex', 'survived']].value_counts(normalize=True)


## Marginalizing from a joint distribution

You can recover a marginal distribution by summing over the other variable.

Below we:
1) build a joint table for `sex` and `survived`
2) reshape it into a 2×2 table
3) sum across `sex` to get the marginal distribution of `survived`


In [None]:
joint = titanic[['sex', 'survived']].value_counts(normalize=True)
joint_table = joint.unstack()   # rows: sex, cols: survived
joint_table


In [None]:
# Marginal distribution of survived (summing over sex)
joint_table.sum(axis=0)


## Conditional probabilities (one variable given another)

Example question:
- “Among first-class passengers, what fraction survived?”

That is: probability `survived == 1` **given** `pclass == 1`.

We’ll do it three ways. The goal is to recognize that these are all the same idea.

### Method 1: filter first (most intuitive)


In [None]:
titanic[titanic.pclass == 1].survived.mean()


### Method 2: numerator / denominator using probabilities

- numerator: probability of (pclass==1 AND survived==1)
- denominator: probability of (pclass==1)


In [None]:
numerator = ((titanic.pclass == 1) & (titanic.survived == 1)).mean()
denominator = (titanic.pclass == 1).mean()
numerator / denominator


### Method 3: raw counts

Same idea, but using counts instead of probabilities.


In [None]:
numerator_ct = ((titanic.pclass == 1) & (titanic.survived == 1)).sum()
denominator_ct = (titanic.pclass == 1).sum()
numerator_ct / denominator_ct


# Lab Task

Try these without looking back too much. The goal is to recognize patterns:
- **make indicators** as new columns
- **compute marginals** with `value_counts(normalize=True)` or `.mean()`
- **compute joints** with multi-column `value_counts(normalize=True)`
- **compute conditionals** by filtering or using numerator/denominator

1) Use `.assign(...)` to create a column called `is_male` that is True if `sex == 'male'` and False otherwise.


2) Use direct assignment to create a column called `age_over_50` that is True if `age > 50` and False otherwise.


3) Compute the joint distribution of `is_male` and `survived`.
(That is: the probability of each (is_male, survived) combination.)


4) Now compute that same joint distribution **only among first-class passengers** (`pclass == 1`).


5) Using the “filter first” idea, compute the probability that a passenger survived **given** the passenger is male.


6) Using the raw-count method, compute the probability that a passenger survived **given** `pclass == 2`.


7) Using a numerator/denominator approach, compute the probability that a passenger survived **given** the passenger is male **and** in first class.
