# INST414 — Lab 1: Tabular Data & Pandas

**What you’ll do today:** get comfortable thinking in *tables* (rows/columns) and using a small set of Pandas operations you’ll use all semester.

## Learning goals
By the end, you should be able to:
- Explain **observation vs variable vs value** (and what “tidy data” means).
- Load a CSV into a Pandas **DataFrame**.
- Inspect a dataset quickly (`.head()`, `.shape`, `.columns`).
- Summarize a column (`.value_counts()`, `.unique()`, `.nunique()`, `.mean()`).
- Filter rows with logical conditions (e.g., “only 2019”, “only survivors”, “pclass==2 AND survived”).

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run the cell after the last successful one.
- When you see **Checkpoint** prompts, pause and try before scrolling.
- If your output doesn’t match the expected *type* (Series vs DataFrame), check your brackets.

## Vocabulary (we’ll use these words precisely)
- **Observation (row):** one unit (a person / case / day / county).
- **Variable (column):** an attribute measured for every observation.
- **Value (cell):** one measurement for one variable on one observation.
- **DataFrame:** a table of data (many columns).
- **Series:** a single column of a DataFrame.


## Common issues (quick fixes)

- **`NameError: name 'pd' is not defined`** → you didn’t run the import cell.
- **`KeyError: 'colname'`** → the column name is misspelled or has different capitalization; check `df.columns`.
- **`TypeError` when combining conditions** → use parentheses and `&` / `|`:
  - ✅ `(df['a'] > 1) & (df['b'] == 'x')`
  - ❌ `df['a'] > 1 & df['b'] == 'x'`
- **Seeing `...` instead of full columns** → we set display options near the top, but you can always use `df.head()` or `df[['col1','col2']].head()`.


# Import modules

We start by importing **Pandas** and setting a few display options so tables are easier to read in Colab.


**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

In [None]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 100


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# `.read_csv()`: Loading data
We'll use the `read_csv` function to load our data.

CSV stands for comma-separated values because each in the raw text, each cell in the table is separated by a comma.

# Titanic Data

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1280px-RMS_Titanic_3.jpg"
  width="30%" align="right">

Dataset info: https://www.kaggle.com/c/titanic/data

## Columns
These are the subset of columns that we'll care about:
 - pclass - "Passenger class". Has 3 values:
   - `1:` 1st, `2:` 2nd, `3:` 3rd
 - sex - The sex as recorded in this dataset (values are either `male` or `female`, or `NA` (missing))
 - survived - An indicator of whether the passenger survived the sinking.
   - `0` - did not survive
   - `1` - survived
 - age - passenger age

In this lab we load a small dataset from a URL. In future labs you’ll also load local files that you upload to Colab.


In [None]:
df = pd.read_csv(filepath_or_buffer='https://zjelveh.github.io/files/titanic.csv')

In [None]:
df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,,S,,,


# `.shape`: Getting the number of rows and columns
The `shape` attribute of a DataFrame tells us the number of rows and columns.

It is returned in a tuple, which is similar to a list but a [little different](https://www.geeksforgeeks.org/python-difference-between-list-and-tuple/).

**Interpretation:** `(rows, columns)` = `(observations, variables)`.


In [None]:
df.shape

In [None]:
print("This dataframe has", df.shape[0], 'rows and', df.shape[1], 'columns.')

# `.columns`: What do my columns represent?
The list of columns in a DataFrame

Columns are variable names. Getting comfortable reading `df.columns` is the fastest way to debug a `KeyError`.


**Checkpoint:** Pick two columns and say (in plain English) what each one represents. Which are numeric vs categorical?

In [None]:
df.columns

In [None]:
df.columns[1]

Columns are of type `Index`


In [None]:
type(df.columns)

## Accessing single columns


In [None]:
df['survived']

In [None]:
df.survived

In [None]:
df[['survived']]

**The type of a column is `Series`. A `DataFrame` is made up of a collection of `Series`.**

In [None]:
type(df['survived'])
type(df.survived)

Note that when accessing a column with two brackets, it returns a DataFrame with one column

In [None]:
type(df[['survived']])

## Accessing multiple columns
Need to use double brackets, and will return a DataFrame

`df[['column1_name_here', 'column2_name_here', ... , 'columnK_name_here'  ]]`

Note that instead of a single bracket `[` we will need to use a double bracket `[[` to access multiple columns.

The code in the next cell will throw an error.

In [None]:
df[('pclass', 'fare')]

In [None]:
df[['pclass', 'fare']]

# `.head()` / `.tail()`: Seeing the first/last set of rows
`head(n=5)` - Show the top n rows in a dataframe

`tail(n=5)` - Show the bottom n rows in a dataframe

[head Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

[tail Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html)

Use these constantly as a sanity check: did I load what I think I loaded?


**Checkpoint:** Use `df.head(3)` and identify one row (one observation). What are the variables measured on that observation?

In [None]:
df.head()

In [None]:
# place cursor inside the parentheses and hit shift+tab after
df.head()


In [None]:
# show the first 3 rows
df.head()
df.head(n=3)

# Performing operations on columns
## `.sum() / .mean()`: Adding and averaging a numeric column
If the column is holding a number (which are types `int`, `float`), then we can do standard math

These are your first-pass summaries. Later we’ll do grouped summaries (by category) and visualizations.


In [None]:
df['survived'].sum()

In [None]:
df['survived'].mean()

In [None]:
df.survived.sum()

In [None]:
df.survived.mean()

## `.value_counts()`: How many of this, how many of that?
Give the count of unique values in a column

[Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
df.sex.value_counts()

In [None]:
df.pclass.value_counts()

## `normalize=True`
This options for `value_counts` divides each count by the sum of counts.

In other words, it computes **probabilities**.

In [None]:
# show counts and shares for ArrestYr column
df.pclass.value_counts(normalize=True)

In [None]:
#sums to one
df.pclass.value_counts(normalize=True).sum()

You can use value counts on more than one variable

In [None]:
df[['survived', 'pclass']].value_counts(normalize=True)

# How to access rows using logical conditioning
Above we used the single bracket to access a column

Another use for a single bracket is to select rows that meet a logical condition.

This is the core pattern: build a **Boolean mask** (a True/False Series) and use it inside `df[ ... ]` to keep only matching rows.


**Checkpoint:** Write a condition that keeps only rows where `sex` is `'male'`. Then compute the mean of `survived` for that subset.


## Example 1:
1. First let's remind ourselves what the DataFrame looks like

In [None]:
df

2. Let's see what happens when we test to see if the values in the `coin` column equal `A`

In [None]:
df.pclass==1

Notice that a `Series` was returned.

3. Now let's put that logical condition inside single brackets

We will see that we get back a DataFrame with 323 rows, where the rows all have `pclass==1`

In [None]:
df[df.pclass==1]

## Example 2:
Let's do something similar w/ the `sex` column

In [None]:
df[df['sex']=='male']

## Across multiple columns
When you have more than two logical conditions, you have to use a parentheses to separate them.


In [None]:
df[df.sex=='male' & df.pclass==1]

In [None]:
df[(df.sex=='male') & (df.pclass==1)]

## `unique` and `nunique`
Get the unique elements and number of unique elements


In [None]:
df.pclass.unique()

In [None]:
df.pclass.nunique()

In [None]:
df[df.pclass==1]['pclass'].nunique()


# Lab Task

Try these without looking back too much. The goal is to recognize patterns:
- **count categories** with `.value_counts()`
- **compute averages** with `.mean()`
- **filter rows** with `df[condition]`

If you get stuck, search upward for the closest example and adapt it.


1) What is the name of the fifth column in `df`?

2) Compute the average (mean) value of the `fare` column

3) Use `shape` and logical conditioning to figure out how many people had fares greater than $100


4. Now use `sum` to do the same thing

5. How many unique fares were paid?

6) Use logical conditioning to compute $P(age>25)$

7. What is $P(pclass, survived)?$

8. How many people were in pclass 2 and survived?

9. What is the average age of people who survived?
