# Lesson 1.7: Introduction to Pandas

## Workshop Overview
This notebook is designed for a 3-hour interactive session.
- **Section 1:** Data Structures (Series & DataFrames)
- **Section 2:** Selection, Indexing & Data Alignment
- **Section 3:** Mapping, Sorting & Ranking

Each section follows a 60-minute cadence: Intuition (10m) -> Demo (15m) -> Exercise (30m) -> Review (5m).

In [None]:
import pandas as pd
import numpy as np

---
## Section 1: Data Structures & Modification (60 min)
**Goal:** Transition from NumPy arrays to labeled tabular data.

Pandas is built on top of `Numpy`. While NumPy is best suited for homogeneous numerical data, Pandas is designed for tabular or heterogeneous data.

### 1.1 The Series
A series is a one-dimensional array-like object containing values and an associated array of data labels, called its **index**.

#### [INSTRUCTOR DEMO]
Creating a Series and exploring label-based context.

In [None]:
# The simplest Series
obj = pd.Series([5, 6, -3, 2])
print("Default Integer Index:\n", obj)

# Series with a label index
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
print("\nLabeled Index:\n", obj2)

# Dictionary-like behavior
print("\nIs 'b' in the index?", "b" in obj2)

#### [LEARNER EXERCISE 1.1]
1. Create a Series 'store_inventory' with values [15, 20, 8, 45].
2. Assign the index ['Apples', 'Oranges', 'Bananas', 'Pears'].
3. Increase the count of 'Apples' by 10 using label assignment (e.g., obj['label'] = val).
4. Check if 'Grapes' is in the inventory.

In [None]:
# Your code below:

---
### 1.2 The DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns. It can be thought of as a dict of Series sharing the same index.

#### [INSTRUCTOR DEMO]
Creating DataFrames and basic inspections.

In [None]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)

# Inspection methods
print("First 5 rows:\n", frame.head())
print("\nShape of data:", frame.shape)
print("\nColumn names:", frame.columns)

# Modification: Adding a column
frame['debt'] = 16.5
print("\nAfter adding 'debt':\n", frame)

#### [LEARNER EXERCISE 1.2]
1. Add a new boolean column 'is_big' which is True if 'pop' is greater than 2.5.
2. Create a new column 'capital' and assign the value "TBC" to all rows.
3. Use the 'del' keyword to remove the 'debt' column.

In [None]:
# Your code below:

---
## Section 2: Indexing, Selection & Alignment (60 min)
**Goal:** Master precision selection using labels and positions.

### 2.1 loc vs iloc
- `loc`: Label-based selection.
- `iloc`: Integer-position based selection.

#### [INSTRUCTOR DEMO]

In [None]:
data_selection = pd.DataFrame(np.arange(16).reshape((4, 4)),
                        index=["Ohio", "Colorado", "Utah", "New York"],
                        columns=["one", "two", "three", "four"])

# loc: selecting by row label and column labels
print("Selection using loc (labels):\n", data_selection.loc["Colorado", ["two", "three"]])

# iloc: selecting by integer position
print("\nSelection using iloc (positions):\n", data_selection.iloc[2, [0, 1]])

# Slicing behavior
print("\nSlicing with loc (includes end):\n", data_selection.loc["Ohio":"Utah", "two"])

#### [LEARNER EXERCISE 2.1]
1. Use .loc to select the 'New York' row for all columns.
2. Use .iloc to select the first 2 rows and the middle 2 columns (two and three).
3. Filter the DataFrame to show only rows where column 'three' is greater than 5.

In [None]:
# Your code below:

---
### 2.2 Data Alignment
An essential feature of Pandas is the automatic alignment of data based on labels during arithmetic operations.

#### [INSTRUCTOR DEMO]

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"), index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])

# Misaligned addition results in NaNs
print("Misaligned Addition:\n", df1 + df2)

# Using fill_value to handle missing labels
print("\nAddition with fill_value=0:\n", df1.add(df2, fill_value=0))

#### [LEARNER EXERCISE 2.2]
1. Multiply df1 and df2.
2. Use the .mul() method with a fill_value of 1 to ensure that any non-overlapping labels don't result in NaN.

In [None]:
# Your code below:

---
## Section 3: Mapping, Sorting & Ranking (60 min)
**Goal:** Transform data using functions and organize it through sorting.

### 3.1 Function Application (apply & applymap)
NumPy ufuncs (element-wise methods) work fine with pandas, but `apply` allows for custom logic across axes.

#### [INSTRUCTOR DEMO]

In [None]:
frame_stats = pd.DataFrame(np.random.randn(4, 3), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])

# Range function: max - min
f = lambda x: x.max() - x.min()

# Applied to columns (axis='index' / 0)
print("Range per column:\n", frame_stats.apply(f))

# Applied to every element (formatting)
format_fn = lambda x: f"{x:.2f}"
print("\nFormatted Data (applymap):\n", frame_stats.applymap(format_fn))

#### [LEARNER EXERCISE 3.1]
1. Create a function that returns the square of a number.
2. Apply this function to only the 'b' column.
3. Use .apply(axis='columns') to find the mean value across each row.

In [None]:
# Your code below:

---
### 3.2 Sorting and Ranking
Sorting lexicographically by index or by values is a fundamental operation.

#### [INSTRUCTOR DEMO]

In [None]:
obj_sort = pd.Series([4, np.nan, 7, np.nan, -3, 2])

# Sorting by values (NaNs go to end by default)
print("Sorted Series:\n", obj_sort.sort_values())

# Ranking data
obj_rank = pd.Series([7, -5, 7, 4, 2, 4, 0, 4])
print("\nRanks (Average of ties):\n", obj_rank.rank())

#### [LEARNER EXERCISE 3.2]
1. Create a DataFrame 'final_scores' with columns 'Math' and 'Science' and 4 rows of random numbers.
2. Sort the DataFrame by 'Math' in descending order.
3. Rank the 'Science' scores using the method='first' parameter to break ties.

In [None]:
# Your code below:

---
## Workshop Summary
1. **Labels provide context:** Always use meaningful indices.
2. **Precision Selection:** Favor `.loc` and `.iloc` over `[]` for clarity.
3. **Alignment is Safety:** Pandas protects your data from being added to the wrong row/column.