# Basic Functionality

## pandas Data Analysis

## Prerequisites & Outcomes

**Prerequisites:**
- pandas Introduction

**Outcomes:**
- Be familiar with `datetime`
- Use built-in aggregation functions and create custom ones
- Use built-in Series transformation functions
- Use built-in scalar transformation functions
- Select subsets using boolean selection
- Apply the "want operator"

## Data Source

US state unemployment data from Bureau of Labor Statistics

## Setup

In [3]:
import pandas as pd

%matplotlib inline

pd.__version__

'2.3.3'

## Loading State Unemployment Data

In [4]:
# Load up the data -- this will take a couple seconds
url = "https://datascience.quantecon.org/assets/data/state_unemployment.csv"
unemp_raw = pd.read_csv(url, parse_dates=["Date"])



Note: `parse_dates=["Date"]` tells pandas to load the Date column as a Python datetime type

## Examining the Raw Data

In [8]:
unemp_raw.head()



Unnamed: 0,Date,state,LaborForce,UnemploymentRate
0,2000-01-01,Alabama,2142945.0,4.7
1,2000-01-01,Alaska,319059.0,6.3
2,2000-01-01,Arizona,2499980.0,4.1
3,2000-01-01,Arkansas,1264619.0,4.4
4,2000-01-01,California,16680246.0,5.0


Each row contains: date, state, labor force size, and unemployment rate

## Transforming the Data

We want to look at unemployment rates across different states over time.

This requires a pivot table transformation:

In [None]:
# Don't worry about the details here quite yet
unemp_all = (
    unemp_raw
    .reset_index()
    .pivot_table(index="Date", columns="state", values="UnemploymentRate")
)
unemp_all.head()

## Filtering to Selected States

In [None]:
states = [
    "Arizona", "California", "Florida", "Illinois",
    "Michigan", "New York", "Texas"
]
unemp = unemp_all[states]
unemp.head()

## Plotting the Data

In [None]:
unemp.plot(figsize=(8, 6))

## Dates in pandas

The index has a nice format (YYYY-MM-DD) because its dtype is `datetime`

In [None]:
unemp.index

## Indexing with Dates

We can use string representations of dates to index:

In [None]:
# Data corresponding to a single date
unemp.loc["01/01/2000", :]

In [None]:
# Data for all days between New Years Day and June first in the year 2000
unemp.loc["01/01/2000":"06/01/2000", :]

## DataFrame Aggregations

**Aggregation**: An operation that combines multiple values into a single value

Example: Computing the mean of [0, 1, 2] returns 1

## Built-in Aggregations

pandas has many built-in aggregation functions:
- Mean (`mean`)
- Variance (`var`)
- Standard deviation (`std`)
- Minimum (`min`)
- Median (`median`)
- Maximum (`max`)
- etc...

## Aggregation Examples

In [None]:
# Default: aggregate each column
unemp.mean()

In [None]:
# Use axis=1 to aggregate by row
unemp.var(axis=1).head()

## Writing Custom Aggregations

Two steps:
1. Write a Python function that takes a Series as input and outputs a single value
2. Call the `agg` method with the new function as an argument

## Custom Aggregation Example

Classify states as "high" or "low" unemployment based on whether their mean is above or below 6.5

In [None]:
# Step 1: Write the aggregation function
def high_or_low(s):
    """
    This function takes a pandas Series object and returns high
    if the mean is above 6.5 and low if the mean is below 6.5
    """
    if s.mean() < 6.5:
        out = "Low"
    else:
        out = "High"
    
    return out

In [None]:
# Step 2: Apply it via the agg method
unemp.agg(high_or_low)

## Multiple Aggregations

`agg` can accept multiple functions at once:

In [None]:
unemp.agg([min, max, high_or_low])

## Transforms

Many operations produce a new Series rather than a single value.

Examples:
- Compute percentage change in unemployment from month to month
- Calculate cumulative sum of elements in each column

## Built-in Transforms

pandas includes many transform functions:
- Cumulative sum/max/min/product (`cumsum`, `cummin`, `cummax`, `cumprod`)
- Difference (`diff`)
- Elementwise operations (`+`, `-`, `*`, `/`)
- Percent change (`pct_change`)
- Number of occurrences (`value_counts`)
- Absolute value (`abs`)

## Transform Examples

In [None]:
unemp.head()

In [None]:
unemp.pct_change(fill_method=None).head()

In [None]:
unemp.diff().head()

## Transform Categories

1. **Series transforms**: Functions that take one Series and produce another Series (index can change)
2. **Scalar transforms**: Functions that take a single value and produce a single value (e.g., `abs`)

## Custom Series Transforms

Two steps:
1. Write a Python function that takes a Series and outputs a new Series
2. Pass the function to the `apply` method (or `transform`)

## Example: Standardizing Data

Transform unemployment data to have mean 0 and standard deviation 1

In [None]:
# Step 1: Write the Series transform function
def standardize_data(x):
    """
    Changes the data in a Series to become mean 0 with standard deviation 1
    """
    mu = x.mean()
    std = x.std()
    
    return (x - mu)/std

In [None]:
# Step 2: Apply via the apply method
std_unemp = unemp.apply(standardize_data)
std_unemp.head()

## Finding Extreme Values

In [None]:
# Take absolute value
abs_std_unemp = std_unemp.abs()
abs_std_unemp.head()

In [None]:
# Find date when unemployment was "most different from normal" for each state
def idxmax(x):
    return x.idxmax()

abs_std_unemp.agg(idxmax)

## Custom Scalar Transforms

Two steps:
1. Define a function that takes a scalar and produces a scalar
2. Pass this function to the `map` method

## Boolean Selection

We can select data based on conditions met by the data itself.

Examples:
- Individuals older than 18
- Data from particular time periods
- Data during a recession
- Specific product or customer IDs

## Boolean Selection Examples

In [None]:
unemp_small = unemp.head()
unemp_small

In [None]:
# List of booleans selects rows
unemp_small.loc[[True, True, True, False, False]]

## Creating Boolean Series

Use conditional statements to construct Series of booleans:

In [None]:
unemp_small["Texas"] < 4.5

In [None]:
# Use boolean Series to extract rows
unemp_small.loc[unemp_small["Texas"] < 4.5]

## Comparing Columns

In [None]:
unemp_small["New York"] > unemp_small["Texas"]

In [None]:
big_NY = unemp_small["New York"] > unemp_small["Texas"]
unemp_small.loc[big_NY]

## Multiple Conditions

Instead of `and` and `or`, use:
- `(bool_series1) & (bool_series2)` for AND
- `(bool_series1) | (bool_series2)` for OR

In [None]:
small_NYTX = (unemp_small["Texas"] < 4.7) & (unemp_small["New York"] < 4.7)
small_NYTX

In [None]:
unemp_small[small_NYTX]

## The `isin` Method

Check if values match any of several fixed values:

In [None]:
unemp_small["Michigan"].isin([3.3, 3.2])

In [None]:
# Select full rows where this Series is True
unemp_small.loc[unemp_small["Michigan"].isin([3.3, 3.2])]

## `.any` and `.all` Methods

- `.any()`: Returns True if at least one value is True
- `.all()`: Returns True only when all values are True

## The "Want Operator"

A concept from Nobel Laureate Tom Sargent for clear analysis:

1. State the goal: **Want:** [clear objective]
2. Work backwards to identify necessary steps
3. Execute the plan

## Example: High Unemployment Months

**Want:** Count months where all states had unemployment above 6.5%

**Plan:**
1. Sum True values in a Series indicating dates with all high unemployment
2. Build the Series using `.all` on a boolean DataFrame
3. Build the DataFrame using `>` comparison

## Executing the Plan

In [None]:
# Step 3: construct the DataFrame of bools
high = unemp > 6.5
high.head()

In [None]:
# Step 2: use .all method on axis=1
all_high = high.all(axis=1)
all_high.head()

In [None]:
# Step 1: Call .sum to count True values
msg = "Out of {} months, {} had high unemployment across all states"
print(msg.format(len(all_high), all_high.sum()))

## Exercises

Practice exercises are included in the notebook cells below

## Exercise 1

- What is the minimum unemployment rate at each date across all states?
- What was the median unemployment rate in each state?
- What was the maximum unemployment rate? In which state and month?
- Classify each state as high or low volatility (variance above/below 4)

In [83]:
# min unemployment rate by date

In [84]:
# median unemployment rate by state


In [100]:
# max unemployment rate across all states and year


In [None]:
#low or high volatility


## Exercise 3

Classify unemployment as high (> 6.5), medium (4.5 < x <= 6.5), or low (<= 4.5)

1. Write a function to classify a single number
2. Pass to `map` and save as `unemp_bins`
3. Count occurrences of each classification per state and create a bar chart
4. Count how many states had each classification in each month

state,Arizona,California,Florida,Illinois,Michigan,New York,Texas
high,75,106,68,91,142,65,51
low,44,4,69,19,17,22,58
medium,97,106,79,106,57,129,107


In [None]:
# Part 2: Pass to map
# unemp_bins = unemp.map(...)

In [None]:
# Part 3: Count occurrences and create bar chart

In [None]:
# Part 4: Count by date instead of state

## Exercise 4

- For one state, determine mean unemployment during "Low", "Medium", and "High" times
- Which states perform best during "bad times" (when mean unemployment > 7)?

In [None]:
# Your code here

## Summary

Key concepts covered:
- Working with datetime indices
- Built-in and custom aggregations
- Built-in and custom transforms
- Boolean selection
- The "want operator" for structured analysis