
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/6_groupby.ipynb)

**IMPORTANT**: Save your own copy!
1. Click File → Save a copy in Drive
2. Rename it
3. Work in YOUR copy, not the original


---

# 6. Groupby and Aggregation - Finding Crime Patterns Across States and Years
## CCJS 418E: Coding for Criminology

Today's Goals:
- Understand when filtering isn't enough and you need groupby
- Master the "split-apply-combine" mental model for groupby operations
- Use groupby to find patterns across states and years in crime data
- Combine filtering with groupby for complex comparisons
- Calculate multiple statistics at once with .agg()
- Apply groupby to answer real criminological policy questions

Note: This builds directly on filtering from last class. We're adding a powerful new tool!
"""

## Part 1: The Problem We Can't Answer Yet

### When Filtering Isn't Enough

Imagine you're working for the Department of Justice. They need to know:
- Which 5 states have the highest average violent crime rates over the past decade?
- Are crime rates increasing or decreasing in each state?
- Which regions of the country have different crime patterns?

We know how to filter for specific states or years, but how do we calculate statistics for EACH state or EACH year separately? This is where **groupby** becomes essential.

**Connection to Computational Thinking: PATTERN RECOGNITION - finding patterns within categories of data**

Let's start by loading our familiar state crime data and seeing why filtering alone isn't sufficient.

In [None]:
# Import pandas as always
import pandas as pd

pd.options.display.max_columns=None

# Load the state crime data we've been using
url = 'https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/state_crime.csv'
df = pd.read_csv(filepath_or_buffer=url)

# Clean column names like before - making them easier to work with
df.columns = df.columns.str.replace('^Data.', '', regex=True)
df.columns = df.columns.str.replace('\\.', '_', regex=True).str.lower()
# Clean up column names one more time
df.columns = df.columns.str.replace('rates_', '')

print("Dataset loaded!")
print(f"We have {len(df)} records covering {df['year'].min()} to {df['year'].max()}")
print(f"Number of states: {df['state'].nunique()}")


In [None]:
df

# Columns that start with "totals" are the counts for each state and year
# Columns that DON'T start with "totals" are the rate columns

In [None]:
# Here's what we CAN do with filtering:
# We can look at one state at a time
maryland_crimes = df[df['state'] == 'Maryland']
maryland_avg = maryland_crimes['violent_all'].mean()
print(f"Maryland's average violent crime rate: {maryland_avg:.1f}")


In [None]:
virginia_crimes = df[df['state'] == 'Virginia']
virginia_avg = virginia_crimes['violent_all'].mean()
print(f"Virginia's average violent crime rate: {virginia_avg:.1f}")


### The Tedious Alternative

Without groupby, to get every state's average, we'd need to:
1. Filter for each state individually (50+ times!)
2. Calculate the average for each filtered dataset
3. Manually combine all the results

This would take hundreds of lines of repetitive code. Instead, groupby can do this in a single line! Let's learn how.

## Part 2: The Groupby Mental Model

### Understanding "Split-Apply-Combine"

Groupby works in three steps that we call "split-apply-combine":
1. **SPLIT** the data into groups (like splitting a deck of cards by suit)
2. **APPLY** a calculation to each group (count the cards in each suit)
3. **COMBINE** the results back together (report how many of each suit)

Let's see this with a simplified example first, then apply it to our crime data.

In [None]:
# Create a small example dataset to visualize the concept
# Just 3 states, 2 years each 
demo_df = df[(df['state'].isin(['Maryland', 'Virginia', 'Delaware'])) & 
             (df['year'].isin([2018, 2019]))]

print("Our mini demonstration dataset:")
print(demo_df[['state', 'year', 'violent_all']].sort_values(by=['state', 'year']))

In [None]:
# Now let's see the three steps in action:

# STEP 1: SPLIT - Python internally separates the data by state
# (We don't see this happening, but imagine three separate piles)

# STEP 2: APPLY - Calculate mean for each state's pile
# Maryland: (454.0 + 453.6) / 2
# Virginia: (200.0 + 208.1) / 2  
# Delaware: (423.6 + 422.6) / 2

# STEP 3: COMBINE - Put results into a new structure
avg_violent = demo_df.groupby(by='state')['violent_all'].mean()

print("Average violent crime rate by state (2018-2019):")
print(avg_violent)
print("\nNotice: We got one result per state, not per row!")

### Quick Check 1

Look at the output above. If a state had violent crime rates of 400 in 2018 and 500 in 2019, what would its average be? Think about it before moving on!

## Part 3: Core Groupby Operations

### The Essential Methods: size(), mean(), sum(), max(), min()

Different methods answer different questions. Let's explore each one using our full dataset, focusing on recent years for cleaner analysis.

In [None]:
# Focus on recent decade for most examples
recent_df = df[df['year'] >= 2010]
print(f"Working with {len(recent_df)} records from 2010 onwards")

In [None]:
# METHOD 1: .size() - How many records in each group?
# This counts rows, regardless of what's in them
records_per_state = recent_df.groupby(by='state').size()

print("Number of records per state:")
print(records_per_state.head(n=5))


In [None]:
# METHOD 2: .mean() - Calculate averages for each group
# This is perfect for finding typical crime rates
avg_property_by_state = recent_df.groupby(by='state')['property_all'].mean()

print("Average property crime rate by state (2010-2019):")
print(avg_property_by_state.nlargest(n=5))
print("\nThese states have the highest property crime on average")

In [None]:
# METHOD 3: .sum() - Add up values in each group
# Use this with TOTALS not RATES (rates don't add meaningfully)
total_violent_by_state = recent_df.groupby(by='state')['totals_violent_all'].sum()

print("Total violent crimes by state (2010-2019 combined):")
print(total_violent_by_state.nlargest(n=5))
print("\nNote: Larger states have more total crimes (population effect)")

In [None]:
# METHOD 4: .max() and .min() - Find extremes in each group
# Useful for finding best/worst years
worst_year_by_state = recent_df.groupby(by='state')['violent_all'].max()
best_year_by_state = recent_df.groupby(by='state')['violent_all'].min()

print("Each state's WORST violent crime rate (2010-2019):")
print(worst_year_by_state.nlargest(n=5))

print("\nEach state's BEST violent crime rate (2010-2019):")
print(best_year_by_state.nsmallest(n=5))

### Quick Check 2

You want to find which states have seen the most murders total over the past decade. 
Would you use:
- `groupby('state')['violent_murder'].sum()` 
- `groupby('state')['totals_violent_murder'].sum()`

Think about the difference between rates and totals!

## Part 4: The Power of Filter-Then-Group

### Combining What You Know

One of the most powerful patterns in data analysis is filtering your data first, then grouping. This lets you compare different time periods, crime types, or any other subset.

Let's answer a real policy question: "How has violent crime changed between the 2000s and 2010s?"

In [None]:
# PATTERN: Filter for time period, then group by state

# First, get crime rates for the 2000s
decade_2000s = df[(df['year'] >= 2000) & (df['year'] <= 2009)]
crime_2000s = decade_2000s.groupby(by='state')['violent_all'].mean()


In [None]:

# Then, get crime rates for the 2010s
decade_2010s = df[(df['year'] >= 2010) & (df['year'] <= 2019)]
crime_2010s = decade_2010s.groupby(by='state')['violent_all'].mean()


In [None]:

# Calculate improvement (positive = crime went down)
improvement = crime_2000s - crime_2010s


In [None]:

print("States with BIGGEST DECREASE in violent crime (2000s → 2010s):")
print(improvement.nlargest(n=5))
print("\nThese states have made the most progress!")

In [None]:
# Which states got worse?
worsened = improvement[improvement < 0].sort_values()

if len(worsened) > 0:
    print("States where violent crime INCREASED:")
    print(worsened)
else:
    print("Good news: No states saw increases in violent crime!")

### Quick Check 3

How would you find the 5 states with the highest burglary rates specifically in 2019? 
Write the filter and groupby operations in your head before looking at the solution below.

<details>
<summary>Click for answer</summary>

```python
# Filter for 2019 first
df_2019 = df[df['year'] == 2019]
# Then group by state and get the mean (though there's only one 2019 per state)
# Or just sort the filtered data directly!
top_burglary_2019 = df_2019.nlargest(5, 'property_burglary')[['state', 'property_burglary']]
```
</details>

## Part 5: Grouping by Multiple Categories

### When One Group Isn't Enough

Sometimes we need to group by multiple things at once. For example, to see crime trends over time for each state, we need to group by BOTH state AND year.

In [None]:
# Group by both state and year to see trends
state_year_crime = recent_df.groupby(by=['state', 'year'])['property_all'].mean()

# This creates a "MultiIndex" - let's look at one state's trend
print("Maryland's property crime trend:")
print(state_year_crime['Maryland'])
print("\nNotice how crime has generally decreased over the decade!")

In [None]:
# For easier manipulation, we can reset the index to make it a regular DataFrame
crime_by_state_year = recent_df.groupby(by=['state', 'year'])['violent_all'].mean().reset_index()
crime_by_state_year


In [None]:

# Now we can easily find extremes
worst_state_years = crime_by_state_year.nlargest(n=5, columns='violent_all')
print("The 5 worst state-years for violent crime (2010-2019):")
print(worst_state_years)


In [None]:

best_state_years = crime_by_state_year.nsmallest(n=5, columns='violent_all')
print("\nThe 5 best state-years for violent crime (2010-2019):")
print(best_state_years)

## Part 7: Hands-On Exercise - Regional Crime Analysis

Let's apply everything we've learned to compare crime across different regions of the United States. This is the kind of analysis that informs federal policy and resource allocation.

In [None]:
# Define regions (simplified - you could improve these groupings!)
northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 
             'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania']
south = ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 
         'South Carolina', 'Virginia', 'District of Columbia', 'West Virginia',
         'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', 'Arkansas', 
         'Louisiana', 'Oklahoma', 'Texas']
midwest = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin',
           'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 
           'North Dakota', 'South Dakota']
west = ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 
        'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']

# Function to assign region based on state name
def assign_region(state):
    if state in northeast:
        return 'Northeast'
    elif state in south:
        return 'South'  
    elif state in midwest:
        return 'Midwest'
    elif state in west:
        return 'West'
    else:
        return 'Other'

# Add region column to our recent data
recent_df['region'] = recent_df['state'].apply(assign_region)

print("Regions assigned! First few rows:")
print(recent_df[['state', 'region']].head(n=10))

In [None]:
# Exercise 1: Which region has the highest average violent crime rate?
regional_violent = None  # YOUR CODE HERE - group by region, get mean of violent_all

print("Average violent crime rate by region:")
# YOUR PRINT STATEMENT HERE

In [None]:
# Exercise 2: How do property crime rates compare across regions?
regional_property = None  # YOUR CODE HERE

print("Average property crime rate by region:")
# YOUR PRINT STATEMENT HERE

In [None]:
# Exercise 3: Which region has seen the biggest decrease from 2010 to 2019?
# Hint: Filter for 2010, group by region, then filter for 2019, group by region, then compare

crime_2010 = None  # YOUR CODE HERE - filter for 2010, group by region
crime_2019 = None  # YOUR CODE HERE - filter for 2019, group by region
regional_change = None  # YOUR CODE HERE - calculate the difference

print("Change in violent crime by region (2010 to 2019):")
# YOUR PRINT STATEMENT HERE

In [None]:
# Exercise 4: Which specific violent crime type varies most by region?
# Look at murder, rape, robbery, and assault separately

regional_murder = None  # YOUR CODE HERE
regional_rape = None  # YOUR CODE HERE
regional_robbery = None  # YOUR CODE HERE
regional_assault = None  # YOUR CODE HERE

# Calculate the range (max - min) for each crime type
murder_range = None  # YOUR CODE HERE
rape_range = None  # YOUR CODE HERE
robbery_range = None  # YOUR CODE HERE
assault_range = None  # YOUR CODE HERE

print(f"Range by crime type across regions:")
print(f"Murder: {murder_range:.1f}")
print(f"Rape: {rape_range:.1f}")
print(f"Robbery: {robbery_range:.1f}")
print(f"Assault: {assault_range:.1f}")

## Quick Reference Card

### Essential Groupby Patterns

```python
# BASIC GROUPBY OPERATIONS

# Count records in each group
df.groupby('column').size()

# Average of a column for each group
df.groupby('column')['numeric_column'].mean()

# Sum totals for each group (use with counts, not rates!)
df.groupby('column')['totals_column'].sum()

# Find maximum/minimum in each group
df.groupby('column')['numeric_column'].max()
df.groupby('column')['numeric_column'].min()

# FILTER THEN GROUP

# First filter your data
filtered = df[df['year'] >= 2015]
# Then group the filtered data
filtered.groupby('state')['column'].mean()

# MULTIPLE GROUPING

# Group by two or more columns
df.groupby(['column1', 'column2']).size()
# Reset index to make it easier to work with
df.groupby(['column1', 'column2']).mean().reset_index()


# TOP N IN EACH GROUP

# Find top 5 in each group
df.groupby('column')['value'].nlargest(5)
```

## Summary: Why Groupby Matters

Today you learned to:
- Use groupby to calculate statistics for each category separately
- Apply the "split-apply-combine" mental model
- Combine filtering with groupby for complex comparisons
- Group by multiple columns to find detailed patterns
- Use .agg() to get multiple statistics at once

These skills let you answer questions like:
- Which states have the highest crime rates?
- How has crime changed over time?
- Which regions show different patterns?
- Are certain crimes increasing while others decrease?

### Next Class
We'll learn to visualize these patterns, turning our groupby results into compelling charts that tell the story of crime in America.

### AI Prompting Tips for Groupby

When asking AI for help with groupby:

```
"I have a DataFrame with columns: state, year, violent_all, totals_violent_all
I want to find the average violent crime rate for each state.
Should I use groupby with mean() on the rates column or sum() on the totals column?"
```

Be specific about:
- What columns you have
- What you're grouping by
- Whether you want averages, totals, counts, etc.
- Whether you're working with rates or raw counts
```

## Practice Problems

Try these on your own:

1. Find the 3 states with the highest murder rates in 2019
2. Which state has seen the biggest decrease in property crime from 1990 to 2019?
3. Calculate the correlation between violent and property crime rates by state
4. Find which day of the dataset had the highest total murders nationwide (hint: group by year, sum totals)
5. Which state has the most consistent crime rate (smallest difference between min and max)?

Remember: You're encouraged to use AI tools to help, but make sure you understand what each line does!