[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/6_groupby_solutions.ipynb)

**IMPORTANT**: Save your own copy!
1. Click File → Save a copy in Drive
2. Rename it
3. Work in YOUR copy, not the original


---

# 6. Groupby and Aggregation - SOLUTIONS
## CCJS 418E: Coding for Criminology

This notebook contains worked solutions for Lab 6.

## Part 1: The Problem We Can't Answer Yet

### When Filtering Isn't Enough

Imagine you're working for the Department of Justice. They need to know:
- Which 5 states have the highest average violent crime rates over the past decade?
- Are crime rates increasing or decreasing in each state?
- Which regions of the country have different crime patterns?

We know how to filter for specific states or years, but how do we calculate statistics for EACH state or EACH year separately? This is where **groupby** becomes essential.

**Connection to Computational Thinking: PATTERN RECOGNITION - finding patterns within categories of data**

Let's start by loading our familiar state crime data and seeing why filtering alone isn't sufficient.

In [1]:
# Import pandas as always
import pandas as pd

pd.options.display.max_columns=None

# Load the state crime data we've been using
url = 'https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/state_crime.csv'
df = pd.read_csv(filepath_or_buffer=url)

# Clean column names like before - making them easier to work with
df.columns = df.columns.str.replace('^Data.', '', regex=True)
df.columns = df.columns.str.replace('\\.', '_', regex=True).str.lower()
# Clean up column names one more time
df.columns = df.columns.str.replace('rates_', '')


print(f"We have {len(df)} records covering {df['year'].min()} to {df['year'].max()}")
print(f"Number of states: {df['state'].nunique()}")

We have 3115 records covering 1960 to 2019
Number of states: 52


In [2]:
df

# Columns that start with "totals" are the counts for each state and year
# Columns that DON'T start with "totals" are the rate columns

Unnamed: 0,state,year,population,property_all,property_burglary,property_larceny,property_motor,violent_all,violent_assault,violent_murder,violent_rape,violent_robbery,totals_property_all,totals_property_burglary,totals_property_larceny,totals_property_motor,totals_violent_all,totals_violent_assault,totals_violent_murder,totals_violent_rape,totals_violent_robbery
0,Alabama,1960,3266740,1035.4,355.9,592.1,87.3,186.6,138.1,12.4,8.6,27.5,33823,11626,19344,2853,6097,4512,406,281,898
1,Alabama,1961,3302000,985.5,339.3,569.4,76.8,168.5,128.9,12.9,7.6,19.1,32541,11205,18801,2535,5564,4255,427,252,630
2,Alabama,1962,3358000,1067.0,349.1,634.5,83.4,157.3,119.0,9.4,6.5,22.5,35829,11722,21306,2801,5283,3995,316,218,754
3,Alabama,1963,3347000,1150.9,376.9,683.4,90.6,182.7,142.1,10.2,5.7,24.7,38521,12614,22874,3033,6115,4755,340,192,828
4,Alabama,1964,3407000,1358.7,466.6,784.1,108.0,213.1,163.0,9.3,11.7,29.1,46290,15898,26713,3679,7260,5555,316,397,992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3110,Wyoming,2015,586107,1902.6,300.6,1500.9,101.0,222.1,179.8,2.7,29.5,10.1,11151,1762,8797,592,1302,1054,16,173,59
3111,Wyoming,2016,585501,1957.3,302.5,1518.2,136.6,244.2,195.7,3.4,35.0,10.1,11460,1771,8889,800,1430,1146,20,205,59
3112,Wyoming,2017,579315,1830.4,275.0,1421.0,134.5,237.5,176.4,2.6,45.4,13.1,10604,1593,8232,779,1376,1022,15,263,76
3113,Wyoming,2018,577737,1785.1,264.0,1375.9,145.2,212.2,150.6,2.3,42.1,17.3,10313,1525,7949,839,1226,870,13,243,100


In [3]:
# Here's what we CAN do with filtering:
# We can look at one state at a time
maryland_crimes = df[df['state'] == 'Maryland']
maryland_avg = maryland_crimes['violent_all'].mean()
print(f"Maryland's average violent crime rate: {maryland_avg:.1f}")

Maryland's average violent crime rate: 653.1


In [4]:
virginia_crimes = df[df['state'] == 'Virginia']
virginia_avg = virginia_crimes['violent_all'].mean()
print(f"Virginia's average violent crime rate: {virginia_avg:.1f}")

Virginia's average violent crime rate: 279.5


### The Tedious Alternative

Without groupby, to get every state's average, we'd need to:
1. Filter for each state individually (50+ times!)
2. Calculate the average for each filtered dataset
3. Manually combine all the results

This would take hundreds of lines of repetitive code. Instead, groupby can do this in a single line! Let's learn how.

### A slightly better alternative

Let's use a `for` loop to automate this, which will serve as a preview of what Pandas does with the `groupby` function.


We will need:
- a list of unique states
- a for loop which goes through each state
- inside the loop, we want to filter to the appropriate state
- compute the average violent crime rate for the state
- print it out

In [5]:
# SOLUTION
unique_states = df['state'].unique()

for state in unique_states:
    state_data = df[df['state'] == state]
    avg_violent = state_data['violent_all'].mean()
    print(f"{state}: {avg_violent:.1f}")

Alabama: 440.5
Alaska: 525.1
Arizona: 482.7
Arkansas: 393.8
California: 617.3
Colorado: 379.0
Connecticut: 290.2
Delaware: 477.7
District of Columbia: 1594.4
Florida: 702.9
Georgia: 445.6
Hawaii: 219.8
Idaho: 208.5
Illinois: 606.9
Indiana: 325.7
Iowa: 207.0
Kansas: 327.3
Kentucky: 259.3
Louisiana: 587.8
Maine: 122.6
Maryland: 653.1
Massachusetts: 435.7
Michigan: 552.6
Minnesota: 226.7
Mississippi: 289.3
Missouri: 479.3
Montana: 212.4
Nebraska: 248.6
Nevada: 601.2
New Hampshire: 121.9
New Jersey: 385.1
New Mexico: 596.5
New York: 685.3
North Carolina: 432.2
North Dakota: 103.6
Ohio: 339.3
Oklahoma: 400.4
Oregon: 343.1
Pennsylvania: 325.3
Rhode Island: 270.5
South Carolina: 604.1
South Dakota: 179.6
Tennessee: 511.3
Texas: 472.5
United States: 465.7
Utah: 221.8
Vermont: 109.9
Virginia: 279.5
Washington: 333.6
West Virginia: 200.9
Wisconsin: 197.3
Wyoming: 221.3


## Part 2: The Groupby Mental Model

### Understanding "Split-Apply-Combine"

Groupby works in three steps that we call "split-apply-combine":
1. **SPLIT** the data into groups (like splitting a deck of cards by suit)
2. **APPLY** a calculation to each group (count the cards in each suit)
3. **COMBINE** the results back together (report how many of each suit)

Let's walk through an example using the image below (from [rostools.org](https://r-cubed-intermediate.rostools.org/sessions/split-apply-combine)):   

<img src='https://r-cubed-intermediate.rostools.org/images/split-apply-combine.png' height=500, width=800>

Let's see this with a simplified example first, then apply it to our crime data.



In [6]:
# Create a small example dataset to visualize the concept
# Just 3 states, 2 years each
demo_df = df[(df['state'].isin(['Maryland', 'Virginia', 'Delaware'])) &
             (df['year'].isin([2018, 2019]))]

print("Our mini demonstration dataset:")
print(demo_df[['state', 'year', 'violent_all']].sort_values(by=['state', 'year']))

Our mini demonstration dataset:
         state  year  violent_all
478   Delaware  2018        423.6
479   Delaware  2019        422.6
1258  Maryland  2018        468.7
1259  Maryland  2019        454.1
2873  Virginia  2018        200.0
2874  Virginia  2019        208.0


In [7]:
# Now let's see the three steps in action:

# STEP 1: SPLIT - Python internally separates the data by state
# (We don't see this happening, but imagine three separate piles)

# STEP 2: APPLY - Calculate mean for each state's pile
# Maryland: (454.0 + 453.6) / 2
# Virginia: (200.0 + 208.1) / 2
# Delaware: (423.6 + 422.6) / 2

# STEP 3: COMBINE - Put results into a new structure

In [8]:
avg_violent = demo_df.groupby(by='state')['violent_all'].mean()

print("Average violent crime rate by state (2018-2019):")
print(avg_violent)
print("\nNotice: We got one result per state, not per row!")

Average violent crime rate by state (2018-2019):
state
Delaware    423.1
Maryland    461.4
Virginia    204.0
Name: violent_all, dtype: float64

Notice: We got one result per state, not per row!


### Quick Check 1

Look at the output above. If a state had violent crime rates of 400 in 2018 and 500 in 2019, what would its average be? Think about it before moving on!

## Part 3: Core Groupby Operations

### The Essential Methods: size(), mean(), sum(), max(), min()

Different methods answer different questions. Let's explore each one using our full dataset, focusing on recent years for cleaner analysis.

In [9]:
# Focus on recent decade for most examples
recent_df = df[df['year'] >= 2010]
print(f"Working with {len(recent_df)} records from 2010 onwards")

Working with 520 records from 2010 onwards


In [10]:
# METHOD 1: .size() - How many records in each group?
# This counts rows, regardless of what's in them
records_per_state = recent_df.groupby(by='state').size()

print("Number of records per state:")
print(records_per_state.head(n=5))

Number of records per state:
state
Alabama       10
Alaska        10
Arizona       10
Arkansas      10
California    10
dtype: int64


In [11]:
# METHOD 2: .mean() - Calculate averages for each group
# This is perfect for finding typical crime rates
avg_property_by_state = recent_df.groupby(by='state')['property_all'].mean()

print("Average property crime rate by state (2010-2019):")
print(avg_property_by_state.nlargest(n=5))
print("\nThese states have the highest property crime on average")

Average property crime rate by state (2010-2019):
state
District of Columbia    4690.74
New Mexico              3592.06
South Carolina          3442.40
Louisiana               3436.60
Washington              3411.28
Name: property_all, dtype: float64

These states have the highest property crime on average


In [12]:
# METHOD 3: .sum() - Add up values in each group
# Use this with TOTALS not RATES (rates don't add meaningfully)
total_violent_by_state = recent_df.groupby(by='state')['totals_violent_all'].sum()

print("Total violent crimes by state (2010-2019 combined):")
print(total_violent_by_state.nlargest(n=5))
print("\nNote: Larger states have more total crimes (population effect)")

Total violent crimes by state (2010-2019 combined):
state
United States    12350037
California        1658477
Texas             1139761
Florida            924962
New York           744843
Name: totals_violent_all, dtype: int64

Note: Larger states have more total crimes (population effect)


In [13]:
# METHOD 4: .max() and .min() - Find extremes in each group
# Useful for finding best/worst years
worst_year_by_state = recent_df.groupby(by='state')['violent_all'].max()
best_year_by_state = recent_df.groupby(by='state')['violent_all'].min()

print("Each state's WORST violent crime rate (2010-2019):")
print(worst_year_by_state.nlargest(n=5))

print("\nEach state's BEST violent crime rate (2010-2019):")
print(best_year_by_state.nsmallest(n=5))

Each state's WORST violent crime rate (2010-2019):
state
District of Columbia    1326.8
Alaska                   885.0
New Mexico               856.6
Nevada                   695.9
Tennessee                651.5
Name: violent_all, dtype: float64

Each state's BEST violent crime rate (2010-2019):
state
Vermont           99.3
Maine            112.1
New Hampshire    152.5
Connecticut      183.6
Virginia         190.1
Name: violent_all, dtype: float64


### Quick Check 2

You want to find which states have seen the most murders total over the past decade.
Would you use:
- `groupby('state')['violent_murder'].sum()`
- `groupby('state')['totals_violent_murder'].sum()`

Think about the difference between rates and totals!

## Part 4: The Power of Filter-Then-Group

### Combining What You Know

One of the most powerful patterns in data analysis is filtering your data first, then grouping. This lets you compare different time periods, crime types, or any other subset.

Let's answer a real policy question: "How has violent crime changed between the 2000s and 2010s?"

In [14]:
# PATTERN: Filter for time period, then group by state

# First, get crime rates for the 2000s
decade_2000s = df[(df['year'] >= 2000) & (df['year'] <= 2009)]
crime_2000s = decade_2000s.groupby(by='state')['violent_all'].mean()
crime_2000s.head()

state
Alabama       443.56
Alaska        621.89
Arizona       512.89
Arkansas      492.81
California    550.07
Name: violent_all, dtype: float64

In [15]:
# Then, get crime rates for the 2010s
decade_2010s = df[(df['year'] >= 2010) & (df['year'] <= 2019)]
crime_2010s = decade_2010s.groupby(by='state')['violent_all'].mean()
crime_2010s.head()

state
Alabama       467.09
Alaska        724.03
Arizona       439.16
Arkansas      515.06
California    428.16
Name: violent_all, dtype: float64

In [16]:
# Verify that the states line up (we'll do this visually right now)

In [17]:
# Calculate improvement (positive = crime went down)
improvement = crime_2000s - crime_2010s

In [18]:
print("States with BIGGEST DECREASE in violent crime (2000s → 2010s):")
print(improvement.nlargest(n=5))
print("\nThese states have made the most progress!")

States with BIGGEST DECREASE in violent crime (2000s → 2010s):
state
District of Columbia    299.18
Florida                 265.64
South Carolina          251.61
Maryland                219.69
Delaware                155.42
Name: violent_all, dtype: float64

These states have made the most progress!


In [19]:
# Which states got worse?
worsened = improvement[improvement < 0].sort_values()

if len(worsened) > 0:
    print("States where violent crime INCREASED:")
    print(worsened)
else:
    print("Good news: No states saw increases in violent crime!")

States where violent crime INCREASED:
state
South Dakota    -150.31
North Dakota    -117.84
Alaska          -102.14
Wisconsin        -38.98
West Virginia    -38.89
New Hampshire    -34.42
Indiana          -25.91
Alabama          -23.53
Vermont          -23.13
Arkansas         -22.25
Maine             -9.99
Hawaii            -0.60
Name: violent_all, dtype: float64


### Quick Check 3

How would you find the 5 states with the highest burglary rates specifically in 2019?
Write the filter and groupby operations in your head before looking at the solution below.

<details>
<summary>Click for answer</summary>

```python
# Filter for 2019 first
df_2019 = df[df['year'] == 2019]
# Then group by state and get the mean (though there's only one 2019 per state)
# Or just sort the filtered data directly!
top_burglary_2019 = df_2019.nlargest(5, 'property_burglary')[['state', 'property_burglary']]
```
</details>

## Part 5: Grouping by Multiple Categories

### When One Group Isn't Enough

Sometimes we need to group by multiple things at once. For example, to see crime trends over time for each state, we need to group by BOTH state AND year.

In [20]:
# Group by both state and year to see trends
state_year_crime = recent_df.groupby(by=['state', 'year'])['property_all'].mean()
state_year_crime

state    year
Alabama  2010    3528.0
         2011    3605.4
         2012    3502.2
         2013    3351.3
         2014    3177.6
                  ...  
Wyoming  2015    1902.6
         2016    1957.3
         2017    1830.4
         2018    1785.1
         2019    1571.1
Name: property_all, Length: 520, dtype: float64

In [21]:
# This creates a "MultiIndex" - let's look at one state's trend
print("Maryland's property crime trend:")
print(state_year_crime['Maryland'])
print("\nNotice how crime has generally decreased over the decade!")

Maryland's property crime trend:
year
2010    2995.5
2011    2857.2
2012    2753.5
2013    2663.5
2014    2507.5
2015    2315.0
2016    2284.5
2017    2222.3
2018    2033.3
2019    1950.2
Name: property_all, dtype: float64

Notice how crime has generally decreased over the decade!


In [22]:
# For easier manipulation, we can reset the index to make it a regular DataFrame
crime_by_state_year = recent_df.groupby(by=['state', 'year'])['violent_all'].mean().reset_index()
crime_by_state_year

Unnamed: 0,state,year,violent_all
0,Alabama,2010,383.7
1,Alabama,2011,419.8
2,Alabama,2012,449.9
3,Alabama,2013,430.8
4,Alabama,2014,427.4
...,...,...,...
515,Wyoming,2015,222.1
516,Wyoming,2016,244.2
517,Wyoming,2017,237.5
518,Wyoming,2018,212.2


In [23]:
# Now we can easily find extremes
worst_state_years = crime_by_state_year.nlargest(n=5, columns='violent_all')
print("The 5 worst state-years for violent crime (2010-2019):")
print(worst_state_years)

The 5 worst state-years for violent crime (2010-2019):
                   state  year  violent_all
80  District of Columbia  2010       1326.8
83  District of Columbia  2013       1300.3
85  District of Columbia  2015       1269.1
84  District of Columbia  2014       1244.4
82  District of Columbia  2012       1243.7


In [24]:
best_state_years = crime_by_state_year.nsmallest(n=5, columns='violent_all')
print("\nThe 5 best state-years for violent crime (2010-2019):")
print(best_state_years)


The 5 best state-years for violent crime (2010-2019):
       state  year  violent_all
464  Vermont  2014         99.3
198    Maine  2018        112.1
199    Maine  2019        115.2
465  Vermont  2015        118.0
197    Maine  2017        121.0


## Part 7: Hands-On Exercise - Regional Crime Analysis

Let's apply everything we've learned to compare crime across different regions of the United States. This is the kind of analysis that informs federal policy and resource allocation.

In [25]:
# Define regions (simplified - you could improve these groupings!)
northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire',
             'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania']
south = ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina',
         'South Carolina', 'Virginia', 'District of Columbia', 'West Virginia',
         'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', 'Arkansas',
         'Louisiana', 'Oklahoma', 'Texas']
midwest = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin',
           'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska',
           'North Dakota', 'South Dakota']
west = ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico',
        'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']

# Function to assign region based on state name
def assign_region(state):
    if state in northeast:
        return 'Northeast'
    elif state in south:
        return 'South'
    elif state in midwest:
        return 'Midwest'
    elif state in west:
        return 'West'
    else:
        return 'Other'

# Add region column to our recent data
recent_df['region'] = recent_df['state'].apply(assign_region)

print("Regions assigned! First few rows:")
print(recent_df[['state', 'region']].head(n=10))

Regions assigned! First few rows:
      state region
50  Alabama  South
51  Alabama  South
52  Alabama  South
53  Alabama  South
54  Alabama  South
55  Alabama  South
56  Alabama  South
57  Alabama  South
58  Alabama  South
59  Alabama  South


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recent_df['region'] = recent_df['state'].apply(assign_region)


In [26]:
# Exercise 1: Which region has the highest average violent crime rate?
# SOLUTION
regional_violent = recent_df.groupby(by='region')['violent_all'].mean()

print("Average violent crime rate by region:")
print(regional_violent.sort_values(ascending=False))
print(f"\nHighest: {regional_violent.idxmax()} with {regional_violent.max():.1f}")

Average violent crime rate by region:
region
South        464.775294
Other        386.560000
West         386.018462
Midwest      338.559167
Northeast    255.097778
Name: violent_all, dtype: float64

Highest: South with 464.8


In [27]:
# Exercise 2: How do property crime rates compare across regions?
# SOLUTION
regional_property = recent_df.groupby(by='region')['property_all'].mean()

print("Average property crime rate by region:")
print(regional_property.sort_values(ascending=False))
print(f"\nHighest: {regional_property.idxmax()} with {regional_property.max():.1f}")

Average property crime rate by region:
region
South        2971.304118
West         2781.006923
Other        2564.660000
Midwest      2357.296667
Northeast    1848.790000
Name: property_all, dtype: float64

Highest: South with 2971.3


In [28]:
# Exercise 3: Which region has seen the biggest decrease from 2010 to 2019?
# Hint: Filter for 2010, group by region, then filter for 2019, group by region, then compare

# SOLUTION
crime_2010 = recent_df[recent_df['year'] == 2010].groupby(by='region')['violent_all'].mean()
crime_2019 = recent_df[recent_df['year'] == 2019].groupby(by='region')['violent_all'].mean()
regional_change = crime_2010 - crime_2019

print("Change in violent crime by region (2010 to 2019):")
print(regional_change.sort_values(ascending=False))
print(f"\nBiggest decrease: {regional_change.idxmax()} with {regional_change.max():.1f} point reduction")

Change in violent crime by region (2010 to 2019):
region
Northeast    47.011111
South        45.864706
Other        25.100000
Midwest     -21.716667
West        -47.384615
Name: violent_all, dtype: float64

Biggest decrease: Northeast with 47.0 point reduction


In [29]:
# Exercise 4: Which specific violent crime type varies most by region?
# Look at murder, robbery, and assault separately

# SOLUTION
regional_murder = recent_df.groupby(by='region')['violent_murder'].mean()
regional_robbery = recent_df.groupby(by='region')['violent_robbery'].mean()
regional_assault = recent_df.groupby(by='region')['violent_assault'].mean()

In [30]:
# Calculate the range (max - min) for each crime type
# SOLUTION
murder_range = regional_murder.max() - regional_murder.min()
robbery_range = regional_robbery.max() - regional_robbery.min()
assault_range = regional_assault.max() - regional_assault.min()

In [31]:
print(f"Range by crime type across regions:")
print(f"Murder: {murder_range:.1f}")
print(f"Robbery: {robbery_range:.1f}")
print(f"Assault: {assault_range:.1f}")
print(f"\nCrime type with most regional variation: {'Murder' if murder_range > max(robbery_range, assault_range) else 'Robbery' if robbery_range > assault_range else 'Assault'}")

Range by crime type across regions:
Murder: 4.2
Robbery: 51.0
Assault: 146.7

Crime type with most regional variation: Assault


## Quick Reference Card

### Essential Groupby Patterns

```python
# BASIC GROUPBY OPERATIONS

# Count records in each group
df.groupby('column').size()

# Average of a column for each group
df.groupby('column')['numeric_column'].mean()

# Sum totals for each group (use with counts, not rates!)
df.groupby('column')['totals_column'].sum()

# Find maximum/minimum in each group
df.groupby('column')['numeric_column'].max()
df.groupby('column')['numeric_column'].min()

# FILTER THEN GROUP

# First filter your data
filtered = df[df['year'] >= 2015]
# Then group the filtered data
filtered.groupby('state')['column'].mean()

# MULTIPLE GROUPING

# Group by two or more columns
df.groupby(['column1', 'column2']).size()
# Reset index to make it easier to work with
df.groupby(['column1', 'column2']).mean().reset_index()


# TOP N IN EACH GROUP

# Find top 5 in each group
df.groupby('column')['value'].nlargest(5)
```

## Summary: Why Groupby Matters

Today you learned to:
- Use groupby to calculate statistics for each category separately
- Apply the "split-apply-combine" mental model
- Combine filtering with groupby for complex comparisons
- Group by multiple columns to find detailed patterns
- Use .agg() to get multiple statistics at once

These skills let you answer questions like:
- Which states have the highest crime rates?
- How has crime changed over time?
- Which regions show different patterns?
- Are certain crimes increasing while others decrease?

### Next Class
We'll learn to visualize these patterns, turning our groupby results into compelling charts that tell the story of crime in America.

### AI Prompting Tips for Groupby

When asking AI for help with groupby:

```
"I have a DataFrame with columns: state, year, violent_all, totals_violent_all
I want to find the average violent crime rate for each state.
Should I use groupby with mean() on the rates column or sum() on the totals column?"
```

Be specific about:
- What columns you have
- What you're grouping by
- Whether you want averages, totals, counts, etc.
- Whether you're working with rates or raw counts
```

## Practice Problems - SOLUTIONS

Below are solutions to the practice problems. Remember: You're encouraged to use AI tools to help, especially for concepts we haven't covered in class (like correlation)!

### Problem 1: Find the 3 states with the highest murder rates in 2019

In [32]:
# SOLUTION
# Filter for 2019 first
df_2019 = df[df['year'] == 2019]

# Sort by murder rate (highest first) and take top 3
top_murder_2019 = df_2019.sort_values(by='violent_murder', ascending=False).head(n=3)

print("Top 3 states with highest murder rates in 2019:")
print(top_murder_2019[['state', 'violent_murder']])

Top 3 states with highest murder rates in 2019:
                     state  violent_murder
539   District of Columbia            23.5
1139             Louisiana            11.7
1499           Mississippi            11.2


### Problem 2: Which state has seen the biggest decrease in property crime from 1990 to 2019?

In [33]:
# SOLUTION
# Get property crime rates for 1990
df_1990 = df[df['year'] == 1990]

# Get property crime rates for 2019
df_2019 = df[df['year'] == 2019]

# For each state, calculate the change
# We'll do this by filtering to each state and comparing
states = df['state'].unique()

# Store results
results = []
for state in states:
    rate_1990 = df_1990[df_1990['state'] == state]['property_all'].values
    rate_2019 = df_2019[df_2019['state'] == state]['property_all'].values
    
    # Check if we have data for both years
    if len(rate_1990) > 0 and len(rate_2019) > 0:
        change = rate_1990[0] - rate_2019[0]
        results.append({'state': state, 'change': change, 
                       'rate_1990': rate_1990[0], 'rate_2019': rate_2019[0]})

# Convert to DataFrame and sort by change
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by='change', ascending=False)

# Get the state with biggest decrease
biggest = results_df.head(n=1)

print(f"State with biggest decrease in property crime (1990-2019):")
print(f"{biggest['state'].values[0]}: {biggest['change'].values[0]:.1f} point decrease")
print(f"From {biggest['rate_1990'].values[0]:.1f} in 1990 to {biggest['rate_2019'].values[0]:.1f} in 2019")

State with biggest decrease in property crime (1990-2019):
Florida: 5420.8 point decrease
From 7566.5 in 1990 to 2145.7 in 2019


### Problem 3: Calculate the correlation between violent and property crime rates by state

**Note**: We didn't cover correlation in class! This is a perfect opportunity to ask an AI tool:
- "What does correlation mean?"
- "How do I calculate correlation between two columns in pandas?"
- "What does a correlation of 0.8 mean vs 0.2?"

The AI can explain that correlation measures how two variables move together, ranging from -1 (perfect negative relationship) to +1 (perfect positive relationship).

In [34]:
# SOLUTION
# First, get average crime rates by state
state_violent = df.groupby('state')['violent_all'].mean()
state_property = df.groupby('state')['property_all'].mean()

# We haven't covered .corr() yet, so this is where AI tools come in handy!
# You could ask: "How do I calculate correlation between two pandas Series?"
# The AI would tell you about .corr()

correlation = state_violent.corr(state_property)

print(f"Correlation between violent and property crime rates by state: {correlation:.3f}")
print(f"\nInterpretation:")
if correlation > 0.7:
    print("Strong positive correlation - states with high violent crime tend to have high property crime")
elif correlation > 0.3:
    print("Moderate positive correlation - some relationship between violent and property crime")
else:
    print("Weak correlation - violent and property crime rates don't move together strongly")

Correlation between violent and property crime rates by state: 0.706

Interpretation:
Strong positive correlation - states with high violent crime tend to have high property crime


### Problem 4: Find which year had the highest total murders nationwide

**Note**: The original problem said "day" but our data is yearly, so we'll find the year instead.

In [35]:
# SOLUTION
# Group by year and sum the total murders across all states
murders_by_year = df.groupby('year')['totals_violent_murder'].sum()

# Find the year with the highest total
worst_year = murders_by_year.idxmax()
worst_year_total = murders_by_year.max()

print(f"Year with highest total murders nationwide: {worst_year}")
print(f"Total murders: {worst_year_total:,.0f}")

print("\nTop 5 worst years:")
print(murders_by_year.nlargest(5))

Year with highest total murders nationwide: 1991
Total murders: 49,406

Top 5 worst years:
year
1991    49406
1993    49052
1992    47520
1990    46876
1994    46652
Name: totals_violent_murder, dtype: int64


### Problem 5: Which state has the most consistent crime rate (smallest difference between min and max)?

In [36]:
# SOLUTION
# For each state, calculate the range (max - min) of violent crime rates
state_stats = df.groupby('state')['violent_all'].agg(['max', 'min'])
state_stats['range'] = state_stats['max'] - state_stats['min']

# Sort by range to find most consistent (smallest range)
state_stats = state_stats.sort_values(by='range')

# Get the state with smallest range
most_consistent = state_stats.head(n=1)

print(f"State with most consistent violent crime rate: {most_consistent.index[0]}")
print(f"Range: {most_consistent['range'].values[0]:.1f} points")
print(f"Min: {most_consistent['min'].values[0]:.1f}, Max: {most_consistent['max'].values[0]:.1f}")

print("\n5 most consistent states:")
print(state_stats.head(n=5)['range'])

print("\n5 most volatile states:")
print(state_stats.sort_values(by='range', ascending=False).head(n=5)['range'])

State with most consistent violent crime rate: Vermont
Range: 192.7 points
Min: 9.5, Max: 202.2

5 most consistent states:
state
Vermont          192.7
Maine            197.0
Virginia         197.2
New Hampshire    204.0
North Dakota     270.4
Name: range, dtype: float64

5 most volatile states:
state
District of Columbia    2368.1
Florida                 1052.2
Louisiana                926.0
South Carolina           906.9
California               887.0
Name: range, dtype: float64
