
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/5_pandas_filtering_lecture.ipynb)

**IMPORTANT**: Save your own copy!
1. Click File → Save a copy in Drive
2. Rename it
3. Work in YOUR copy, not the original


---


# 5. Filtering and Analyzing Data with Pandas
## CCJS 418E: Coding for Criminology

Today's Goals:
- Filter DataFrames to find specific rows (like using if statements on entire datasets)
- Combine multiple conditions to answer complex questions
- Create new columns from existing data
- Answer real criminological questions with pandas

Last class: We learned to load, explore, select columns, and sort data
Today: We learn to ask specific questions and get specific answers


## Quick Review: What We Know So Far

Let's reload our data and remind ourselves of the basics:

In [None]:
# First, import pandas (conventionally abbreviated as pd)
import pandas as pd

# Load actual state crime data
df = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/state_crime.csv')

# Cleaning up the column names
df.columns = df.columns.str.replace('^Data.', '', regex=True)
df.columns = df.columns.str.replace('\\.', '_', regex=True).str.lower()

# Filter to specific years and states to keep dataset manageable
df = df[df.year.isin([2015, 2016, 2017, 2018, 2019])]
df = df[df.state.isin(['Maryland', 'Virginia', 'Delaware', 'Pennsylvania'])]

# Select just the property crime columns
df = df[['state', 'year', 'rates_property_all',
       'rates_property_burglary', 'rates_property_larceny',
       'rates_property_motor']]

# Clean up column names one more time
df.columns = df.columns.str.replace('rates_', '')

# The variable name 'df' is a common convention for DataFrame
print("Data loaded successfully!")
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")

In [None]:
# Quick look at what we have
df.head(n=5)

In [None]:
# Last class we learned to:
# 1. Select columns
print("Average property crime rate:", df['property_all'].mean())

# 2. Sort data
print("\nHighest property crime rates:")
df.sort_values(by='property_all', ascending=False)

**But what if we want to answer questions like:**
- "Show me ONLY Maryland's data"
- "Which state-years had property crime rates above 2500?"
- "What was Virginia's crime trend from 2015-2019?"

That's what filtering lets us do!

## Part 1: Boolean Indexing - Finding Specific Rows

### The Concept: True/False for Every Row

**Connection to Computational Thinking: This is ALGORITHMIC THINKING - creating decision rules**

Think of filtering like using an if statement on every single row at once.

In [None]:
# Let's start simple: find all rows where the state is Maryland
# Step 1: Create a True/False test for each row
is_maryland = df['state'] == 'Maryland'

print("True/False for each row:")
print(is_maryland)

# Count how many are True
print(f"\nNumber of Maryland rows: {is_maryland.sum()}")  # True=1, False=0

In [None]:
# Step 2: Use those True/False values to filter the DataFrame
maryland_data = df[is_maryland]

print("Just Maryland's data:")
maryland_data

### The Shortcut: Combine the Steps

Usually we don't create the True/False column separately - we do it in one line:

In [None]:
# Find all Maryland data in one line
maryland_data = df[df['state'] == 'Maryland']

print("Maryland's crime data (2015-2019):")
maryland_data

In [None]:
# The pattern: df[df['column'] COMPARISON value]
# Read it as: "Give me rows from df where column meets this condition"

# Examples:
print("States with total property crime rate above 2500:")
high_crime = df[df['property_all'] > 2500]
print(high_crime[['state', 'year', 'property_all']])

### 🎯 QUICK CHECK #1
Filter the data to show only rows where the year is 2019. How many rows should you get?

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
data_2019 = df[df['year'] == 2019]
print(f"Number of rows in 2019: {len(data_2019)}")
data_2019
# Should get 4 rows (one for each state)
```
</details>

## Part 2: Different Types of Comparisons

### Comparison Operators You Can Use

In [None]:
# == : Equal to
delaware = df[df['state'] == 'Delaware']
print("Delaware data:")
print(delaware[['state', 'year', 'property_all']])

In [None]:
# > : Greater than
high_burglary = df[df['property_burglary'] > 400]
print("\nHigh burglary rates (>400):")
print(high_burglary[['state', 'year', 'property_burglary']])

In [None]:
# < : Less than
low_motor = df[df['property_motor'] < 200]
print("\nLow motor vehicle theft rates (<200):")
print(low_motor[['state', 'year', 'property_motor']])

In [None]:
# >= : Greater than or equal to
recent = df[df['year'] >= 2017]
print("\nRecent data (2017 or later):")
print(recent[['state', 'year', 'property_all']])

In [None]:
# != : Not equal to
not_virginia = df[df['state'] != 'Virginia']
print("\nAll states except Virginia:")
print(not_virginia['state'].value_counts())

### 🎯 QUICK CHECK #2
Find all rows where the larceny rate is less than or equal to 1500.
Which states and years have such low larceny rates?

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
low_larceny = df[df['property_larceny'] <= 1500]
print(f"Number of rows with larceny rate <= 1500: {len(low_larceny)}")
low_larceny[['state', 'year', 'property_larceny']]
```
</details>

## Part 3: Combining Multiple Conditions

### Using AND (&) - Both conditions must be True

Sometimes you need multiple conditions to be true at once.

In [None]:
# Find Maryland data from 2018 or later
# Use & for AND - both conditions must be True
# IMPORTANT: Put parentheses around EACH condition!

maryland_recent = df[(df['state'] == 'Maryland') & (df['year'] >= 2018)]

print("Maryland data from 2018 onwards:")
maryland_recent

In [None]:
# Find states with BOTH high total property crime (>2500) AND high burglary (>400)
high_both = df[(df['property_all'] > 2500) & (df['property_burglary'] > 400)]

print("High rates in both categories:")
high_both[['state', 'year', 'property_all', 'property_burglary']]

### Common Mistake: Forgetting Parentheses

In [None]:
# This will cause an error - missing parentheses:
# maryland_recent = df[df['state'] == 'Maryland' & df['year'] >= 2018]  # ERROR!

# Correct version - each condition needs its own parentheses:
maryland_recent = df[(df['state'] == 'Maryland') & (df['year'] >= 2018)]

print("This works because we used parentheses correctly!")

### 🎯 QUICK CHECK #3
Find all rows where:
- The year is 2019 AND
- The total property crime rate is less than 2000

How many state-years meet both criteria?

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
safe_2019 = df[(df['year'] == 2019) & (df['property_all'] < 2000)]
print(f"States with property crime < 2000 in 2019: {len(safe_2019)}")
safe_2019[['state', 'property_all']]
```
</details>

### Using OR (|) - At least one condition must be True

In [None]:
# Find data that's EITHER from 2015 OR from 2019
endpoints = df[(df['year'] == 2015) | (df['year'] == 2019)]

print("Data from first and last year in dataset:")
print(endpoints[['state', 'year', 'property_all']])
print(f"\nTotal rows: {len(endpoints)}")  # Should be 8 (4 states × 2 years)

In [None]:
# Find states that are EITHER Maryland OR Pennsylvania
mid_atlantic = df[(df['state'] == 'Maryland') | (df['state'] == 'Pennsylvania')]

print("Just Maryland and Pennsylvania:")
print(mid_atlantic[['state', 'year', 'property_all']])

### Using NOT (~) - Flip the condition

In [None]:
# Find all rows that are NOT from Maryland
not_maryland = df[~(df['state'] == 'Maryland')]

print("All states except Maryland:")
print(not_maryland['state'].value_counts())

### 🎯 QUICK CHECK #4
Find all rows where the state is EITHER Delaware OR Virginia.
Calculate the average total property crime rate for these two states combined.

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
de_va = df[(df['state'] == 'Delaware') | (df['state'] == 'Virginia')]
print("Delaware and Virginia data:")
print(de_va[['state', 'year', 'property_all']])

avg_property = de_va['property_all'].mean()
print(f"\nAverage property crime rate (DE + VA): {avg_property:.1f}")
```
</details>

## Part 4: The .isin() Method - Checking Multiple Values

### When You Have Many Values to Check

Instead of writing `(state == 'A') | (state == 'B') | (state == 'C')`, use `.isin()`:

In [None]:
# Find data for specific states using a list
border_states = ['Maryland', 'Virginia', 'Delaware']
border_data = df[df['state'].isin(border_states)]

print("Border states data:")
print(border_data[['state', 'year', 'property_all']])

In [None]:
# Check multiple years
recent_years = [2018, 2019]
recent_data = df[df['year'].isin(recent_years)]

print(f"\nData from {recent_years}:")
print(recent_data[['state', 'year', 'property_all']])

### 🎯 QUICK CHECK #5
Create a list of years [2015, 2016, 2017] and filter for Pennsylvania data in those years only.
What was Pennsylvania's average motor vehicle theft rate during this period?

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
early_years = [2015, 2016, 2017]
pa_early = df[(df['state'] == 'Pennsylvania') & (df['year'].isin(early_years))]

print("Pennsylvania 2015-2017:")
print(pa_early[['year', 'property_motor']])

avg_motor = pa_early['property_motor'].mean()
print(f"\nAverage motor vehicle theft rate: {avg_motor:.2f}")
```
</details>

## Part 5: Creating New Columns

### Calculating New Information from Existing Columns

Sometimes you need to create new variables based on what you have:

In [None]:
# Calculate what percentage of property crime is burglary
df['pct_burglary'] = (df['property_burglary'] / df['property_all']) * 100

print("Original columns plus our new one:")
df[['state', 'year', 'property_all', 'property_burglary', 'pct_burglary']].head(n=5)

In [None]:
# Now we can use this new column like any other
print("Where is burglary the highest percentage of property crime?")
df.sort_values(by='pct_burglary', ascending=False)[['state', 'year', 'pct_burglary']].head(n=5)

In [None]:
# Calculate what percentage of property crime is larceny
df['pct_larceny'] = (df['property_larceny'] / df['property_all']) * 100

print("\nWhere is larceny the highest percentage of property crime?")
df.sort_values(by='pct_larceny', ascending=False)[['state', 'year', 'pct_larceny']].head(n=5)

### 🎯 QUICK CHECK #6
Create a new column called 'pct_motor' that shows what percentage of property crime is motor vehicle theft.
Formula: (property_motor / property_all) * 100

Which state-year has the highest percentage?

In [None]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
df['pct_motor'] = (df['property_motor'] / df['property_all']) * 100

print("Motor vehicle theft as percentage of property crime:")
df.sort_values(by='pct_motor', ascending=False)[['state', 'year', 'pct_motor']].head(n=5)
```
</details>

## Part 6: Putting It All Together - Real Analysis

### Answering Complex Criminological Questions

Let's use all our skills to answer real questions:

In [None]:
# Question 1: How did Maryland's property crime change from 2015 to 2019?

# Filter for just Maryland
md = df[df['state'] == 'Maryland'].sort_values(by='year')

# Look at the trend
print("Maryland property crime trend:")
print(md[['year', 'property_all']])



In [None]:
# Question 2: Compare burglary rates across all states in 2019

data_2019 = df[df['year'] == 2019].sort_values(by='property_burglary', ascending=False)

print("Burglary rates by state (2019):")
print(data_2019[['state', 'property_burglary']])

### Pattern Recognition: Filter + Calculate

Most analyses follow this pattern:
1. **Filter** to get the subset you care about
2. **Calculate** statistics on that subset
3. **Compare** to other subsets or overall averages

Let's practice this pattern:

In [None]:
# Compare states with high vs low larceny rates in 2019
# Define "low" as larceny rate < 1400 in 2019

data_2019 = df[df['year'] == 2019]

low_larceny_states = data_2019[data_2019['property_larceny'] < 1400]
high_larceny_states = data_2019[data_2019['property_larceny'] >= 1400]

print("Low larceny states (2019):")
print(low_larceny_states[['state', 'property_larceny']])

print("\nHigh larceny states (2019):")
print(high_larceny_states[['state', 'property_larceny']])

print(f"\nAverage motor vehicle theft in low-larceny states: {low_larceny_states['property_motor'].mean():.1f}")
print(f"Average motor vehicle theft in high-larceny states: {high_larceny_states['property_motor'].mean():.1f}")

## Hands-On Exercise: Your Turn to Analyze

Use everything you learned today to answer these questions:

In [None]:
# Exercise 1: How many state-years had burglary rates above 450?
# Your code here:




In [None]:
# Exercise 2: What was Delaware's average larceny rate across all 5 years?
# Your code here:




In [None]:
# Exercise 3: Filter for years 2017-2019 only.
# Which state had the lowest average total property crime during this period?
# Your code here:




In [None]:
# Exercise 4: Create a new column called 'burglary_per_motor' (property_burglary / property_motor)
# Which state-year has the highest ratio?
# Your code here:




In [None]:
# Exercise 5: Compare Maryland and Pennsylvania in 2019.
# Which had higher burglary? Which had higher motor vehicle theft?
# Your code here:

## Using AI for Filtering Questions

### Effective Prompts:

```
I have a pandas DataFrame with columns: state, year, property_all,
property_burglary, property_larceny, property_motor

How do I filter for:
1. Rows where state is Maryland AND year is after 2016
2. Rows where property_all is between 2000 and 2500

Please explain the syntax with parentheses.
```

### Common AI Questions:
- "Why do I need parentheses when combining conditions?"
- "What's the difference between & and 'and'?" (Answer: use & for pandas!)
- "How do I filter for multiple values in a column?"
- "How do I create a new column based on a calculation?"

## Wrap-Up: Key Takeaways

Today you learned how to **ask specific questions of your data**:

### Filtering Basics:
- `df[df['column'] == value]` - find exact matches
- `df[df['column'] > value]` - greater than
- `df[df['column'] < value]` - less than
- `df[df['column'] >= value]` - greater than or equal
- `df[df['column'] != value]` - not equal

### Combining Conditions:
- `&` for AND (both must be True)
- `|` for OR (at least one must be True)
- `~` for NOT (flip the condition)
- **Always use parentheses around each condition!**
- `.isin([list])` for checking multiple values

### Creating New Columns:
- `df['new_col'] = df['col1'] + df['col2']` - calculations
- Can use any math: `+`, `-`, `*`, `/`, `**`
- Can calculate percentages, ratios, differences

### The Analysis Pattern:
1. **Filter** for your subset
2. **Calculate** statistics  
3. **Compare** or **sort** results

## Before Next Class

1. **Practice filtering:**
   - Try different comparison operators
   - Combine multiple conditions
   - Use `.isin()` with different lists

2. **Experiment with new columns:**
   - Create ratios between existing columns
   - Calculate differences
   - Make percentage calculations

3. **Answer your own questions:**
   - Think of a question about the data
   - Use filtering to find the answer
   - Calculate relevant statistics

4. **Use AI when stuck:**
   - Copy your code and the error
   - Ask AI to explain the problem
   - Ask for alternative approaches

## Quick Reference Card

```python
# Filtering - Single Condition
df[df['column'] == value]     # Equal to
df[df['column'] > value]      # Greater than
df[df['column'] < value]      # Less than
df[df['column'] >= value]     # Greater/equal
df[df['column'] <= value]     # Less/equal
df[df['column'] != value]     # Not equal

# Filtering - Multiple Conditions (USE PARENTHESES!)
df[(df['col1'] == val1) & (df['col2'] > val2)]   # AND
df[(df['col1'] == val1) | (df['col2'] > val2)]   # OR
df[~(df['col1'] == val1)]                         # NOT

# Multiple values
df[df['column'].isin([val1, val2, val3])]

# Creating new columns
df['new_col'] = df['col1'] + df['col2']          # Add
df['new_col'] = df['col1'] - df['col2']          # Subtract
df['new_col'] = df['col1'] / df['col2']          # Divide
df['new_col'] = (df['col1'] / df['col2']) * 100  # Percentage

# Filter + Calculate pattern
subset = df[df['condition']]
result = subset['column'].mean()
```