
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/4_pandas_intro_lecture.ipynb)

**IMPORTANT**: Save your own copy!
1. Click File → Save a copy in Drive
2. Rename it 
3. Work in YOUR copy, not the original


---


# 4. Introduction to Pandas - From Lists to DataFrames
## CCJS 418E: Coding for Criminology


Today's Goals:
- Understand why we need DataFrames instead of lists
- Load and explore real crime data with pandas
- Select and filter data to answer criminological questions
- Calculate statistics on crime data columns
- Learn the fundamental pandas operations you'll use constantly


## Part 1: The Problem with Lists

### When Lists Become a Nightmare

Let's say you're analyzing crime data for three states. You need to track the state name, violent crime rate, and property crime rate for each.

**Connection to Computational Thinking: ABSTRACTION - hiding complexity behind simple, organized structures**

In [None]:
# Managing related data in separate lists - this gets messy fast!
states = ["Maryland", "Virginia", "Pennsylvania"]
violent_rates = [472.0, 212.7, 306.0]  # per 100,000
property_rates = [2023.6, 1687.7, 1744.2]  # per 100,000
years = [2019, 2019, 2019]

# To find Virginia's violent crime rate, we need to:
# 1. Find Virginia's position in the states list
virginia_index = states.index("Virginia")
print(f"This is the index for Virginia: {virginia_index}")
# 2. Use that position to get the rate from another list
print(f"Virginia violent crime rate: {violent_rates[virginia_index]}")

### What Goes Wrong?

1. **Lists can get out of sync** - What if you sort one list but forget the others?
2. **Inserting/deleting is risky** - Remove Maryland from states, but forget to remove its rates
3. **No built-in relationship** - Python doesn't know these lists are connected
4. **Inefficient for large data** - Imagine doing this for all 50 states over 20 years!

In [None]:
# The nightmare scenario - lists get out of sync
# Note: .copy() create a duplicate of the variable in memory
states_copy = states.copy()
violent_rates_copy = violent_rates.copy()

# Someone sorts the states alphabetically but forgets the rates!
states_copy.sort()
print("States (sorted):", states_copy)
print("Violent rates (not sorted):", violent_rates_copy)
print(f"Now {states_copy[0]} appears to have rate {violent_rates_copy[0]} - WRONG!")

## Part 2: Enter Pandas DataFrames

### The Solution: Keeping Related Data Together

Pandas solves this problem with **DataFrames** - think of them as spreadsheets in Python where:
- Each row stays together (no more sync issues!)
- Columns have names (access by meaning, not position)
- Built-in tools for sorting, filtering, and calculating

**Think of it as**: Excel or Google Sheets, but programmable

Instead of juggling separate lists, a DataFrame would organize our data like this:
```
    state         violent_rate  property_rate  year
0   Maryland      472.0         2023.6         2019
1   Virginia      212.7         1687.7         2019
2   Pennsylvania  306.0         1744.2         2019
```

Let's see this in action with real crime data!

## Part 3: Loading Real Crime Data

### Reading Data from CSV Files

Most real data comes in CSV (Comma-Separated Values) files. Pandas makes loading them simple.

In [None]:
# First, import pandas (conventionally abbreviated as pd)
import pandas as pd

# Load actual state crime data
df = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/state_crime.csv')

# Cleaning up the columns names
df.columns = df.columns.str.replace('^Data.', '', regex=True)
df.columns = df.columns.str.replace('\\.', '_', regex=True).str.lower()

# The variable name 'df' is a common convention for DataFrame
print("Data loaded successfully!")

## Part 4: Exploring Your Data

### First Look at the Data

Before any analysis, always explore what you have:

In [None]:
# See the first 5 rows - like peeking at the top of a spreadsheet
print("First 5 rows of our crime data:")
df.head(n=5)

In [None]:
# How big is our dataset?
print(f"Dataset size: {df.shape}")
print(f"  - {df.shape[0]} rows (observations)")
print(f"  - {df.shape[1]} columns (variables)")

# Connection: df.shape is like (len(list), number_of_attributes)

In [None]:
# What columns do we have?
print("Column names:")
print(df.columns)

In [None]:
# What types of data are in each column?
print("\nData types and basic info:")
df.info()

### Common Mistake: Not Exploring First
Always look at your data before analyzing! You might find:
- Missing values
- Unexpected column names
- Wrong data types

## Part 5: Selecting Columns

### Getting a Single Column

Access columns using square brackets - just like accessing list elements!

In [None]:
# Select a single column - returns a Series (like a labeled list)
states_column = df['state']
print("First 10 states:")
print(states_column.head(n=10))

In [None]:
# You can perform operations on columns just like lists
violent_crime_rates = df['rates_violent_all']

# These operations should feel familiar from working with lists!
print(f"Maximum violent crime rate: {violent_crime_rates.max()}")
print(f"Minimum violent crime rate: {violent_crime_rates.min()}")
print(f"Average violent crime rate: {violent_crime_rates.mean():.1f}")

# Connection: .max() is like max(list), .mean() is like sum(list)/len(list)

### Selecting Multiple Columns

In [None]:
# Select multiple columns - use a list of column names
subset = df[['state', 'year', 'rates_violent_all']]
print("Just state, year, and violent crime rate:")
subset.head(n=5)

## Part 6: Understanding Your Data with Methods

### Counting and Unique Values

These methods help you understand what's in your data:

In [None]:
# How many unique states are in our data?
n_states = df['state'].nunique()
print(f"Number of unique states: {n_states}")

# What are they?
unique_states = df['state'].unique()
print(f"\nFirst 5 states: {unique_states[:5]}")

In [None]:
# How many times does each year appear?
year_counts = df['year'].value_counts()
print("Number of records per year:")
print(year_counts.head(n=5))

In [None]:
# Get proportions instead of counts
year_proportions = df['year'].value_counts(normalize=True)
print("\nProportion of records per year:")
print(year_proportions.head(n=5))

## Part 7: Filtering Data (Boolean Indexing)

### Finding Specific Rows

This is like using if statements with lists, but more powerful!

In [None]:
# Find all records from 2019
# This creates a boolean mask - True/False for each row
is_2019 = df['year'] == 2019
print(f"Number of 2019 records: {is_2019.sum()}")  # True = 1, False = 0

# Use the mask to filter
df_2019 = df[is_2019]
print(f"\n2019 data shape: {df_2019.shape}")
df_2019.head()

In [None]:
# More intuitive syntax - combine the steps
df_2019 = df[df['year'] == 2019]
print("Maryland's 2019 data:")
df_2019[df_2019['state'] == 'Maryland'].head()

In [None]:
# Filter for high violent crime rates
high_crime = df[df['rates_violent_all'] > 500]
print(f"Number of state-years with violent crime rate > 500: {len(high_crime)}")
print("\nStates with highest violent crime rates:")
print(high_crime[['state', 'year', 'rates_violent_all']].head())

### Combining Conditions

Use `&` for AND, `|` for OR, `~` for NOT
Remember: Use parentheses around each condition!

In [None]:
# Find Maryland data from 2015 or later
maryland_recent = df[(df['state'] == 'Maryland') & (df['year'] >= 2015)]
print("Maryland crime data, 2015-present:")
print(maryland_recent[['year', 'rates_violent_all', 'rates_property_all']])

In [None]:
# Common Mistake: Forgetting parentheses
# This will cause an error:
# maryland_recent = df[df['state'] == 'Maryland' & df['Year'] >= 2015]  # Error!

# Correct: Each condition needs parentheses
maryland_recent = df[(df['state'] == 'Maryland') & (df['year'] >= 2015)]
maryland_recent.head()

## Part 8: Calculating with Columns

### Creating New Columns from Existing Ones

In [None]:
# Calculate total crime rate (violent + property)
df_2019['total_crime_rate'] = df_2019['rates_violent_all'] + df_2019['rates_property_all']

# Find states with highest total crime
print("States with highest total crime rates in 2019:")
sorted_2019 = df_2019.sort_values(by='total_crime_rate', ascending=False)
sorted_2019[['state', 'total_crime_rate']].head(n=5)

In [None]:
# Calculate what percentage of crime is violent
df_2019['pct_violent'] = (df_2019['rates_violent_all'] / 
                          df_2019['total_crime_rate'] * 100)

print("\nStates where violent crime is highest percentage of total:")
sorted_by_pct = df_2019.sort_values(by='pct_violent', ascending=False)
print(sorted_by_pct[['state', 'pct_violent']].head(n=5))

## Part 9: Common Pandas Patterns

### Pattern 1: Filter and Calculate
Find a subset of data and calculate statistics on it

In [None]:
# What's the average violent crime rate in the Mid-Atlantic states?
mid_atlantic = ['Maryland', 'Delaware', 'Pennsylvania', 'New Jersey', 'New York']
ma_data = df[df['state'].isin(mid_atlantic)]
ma_2019 = ma_data[ma_data['year'] == 2019]

avg_violent = ma_2019['rates_violent_all'].mean()
print(f"Average violent crime rate in Mid-Atlantic (2019): {avg_violent:.1f}")

### Pattern 2: Find Extremes
Identify the highest/lowest values

In [None]:
# Which state had the lowest property crime rate in 2019?
df_2019 = df[df['year'] == 2019]
min_property_idx = df_2019['rates_property_all'].idxmin()
safest_state = df_2019.loc[min_property_idx]
print(f"Lowest property crime rate in 2019:")
print(f"  State: {safest_state['state']}")
print(f"  Rate: {safest_state['rates_property_all']:.1f} per 100,000")

## Part 10: Using AI Tools for Pandas Help

### Effective Prompts for Pandas Problems:

Example prompt:
```
I'm a criminology student learning pandas. I have a DataFrame called 'df' with columns:
- State (string)
- Year (integer)
- Data.Rates.Violent.All (float)

How do I find the 5 states with the highest violent crime rates in 2019?
```

### Good Questions to Ask AI:
- "How do I filter a DataFrame for multiple conditions?"
- "What's the difference between loc and iloc?"
- "How do I handle missing values in pandas?"
- "How do I sort a DataFrame by multiple columns?"

## Hands-On Exercise: State Crime Analysis

Complete each step to analyze crime patterns:

In [None]:
# Load the data (already done above, but repeated here for clarity)
df = pd.read_csv(filepath_or_buffer='state_crime.csv')
df.head()



In [None]:

# Step 1: How many years of data do we have?
# Your code here:




In [None]:
# Step 2: What's the average property crime rate across all states in 2019?
# Your code here:




In [None]:
# Step 3: Which state had the highest violent crime rate in 2018?
# Your code here:




In [None]:
# Step 4: How many states had violent crime rates below 200 in 2019?
# Your code here:




In [None]:
# Step 5: Create a subset with just California data from 2015-2019
# Your code here:

## Challenge Exercise: Regional Analysis

Compare crime rates across different regions:

In [None]:
# Define regions
northeast = ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island',
             'Connecticut', 'New York', 'New Jersey', 'Pennsylvania']
south = ['Delaware', 'Maryland', 'Virginia', 'West Virginia', 'North Carolina',
         'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee',
         'Alabama', 'Mississippi', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas']

# For 2019 data, calculate and compare average violent crime rates by region
# Your code here:

## Wrap-Up: Key Takeaways

Today you learned:
1. **DataFrames** keep related data together (like a spreadsheet)
2. **pd.read_csv()** loads data from files
3. **df.head()**, **df.shape**, **df.info()** help explore data
4. **df['column']** selects columns (like list indexing but with names)
5. **Boolean indexing** filters rows (like if statements for entire columns)
6. **.mean()**, **.max()**, **.min()** calculate statistics (like list operations)
7. **Always explore first** - understand your data before analyzing

### The Pandas-to-Lists Connection:
- `df['column']` is like accessing a named list
- `df[df['column'] > value]` is like filtering a list with a condition
- `df['column'].mean()` is like `sum(list)/len(list)`
- `df.shape[0]` is like `len(list)`

## Before Next Class

1. **Practice the basics:**
   - Load a CSV file
   - Explore it with `.head()`, `.shape`, `.info()`
   - Select columns and calculate statistics
   - Filter for specific conditions

2. **Get comfortable with syntax:**
   - Remember: square brackets for columns
   - Use parentheses around each condition when combining
   - Try different methods: `.unique()`, `.value_counts()`, `.describe()`

3. **Experiment with errors:**
   - Try filtering without parentheses
   - Access a column that doesn't exist
   - Learn from the error messages

## Quick Reference

### Essential Pandas Operations:
```python
# Loading and exploring
df = pd.read_csv(filepath_or_buffer='file.csv')
df.head(n=5)                  # First 5 rows
df.shape                       # (rows, columns)
df.info()                      # Column types and non-null counts
df.columns                     # Column names

# Selecting
df['column']                   # Single column
df[['col1', 'col2']]          # Multiple columns

# Filtering
df[df['column'] == value]     # Equal to
df[df['column'] > value]      # Greater than
df[(cond1) & (cond2)]         # Multiple conditions

# Statistics
df['column'].mean()           # Average
df['column'].max()            # Maximum
df['column'].min()            # Minimum
df['column'].sum()            # Total
df['column'].count()          # Non-null count

# Counting
df['column'].nunique()        # Number of unique values
df['column'].unique()         # Array of unique values
df['column'].value_counts()   # Frequency of each value
df['column'].value_counts(normalize=True)  # Proportions

# Sorting
df.sort_values(by='column', ascending=False)  # Sort by column
```