
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/4_pandas_intro_lecture.ipynb)

**IMPORTANT**: Save your own copy!
1. Click File → Save a copy in Drive
2. Rename it
3. Work in YOUR copy, not the original


---


# 4. Introduction to Pandas - From Lists to DataFrames
## CCJS 418E: Coding for Criminology

Today's Goals:
- Understand why we need DataFrames instead of lists
- Load and explore real crime data with pandas
- Select columns to focus on what matters
- Calculate basic statistics on crime data
- Get comfortable with pandas fundamentals before we do more complex analysis

Note: Today we're learning the basics. Next class we'll learn filtering and more advanced operations!


## Part 1: The Problem with Lists

### When Lists Become a Nightmare

Let's say you're analyzing crime data for three states. You need to track the state name, property crime rate, and burglary rate for each.

**Connection to Computational Thinking: ABSTRACTION - hiding complexity behind simple, organized structures**

In [1]:
# Managing related data in separate lists - this gets messy fast!
states = ["Maryland", "Virginia", "Pennsylvania"]
property_rates = [2023.6, 1687.7, 1744.2]  # per 100,000
burglary_rates = [472.0, 212.7, 306.0]  # per 100,000

# To find Virginia's property crime rate, we need to:
# 1. Find Virginia's position in the states list
virginia_index = states.index("Virginia")
# 2. Use that position to get the rate from another list
print(f"Virginia property crime rate: {property_rates[virginia_index]}")

Virginia property crime rate: 1687.7


### What Goes Wrong?

1. **Lists can get out of sync** - What if you sort one list but forget the others?
2. **Inserting/deleting is risky** - Remove Maryland from states, but forget to remove its rates
3. **No built-in relationship** - Python doesn't know these lists are connected
4. **Hard to scale** - Imagine doing this for 50 states over 20 years!

In [2]:
# The nightmare scenario - lists get out of sync
states_copy = states.copy()
property_rates_copy = property_rates.copy()

# Someone sorts the states alphabetically but forgets the rates!
states_copy.sort()
print("States (sorted):", states_copy)
print("Property rates (not sorted):", property_rates_copy)
print(f"\nNow {states_copy[0]} appears to have rate {property_rates_copy[0]} - WRONG!")

States (sorted): ['Maryland', 'Pennsylvania', 'Virginia']
Property rates (not sorted): [2023.6, 1687.7, 1744.2]

Now Maryland appears to have rate 2023.6 - WRONG!


## Part 2: Enter Pandas DataFrames

### The Solution: Keeping Related Data Together

Pandas solves this problem with **DataFrames** - think of them as spreadsheets in Python where:
- Each row stays together (no more sync issues!)
- Columns have names (access by meaning, not position)
- Built-in tools for sorting, filtering, and calculating

**Think of it as**: Excel or Google Sheets, but programmable

Instead of juggling separate lists, a DataFrame would organize our data like this:
```
  state         property_all  property_burglary
0 Maryland      2023.6        472.0        
1 Virginia      1687.7        212.7        
2 Pennsylvania  1744.2        306.0        
```

Let's see this in action with real crime data!

## Part 3: Loading Real Crime Data

### Reading Data from CSV Files

Most real data comes in CSV (Comma-Separated Values) files. Pandas makes loading them simple.

In [3]:
# First, import pandas (conventionally abbreviated as pd)
import pandas as pd

# Load actual state crime data
df = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/state_crime.csv')

# Cleaning up the column names
df.columns = df.columns.str.replace('^Data.', '', regex=True)
df.columns = df.columns.str.replace('\\.', '_', regex=True).str.lower()

# Filter to specific years and states to keep dataset manageable
df = df[df.year.isin([2015, 2016, 2017, 2018, 2019])]
df = df[df.state.isin(['Maryland', 'Virginia', 'Delaware', 'Pennsylvania'])]

# Select just the property crime columns we want to focus on
df = df[['state', 'year', 'rates_property_all',
       'rates_property_burglary', 'rates_property_larceny',
       'rates_property_motor']]

# Clean up column names one more time
df.columns = df.columns.str.replace('rates_', '')

# The variable name 'df' is a common convention for DataFrame
print("Data loaded successfully!")
print(f"Dataset has {len(df)} rows")

Data loaded successfully!
Dataset has 20 rows


## Part 4: First Look at Your Data

### Always Start by Exploring

Before any analysis, always look at what you have. Let's use three essential commands:

In [4]:
# 1. See the first few rows - like peeking at the top of a spreadsheet
print("First 5 rows of our crime data:")
df.head(n=5)

First 5 rows of our crime data:


Unnamed: 0,state,year,property_all,property_burglary,property_larceny,property_motor
475,Delaware,2015,2691.0,504.6,2061.6,124.9
476,Delaware,2016,2766.0,527.6,2078.7,159.7
477,Delaware,2017,2440.6,412.7,1885.6,142.3
478,Delaware,2018,2324.4,326.5,1845.3,152.6
479,Delaware,2019,2252.2,304.8,1782.7,164.7


In [5]:
# 2. How big is our dataset?
print(f"Dataset shape: {df.shape}")
print(f"  - {df.shape[0]} rows (each row is one state-year)")
print(f"  - {df.shape[1]} columns (different pieces of information)")

# Connection to lists: df.shape is like (len(list), number_of_things_we_track)

Dataset shape: (20, 6)
  - 20 rows (each row is one state-year)
  - 6 columns (different pieces of information)


In [6]:
# 3. What columns do we have and what type of data?
print("Basic information about our dataset:")
df.info()

Basic information about our dataset:
<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 475 to 2874
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   state              20 non-null     object 
 1   year               20 non-null     int64  
 2   property_all       20 non-null     float64
 3   property_burglary  20 non-null     float64
 4   property_larceny   20 non-null     float64
 5   property_motor     20 non-null     float64
dtypes: float64(4), int64(1), object(1)
memory usage: 1.1+ KB


### 🎯 QUICK CHECK #1
Try it yourself! Look at the **last** 5 rows instead of the first 5.
Hint: If there's a `.head()` method, what do you think shows the bottom?

In [7]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
df.tail(n=5)
```
</details>

## Part 5: Selecting Columns

### Getting a Single Column

Access columns using square brackets - just like accessing items in a dictionary or list!

In [8]:
# Select a single column - returns a Series (like a labeled list)
states_column = df['state']
print("All states in our dataset:")
print(states_column)

All states in our dataset:
475         Delaware
476         Delaware
477         Delaware
478         Delaware
479         Delaware
1255        Maryland
1256        Maryland
1257        Maryland
1258        Maryland
1259        Maryland
2330    Pennsylvania
2331    Pennsylvania
2332    Pennsylvania
2333    Pennsylvania
2334    Pennsylvania
2870        Virginia
2871        Virginia
2872        Virginia
2873        Virginia
2874        Virginia
Name: state, dtype: object


In [9]:
# You can perform operations on columns just like lists!
property_crime_rates = df['property_all']

# These operations should feel familiar from working with lists:
print(f"Maximum property crime rate: {property_crime_rates.max():.1f}")
print(f"Minimum property crime rate: {property_crime_rates.min():.1f}")
print(f"Average property crime rate: {property_crime_rates.mean():.1f}")

# Connection to lists:
#   .max() is like max(list)
#   .min() is like min(list)
#   .mean() is like sum(list)/len(list)

Maximum property crime rate: 2766.0
Minimum property crime rate: 1403.4
Average property crime rate: 2010.2


### 🎯 QUICK CHECK #2
Calculate the maximum, minimum, and average for the **property_burglary** column.

In [10]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
burglary_rates = df['property_burglary']
print(f"Maximum burglary rate: {burglary_rates.max():.1f}")
print(f"Minimum burglary rate: {burglary_rates.min():.1f}")
print(f"Average burglary rate: {burglary_rates.mean():.1f}")
```
</details>

### Selecting Multiple Columns

Sometimes you only want to see a few columns at once:

In [11]:
# Select multiple columns - use a list of column names inside brackets
subset = df[['state', 'year', 'property_all']]
print("Just state, year, and total property crime rate:")
subset.head(n=8)

Just state, year, and total property crime rate:


Unnamed: 0,state,year,property_all
475,Delaware,2015,2691.0
476,Delaware,2016,2766.0
477,Delaware,2017,2440.6
478,Delaware,2018,2324.4
479,Delaware,2019,2252.2
1255,Maryland,2015,2315.0
1256,Maryland,2016,2284.5
1257,Maryland,2017,2222.3


In [12]:
# Why do we use double brackets [[  ]]?
# The outer brackets say "select from df"
# The inner brackets create a list of column names
# So: df[ ['col1', 'col2'] ] means "select these columns from df"

# Compare:
print("Single column (Series):")
print(type(df['state']))  # Returns a Series

print("\nMultiple columns (DataFrame):")
print(type(df[['state', 'year']]))  # Returns a DataFrame

Single column (Series):
<class 'pandas.core.series.Series'>

Multiple columns (DataFrame):
<class 'pandas.core.frame.DataFrame'>


### 🎯 QUICK CHECK #3
Create a subset showing only state, year, and property_larceny. Display the first 10 rows.

In [13]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
larceny_subset = df[['state', 'year', 'property_larceny']]
larceny_subset.head(n=10)
```
</details>

## Part 6: Understanding Your Data with Methods

### Counting Unique Values

These methods help you understand what's in your data:

In [14]:
# How many unique states are in our data?
n_states = df['state'].nunique()
print(f"Number of unique states: {n_states}")

# What are they?
unique_states = df['state'].unique()
print(f"\nThe states: {unique_states}")

Number of unique states: 4

The states: ['Delaware' 'Maryland' 'Pennsylvania' 'Virginia']


In [15]:
# How many times does each state appear in our data?
state_counts = df['state'].value_counts()
print("Number of records per state:")
print(state_counts)

# Each state appears 5 times because we have 5 years of data (2015-2019)

Number of records per state:
state
Delaware        5
Maryland        5
Pennsylvania    5
Virginia        5
Name: count, dtype: int64


In [16]:
# What years do we have?
years = df['year'].unique()
print(f"Years in dataset: {sorted(years)}")

Years in dataset: [np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019)]


### Understanding .unique(), .nunique(), and .value_counts()

**These three methods are similar but do DIFFERENT things. Let's see exactly what each returns:**

In [17]:
# Let's use the 'state' column to see the differences clearly

print("=" * 60)
print("METHOD 1: .unique() - Returns an ARRAY of unique values")
print("=" * 60)
states_unique = df['state'].unique()
print(states_unique)
print(f"Type: {type(states_unique)}")
print(f"\nWhat you get: The actual state names (each appears once)")
print("Use this when: You want to SEE what the unique values are")

print("\n" + "=" * 60)
print("METHOD 2: .nunique() - Returns a SINGLE NUMBER")
print("=" * 60)
num_states = df['state'].nunique()
print(num_states)
print(f"Type: {type(num_states)}")
print(f"\nWhat you get: Just a count (how many unique states)")
print("Use this when: You only care about HOW MANY unique values")

print("\n" + "=" * 60)
print("METHOD 3: .value_counts() - Returns COUNTS for each value")
print("=" * 60)
state_counts = df['state'].value_counts()
print(state_counts)
print(f"Type: {type(state_counts)}")
print(f"\nWhat you get: Each state AND how many times it appears")
print("Use this when: You want to see BOTH what values exist AND their frequencies")

METHOD 1: .unique() - Returns an ARRAY of unique values
['Delaware' 'Maryland' 'Pennsylvania' 'Virginia']
Type: <class 'numpy.ndarray'>

What you get: The actual state names (each appears once)
Use this when: You want to SEE what the unique values are

METHOD 2: .nunique() - Returns a SINGLE NUMBER
4
Type: <class 'int'>

What you get: Just a count (how many unique states)
Use this when: You only care about HOW MANY unique values

METHOD 3: .value_counts() - Returns COUNTS for each value
state
Delaware        5
Maryland        5
Pennsylvania    5
Virginia        5
Name: count, dtype: int64
Type: <class 'pandas.core.series.Series'>

What you get: Each state AND how many times it appears
Use this when: You want to see BOTH what values exist AND their frequencies


### Quick Comparison Table

| Method | What it returns | Example output | When to use |
|--------|----------------|----------------|-------------|
| `.unique()` | Array of unique values | `['Maryland', 'Virginia', ...]` | "What states are in my data?" |
| `.nunique()` | Single number (count) | `4` | "How many different states?" |
| `.value_counts()` | Each value + its count | `Maryland: 5, Virginia: 5, ...` | "How often does each state appear?" |

**Memory trick:**
- `.unique()` = "Show me the unique items"
- `.nunique()` = "**N**umber of unique items" (the 'n' stands for number!)
- `.value_counts()` = "**Count** how many times each **value** appears"

### 🎯 QUICK CHECK #4
How many unique years are in the dataset? What's the range (min and max year)?

In [18]:
# Your code here:

<details>
<summary>Click for solution</summary>

```python
n_years = df['year'].nunique()
min_year = df['year'].min()
max_year = df['year'].max()
print(f"Number of unique years: {n_years}")
print(f"Year range: {min_year} to {max_year}")
```
</details>

## Part 7: Simple Sorting

### Ordering Your Data

Sometimes you want to see data in a specific order:

In [19]:
# Sort by property crime rate (lowest to highest)
sorted_df = df.sort_values(by='property_all')
print("States/years with lowest property crime rates:")
sorted_df[['state', 'year', 'property_all']].head(n=8)

States/years with lowest property crime rates:


Unnamed: 0,state,year,property_all
2334,Pennsylvania,2019,1403.4
2333,Pennsylvania,2018,1489.9
2874,Virginia,2019,1642.7
2332,Pennsylvania,2017,1649.4
2873,Virginia,2018,1665.8
2331,Pennsylvania,2016,1742.7
2872,Virginia,2017,1792.9
2330,Pennsylvania,2015,1812.8


In [20]:
# Sort by property crime rate (highest to lowest)
sorted_df = df.sort_values(by='property_all', ascending=False)
print("States/years with highest property crime rates:")
sorted_df[['state', 'year', 'property_all']].head(n=8)

States/years with highest property crime rates:


Unnamed: 0,state,year,property_all
476,Delaware,2016,2766.0
475,Delaware,2015,2691.0
477,Delaware,2017,2440.6
478,Delaware,2018,2324.4
1255,Maryland,2015,2315.0
1256,Maryland,2016,2284.5
479,Delaware,2019,2252.2
1257,Maryland,2017,2222.3


In [21]:
# You can sort by multiple columns
# First by state, then by year
sorted_df = df.sort_values(by=['state', 'year'])
print("Data sorted by state, then year:")
sorted_df[['state', 'year', 'property_all']].head(n=10)

Data sorted by state, then year:


Unnamed: 0,state,year,property_all
475,Delaware,2015,2691.0
476,Delaware,2016,2766.0
477,Delaware,2017,2440.6
478,Delaware,2018,2324.4
479,Delaware,2019,2252.2
1255,Maryland,2015,2315.0
1256,Maryland,2016,2284.5
1257,Maryland,2017,2222.3
1258,Maryland,2018,2033.3
1259,Maryland,2019,1950.2


### 🎯 QUICK CHECK #5
Sort the data to find which state-years had the lowest burglary rates. Show the top 5.

In [24]:
# Your code here:


<details>
<summary>Click for solution</summary>

```python
sorted_burglary = df.sort_values(by='property_burglary')
print("Lowest burglary rates:")
sorted_burglary[['state', 'year', 'property_burglary']].head(n=5)
```
</details>

## Part 8: Putting It Together

### A Complete Analysis Workflow

Let's answer: "What was the average property crime rate across all states from 2015-2019?"

In [25]:
# This is a preview of what we can do - we'll learn more next class!
# For now, just observe the pattern:

# Step 1: Look at all the data
print("Full dataset:")
print(df[['state', 'year', 'property_all']].head(n=10))

# Step 2: Calculate the average across ALL states and years
overall_avg = df['property_all'].mean()
print(f"\nAverage property crime rate (all states, all years): {overall_avg:.1f}")

# Step 3: Let's also look at burglary specifically
burglary_avg = df['property_burglary'].mean()
print(f"Average burglary rate (all states, all years): {burglary_avg:.1f}")

Full dataset:
         state  year  property_all
475   Delaware  2015        2691.0
476   Delaware  2016        2766.0
477   Delaware  2017        2440.6
478   Delaware  2018        2324.4
479   Delaware  2019        2252.2
1255  Maryland  2015        2315.0
1256  Maryland  2016        2284.5
1257  Maryland  2017        2222.3
1258  Maryland  2018        2033.3
1259  Maryland  2019        1950.2

Average property crime rate (all states, all years): 2010.2
Average burglary rate (all states, all years): 309.1


## Hands-On Exercise: Your Turn to Explore

Practice the fundamental operations we learned today:

In [26]:
# Exercise 1: Display the first 7 rows of the dataset
# Your code here:
df.head(n=7)



Unnamed: 0,state,year,property_all,property_burglary,property_larceny,property_motor
475,Delaware,2015,2691.0,504.6,2061.6,124.9
476,Delaware,2016,2766.0,527.6,2078.7,159.7
477,Delaware,2017,2440.6,412.7,1885.6,142.3
478,Delaware,2018,2324.4,326.5,1845.3,152.6
479,Delaware,2019,2252.2,304.8,1782.7,164.7
1255,Maryland,2015,2315.0,427.5,1668.5,218.9
1256,Maryland,2016,2284.5,410.4,1677.4,196.7


In [28]:

# Exercise 2: How many total rows and columns does the dataset have?
# Your code here:
print(f"Number of rows: {df.shape[0]}, Number of columns: {df.shape[1]}") 

Number of rows: 20, Number of columns: 6


In [30]:


# Exercise 3: What's the maximum motor vehicle theft rate in the dataset?
# Your code here:
df.property_motor.max()


np.float64(224.2)

In [31]:


# Exercise 4: Show the states sorted by larceny rate (highest first)
# Display just state, year, and property_larceny columns for the top 5
# Your code here:
df[['state', 'year', 'property_larceny']].sort_values(['property_larceny'], ascending=False).head(5)



Unnamed: 0,state,year,property_larceny
476,Delaware,2016,2078.7
475,Delaware,2015,2061.6
477,Delaware,2017,1885.6
478,Delaware,2018,1845.3
479,Delaware,2019,1782.7


In [32]:

# Exercise 5: How many times does Delaware appear in the dataset?
# Your code here:
df['state'].value_counts()


state
Delaware        5
Maryland        5
Pennsylvania    5
Virginia        5
Name: count, dtype: int64

In [33]:

# Exercise 6: What is the range (difference between max and min) of burglary rates?
# Your code here:
df['property_burglary'].max() - df['property_burglary'].min()


np.float64(364.8)

In [34]:

# Exercise 7: Which crime type has a higher average rate: burglary or motor vehicle theft?
# Your code here:
burg_rate = df['property_burglary'].mean()
mvt_rate = df['property_motor'].mean()

if burg_rate > mvt_rate:
    print(f"Burglary > mvt: {burg_rate:.1f} > {mvt_rate:.1f}")
elif burg_rate < mvt_rate:
    print(f"Burglary < mvt: {burg_rate:.1f} < {mvt_rate:.1f}")
else:
    print("They have the same rate")

Burglary > mvt: 309.1 > 142.4


In [37]:

# Exercise 8: Create a subset with ONLY the year and all four crime rate columns
# Then display the first 8 rows sorted by year
# Your code here:
df_new = df[['year', 'property_all', 'property_burglary',
       'property_larceny', 'property_motor']]

df_new.sort_values(by='year').head(n=8)

Unnamed: 0,year,property_all,property_burglary,property_larceny,property_motor
475,2015,2691.0,504.6,2061.6,124.9
1255,2015,2315.0,427.5,1668.5,218.9
2870,2015,1866.5,254.6,1515.2,96.8
2330,2015,1812.8,309.8,1408.2,94.8
1256,2016,2284.5,410.4,1677.4,196.7
2331,2016,1742.7,277.8,1362.8,102.1
476,2016,2766.0,527.6,2078.7,159.7
2871,2016,1859.4,238.0,1505.1,116.4


## Using AI Tools for Pandas Help

### Effective Prompts for Pandas Problems:

Example prompt:
```
I'm a criminology student learning pandas. I have a DataFrame called 'df' with columns:
- state (string)
- year (integer)  
- property_all (float) - total property crime rate
- property_burglary (float) - burglary rate
- property_larceny (float) - larceny rate
- property_motor (float) - motor vehicle theft rate

How do I find the maximum property crime rate in my data?
```

### Good Questions to Ask AI:
- "How do I see the first 10 rows of my DataFrame?"
- "What's the difference between .head() and .tail()?"
- "How do I calculate the average of a column?"
- "How do I sort my data by a specific column?"

## Wrap-Up: Key Takeaways

Today you learned the **fundamental pandas operations** you'll use constantly:

1. **DataFrames** keep related data together (like a spreadsheet)
2. **pd.read_csv()** loads data from files
3. **df.head()** and **df.tail()** preview your data
4. **df.shape** tells you the size (rows, columns)
5. **df.info()** shows column types and basic information
6. **df['column']** selects a single column
7. **df[['col1', 'col2']]** selects multiple columns
8. **df['column'].mean()**, **.max()**, **.min()** calculate statistics
9. **df.sort_values()** orders your data

### The Pandas-to-Lists Connection:
- `df['column']` is like accessing a named list
- `df['column'].mean()` is like `sum(list)/len(list)`
- `df.shape[0]` is like `len(list)`
- `.max()` is like `max(list)`, `.min()` is like `min(list)`

## Before Next Class

1. **Practice the basics:**
   - Load the CSV file
   - Try `.head()`, `.tail()`, `.shape`, `.info()`
   - Select different columns
   - Calculate `.mean()`, `.max()`, `.min()` on different columns
   - Sort by different columns

2. **Experiment:**
   - What happens if you try `df.head(n=100)` on our 20-row dataset?
   - What happens if you ask for a column that doesn't exist?
   - Can you display the entire dataset without using `.head()` or `.tail()`?

3. **Use AI to learn:**
   - Pick any operation from today and ask ChatGPT/Claude to explain it
   - If you get confused, paste your code and ask what's wrong

## Quick Reference Card

### Essential Operations Learned Today:
```python
# Loading data
df = pd.read_csv(filepath_or_buffer='file.csv')

# Exploring data
df.head(n=5)           # First 5 rows
df.tail(n=5)           # Last 5 rows
df.shape               # (rows, columns)
df.info()              # Column information
df.describe()          # Statistical summary

# Selecting columns
df['column']                    # Single column
df[['col1', 'col2']]           # Multiple columns

# Statistics on columns
df['column'].max()              # Maximum value
df['column'].min()              # Minimum value
df['column'].mean()             # Average
df['column'].sum()              # Total
df['column'].nunique()          # Count unique values
df['column'].unique()           # Array of unique values
df['column'].value_counts()     # Frequency of each value

# Sorting
df.sort_values(by='column')                    # Ascending
df.sort_values(by='column', ascending=False)   # Descending
df.sort_values(by=['col1', 'col2'])           # Multiple columns
```

### Common Errors to Watch For:
- **KeyError**: Column name doesn't exist - check spelling and capitalization
- **Forgetting quotes**: Column names need quotes: `df['state']` not `df[state]`
- **Single vs double brackets**: `df['col']` vs `df[['col']]` - one returns Series, two returns DataFrame