# 9. Merging DataFrames - SOLUTIONS
## Practice Exercises Solutions

This notebook contains solutions to the three practice exercises from Lecture 9.

## Setup

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Load NYPD arrest data
df = pd.read_csv('https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/nypd_arrests_2013_2015_garner.csv')

# Convert date to datetime
df['ARREST_DATE'] = pd.to_datetime(df['ARREST_DATE'])

# Extract year
df['year'] = df['ARREST_DATE'].dt.year

print(f"Loaded {len(df):,} arrest records")

---

## Exercise 1: Pattern 1 Practice - Drug Enforcement Priority

**Question**: In which borough were drug arrests the highest percentage of total arrests in December 2014?

**Pattern**: Merge Two Aggregations

**Approach**:
1. Filter to December 2014
2. Count total arrests by borough (Aggregation 1)
3. Count drug arrests by borough (Aggregation 2)
4. Merge the two aggregations
5. Calculate drug arrest percentage

In [None]:
# Step 1: Filter to December 2014
dec_2014 = df[(df['year'] == 2014) & (df['ARREST_DATE'].dt.month == 12)]

print(f"Arrests in December 2014: {len(dec_2014):,}")

In [None]:
# Step 2: Aggregation 1 - Total arrests by borough
total_by_borough = dec_2014.groupby('ARREST_BORO').size().reset_index(name='total_arrests')

print("Total arrests by borough:")
print(total_by_borough)

In [None]:
# Step 3: Create is_drug indicator and count drug arrests
dec_2014['is_drug'] = dec_2014['OFNS_DESC'].str.contains('DRUG', na=False)

# Aggregation 2 - Drug arrests by borough
drug_by_borough = dec_2014[dec_2014['is_drug']].groupby('ARREST_BORO').size().reset_index(name='drug_arrests')

print("Drug arrests by borough:")
print(drug_by_borough)

In [None]:
# Step 4: Merge the two aggregations
drug_merged = total_by_borough.merge(drug_by_borough, on='ARREST_BORO', how='left')

print("Merged data:")
print(drug_merged)

In [None]:
# Step 5: Calculate drug arrest percentage
drug_merged['drug_pct'] = (drug_merged['drug_arrests'] / drug_merged['total_arrests']) * 100

# Sort by drug percentage
drug_sorted = drug_merged.sort_values('drug_pct', ascending=False)

print("Boroughs ranked by drug arrest percentage:")
print(drug_sorted)

In [None]:
# Visualize
sns.barplot(data=drug_sorted, x='ARREST_BORO', y='drug_pct')

**Answer**: The borough with the highest drug arrest percentage in December 2014 was the Bronx (B), where approximately 16-17% of all arrests were for drug offenses.

---

## Exercise 2: Pattern 2 Practice - Queens' Share Over Time

**Question**: What percentage of misdemeanor arrests happened in Queens each month?

**Pattern**: Aggregate-Merge-Back

**Approach**:
1. Filter to misdemeanors only
2. Count total misdemeanor arrests per month (broader aggregation)
3. Count Queens misdemeanor arrests per month (detailed aggregation)
4. Merge monthly totals back to Queens data
5. Calculate percentage

In [None]:
# Step 1: Filter to misdemeanors only
misdemeanors = df[df['LAW_CAT_CD'] == 'M']

print(f"Total misdemeanor arrests: {len(misdemeanors):,}")

In [None]:
# Create year-month period column
misdemeanors['year_month'] = misdemeanors['ARREST_DATE'].dt.to_period('M')

print("Sample of year-month:")
print(misdemeanors[['ARREST_DATE', 'year_month']].head())

In [None]:
# Step 2: Count total misdemeanor arrests per month (citywide)
monthly_totals = misdemeanors.groupby('year_month').size().reset_index(name='total_misdemeanors')

print("Monthly totals (first 10):")
print(monthly_totals.head(10))

In [None]:
# Step 3: Count Queens misdemeanor arrests per month
queens_only = misdemeanors[misdemeanors['ARREST_BORO'] == 'Q']
queens_monthly = queens_only.groupby('year_month').size().reset_index(name='queens_misdemeanors')

print("Queens monthly arrests (first 10):")
print(queens_monthly.head(10))

In [None]:
# Step 4: Merge monthly totals back to Queens data
queens_with_context = queens_monthly.merge(monthly_totals, on='year_month', how='left')

print("Queens data with citywide context:")
print(queens_with_context.head(10))

In [None]:
# Step 5: Calculate Queens' percentage
queens_with_context['queens_pct'] = (queens_with_context['queens_misdemeanors'] / 
                                      queens_with_context['total_misdemeanors']) * 100

print(queens_with_context.head(10))

# Summary statistics
print(f"\nQueens' share of misdemeanors ranged from {queens_with_context['queens_pct'].min():.1f}% to {queens_with_context['queens_pct'].max():.1f}%")
print(f"Average: {queens_with_context['queens_pct'].mean():.1f}%")

In [None]:
# Visualize Queens' share over time
queens_with_context['year'] = queens_with_context['year_month'].dt.year
queens_with_context['month'] = queens_with_context['year_month'].dt.month

sns.lineplot(data=queens_with_context, x='month', y='queens_pct', hue='year')

**Answer**: Queens accounted for approximately 17-20% of all misdemeanor arrests each month, with an average around 18-19%. The percentage was fairly stable across the 2013-2015 period.

---

## Exercise 3: Pattern 3 Practice - Borough Recovery Rates

**Question**: Did the Bronx and Manhattan recover at the same rate from 2014 to 2015?

**Pattern**: Compare Filtered DataFrames

**Approach**:
1. Filter to just Bronx and Manhattan
2. Count arrests by borough and year
3. Separate 2014 and 2015 data
4. Merge them with suffixes
5. Calculate percentage change

In [None]:
# Step 1: Filter to just Bronx (B) and Manhattan (M)
bronx_manhattan = df[df['ARREST_BORO'].isin(['B', 'M'])]

print(f"Arrests in Bronx and Manhattan: {len(bronx_manhattan):,}")
print(bronx_manhattan['ARREST_BORO'].value_counts())

In [None]:
# Step 2: Count arrests by borough and year
borough_year = bronx_manhattan.groupby(['ARREST_BORO', 'year']).size().reset_index(name='arrests')

print("Arrests by borough and year:")
print(borough_year)

In [None]:
# Step 3: Separate 2014 and 2015 data
arrests_2014 = borough_year[borough_year['year'] == 2014]
arrests_2015 = borough_year[borough_year['year'] == 2015]

print("2014 arrests:")
print(arrests_2014)
print("\n2015 arrests:")
print(arrests_2015)

In [None]:
# Step 4: Merge on borough with suffixes
comparison = pd.merge(arrests_2014, arrests_2015, 
                     on='ARREST_BORO', 
                     how='inner',
                     suffixes=('_2014', '_2015'))

print("Merged comparison:")
print(comparison)

In [None]:
# Step 5: Calculate percentage change
comparison['pct_change'] = ((comparison['arrests_2015'] - comparison['arrests_2014']) / 
                            comparison['arrests_2014']) * 100

print("Recovery rates by borough:")
print(comparison[['ARREST_BORO', 'arrests_2014', 'arrests_2015', 'pct_change']])

# Print the answer
for _, row in comparison.iterrows():
    borough_name = "Bronx" if row['ARREST_BORO'] == 'B' else "Manhattan"
    print(f"\n{borough_name}: {row['pct_change']:+.1f}% change from 2014 to 2015")

In [None]:
# Visualize the comparison
sns.barplot(data=comparison, x='ARREST_BORO', y='pct_change')

**Answer**: Both boroughs saw increases from 2014 to 2015, but Manhattan recovered at a slightly higher rate (around +10-12%) compared to the Bronx (around +8-10%). This suggests that the post-pullback recovery was not uniform across boroughs - Manhattan's arrests rebounded more strongly than the Bronx.

---

## Key Takeaways

**Exercise 1 (Pattern 1)**: Merging two aggregations at the same level (borough) to calculate rates/percentages. This is useful for comparing what proportion of arrests fall into a specific category.

**Exercise 2 (Pattern 2)**: Merging a broader aggregation (citywide totals) back to a more detailed level (borough-month) to add context. This helps us understand each borough's share of the total.

**Exercise 3 (Pattern 3)**: Comparing two different time periods by creating separate aggregations and merging with suffixes. This is the foundation for calculating percentage changes and recovery rates.

All three patterns will be essential for Problem Set 3!