---
## Assignment 1.2: Crime Profiles by Police District

Different parts of the city have very different crime patterns. Here we quantify that using conditional probabilities.

*Draws from*: Week 3, Exercises 2.1 and 2.2.

> * For each police district in your dataset, compute the **conditional crime profile**: for each of your Personal Focus Crimes, calculate
>
>   $$r(\text{crime}, \text{district}) = \frac{P(\text{crime} \mid \text{district})}{P(\text{crime})}$$
>
>   A value above 1 means that crime type is *over-represented* in that district relative to the city-wide average; below 1 means it is *under-represented*.
> * Visualize these ratios in a way that makes it easy to compare across both districts and crime types. (Simple barcharts are fine, but you may also use ideas from more complex visualization techniques, for example, a heatmap could work well here, but you're free to choose another format if you can justify it.)
> * Pick **one district** whose profile stands out to you. Describe the pattern and offer an explanation for why that district looks the way it does. Are there geographic, demographic, or other factors that might explain it?

In [None]:
# Load data
import pandas as pd
df = pd.read_csv('./merged_sfpd.csv')

personal_focus = [
    'larceny/theft',
    'non-criminal',
    'assault',
    'vehicle theft',
    'drug/narcotic',
    'vandalism',
    'warrants',
    'burglary',
    'suspicious occ'
]

df_pf = df[df['incident_category'].isin(personal_focus)].copy()

In [None]:
prob_crime_district = df_pf.groupby('police_district')['incident_category'].value_counts(normalize=True).unstack().fillna(0)
prob_crime = df_pf['incident_category'].value_counts(normalize=True)

districts = sorted(df_pf['police_district'].dropna().unique().tolist())
crimes = sorted(df_pf['incident_category'].dropna().unique().tolist())

cond_probs = {crime: (prob_crime_district[crime] / prob_crime[crime]).to_dict() for crime in crimes}

In [None]:
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

cond_probs = pd.DataFrame(cond_probs)

# Calculate sums for marginal bar charts
row_sums = df_pf['police_district'].value_counts(normalize=True).sort_index()
col_sums = prob_crime.sort_index()

# Create figure with subplots using GridSpec
fig = plt.figure(figsize=(10, 10))
gs = GridSpec(2, 2, width_ratios=[8, 1], height_ratios=[1, 8], 
              hspace=0.05, wspace=0.05)

# Main heatmap
ax_main = fig.add_subplot(gs[1, 0])
im = ax_main.imshow(cond_probs.T, cmap='Blues', aspect='auto')
ax_main.set_xticks(range(len(districts)))
ax_main.set_xticklabels(districts, rotation=90)
ax_main.set_yticks(range(len(crimes)))
ax_main.set_yticklabels(crimes)
ax_main.set_xlabel('Police District')
ax_main.set_ylabel('Crime Category')

# Text annotations
for i in range(cond_probs.shape[0]):
    for j in range(cond_probs.shape[1]):
        val = cond_probs.iloc[i, j]
        ax_main.text(i, j, f"{val:.2f}", ha='center', va='center', 
                    color='black', fontsize=8+(1.5*val))

# Column Sums
ax_top = fig.add_subplot(gs[0, 0], sharex=ax_main)
ax_top.bar(range(len(districts)), row_sums, color='cornflowerblue', alpha=0.7)
ax_top.set_ylabel('Prob')
ax_top.tick_params(labelbottom=False)
ax_top.spines['top'].set_visible(False)
ax_top.spines['right'].set_visible(False)
ax_top.grid(axis='y', linestyle='--', alpha=0.7)

# Row Sums
ax_right = fig.add_subplot(gs[1, 1], sharey=ax_main)
ax_right.barh(range(len(crimes)), col_sums, color='cornflowerblue', alpha=0.7)
ax_right.set_xlabel('Prob')
ax_right.tick_params(labelleft=False)
ax_right.spines['top'].set_visible(False)
ax_right.spines['right'].set_visible(False)
ax_right.grid(axis='x', linestyle='--', alpha=0.7)

fig.suptitle('Conditional Probabilities of Police Districts Given Crimes', fontsize=14, y=0.98)

plt.show()

Above we see a visualization of the various types of crimes committed in the 11 districts of San Francisco. The data is visualized as a heatmap, in which each cell represents the conditional probability of a given crime in a given district. Thus, a value higher than one means that that crime is occurring more often than average, in the district. Above and to the right of the heatmap you see two bar charts. The top one shows the unconditional probability that if a crime occurs, it is in that given district. While the right bar chart shows the unconditional probability that if a crimes occurs, it is that crime (OBS. probabilities are only normalized with focus crimes, and not the total data set).

When looking at the plot two thing become clear. Firstly, a very small amount of the crimes registered are "ouf of sf". This district is quite high in the "non-criminal" category. This suggests that SF Police is called out of the city, every now and then, to simply declare that something is perfectly legal. If it is a crime, it is most likely someone stealing something, either a car or something else. In and of itself, this is not that interesting, but considering that this is a data set for crime in San Francisco, it is curious that we have a district called "Out of San Francisco" and a category called "non-criminal".
Moving on to SF we can see some interesting patterns. One that spring to mind is what is happening in the "Southern" district. This district does not spike out in any category, like we see with Tenderloin and drugs, instead it rests at a steady rate for each crime. In total, this district has the most crime, as seen in top bar chart. This suggests that criminals here are less picky at what crimes they commit, and instead try for a quantity over quality tactic. The most out-of-distribution crime in this district falls on the category "warrants", which in itself isn't a crime but typically suggests that a police investigation needs special permissions. Putting that together it seems that the Southern district of SF is one of the more "harsh" neighborhoods, and also one of the most policed. Which one causes the other, one can wonder?