# Components of Crime 1: Principal Components Analysis of Law Enforcement Agency Data

<br />
<br />
<br />

### Table of Contents

* Introduction

* Analysis of Law Enforcement Agency Data
 * Load Data
 * Covariance and Eigenvalue Analysis
 * Principal Components Analysis
 * Subspace Projection
 * K-Means Clustering Analysis
 

<br />
<br />
<br />


# Introduction 

This notebook analyzes the California crime data set from the FBI.

In this notebook, we'll be loading eight data files, which form four data sets:
* Law enforcement agencies
* Cities
* Counties
* Campuses

Each set has two data files, one with data about law enforcement and the other with data about crimes.

In [None]:
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd

# plots
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import *

# useful for data wrangling
import io, os, re, subprocess

# for sanity
from pprint import pprint

In [None]:
# learn you some machines
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [None]:
data_files = os.listdir('../input/')
pprint(data_files)

## Analysis of Law Enforcement Agency Data

Start by analyzing crime statistics broken out by law enforcement agency. 

We can utilize PCA to reduce the number of dimensions of our data set, but it would be nice if that information were in a more useful form - or, even better, if it gave us a new way to look at law enforcement agencies. 

What we'll do is to apply principal components analysis to our data set to reduce the dimensionality, then create a K-Means clustering algorithm to group points that are neighbors in the lower-dimensional PCA space. This can help us to identify the important characteristics of a law enforcement agency and group similar law enforcement agencies based on the values of these important characteristics.

## Load Data

The data requires a bit of wrangling to get into a DataFrame. We worked out a few functions in a prior notebook (link):

In [None]:
def ca_law_enforcement_by_agency(data_directory):
    filename = 'ca_law_enforcement_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
    agencies = re.findall('\w+ Agencies', data)
    all_but_agencies = re.split('\w+ Agencies',data)
    del all_but_agencies[0]
    
    newlines = []
    for (a,aba) in zip(agencies,all_but_agencies):
        newlines.append(''.join([a,aba]))
    
    # Combine into one long string, and do more processing
    one_string = '\n'.join(newlines)
    sio = io.StringIO(one_string)
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

def ca_offenses_by_agency(data_directory):
    filename = 'ca_offenses_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()
    
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(one_line)
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

df1 = ca_law_enforcement_by_agency('../input/')
df1.head()

df2 = ca_offenses_by_agency('../input/')
df2.head()

df = pd.merge(df1,df2)

The DataFrame that results contains combined data about law enforcement and crimes:

In [None]:
print(df.shape)
print(df.head(2))

In [None]:
# We should note that the columns
# "violent crime" and "property crime" 
# are sums of other columns.

col1 = df['Violent crime']
col2 = (df['Murder and nonnegligent manslaughter']+df['Rape (revised definition)']+df['Robbery']+df['Aggravated assault'])

print("Columns col1 (violent crime) and col2 (sum of violent types of crime) are identical.")
print((col2-col1)[:10])

In [None]:
# This column does not have data
try:
    del df['Rape (legacy definition)']
except KeyError:
    pass

df = df.replace(np.nan,0.0)

for col in df.columns.tolist():
    print("Number of NaNs in column %s is %d"%(col, df[col].isnull().sum() ))

In [None]:
pca_cols = df.columns.tolist()[3:]

X_orig = df[pca_cols].values

## Covariance and Eigenvalue Analysis

Next we use a function that normalizes our data, so that each input variable has a mean of 0 and a variance of 1. Once we've done that, we can proceed with a covariance and eigenvalue analysis:

In [None]:
def get_normed_mean_cov(X):
    X_std = StandardScaler().fit_transform(X)
    X_mean = np.mean(X_std, axis=0)
    
    ## Automatic:
    #X_cov = np.cov(X_std.T)
    
    # Manual:
    X_cov = (X_std - X_mean).T.dot((X_std - X_mean)) / (X_std.shape[0]-1)
    
    return X_std, X_mean, X_cov

X_std, X_mean, X_cov = get_normed_mean_cov(X_orig)

The covariance matrix can be visualized with a heatmap, which will reveal any structure (variables that co-vary, and can therefore be reduced using PCA):

In [None]:
xlabels = pca_cols
xlabels = [re.sub("Murder and nonnegligent manslaughter","Murder, Manslaughter",j) for j in xlabels]
xlabels = [re.sub("Total law enforcement employees","Tot law enf empl",j) for j in xlabels]

In [None]:
fig = plt.figure(figsize=(6,6))
sns.heatmap(pd.DataFrame(X_cov), 
            xticklabels=xlabels, yticklabels=xlabels,
            vmin=-1,vmax=1,
            annot=False, square=True, cmap='BrBG')
plt.title('Heatmap of Covariance Matrix Magnitude: Law Enforcement Agency Data', size=14)

plt.show()

This visualization shows which variables vary together. The darker squares show a correspondence between changes in variables. The covariance matrix shows that rape and robbery, both violent crimes, co-vary positively with property crimes. We can also see that burglary is an outlier among property crimes in that it does not co-vary strongly with other property crimes. We also see that every crime, with the exception of aggrevated assault, co-varies positively with the total number of civilians in the law enforcement agency.

In [None]:
eigenvals, eigenvecs = np.linalg.eig(X_cov)

eigenvals = np.abs(eigenvals)
eigenvecs = np.abs(eigenvecs)

# Eigenvalues are not necessarily sorted, but eigenval[i] *does* correspond to eigenvec[i]
#print "Eigenvals shape: "+str(eigenvals.shape)
#print "Eigenvecs shape: "+str(eigenvecs.shape)

# Create a tuple of (eigenvalues, eigenvectors)
unsrt_eigenvalvec = [(eigenvals[i], eigenvecs[:,i]) for i in range(len(eigenvals))]

# Sort tuple by eigenvalues
eigenvalvec = sorted(unsrt_eigenvalvec, reverse=True, key=lambda x:x[0])

## This is noisy, but interesting:
#pprint([pair for pair in eigenvalvec])
## We will visualize this below.

In [None]:

fig = plt.figure(figsize=(6,3))
sns.heatmap(pd.DataFrame([pair[1] for pair in eigenvalvec]), 
            annot=False, cmap='coolwarm',
            xticklabels=xlabels, yticklabels=range(len(eigenvalvec)),
            vmin=-1,vmax=1)

plt.ylabel("Ranked Eigenvalue")
plt.xlabel("Eigenvector Components")
plt.title('Eigenvalue Analysis: Law Enforcement Agency Data', size=14)
plt.show()

The eigenvalue visualization above shows each eigenvector on a row. The eigenvectors are sorted in order from highest to lowest eigenvalue. Higher values closer to the top of the plot indicate a variable that does a good job of representing the rest of the data set. While there is no single dominant quantity, there are a few quantities that we can drop or ignore.

For example, the murder and manslaughter column is nearly entirely gray - meaning that the number of cases of murder or manslaughter is the least statistically representative quantity we could look at when assessing a law enforcement agency. (Take note, all you crime journalists salivating over the latest murder statistics - murder stats are no judge of a law enforcement agency.)

It turns out that the eigenvectors of the covariance matrix, which we're visualizing above, are precisely the same as the principal components. The visualization above is showing the values of the principal components.

We can also look at the explained variance, which is a measure of how much of the variance in the data can be reprsented by a given principal component. This plot will tell us how useful PCA will be.

In [None]:
lam_sum = sum([j[0] for j in eigenvalvec])
explained_variance = [(lam_k/lam_sum) for lam_k in sorted(eigenvals, reverse=True)]

In [None]:
plt.figure(figsize=(6, 4))

plt.bar(range(len(explained_variance)), explained_variance, 
        alpha=0.5, align='center',
        label='Individual Explained Variance $\lambda_{k}$')

plt.ylabel('Explained variance ratio')
plt.xlabel('Ranked Eigenvalues')
plt.title("Scree Graph: Law Enforcement Agency Data", size=14)

plt.legend(loc='best')
plt.ylim([0,np.max(explained_variance)+0.1])
plt.tight_layout()

Not bad - the first principal component accounts for quite a bit of variance. Typically we specify a minimum amount of variance in the original data set that must be retained, and select a number of principal components based on this. By plotting the cumulative sum of the explained variance, we can use a number of components that gives us an explained variance of 90% or more:

In [None]:
fig = plt.figure(figsize=(6,4))
ax1 = fig.add_subplot(111)

ax1.plot(np.cumsum(explained_variance),'o')

ax1.set_ylim([0,1.01])

ax1.set_xlabel('Number of Principal Components')
ax1.set_ylabel('Cumulative explained variance')
ax1.set_title('Explained Variance: Law Enforcement Agency Data', size=14)

plt.show()

In [None]:
print(np.cumsum(explained_variance)[:4])

Four components it is - those four dimensions will account for 91.4% of the variance in the original data. Now we'll begin our PCA analysis.

## Principal Components Analysis

To perform PCA, we can do the linear algebra by hand (good learning experience), or we can let scikit-learn do all the heavy lifting for us (better idea). Perform PCA by creating a PCA object and setting parameters (mainly, number of components). Then fit it to the normalized data (vectors of input variables in `X_std`):

In [None]:
N_PCA = 4

# 4 components should explain about 90% of the variance.
sklearn_pca = PCA(n_components = N_PCA).fit(X_std)
print(sklearn_pca.components_.shape)

In [None]:
print("Principal Components:")
print(sklearn_pca.components_)

We already visualized these principal components in a heatmap, but it is useful to break out the first four principal components and examine them with bar charts to more easily see which components contribute to the principal components.

In [None]:
# This requires a weird bar chart label offset.
# 
# xticks() controls where the tick marks are located,
# and parameters like rotation angle of text.
# this is available through pyplot (plt).
#
# xticklabels controls the x tick labels.
# of course, this is NOT available through pyplot.
# you have to have a handle to the axis itself.
# that's why I use gca().set_xticklabels()
# 
# the more sane way would be plt.xticklabels()

colors = [sns.xkcd_rgb[z] for z in ['dusty purple','dusty green','dusty blue','orange']]
for i in range(4):
    fig = plt.figure(figsize=(6,4))
    xstuff = list(range(len(sklearn_pca.components_[i])))
    sns.barplot(xstuff,
                sklearn_pca.components_[i], color=colors[i])
    
    gca().set_xticklabels(xlabels)
    
    plt.xticks(np.arange(len(sklearn_pca.components_[i]))-0.1,rotation=90,size=14)
    plt.ylabel('Principal Component '+str(i+1)+' Value',size=12)
    plt.title('Principal Component '+str(i+1),size=12)
    plt.show()

Interestingly, the first principal component acts to greatly dampen the effect of certain variables - some we would expect, namely, the number of aggrevated assaults, burglaries, and occurrences of murder or manslaughter, but also some that we would not expect, such as total number of law enforcement employees, total number of officers, and total occurrences of violent crime. Because these data consist of crime statistics, we would anticipate that a quantity such as total number of officers would be a good prediction of other crime statistics, but it is not. In fact, the number of civilians in the law enforcement agency is the quantity that represents crime data better.

When we perform PCA, we are performing a linear transform that reduces the dimensions; this transform preserves the original data (does not skew it) with respect to certain directions - namely, the eigenvectors. This transform can be expressed as:

$$
\mathbf{Z} = \mathbf{W} \mathbf{X}
$$

where $\mathbf{W}$ is the projection matrix (that's what we're fitting when we call the `fit()` method of the PCA object), $\mathbf{X}$ is the high-dimensional list of input vectors, and $\mathbf{Z}$ is a low-dimensional representation of said input vectors. In our case, this will be a four-element vector instead of a sixteen-element vector. 

## Subspace Projection

Because we've picked a small number of principal components, we can perform a subspace projection from the high-dimensional data to the lower-dimensional subspace, and use that to visualize the structure of the data. While it may be difficult to think about what clusters in principal component space mean, it's easy to build intuition by visualization. Here's a scatterplot of the projection of data into principal component subspace:

In [None]:
Z = sklearn_pca.fit_transform(X_std)
print(Z.shape)

In [None]:
fig = plt.figure(figsize=(14,6))
ax1, ax2 = [fig.add_subplot(120 + i + 1) for i in range(2)]



ax1.scatter( Z[:,0], Z[:,1], s=80 )

ax1.set_title('Principal Components 0 and 1\nSubspace Projection', size=14)
ax1.set_xlabel('Principal Component 0')
ax1.set_ylabel('Principal Component 1')



ax2.scatter( Z[:,2], Z[:,3], s=80 )

ax2.set_title('Principal Components 2 and 3\nSubspace Projection', size=14)
ax1.set_xlabel('Principal Component 0')
ax1.set_ylabel('Principal Component 1')

plt.show()

In [None]:
for i in range(4):
    print("Explained Variance, Principal Component %d: %0.4f"%(i,sklearn_pca.explained_variance_[i]/np.sum(sklearn_pca.explained_variance_.sum())))

Note that the explained variance quantity above is reporting the within-model explained variance, not the total explained variance accounting for all input variables.

## K-Means Cluster Analysis

From the plots of the data in the four principal component dimensions above, we can visually see some clusters. We can use a k-means cluster analysis to group neighboring points together into clusters. Once we've done that, we can figure out which law enforcement agencies correspond to which group, and derive a useful way of classifying law enforcement agencies.

In [None]:
km = KMeans(n_clusters=6, n_init=4, random_state=False)
km.fit(Z)
print(km.n_clusters)
print(km.predict(Z))

In [None]:
# To color each point by the digit it represents,
# create a color map with N elements (N rgb values).
# One for each cluster.
#
# Then, use the system response (y_training), which conveniently
# is a digit from 0 to 9.
def get_cmap(n):
    colorz = plt.get_cmap('Set1')
    return[ colorz(float(i)/n) for i in range(n)]

colorz = get_cmap( km.n_clusters )
colors = [colorz[j] for j in km.predict(Z)]

fig = plt.figure(figsize=(12,4))
ax1, ax2 = [fig.add_subplot(120 + i + 1) for i in range(2)]

s1 = ax1.scatter( Z[:,0], Z[:,1] , c=colors, s=80 )
ax1.set_title('Principal Components 0 and 1\nSubspace Projection')

s2 = ax2.scatter( Z[:,2], Z[:,3] , c=colors, s=80 )
ax2.set_title('Principal Components 2 and 3\nSubspace Projection')


# ------------
# thanks to matplotlib for legend stupid-ness.
# guess i'll just draw the legend myself.
labels = ["Cluster "+str(j) for j in range(km.n_clusters)]
rs = []
for i in range(len(colorz)):
    p = Rectangle((0,0), 1, 1, fc = colorz[i])
    rs.append(p)
ax1.legend(rs, labels, loc='best')
ax2.legend(rs, labels, loc='best')
# ------------

ax1.set_ylim([-3,7])
ax2.set_ylim([-3,7])

ax1.set_xlabel("Principal Component 0")
ax1.set_ylabel("Principal Component 1")

ax2.set_xlabel("Principal Component 2")
ax2.set_ylabel("Principal Component 3")

plt.show()

In [None]:
# Store the cluster number in a new DataFrame column 
cluster_col = km.predict(Z)
df['Cluster'] = cluster_col

In [None]:
for k in range(km.n_clusters):
    if k!=0:
        print("-"*20)
    print("Cluster %d:"%(k))
    pprint(df['Agency'][df['Cluster']==k].tolist())

It looks like the cluster analysis has recovered groupings of law enforcement agencies into school districts (two clusters), tribal law enforcement agencies, and three clusters corresponding to the three law enforcement agencies that are statistical outliers from each other and from other law enforcement agencies (parks and recreation, a state hospital, and BART police).

The analysis provides two novel observations:

 * The San Bernadino school district law enforcement agency is statistically much different from the three other school district law enforcement agencies;

 * The tribal law enforcement agency cluster also includes an airport, a rancheria, and a developmental center, meaning these agencies operate similarly to tribal law enforcement agencies.