# San Francisco Police Department Incident Dataset

In this notebook, we'll be exploring the incident report dataset from the San Francisco Police Department using data mining and visualization techniques.

----

## Reading Dataset

In [None]:
import pandas

Our raw source dataset can be retrieved from this link: <https://data.sfgov.org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD>.

It is larger than the maximum 100 MB file size that github allows.  Until we decide how we're going to filter it and can check it in, just make certain it's downloaded into the same directory as the ipython notebook file.

Once the dataset is in place, the following code can read in the CSV as a dataframe and report the number of rows.

In [None]:
incident_df = pandas.read_csv('./SFPD_Incidents_-_from_1_January_2003.csv')
print(incident_df.shape[0])

Print the entire list of features from original dataset so we can see what we're working with.

In [None]:
print(incident_df.columns)

It is likely to prove useful to know the unique types within the Resolution, Category, and Descript columns.

In [None]:
resolution_types = incident_df.groupby('Resolution').Resolution.nunique()
print("{} unique values in 'Resolution' column".format(resolution_types.size))
resolution_types

In [None]:
category_types = incident_df.groupby('Category').Category.nunique()
print("{} unique values in 'Category' column".format(category_types.size))
category_types

In [None]:
descript_types = incident_df.groupby('Descript').Descript.nunique()
print("{} unique values in 'Descript' column".format(descript_types.size))
descript_types

In order to confirm that we're getting the expected data, report the value in the 'Date' column for row 0.

In [None]:
date = incident_df.get_value(0, 'Date')
print(date)

This is a neat way to perform a select to filter rows on a dataframe. Note that I'm creating a smaller dataframe by selecting the rows where the 'Date' column is equal to the date read above for row 0.  Be careful, though, and note that the row indices still match those from the original dataset, which will cause you fits if you're trying to iterate over row indices with the smaller dataset since it no longer has contiguous indices.

In [None]:
df_on_date = incident_df[incident_df.Date == date]
df_on_date

----

## Initializing Bokeh

Bokeh is a useful Python visualization library that we'll be using multiple times below.  Rather than repeat its initialization or spread it out over multiple locations in the document, I'm giving its initialization an independent section early in the notebook.

Also, please note that I had to run the following command before I could get Bokeh plots to display at all.

```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```

In [None]:
# Initialize Bokeh for visualizations.
import bokeh.io
import bokeh.models
import bokeh.plotting

bokeh.io.output_notebook()

----

## Mapping Incidents

We probably need to be able to do visualizations involving
layering over a base map of San Francisco based upon coordinates
expressed like we have available in the dataset. To start with, 
I've been using Bokeh:
<http://bokeh.pydata.org/en/latest/docs/user_guide/geo.html>.

Given that it was needed for Bokeh mapping, I went ahead 
and grabbed a Google Maps API key for a project named
"CMPE 188 - Crime Predictors".
It is "AIzaSyAs6Ugy0oz0R5YAxep9-kQ170t0U2fjELQ".

In [None]:
def map_incidents(df):

    # Extract the coordinates for the incidents on the extracted date.
    map_lats = []
    map_lons = []
    for row in df.itertuples():
        map_lats.append(float(row.Y))
        map_lons.append(float(row.X))
    map_data = dict(lat=map_lats, lon=map_lons)
    
    # Create the map plot object.
    map_plot = bokeh.models.GMapPlot(
        x_range=bokeh.models.DataRange1d(),
        y_range=bokeh.models.DataRange1d(),
        map_options=bokeh.models.GMapOptions(
            lat=map_lats[0],
            lng=map_lons[0],
            map_type="roadmap",
            zoom=11
        )
    )
    
    # Give the map plot a title.
    map_plot.title.text = "San Francisco Police Department Incident Locations"
    
    # Set the Google Maps API key to our project-specific key.
    map_plot.api_key = "AIzaSyAs6Ugy0oz0R5YAxep9-kQ170t0U2fjELQ"
    
    # Add glyphs of blue circles at extracted locations of incidents.
    map_plot.add_glyph(
        bokeh.models.ColumnDataSource(data=map_data),
        bokeh.models.Circle(
            x="lon",
            y="lat",
            size=10,
            fill_color="blue",
            fill_alpha=0.20,
            line_color=None
        )
    )
    
    # Add the standard map control tools.
    map_plot.add_tools(
        bokeh.models.PanTool(),
        bokeh.models.WheelZoomTool(),
        bokeh.models.BoxSelectTool()
    )

    return map_plot

In [None]:
bokeh.io.show(map_incidents(df_on_date))

----

## Clustering Incidents

In [None]:
# Brief overview - narrowing down the sfpd dataset by given date 
# and other minor things to get small enough subset for K means algorithm to run on my hardware
# for k means, the input must be numeric and the dataframe converted into a matrix to ditch the column headers 

# please remember, this is really only horseplay

reduced_df = incident_df[['Category', 'Descript', 'DayOfWeek', 'Date', 'Time']]

#select rows on only 1 day
date = reduced_df.get_value(0, 'Date')

#filter useless? information
reduced_df = reduced_df[reduced_df.Category != "FRAUD"]
reduced_df = reduced_df[reduced_df.Category != "NON-CRIMINAL"]
reduced_df = reduced_df[reduced_df.Date == date]
reduced_df

In [None]:
# get the number of different description cases out of each category, tells the number of 
# categorys but not neccesarily the diff descriptions since some are duplicates
# found help from this site 
# http://stackoverflow.com/questions/15411158/pandas-countdistinct-equivalent

# for a given day 
reduced_df.groupby('Category').Descript.nunique()


# for the entire dataset
incident_df.groupby('Category').Descript.nunique()


#take a sample set of features to feed into Kmeans

data_df_test = incident_df[['Category', 'Descript']]

#data_df_test = data_df_test[data_df_test.Category != "TREA"]
#data_df_test = data_df_test[data_df_test.Category != "RECOVERED VEHICLE"]
#data_df_test = data_df_test[data_df_test.Category != "PORNOGRAPHY/OBSCENE MAT"]
#data_df_test = data_df_test[data_df_test.Category != "BRIBERY"]
#data_df_test = data_df_test[data_df_test.Category != "SUICIDE"]


data_df_test.groupby("Category").Descript.nunique()

In [None]:
from sklearn.preprocessing import LabelEncoder

# in order to use K-means, the inputs must be numerical, so we have to discretize the category input
# found this post off stackoverflow helpful 
# http://stackoverflow.com/questions/34915813/convert-text-columns-into-numbers-in-sklearn 

le = LabelEncoder()

test_series = reduced_df[reduced_df.columns[:]].apply(le.fit_transform)

print(test_series)

In [None]:
#normalize the input data before using kmeans

from sklearn.cluster import KMeans

normalized_df = (test_series-test_series.mean())/test_series.std()

# remove the date and time columns since LabelEncoder zero'ed all date and times (not sure why)

# TMT: They were zero'd because you were normalizing labels to numbers, and the dataframe
# was already selected down incidents from the same Date (and, thus, DayOfWeek).

del normalized_df["DayOfWeek"]
del normalized_df["Date"]

# remove the column names by transforming dataframe inot matrix

testdata = normalized_df.as_matrix(columns=None)

print(testdata)
#perform k-means analysis on the reduced data set

kmean = KMeans(n_clusters=5) 

kmean.fit(testdata)

In [None]:
plot = bokeh.plotting.figure(
    width = 500,
    height = 500,
    title = 'CrimeStoppers',
    x_axis_label = "category",
    y_axis_label = "descript"
)

#plot centroid / cluster center / group mean for each group

clus_xs = []

clus_ys = []

#we get the  cluster x / y values from the k-means algorithm

for entry in kmean.cluster_centers_:

   clus_xs.append(entry[0])

   clus_ys.append(entry[1])

#the cluster center is marked by a circle, with a cross in it

plot.circle_cross(
    x=clus_xs,
    y=clus_ys,
    size=40,
    fill_alpha=0,
    line_width=2,
    color=['red', 'blue', 'purple', 'green', 'yellow']
)

plot.text(text = ['something', 'other', 'another', 'yet', 'more'], x=clus_xs, y=clus_ys, text_font_size='30pt')

i = 0 #counter

#begin plotting each petal length / width

#We get our x / y values from the original plot data.

#The k-means algorithm tells us which 'color' each plot point is,

#and therefore which group it is a member of.

for sample in testdata:

    #"labels_" tells us which cluster each plot point is a member of
    if kmean.labels_[i] == 0:
        plot.circle(x=sample[0], y=sample[1], size=15, color="red")
    if kmean.labels_[i] == 1:
        plot.circle(x=sample[0], y=sample[1], size=15, color="blue")
    if kmean.labels_[i] == 2:
        plot.circle(x=sample[0], y=sample[1], size=15, color="purple")
    if kmean.labels_[i] == 3:
        plot.circle(x=sample[0], y=sample[1], size=15, color="green")
    if kmean.labels_[i] == 4:
        plot.circle(x=sample[0], y=sample[1], size=15, color="yellow")  
    i += 1

# output using given date, normalization with std dev and 5 categories
# the last step, I have been trying to evaluate the messy plot and tweek some parameters

bokeh.io.show(plot)

----

## Notes

*Tahoma's list of interesting things to experiment with:*

* See if clustering would show anything interesting in our dataset.
* Make a set out of the category column, and then display its size and elements.
* Change the map to display different categories of incident in different colors.