# San Francisco Police Department Incident Dataset

In this notebook, we'll be exploring the incident report dataset from the San Francisco Police Department using data mining and visualization techniques.

----

## Library Code

In order to make code development easier and to not clutter the notebook,
we'll be referring to definitions from our own notebook-local Python library
in the following sections.

In [None]:
import crime

----

## Reading Dataset

In [None]:
import pandas

Our raw source dataset can be retrieved from this link: <https://data.sfgov.org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD>.

It is larger than the maximum 100 MB file size that github allows.  Until we decide how we're going to filter it and can check it in, just make certain it's downloaded into the same directory as the ipython notebook file.

Once the dataset is in place, the following code can read in the CSV as a dataframe and report the number of rows.

In [None]:
incident_file = 'SFPD_Incidents_-_from_1_January_2003.csv'
incident_df = pandas.read_csv(incident_file)
print("Read {} rows x {} columns of incident data from '{}'.".format(incident_df.shape[0], incident_df.shape[1], incident_file))

Print the list of features from the original dataset so that we can see what kinds of columns we're working with.

In [None]:
incident_features = incident_df.columns.values
print(incident_features)

Let's get a little bit more information about the types of values in our input data.

In [None]:
incident_df.info()

In [None]:
incident_df.describe()

In [None]:
incident_df.describe(include=['O'])

Also, we can see the first several rows of our input dataframe like this.

In [None]:
incident_df

It is likely to prove useful to know the unique values within each of the Resolution, Category, and Descript columns.

In [None]:
category_values = incident_df.groupby('Category').Category.nunique().index.values
print("The 'Category' column contains {} unique values.".format(category_values.size))
category_values

In [None]:
descript_values = incident_df.groupby('Descript').Descript.nunique().index.values
print("The 'Descript' column contains {} unique values.".format(descript_values.size))
descript_values

In [None]:
resolution_values = incident_df.groupby('Resolution').Resolution.nunique().index.values
print("The 'Resolution' column contains {} unique values.".format(resolution_values.size))
resolution_values

----

## Cleaning Dataset

Some of the columns in the input dataset need to be cleaned up before we can make much use of them.

To begin with, observe that the 'Date' and 'Time' strings should be converted to a numeric Timestamp.

In [None]:
import datetime
#incident_df['Date'] = pandas.to_datetime(incident_df['Date'])
pandas.to_datetime(incident_df['Date'])

----

## Reducing Dataset

Examining the available features, we can see several columns that are of little interest to the rest of the work.  So, let's immediately filter for the interesting features.

In [None]:
filtered_df = incident_df.filter(items=['Category', 'Descript', 'Date', 'Time', 'Resolution', 'X', 'Y'])
print("Filtered incident data down to {} rows x {} columns.".format(filtered_df.shape[0], filtered_df.shape[1]))

Now that we've reduced the dimensionality of our dataset by eliminating uninteresting columns, we can also consider selecting fewer rows.  Looking at the classes of 'Resolution' values, it might make sense to constrain our work over the set of instances that resulted in an arrest and booking and then eliminate the 'Resolution' column.

This is a neat way to perform a select to filter rows on a dataframe. Be careful, though, and note that the row indices still match those from the original dataset, which will cause you fits if you're trying to iterate over row indices with the smaller dataset since it no longer has contiguous indices.

In [None]:
bookings_df = filtered_df[filtered_df.Resolution == 'ARREST, BOOKED']
bookings_df = bookings_df.drop('Resolution', axis=1)
print("Bookings data reduced down to {} rows x {} columns.".format(bookings_df.shape[0], bookings_df.shape[1]))

Let's take a look at the shape of the reduced bookings dataset.

In [None]:
bookings_df.info()

In [None]:
bookings_df.describe()

In [None]:
bookings_df.describe(include=['O'])

NOTE: For whatever it's worth, I'm starting to second guess filtering the incidents down to just bookings.

----

## Initializing Bokeh

Bokeh is a useful Python visualization library that we'll be using multiple times below.  Rather than repeat its initialization or spread it out over multiple locations in the document, I'm giving its initialization an independent section early in the notebook.

Also, please note that I had to run the following command before I could get Bokeh plots to display at all.

```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```

In [None]:
# Initialize Bokeh for visualizations.
import bokeh.io
import bokeh.models
import bokeh.plotting

bokeh.io.output_notebook()

----

## Mapping Incidents

It would be useful to be able to do visualizations involving
layering over a base map of San Francisco based upon coordinates
expressed like we have available in the dataset.

That said, our bookings dataframe still has several hundred thousand
incidents in it, so let's constrain our example visualization down
to a single explicit date, using whatever the date of the first
incident in our original input data was for arbitrary simplicity.

In [None]:
date = incident_df.get_value(0, 'Date')
print(date)

In [None]:
df_bookings_on_date = bookings_df[bookings_df.Date == date]
df_bookings_on_date = df_bookings_on_date.filter(items=['Category', 'Descript', 'Time', 'X', 'Y'])
df_bookings_on_date.head(5)

To start with for map visualization, we've been using Bokeh:
<http://bokeh.pydata.org/en/latest/docs/user_guide/geo.html>.

In [None]:
bokeh.io.show(crime.map_incidents(df_bookings_on_date))

----

## Gaussian Naive Bayes

Before editing, the following snippet originally came from <https://www.kaggle.com/wikaiqi/titanic/titaniclearningqi>.

In [None]:
# Gaussian Naive Bayes
#gaussian = GaussianNB()
#gaussian.fit(X_data, Y_data)
#Y_pred = gaussian.predict(X_test_kaggle)
#acc_gaussian = cross_val_score(gaussian, X_data, Y_data, cv=Kfold)
#bcc_gaussian = round(gaussian.score(X_test, Y_test) * 100, 5)

#submission = pd.DataFrame({
#        "PassengerId": test_df["PassengerId"],
#        "Survived": Y_pred
#    })
#submission.to_csv('submission_Gassian_Naive_Bayes.csv', index=False)

----

## Clustering Incidents

In [None]:
from sklearn.preprocessing import LabelEncoder

# in order to use K-means, the inputs must be numerical, so we have to discretize the category input
# found this post off stackoverflow helpful 
# http://stackoverflow.com/questions/34915813/convert-text-columns-into-numbers-in-sklearn 

le = LabelEncoder()

test_series = df_bookings_on_date[df_bookings_on_date.columns[0:2]].apply(le.fit_transform)

print(test_series)

In [None]:
#normalize the input data before using kmeans

from sklearn.cluster import KMeans

normalized_df = (test_series-test_series.mean())/test_series.std()

# TMT: it might make more sense to cluster by location...
#normalized_df = df_bookings_on_date[df_bookings_on_date.columns[3:5]]

# remove the column names by transforming dataframe into matrix
testdata = normalized_df.as_matrix(columns=None)

print(testdata)
#perform k-means analysis on the reduced data set

kmean = KMeans(n_clusters=5) 

kmean.fit(testdata)

In [None]:
plot = bokeh.plotting.figure(
    width = 500,
    height = 500,
    title = 'CrimeStoppers',
    x_axis_label = "category",
    y_axis_label = "descript"
)

#plot centroid / cluster center / group mean for each group

clus_xs = []

clus_ys = []

#we get the  cluster x / y values from the k-means algorithm

for entry in kmean.cluster_centers_:

   clus_xs.append(entry[0])

   clus_ys.append(entry[1])

#the cluster center is marked by a circle, with a cross in it

plot.circle_cross(
    x=clus_xs,
    y=clus_ys,
    size=40,
    fill_alpha=0,
    line_width=2,
    color=['red', 'blue', 'purple', 'green', 'yellow']
)

plot.text(text = ['something', 'other', 'another', 'yet', 'more'], x=clus_xs, y=clus_ys, text_font_size='30pt')

i = 0 #counter

#begin plotting each petal length / width

#We get our x / y values from the original plot data.

#The k-means algorithm tells us which 'color' each plot point is,

#and therefore which group it is a member of.

for sample in testdata:

    #"labels_" tells us which cluster each plot point is a member of
    if kmean.labels_[i] == 0:
        plot.circle(x=sample[0], y=sample[1], size=15, color="red")
    if kmean.labels_[i] == 1:
        plot.circle(x=sample[0], y=sample[1], size=15, color="blue")
    if kmean.labels_[i] == 2:
        plot.circle(x=sample[0], y=sample[1], size=15, color="purple")
    if kmean.labels_[i] == 3:
        plot.circle(x=sample[0], y=sample[1], size=15, color="green")
    if kmean.labels_[i] == 4:
        plot.circle(x=sample[0], y=sample[1], size=15, color="yellow")  
    i += 1

# output using given date, normalization with std dev and 5 categories
# the last step, I have been trying to evaluate the messy plot and tweek some parameters

bokeh.io.show(plot)

----

## Notes

*interesting things to experiment with:*

* See if clustering would show anything interesting in our dataset.
* Change the map to display different categories of incident in different colors.
* Write a getDayOfWeekFromDate() helper function in crime library.
* Linearize 'Date' and 'Time' fields into a single date/time value (e.g. unix time in seconds since 1970).