# DAT210x - Programming with Python for DS

## Module5- Lab1

Start by importing whatever you need to import in order to make this lab work:

In [2]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib
import matplotlib.pyplot as plt


### How to Get The Dataset

1. Open up the City of Chicago's [Open Data | Crimes](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) page.
1. In the `Primary Type` column, click on the `Menu` button next to the info button, and select `Filter This Column`. It might take a second for the filter option to show up, since it has to load the entire list first.
1. Scroll down to `GAMBLING`
1. Click the light blue `Export` button next to the `Filter` button, and select `Download As CSV`

Now that you have th dataset stored as a CSV, load it up being careful to double check headers, as per usual:

In [9]:
df1 = pd.read_csv('Datasets/crimes.csv')

Get rid of any _rows_ that have nans in them:

In [10]:
df1.dropna(how='any',inplace=True,axis=0)

Display the `dtypes` of your dset:

In [11]:
df1.dtypes

ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                      int64
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                  int64
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

Coerce the `Date` feature (which is currently a string object) into real date, and confirm by displaying the `dtypes` again. This might be a slow executing process...

In [12]:
df1['Date'] = pd.to_datetime(df1['Date'])

In [13]:
df1.dtypes

ID                               int64
Case Number                     object
Date                    datetime64[ns]
Block                           object
IUCR                             int64
Primary Type                    object
Description                     object
Location Description            object
Arrest                            bool
Domestic                          bool
Beat                             int64
District                       float64
Ward                           float64
Community Area                 float64
FBI Code                         int64
X Coordinate                   float64
Y Coordinate                   float64
Year                             int64
Updated On                      object
Latitude                       float64
Longitude                      float64
Location                        object
dtype: object

In [14]:
df1 = df1[df1.Latitude > 37]

In [16]:
def doKMeans(df):
    # Let's plot your data with a '.' marker, a 0.3 alpha at the Longitude,
    # and Latitude locations in your dataset. Longitude = x, Latitude = y
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(df.Longitude, df.Latitude, marker='.', alpha=0.3)

    
    # TODO: Filter `df` using indexing so it only contains Longitude and Latitude,
    # since the remaining columns aren't really applicable for this lab:
    #
    df = df.loc[:,['Longitude','Latitude']]

    # TODO: Use K-Means to try and find seven cluster centers in this df.
    # Be sure to name your kmeans model `model` so that the printing works.
    #
    model = KMeans(n_clusters=7)
    model.fit(df)
    labels = model.predict(df)

    # Now we can print and plot the centroids:
    centroids = model.cluster_centers_
    print(centroids)
    ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.5, linewidths=3, s=169)
    plt.show()

In [20]:
%matplotlib notebook
# Print & Plot your data
doKMeans(df1)

<IPython.core.display.Javascript object>

[[-87.70936075  41.78630675]
 [-87.63688961  41.78503408]
 [-87.67985182  41.96835828]
 [-87.75664984  41.92599372]
 [-87.67979694  41.88844252]
 [-87.63991902  41.70199066]
 [-87.57548112  41.74559377]]


Filter out the data so that it only contains samples that have a `Date > '2011-01-01'`, using indexing. Then, in a new figure, plot the crime incidents, as well as a new K-Means run's centroids.

In [18]:
# Print & Plot your data
df2 = df1.loc[df1.Date > '2011-01-01']

In [21]:
%matplotlib notebook
# Print & Plot your data
doKMeans(df2)

<IPython.core.display.Javascript object>

[[-87.57630838  41.74848815]
 [-87.71207888  41.78487714]
 [-87.75881503  41.92790985]
 [-87.63769862  41.70058702]
 [-87.64057197  41.78342326]
 [-87.68290846  41.88595454]
 [-87.68097374  41.96700236]]
