# Lab 5: Clustering

In this assignment, you will explore clustering-based algorithms to explore 311 Service Request data in Chicago.

## Learning Objectives

In this assignment, you will learn the following:
* How to work with various clustering methods, including K-Means and Hierarchical Agglomeration 
* How to work with Chicago's 311 Service Request data
* How to characterize and interpret the outputs of clustering algorithms 
* How to evaluate the output of clustering algorithms

## Part 1: Loading the Data

### 1.1 Download the Data

Download the 311 Service Request data from the [City of Chicago Open Data Portal](https://data.cityofchicago.org/Service-Requests/311-Service-Requests/v6vf-nfxy). For this assignment, we will only use requests made in 2018 (where `created_date` is 2018). We've provided the code for pulling this data below. 

Print out the shape of your dataset and the first 5 rows below. Note that you should have 461,170 rows.

In [1]:
import pandas as pd
from sodapy import Socrata

# Init Socrata client
client = Socrata("data.cityofchicago.org", None)

# Query Socrata client to get first 500K rows of 311 requests made in 2018
results = client.get("v6vf-nfxy", limit=500000, where="date_extract_y(created_date) = 2018")

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



In [2]:
# YOUR CODE HERE 

### 1.2 Preprocess the Data

First, we will aggregate the data to count the number of each service request type per [Chicago community area](https://en.wikipedia.org/wiki/Community_areas_in_Chicago).

1. Drop all records without a `community_area` value. 
2. Create a dataframe where each row is a community area, and each column is the **proportion** of a particular type of service request in that community area.
3. Print the first 5 rows of your dataframe.

You should have 77 rows (one for each community area) and 92 columns (one for each type of service request). To keep your column names short, you may want to use the service request short code (`sr_short_code`) rather than the full text of the service request type.

In [3]:
# YOUR CODE HERE 

## Part 2: Clustering the Data

### 2.1 K-Means

#### 2.1.1 Clustering

1. Cluster your community areas using K-Means. Use `random_state=0` and set `k=3`.
2. Print the array of cluster labels.

In [4]:
# YOUR CODE HERE 

#### 2.1.2 Analysing Clusters

For each cluster: 
1. Provide summary stats for the cluster.
2. Describe - using statistics, graphs, or any other visualization - what types of data points are in this cluster.
3. What are the distinctive features for this cluster? Hint: you may want to use a decision tree here.

In [5]:
# YOUR CODE HERE 

*YOUR ANSWER HERE*

#### 2.1.3 Experimenting with Parameters

Explore how changing the number of clusters (`k`) affects the results above. Your analysis should include at least one visualization and a 5-7 sentence summary.  

In [6]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

### 2.2 Agglomerative Clustering

#### 2.2.1 Clustering

1. Cluster your community areas using Agglomerative Clustering. Use `random_state=0`.
2. Print the array of cluster labels. How do your new clusters differ from the ones you discovered using K-Means?

In [7]:
# YOUR CODE HERE

#### 2.2.2 Analysing Clusters

For each cluster: 
1. Provide summary stats for the cluster.
2. Describe - using statistics, graphs, or any other visualization - what types of data points are in this cluster.
3. What are the distinctive features for this cluster? Hint: you may want to use a decision tree here.

In [8]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

#### 2.2.3 Experimenting with Parameters

Explore how changing the distance metric (`affinity`) affects the results above. Your analysis should include at least one visualization and a 5-7 sentence summary.  

In [9]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

## Part 3: Open-Ended Exploration 

Note that this section is intentionally open-ended. The goal is to explore how integrating additional data and dimensionality reduction techniques can affect the robustness of your clusters. 

In this section, you are welcome to use K-Means, Agglomerative Clustering, or to try out additional clustering methods discussed in class. 

### 3.1 Integrating Additional Data 

Add at least three additional variables to your dataframe. For example, this might include demographic characteristics for the community areas. Feel free to pull this data from the Census (e.g. American Community Survey) or from the [City of Chicago Open Data Portal](https://data.cityofchicago.org/). 

Describe how including this additional data affects your clusters above. Again, include a 5-7 sentence summary of your analysis. 

In [10]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

### 3.2 Dimensionality Reduction 

Above, we clustered on 92 features corresponding to each of the service request types. We suspect that dimensionality reduction can be used to shrink this feature space while still capturing much of the relevant relationships in the data. 

Here, explore how principal component analysis (PCA) and clustering can be combined. First perform PCA on your features, and then cluster on the first N principal components. Be sure to justify your choice in N. 

Again, provide a 5-7 sentence summary of your analysis. 

In [11]:
# YOUR CODE HERE

*YOUR ANSWER HERE*