# Module 7 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from nose.tools import assert_is_instance, assert_equal, assert_almost_equal

-----

## Loading Car Specification Data

In this assignment, we will work with the car specification-price data set to perform clustering analysis. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows. The second Code cell removes non-numeric features.

-----

In [None]:
# Load data
df = pd.read_csv('./imports-85.data')
df.sample(5)

In [None]:
# Remove non-numeric features
df_simple = df[df.columns[df.dtypes!=object]]
df_simple.head()

-----

## Problem 1: K-Means Clustering

For this problem, you will complete the `cluster` function, started below, to perform K-Means clustering on an input DataFrame. This function must create the [`KMeans`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) estimator, and apply this estimator to the input data to predict cluster membership. Your function will return both the model, and the cluster labels for each instance in the input data set.

-----

In [None]:
from sklearn.cluster import KMeans

def cluster(data, rs=0, n_clusters=5):
    """
    Applies KMeans to find n_clusters in the given data.
    
    Parameters
    ----------
    data: The dataset to apply clustering to
    rs: A paremter for reproducibility
    n_clusters: The number of clusters
    
    Returns
    -------
    model: the K-means model object
    clusters: the cluster label of each datapoint
    """
    
    ### YOUR CODE HERE
    
    return model, clusters

In [None]:
from sklearn.cluster import k_means_

# Perform kMeans clustering
k_means, clusters = cluster(df_simple, rs=1, n_clusters=6)

# Test Function
assert_is_instance(k_means, k_means_.KMeans)
assert_is_instance(clusters, np.ndarray)
assert_equal(k_means.n_clusters, 6)
assert_equal(clusters[0],1)

-----

## Problem 2: The Elbow Method

For this problem, you will complete the `inertia_calc` function, started below, to compute the inertia values for different K-Means cluster models. This function will use the cluster model passed in via the `cluster_function` parameter to identify clusters in the `data` dataset by calling the appropriate `fit` function. The fitted model can be used to extract the model inertias, which should be aggregated in a NumPy array (along with all the different inertias for each value in the `n_clusters` array) that will be returned from the function.

-----

In [None]:
def intertia_calc(cluster_function, data, n_clusters):
    """
    Compute inertia values for different numbers of clusters
    
    Parameters
    ----------
    cluster_function: The function used to determine clusters
    data: The dataset to which the clustering function is applied
    n_clusters: An array containing the different numbers of clusters
    
    Returns
    -------
    intertias: A NumPy array of inertias for each k-means model.
    """
    
    ### YOUR CODE HERE
    
    return intertias

In [None]:
# Define cluster size array
n_clusters = np.arange(2,10)

# Compute Intertia
inertia_vec=intertia_calc(cluster, df_simple, n_clusters)

# Test Inertia Calculation
assert_is_instance(inertia_vec, np.ndarray)
assert_almost_equal(sum(inertia_vec), 8098575233.7986,places=2)

-----

The following Code cell generates the _elbow plot_, which indicates (based on the location of the elbow, or break in slope) that the optimal number of clusters is three.

-----

In [None]:
# Make the Elbow plot
fig, ax = plt.subplots()
ax.plot(n_clusters, inertia_vec)
ax.set(title='Elbow Plot', 
       xlabel='Number of Clusters', 
       ylabel='Total distance')

sns.despine()

-----

## Problem 3: Gaussian Mixture Models

For this problem, you will complete the `GMM_fit` function, started below, to compute the Gaussian Mixture Model for the supplied data. This function must create the [`GaussianMixture`](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) estimator, and apply this estimator to the input data to predict cluster membership. Your function will return both the model, and the cluster labels for each instance in the input data set.

-----

In [None]:
from sklearn.mixture import GaussianMixture

def GMM_fit(data, rs=0, n_c=5):
    '''
    Apply a Gaussian Mixture Model to input data
    
    Parameters
    ----------
    data: The dataset to which the mxture model is applied
    rs: A parameter for reproducibility
    n_c: The number of clusters
    
    Returns
    -------
    model: the GMM model
    clusters: the cluster label of each datapoint
    '''
    
    ### YOUR CODE HERE
    
    return model, clusters

In [None]:
# Perform Gaussian Mixture Modeling
GMM, clusters = GMM_fit(df_simple, rs=1, n_c=3)

# Test Function
assert_is_instance(GMM, GaussianMixture)
assert_is_instance(clusters, np.ndarray)
assert_equal(GMM.n_components, 3)
assert_equal(clusters[0],1)

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 