# Module 7 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

---
# Prepare Breast Cancer Data

In this assignment we will use the breast cancer dataset to perform cluster finding. Before we attempt to build models, we first prepare the data.

Please run the next two code cells before proceeding to Problem 1.



In [None]:
from sklearn.preprocessing import LabelEncoder
#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
label = LabelEncoder().fit_transform(df['class'])

data.sample()

---
# Problem 1: Standardize Features

For this problem you will use the DataFrame **data** created above.

To solve this problem do the following:
- Use `StandardScaler` to standardize `data` and assign normalized data to variable **data_ss**

After this problem, there's a new variable **data_ss** defined.

---

In [None]:
from sklearn.preprocessing import StandardScaler
# YOUR CODE HERE


In [None]:
assert_almost_equal(data_ss[0,0], 0.19790469484926235, msg='data_ss is not correct')
assert_almost_equal(data_ss[0,-1], -0.3483997074310662, msg='data_ss is not correct')
print('Sample Standardized data:')
print(data_ss[:2])

---
# Problem 2: Fit KMeans Model and Calculate Evaluation Metrics

Create and fit a KMeans model with breast cancer dataset. Calculate Adjusted Rand Index and Silhouette score of the model.

For this problem, use `data_ss` and `label` created above.

To solve this problem do the following:
 - Create a `KMeans` model **k_means**. Set `n_clusters` to 2, `n_init` to 10, `random_state` to 23
 - Fit the KMeans model on `data_ss`
 - Apply k_means `predict` function on data_ss to get predicted clusters, assign it to variable **pred_clusters**.
 - use `adjusted_rand_score` in `metrics` module with label and pred_clusters to calculate Adjusted Rand Index and set the score to variable **ari_score**
 - Use `silhouette_score` in `metrics` module with data_ss and pred_clusters to calculate Silhouette score. Assign the score to variable **s_score**

After this problem, there will be a fitted KMeans model **k_means**, as well as __ari_score__ and **s_score** defined.

---

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics

# YOUR CODE HERE


In [None]:
assert_equal(type(k_means), type(KMeans()), msg="k_means is a KMeans model")
assert_equal(k_means.get_params()['n_clusters'], 2, msg="k_means is not created with n_clusters = 2")
assert_equal(k_means.get_params()['n_init'], 10, msg="k_means is not created with n_init = 10")
assert_equal(k_means.get_params()['random_state'], 23, msg="k_means is not created with random_state = 23")
assert_almost_equal(k_means.inertia_, 2728.1495129753007, msg="k_means is not fit properly")
assert_almost_equal(ari_score, 0.8355975533950785, msg='Adjusted Rand Index is not correct')
assert_almost_equal(s_score, 0.5732450609290859, msg='Silhouette score is not correct')
print(f"Adjusted Rand Index of Kmeans: {ari_score:5.3f}")
print(f"Silhouette Score of Kmeans: {s_score:5.3f}")

---

# Problem 3: Prepare Data for k-distance Graph

Prepare data to plot k-distance graph to determine proper `eps` value for a DBSCAN model.

For this problem you will use **data_ss** created in problem 1.

To solve this problem do the following:
- Create `NearestNeighbors` model **nnb**. Set `n_neighbors` to 2.
- Fit nnb with data_ss.
- Use nnb function `kneighbors` with `data_ss` to calculate distnace to the nearest point of each data point. Assign return values to variable **distances** and __indices__.

After this problem, there's a trained model **nnb** defined, as well as numpy array of distance to nearest points, **distances**. We will use distances to plot the k-distance graph in the autograder cell.

-----

In [None]:
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt

# YOUR CODE HERE


In [None]:
assert_equal(type(nnb), type(NearestNeighbors()), msg="nnb is a NearestNeighbors model")
assert_equal(nnb.get_params()['n_neighbors'], 2, msg="nnb is not created with n_neighbors = 2")
assert_almost_equal(distances[1, 1], 1.00846778, msg="distances are correct")
assert_almost_equal(distances[2, 1], 0.27463559, msg="distances are correct")

#plot k-distance graph
import matplotlib.pyplot as plt
#sort by distance from low to high and plot
fig, ax = plt.subplots(figsize=(8, 6))
distances = distances[:,1]
distances = np.sort(distances)
plt.plot(distances)
plt.title("K-Distance")
plt.ylabel("Distance")
sns.despine()

---

# Problem 4: Create and Fit DBSCAN model

Create and fit a DBSCAN model with breast cancer dataset.

For this problem, you will use data_ss created in problem 1.

To solve this problem do the following:

 - Create a `DBSCAN` model **dbscan**. Set `eps` to 2.0, `min_samples` to 20
 - Fit the `DBSCAN` model on `data_ss`

After this problem, there will be a fitted DBSCAN model **dbscan** defined.


-----

In [None]:
from sklearn.cluster import DBSCAN

# YOUR CODE HERE


In [None]:
assert_equal(type(dbscan), type(DBSCAN()), msg="dbscan is not a DBSCAN model")
assert_equal(dbscan.get_params()['eps'], 2.0, msg="dbscan is not created with eps = 2.0")
assert_equal(dbscan.get_params()['min_samples'], 20, msg="dbscan is not created with min_samples = 20")
assert_equal(dbscan.components_.shape[0], 444, msg='dbscan is not fit properly')
print(f'Cluster labels: {np.unique(dbscan.labels_)}')

# Problem 5: Calculate Clustering Metrics for DBSCAN Model

Compute two clustering metrics, Adjusted Rand Index and Silhouette, of the DBSCAN model created in problem 4.

For this problem, you will use `data_ss`, `label` and `dbscan` created above.

To solve this problem do the following:

- Get predicted clusters from dbscan's `labels_` attribute, assign to variable **pred_clusters**
- use `adjusted_rand_score` in `metrics` module with label and pred_clusters to calculate Adjusted Rand Index and set the score to variable **ari_score_db**
- Use `silhouette_score` in `metrics` module with data_ss and pred_clusters to calculate Silhouette score. Assign the score to variable **s_score_db**

After this problem, there will be two new variables, **ari_score_db** and __s_score_db__ defined.

-----

In [None]:
from sklearn import metrics

# YOUR CODE HERE


In [None]:
assert_almost_equal(ari_score_db, 0.7322398150440137, msg='Adjusted Rand Index is not correct')
assert_almost_equal(s_score_db, 0.48507225390269243, msg='Silhouette score is not correct')
print(f"Adjusted Rand Index: {ari_score_db:5.3f}")
print(f"Silhouette Score: {s_score_db:5.3f}")