**YOUR NAME**

Spring 2024

CS 251: Data Analysis and Visualization

# Lab 5 | K-Means Clustering

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.cluster.vq import kmeans2

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 10})
plt.rcParams.update({'figure.figsize': [12, 5]})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

 

## Task 1:  Import and explore the data

We are going to be using K-Means to explore the flea beetle dataset which contains measurements of three different types of beetles.

**Dataset Variables**      
**species:** Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri     
**tars1:** width of the first joint of the first tarsus in microns    
**tars2:** width of the second joint of the first tarsus in microns    
**head:** the maximal width of the head between the external edges of the eyes in 0.01 mm    
**aede1:** the maximal width of the aedeagus in the fore-part in microns     
**aede2:** the front angle of the aedeagus (1 unit = 7.5 degrees)    
**aede3:** the aedeagus width from the side in microns    
 
1. Import dataset using pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to create a Dataframe from `data/flea.csv`.
2. Using the [Dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) documentation, check out the data
    1. Print the shape of the data
    2. Print first 5 rows of the data using head.
3. The `species` variable is a categorical data variable with 3 levels: `['Concinna', 'Heikert', 'Heptapot']`. Convert the data type of this column in the DataFrame to [categorical](https://pandas.pydata.org/docs/user_guide/categorical.html).
   1. Print the first 5 items of the updated DataFrane column to make sure that the replace worked correctly. You should see `Categories (3, object): ['Concinna', 'Heikert', 'Heptapot']` at the bottom of the print-out.
4. Create a list of color-blind friendly hard-coded RGB colors that will be used to color samples that belong to the same cluster. Pick either the [Okabe & Ito or one of the Petroff color palettes](https://github.com/proplot-dev/proplot/issues/424) (*These lists have been studied to be effective for colorblind individuals*). Your list of color strings should be at least as long as the number of clusters you select. Since there are 3 in this lab, any of the color palettes will work. *Each string is represented as a [hexadecimal color code](https://en.wikipedia.org/wiki/Web_colors)*.
5. Graph a scatterplot of the `tars1` and `aede3` columns using plt.scatter
    1. Write a loop that runs for 3 iterations (one for each flea species).
    2. Inside the loop, use logical indexing to select rows in the DataFrame that have the current species level. Assign this 'filtered' DataFrame to a temp variable. If everything is working as expected, you should have `21`, `31`, and `22` samples for the species `'Concinna'`, `'Heikert'`, `'Heptapot'`, respectively.
    3. Plot `tars1` on the x axis and `aede3` on the y axis from the **filtered** DataFrame with markers that have a black edgecolor. Set the color to one of the strings from your hard-coded list of colors.
    4. Add a useful title, and axis labels.

In [None]:
species = ['Concinna', 'Heikert', 'Heptapot']



## Task 2: K-Means

1. Use the [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html) documentation to find out how to calculate K-Means on for the columns `tars1` and `aede3` using  kmeans2 
    1. Use 3 clusters and set the method of initialization to random. 
    2. Make sure to convert the data to floats, with `.astype('float')` before running the analysis.
2. Calculate the **inertia**: how closely packed samples are around their cluster centroids in the current clustering.
    1. For each data sample calculate the **squared** euclidean distance between that sample and its cluster centroid. Average these values across all samples.
    2. Print the inertia.
3. Graph the results of the clustering next in a plot next to the actual data. (2 subplots)
    1. Graph `'tars1'` on the x axis and `'aede3'` on the y axis
    2. Have a title for each subplot, and axis labels.
    3. Graph the original data following the instructions from Part 1 in the first subplot. 
        1. Use the `species` column of your data for the color with a black edgecolor
    4. Graph the results of K-means in the second subplot. [See example](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html)
        1. Graph the data colored by the clusters with the centroids labeled. Use your colorblind-friendly custom color palette to color each of the three clusters.

You should get an inertia of ~203.


In [None]:
np.random.seed(1)



## Task 3: Analysis

1. Write code that calculates the kmeans2 50 times
2. Save the centroids, labels and inertia for each clustering instance.

In [None]:
np.random.seed(1)




3. Graph a plot with 3 subplots
    1. **Original:** The first plot should graph Graph `'tars1'` on the x axis and `'aede3'` on the y axis with the original labels
    2. **Best clustering:** Using your above analysis the second plot should graph the results with the **best** K-Means fit using the saved labels and centroids from the above analysis.
    3. **Worst clustering:** Using your above analysis the third plot should graph the results with the worst K-Means fit using the saved labels and centroids from the above analysis.
    4. For formatting, follow the instructions in Task 2.

#### Question 1: Why does kmeans produce different solutions?

#### Answer 1: