## Assignment: $k$ Means Clustering

Tamera Fang (ven7sg)

## **Do two questions.**

`! git clone https://www.github.com/DS3001/kmc`

**Q1.** This question is a case study for $k$ means clustering.

1. Load the `airbnb_hw.csv` data. Clean `Price` along with `Beds`, `Number of Reviews`, and `Review Scores Rating`.


```
import pandas as pd

df = pd.read_csv('/Users/tamerafang/PycharmProjects/DS3001_kMC/airbnb_hw.csv')

df['price'] = df['Price']
df['beds'] = df['Beds']
df['n_reviews'] = df['Number Of Reviews']
df['score'] = df['Review Scores Rating']
cleaned_data = df.loc[:,['price','beds','n_reviews','score']]
print(cleaned_data.shape)
print(cleaned_data.describe())

cleaned_data['price'].value_counts()
cleaned_data['price'] = cleaned_data['price'].str.replace(',','')
cleaned_data['price'] = pd.to_numeric(cleaned_data['price'],errors='coerce')
print(cleaned_data.describe())

cleaned_data['beds'] = cleaned_data['beds'].fillna(1)
print(cleaned_data.describe())

pd.crosstab(df['score'].isnull(), df['n_reviews']>0)
cleaned_data = cleaned_data.dropna()
print(cleaned_data.describe())

```


2. Maxmin normalize the data and remove any `nan`'s (`KMeans` from `sklearn` doesn't accept `nan` input).


```
def maxmin(x):
    u = (x-min(x))/(max(x)-min(x))
    return u
normalized = cleaned_data.drop('price',axis=1)
normalized = normalized.apply(maxmin)
```


3. Use `sklearn`'s `KMeans` module to cluster the data by `Beds`, `Number of Reviews`, and `Review Scores Rating` for `k=6`.


```
from sklearn.cluster import KMeans
model = KMeans(n_clusters=6, max_iter=300, n_init = 10, random_state=0)
model = model.fit(normalized) # Fit the emodel
normalized['cluster'] = model.labels_
print(normalized.describe())
```


4. Use `seaborn`'s `.pairplot()` to make a grid of scatterplots that show how the clustering is carried out in multiple dimensions.


```
import matplotlib.pyplot as plt
import seaborn as sns
pairplot = sns.pairplot(normalized, hue='cluster', palette='bright', vars=['beds', 'n_reviews', 'score'])
plt.show()
```


5. Use `.groupby` and `.describe` to compute the average price for each cluster. Which clusters have the highest rental prices?

* The cluster with the highest rental price is cluster 3 around $293.53 per night.
```
cleaned_data['cluster'] = model.labels_
avg_cluster_price = cleaned_data.loc[:,['price','cluster']].groupby('cluster').describe()
print(avg_cluster_price)
mean_prices = avg_cluster_price['price']['mean']
highest_price_cluster = mean_prices.idxmax()
print(f"Highest average price cluster: {highest_price_cluster}")
print(f"Average price for this cluster: ${mean_prices[highest_price_cluster]:.2f}")
```


6. Use a scree plot to pick the number of clusters and repeat steps 4 and 5.


```
import numpy as np

k_bar = 15
k_grid = np.arange(1, k_bar + 1)
SSE = np.zeros(k_bar)
for k in range(k_bar):
    model = KMeans(n_clusters=k+1, max_iter=300, n_init=10, random_state=0)
    model.fit(normalized)
    SSE[k] = model.inertia_

plt.figure(figsize=(10, 6))
sns.lineplot(x=k_grid, y=SSE)
plt.title('Scree Plot')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

chosen_clusters = 3  

model = KMeans(n_clusters=chosen_clusters, max_iter=300, n_init=10, random_state=0)
model.fit(normalized)
normalized['cluster'] = model.labels_
plt.figure(figsize=(10, 6))
sns.pairplot(data=normalized, hue='cluster', palette='bright')
plt.show()
cleaned_data['cluster'] = model.labels_
cluster_price_stats = cleaned_data.loc[:, ['price', 'cluster']].groupby('cluster').describe()
print(cluster_price_stats)

```



**Q2.** This is a question about $k$ means clustering. We want to investigate how adjusting the "noisiness" of the data impacts the quality of the algorithm and the difficulty of picking $k$.

1. Run the code below, which creates four datasets: `df0_125`, `df0_25`, `df0_5`, `df1_0`, and `df2_0`. Each data set is created by increasing the amount of `noise` (standard deviation) around the cluster centers, from `0.125` to `0.25` to `0.5` to `1.0` to `2.0`.
* ran code below

```
import numpy as np
import pandas as pd

def createData(noise,N=50):
    np.random.seed(100) # Set the seed for replicability
    # Generate (x1,x2,g) triples:
    X1 = np.array([np.random.normal(1,noise,N),np.random.normal(1,noise,N)])
    X2 = np.array([np.random.normal(3,noise,N),np.random.normal(2,noise,N)])
    X3 = np.array([np.random.normal(5,noise,N),np.random.normal(3,noise,N)])
    # Concatenate into one data frame
    gdf1 = pd.DataFrame({'x1':X1[0,:],'x2':X1[1,:],'group':'a'})
    gdf2 = pd.DataFrame({'x1':X2[0,:],'x2':X2[1,:],'group':'b'})
    gdf3 = pd.DataFrame({'x1':X3[0,:],'x2':X3[1,:],'group':'c'})
    df = pd.concat([gdf1,gdf2,gdf3],axis=0)
    return df

df0_125 = createData(0.125)
df0_25 = createData(0.25)
df0_5 = createData(0.5)
df1_0 = createData(1.0)
df2_0 = createData(2.0)
```

2. Make scatterplots of the $(X1,X2)$ points by group for each of the datasets. As the `noise` goes up from 0.125 to 2.0, what happens to the visual distinctness of the clusters?

* As noise increases, we see that the clusters become more spread out. This causes the visual distinctness to decrease as the clusters collide and are on top of each other.


```
import seaborn as sns
import matplotlib.pyplot as plt
def scatter(df, noise_level, ax):
    sns.scatterplot(data=df, x='x1', y='x2', hue='group', style='group', ax=ax)
    ax.set_title(f'Scatterplot with Noise Level: {noise_level}')

fig, axes = plt.subplots(2, 3, figsize=(10.5, 7))

scatter(df0_125, '0.125', axes[0, 0])
scatter(df0_25, '0.25', axes[0, 1])
scatter(df0_5, '0.5', axes[0, 2])
scatter(df1_0, '1.0', axes[1, 0])
scatter(df2_0, '2.0', axes[1, 1])

axes[1, 2].axis('off')
plt.tight_layout()
plt.show()
```




3. Create a scree plot for each of the datasets. Describe how the level of `noise` affects the scree plot (particularly the presence of a clear "elbow") and your ability to definitively select a $k$.
* As noise increases, the clarity of the elbow decreases. It becomes harder to select an elbow since it isn't as clear.

```
from sklearn.cluster import KMeans

def maxmin(x):
    x = (x-min(x))/(max(x)-min(x))
    return x

def scree(data, ax, title):
    X = data.loc[:, ['x1', 'x2']]
    X = X.apply(maxmin)
    k_bar = 15
    k_grid = np.arange(1, k_bar + 1)
    SSE = np.zeros(k_bar)
    for k in range(k_bar):
        model = KMeans(n_clusters=k + 1, max_iter=300, n_init=10, random_state=0)
        model.fit(X)
        SSE[k] = model.inertia_
    sns.lineplot(x=k_grid, y=SSE, ax=ax)
    ax.set_title(title)
    ax.set_ylim(0, 35)

fig, axes = plt.subplots(2, 3, figsize=(10.5, 7))
scree(df0_125, axes[0, 0], 'Noise 0.125')
scree(df0_25, axes[0, 1], 'Noise 0.25')
scree(df0_5, axes[0, 2], 'Noise 0.5')
scree(df1_0, axes[1, 0], 'Noise 1.0')
scree(df2_0, axes[1, 1], 'Noise 2.0')
axes[1, 2].axis('off')
plt.tight_layout()
plt.show()
```


4. Explain the intuition of the elbow, using this numerical simulation as an example.
* In this numerical simulation, we can see that the noise varies at 5 levels: 0.125, 0.25, 0.5, 1.0, and 2.0. As the noise changes, we notice how the difference in cluster distinctness affects the appearance of the elbows in our scree plots.  At higher noise levels, the clusters are most spread out with more overlap and are harder to distinguish; at low levels, the clusters are closer together and distinct. The marginal benefit diminished with more clusters. This also means that as SSE decreases, K increases.  

**Q3.** We looked at computer vision with $k$NN in a previous question. Can $k$ means clustering correctly group digits, even if we don't know which symbols are which?

1. To load the data, run the following code in a chunk:
```
from keras.datasets import mnist
df = mnist.load_data('minst.db')
train,test = df
X_train, y_train = train
X_test, y_test = test
```
The `y_test` and `y_train` vectors, for each index `i`, tell you want number is written in the corresponding index in `X_train[i]` and `X_test[i]`. The value of `X_train[i]` and `X_test[i]`, however, is a 28$\times$28 array whose entries contain values between 0 and 256. Each element of the matrix is essentially a "pixel" and the matrix encodes a representation of a number. To visualize this, run the following code to see the first ten numbers:
```
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000)
for i in range(5):
    print(y_test[i],'\n') # Print the label
    print(X_test[i],'\n') # Print the matrix of values
    plt.contourf(np.rot90(X_test[i].transpose())) # Make a contour plot of the matrix values
    plt.show()
```
OK, those are the data: Labels attached to handwritten digits encoded as a matrix.

2. What is the shape of `X_train` and `X_test`? What is the shape of `X_train[i]` and `X_test[i]` for each index `i`? What is the shape of `y_train` and `y_test`?
3. Use Numpy's `.reshape()` method to covert the training and testing data from a matrix into an vector of features. So, `X_test[index].reshape((1,784))` will convert the $index$-th element of `X_test` into a $28\times 28=784$-length row vector of values, rather than a matrix. Turn `X_train` into an $N \times 784$ matrix $X$ that is suitable for scikit-learn's kNN classifier where $N$ is the number of observations and $784=28*28$ (you could use, for example, a `for` loop).
4. Use $k$ means clustering on the reshaped `X_test` data with `k=10`.  
5. Cross tabulate the cluster assignments with the true labels for the test set values. How good is the correspondence? What proportion of digits are clustered correctly? Which digits are the hardest to distinguish from one another? Can $k$MC recover the latent digits 0 to 9, without even knowing what those digits were?
6. If you use a scree plot to determine the number of clusters $k$, does it pick 10 (the true number of digits), or not? If it fails to pick $k=10$, which digits does it tend to combine into the same classification?