# K means clustering for portfolio diversity

You are a machine learning engineer working in the finance industry. The goal of your current project is to help investors improve portfolio diversity by letting them know which stocks are likely to "move together" - i.e. increase at the same time (e.g. because of a new development in their common industry) or decrease at the same time (e.g. because of a war in their common geographical home base.)

Using a historical dataset of daily stock movement (difference between closing price of the stock, and price when it opens), you decide to use K means clustering to find groups of stocks that "move together". You will assign stocks to clusters, with a few different values of "number of clusters" (so that the finance experts can decide which are most useful for them). You will save the cluster indices assigned by the default K means clustering algorithm in sklearn to a variable `c_idx_def`.

Since it's not meaningful in this context to have a cluster with only a few stocks, you will also post-process the clusters as follows: whenever the default K means clustering finds a cluster with fewer than 3 samples, you will re-assign each of the samples in this cluster to its nearest cluster (by Euclidean distance to cluster center) of a cluster that had 3 or more samples assigned to it by the default K means clustering. You will save your modified version of the cluster indices in `c_idx_mod`.

|Name|	Type|	Description|
| --- | --- | --- |
|`c_idx_def`	|2d numpy array	|Cluster indices from default sklearn implementation|
|`c_idx_mod`	|2d numpy array	|Cluster indices after re-assigning samples in clusters with only one member|

In [1]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

Read in the `stock_movement.csv` dataset. Each row in this data corresponds to one stock, and each column corresponds to one day of trading. The data is indexed with its "ticker", or stock symbol.

In [2]:
df = pd.read_csv("stock_movement.csv", index_col=0)
df.head()

Unnamed: 0,2020-01-02,2020-01-03,2020-01-06,2020-01-07,2020-01-08,2020-01-09,2020-01-10,2020-01-13,2020-01-14,2020-01-15,...,2022-12-28,2022-12-29,2022-12-30,2023-01-03,2023-01-04,2023-01-05,2023-01-06,2023-01-09,2023-01-10,2023-01-11
AAPL,0.007886,0.000537,0.011531,-0.002782,0.01157,0.004586,-0.000518,0.010207,-0.007713,-0.000979,...,-0.02786,0.012433,0.011666,-0.039986,-0.004068,-0.016194,0.027706,-0.002456,0.003607,0.017192
ADBE,0.009704,0.005783,0.011873,-0.001665,0.008894,0.000679,-0.004907,0.00896,-0.005214,-0.007623,...,-0.014743,0.014962,0.007733,-0.007098,-0.003242,-0.019913,0.00103,0.006747,0.0023,0.012136
ADI,0.002177,0.002109,0.004558,0.004627,0.002381,-0.0066,-0.015172,0.001905,0.003742,-0.012927,...,-0.008301,0.009525,0.011362,-0.021228,0.003606,-0.032522,0.025582,-0.001973,0.018506,0.014084
ADP,-0.008449,0.009789,0.005719,-0.005976,0.00474,0.004276,-0.004688,0.004328,-0.004585,0.003812,...,-0.018032,0.012571,-0.004328,-0.016126,-0.007161,-0.019938,0.021535,-0.008295,0.009119,0.01592
ADSK,0.015114,0.00192,0.013945,0.00572,0.007223,0.00096,-0.00334,0.010229,-0.00476,-0.000626,...,-0.010813,0.019414,0.008726,-0.022837,0.000292,-0.016617,0.012358,0.008141,0.01336,0.01908


In [3]:
X = df.values

You will fit clusters across a range of different "number of clusters", so that the finance experts can decide which value for "number of clusters" is most useful for them. This is the list of values you will consider:

In [4]:
n_cluster_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

For each value in `n_cluster_list`:

* Use the `sklearn` implementation of K means clustering to assign each of the samples in `X` to a cluster, specifying the number of clusters to fit.
* Also specify `random_state=42` and `n_init=1`, but leave other arguments at their default values.
* Save the cluster indices assigned by the model to the corresponding column in `c_idx_def`
* Then, find the size of each of the clusters. (Hint: you can use `np.unique`.)
* For each cluster that has fewer than the required number of samples (specified on the question page), re-assign that sample to the closest cluster that *did* have enough samples assigned by the default model. ("Closest" according to "minimum Euclidean distance to the cluster center". Hint: use `np.linalg.norm`.)
* Save your new cluster assignments (for all samples, not just the ones you modified) to the corresponding column in `c_idx_mod`.

For full credit, you should use no more than three `for` loops.

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

# initialize graded arrays
c_idx_def = np.zeros((X.shape[0], len(n_cluster_list)))
c_idx_mod = np.zeros((X.shape[0], len(n_cluster_list)))

# fill in the rest of your code here...
for i, n_clusters in enumerate(n_cluster_list):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=1)
    labels = kmeans.fit_predict(X)
    c_idx_def[:, i] = labels

    cluster_sizes = np.bincount(labels)
    valid_clusters = np.where(cluster_sizes >= 3)[0]
    invalid_clusters = np.where(cluster_sizes < 3)[0]

    new_labels = labels.copy()
    for invalid_cluster in invalid_clusters:
        invalid_samples = np.where(labels == invalid_cluster)[0]
        distances = np.linalg.norm(X[invalid_samples][:, np.newaxis] - kmeans.cluster_centers_[valid_clusters], axis=2)
        closest_clusters = valid_clusters[np.argmin(distances, axis=1)]
        new_labels[invalid_samples] = closest_clusters

    c_idx_mod[:, i] = new_labels