# Unsupervised Model 1 - KMeans
In this approach we will implement the KMeans clustering algorithm to group the dataset based on the combination of encoded inputs of Ligand SMILES and Target Drug Sequences to try and create a classification model that can predict the drug-protein affinities of drug-protein combinations that the model has not seen yet.


# Prerequisites

First, we need to install the necessary prerequisites to run our code. The main libraries that are most important to our code are as follows:


*   **rdkit** - an open-source cheminformatics library
*   **sklearn** - an open-source data analysis library



In [1]:
!pip install rdkit



In [2]:
from rdkit import Chem
from rdkit.Chem import AllChem

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import pandas as pd
import numpy as np
import joblib
import math
import random

# Data Loading and Preprocessing

Next, import our dataset. For this model, we're utilizing the EC50_bind.tsv data file.

In [3]:
import os
print(os.getcwd())
data = pd.read_csv('EC50_bind.tsv', sep='\t', on_bad_lines='skip')
print(data.shape)
data.head()

/content
(163745, 6)


Unnamed: 0,drug_id,target_id,smiles,target_seq,origin_affinity,affinity
0,100000,P49862,CN1CCN(Cc2c(O)c(Cl)cc3c(cc(=O)oc23)-c2ccccc2)CC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,68293,4.165624
1,100001,P49862,COc1ccccc1C1CC(=Nc2nnnn12)c1ccc(C)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,23546,4.628083
2,100002,P49862,Cc1oc2c(CN3CCCC3)c(O)ccc2c(=O)c1-c1ccc(Br)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69498,4.158021
3,100003,P49862,CCN1C(c2ccccn2)n2c(nc3ccccc23)-c2ccccc12,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69511,4.15794
4,100004,P49862,Oc1ccc2c(occ(-c3ccc(Br)cc3)c2=O)c1CN1CCOCC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,66092,4.179851


Remove any lines where 'smiles' or 'target_seq' are equal to nan.

In [4]:
data = data[data['smiles'].notna()]
data = data[data['target_seq'].notna()]

print(data.shape)
data.head()

(163745, 6)


Unnamed: 0,drug_id,target_id,smiles,target_seq,origin_affinity,affinity
0,100000,P49862,CN1CCN(Cc2c(O)c(Cl)cc3c(cc(=O)oc23)-c2ccccc2)CC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,68293,4.165624
1,100001,P49862,COc1ccccc1C1CC(=Nc2nnnn12)c1ccc(C)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,23546,4.628083
2,100002,P49862,Cc1oc2c(CN3CCCC3)c(O)ccc2c(=O)c1-c1ccc(Br)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69498,4.158021
3,100003,P49862,CCN1C(c2ccccn2)n2c(nc3ccccc23)-c2ccccc12,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69511,4.15794
4,100004,P49862,Oc1ccc2c(occ(-c3ccc(Br)cc3)c2=O)c1CN1CCOCC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,66092,4.179851


As seen in the sample of the dataset in the code output above, we have two data features that we need to manipulate so that our model can understand the data.

The "smiles" feature stands for Simplified Molecular Input Line Entry System. This is basically a textual representation of the molecular structure of the drug. In order for our model to understand how each chemical structure can effect our model, we will convert the SMILES data into 2048 bit structure called a Morgan Fingerprint. The Morgan Fingerprint represents chemical structures, and the bit vector indicates which chemical structures are present in the provided drug molecule.

The "target_seq" feature stands for the Target Drug Sequence of amino acids in the target protein. We will separate the Target Drug Sequence into the percent composition of each Amino Acid in the sequence. This allows us to evaluate each Amino Acid individually and how it contributes to the classification.

The function `smiles_to_fingerprint(smiles_string, atom_radius, number_of_bits)`, takes in 3 inputs:


*   smiles_string = a string representation of the SMILES textual molecular structure of the drug molecule
*   atom_radius = an integer representing how many bond lengths around each atom in the molecule to consider
*   number_of_bits = an integer representing how many bits the output vector should have

The function will attempt to convert the molecular structure to a Morgan Fingerprint bit-vector. If it is unable to convert the molecular structure, it will return a bit-vector of zeroes. If the molecular structure is none, it will return a bit-vector of zeroes.








In [5]:
def smiles_to_fingerprint(smiles_string, atom_radius=2, number_of_bits=2048):
  mol = Chem.MolFromSmiles(smiles_string)
  morganGenerator = AllChem.GetMorganGenerator(radius = atom_radius, fpSize = number_of_bits)
  if mol is not None:
    try:
      return morganGenerator.GetFingerprint(mol)
    except:
      return np.zeros((number_of_bits,))
  else:
    return np.zeros((number_of_bits,))

The function `sequence_to_composition(target_sequence_string)`, takes in 1 input:

*   target_sequence_string = a string representation of sequence of amino acids making up the target protein.

The function will convert the sequence of amino acids to an array of percent compositions correlating to each amino acid. This composition array will tell us what percentage of each amino acid the target protein is comprised of.


In [6]:
def sequence_to_composition(target_sequence_string):
  #print(target_sequence_string)
  amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
  try:
    composition = [target_sequence_string.count(aa) / len(target_sequence_string) for aa in amino_acids]
  except:
    print(target_sequence_string)
  return np.array(composition)

Create new features in the dataset called 'Fingerprint' and 'Composition', which will represent the MorganFingerprint and the target protein amino acid composition respectively.

In [7]:
data['Fingerprint'] = data['smiles'].apply(smiles_to_fingerprint)
data['Composition'] = data['target_seq'].apply(sequence_to_composition)

[23:05:26] Can't kekulize mol.  Unkekulized atoms: 23 24 25
[23:05:26] Can't kekulize mol.  Unkekulized atoms: 29 30 31
[23:05:27] Explicit valence for atom # 23 N, 4, is greater than permitted
[23:05:28] Explicit valence for atom # 12 N, 4, is greater than permitted
[23:05:29] Explicit valence for atom # 1 N, 4, is greater than permitted
[23:05:29] Explicit valence for atom # 1 N, 4, is greater than permitted
[23:05:30] Explicit valence for atom # 5 N, 4, is greater than permitted
[23:05:30] Explicit valence for atom # 5 N, 4, is greater than permitted
[23:05:31] Explicit valence for atom # 41 N, 4, is greater than permitted
[23:05:31] Explicit valence for atom # 0 N, 4, is greater than permitted
[23:05:31] Explicit valence for atom # 1 N, 4, is greater than permitted
[23:05:31] Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11 12 13 14 15
[23:05:32] Explicit valence for atom # 25 N, 4, is greater than permitted
[23:05:32] Explicit valence for atom # 25 N, 4, is greater than permitt

In [8]:
print(data.shape)
data.head()

(163745, 8)


Unnamed: 0,drug_id,target_id,smiles,target_seq,origin_affinity,affinity,Fingerprint,Composition
0,100000,P49862,CN1CCN(Cc2c(O)c(Cl)cc3c(cc(=O)oc23)-c2ccccc2)CC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,68293,4.165624,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0.05533596837944664, 0.04743083003952569, 0.0..."
1,100001,P49862,COc1ccccc1C1CC(=Nc2nnnn12)c1ccc(C)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,23546,4.628083,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0.05533596837944664, 0.04743083003952569, 0.0..."
2,100002,P49862,Cc1oc2c(CN3CCCC3)c(O)ccc2c(=O)c1-c1ccc(Br)cc1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69498,4.158021,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.05533596837944664, 0.04743083003952569, 0.0..."
3,100003,P49862,CCN1C(c2ccccn2)n2c(nc3ccccc23)-c2ccccc12,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,>69511,4.15794,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.05533596837944664, 0.04743083003952569, 0.0..."
4,100004,P49862,Oc1ccc2c(occ(-c3ccc(Br)cc3)c2=O)c1CN1CCOCC1,MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVA...,66092,4.179851,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.05533596837944664, 0.04743083003952569, 0.0..."


Now, we will combine the 'Fingerprint' feature and the 'Composition' feature into one feature to pass into the KMeans model. We acheive this by using `np.hstack` to horizonally combine the 'Fingerprint' and 'Composition' arrays for each drug-target interaction and assign the resulting array to our variable, `X`.

In addition, we will also assign an array of drug-target affinities to our variable, `y`.

Although the affinities in `y` won't be utilized during clustering, we will utilize the affinity scores to evaluate the effectiveness of our model.

In [9]:
X_drug = np.array(list(data['Fingerprint']))
X_target = np.array(list(data['Composition']))

X = np.hstack([X_drug, X_target])

y = data['affinity'].values

Now, we'll do a bit of data cleaning to check both our X and y arrays for nan values. Whereever a value is set to nan, we will drop those indices in both the X and y arrays, as well as our original dataset.

In [10]:
x_nan_indices = np.argwhere(np.isnan(X))
y_nan_indices = np.argwhere(np.isnan(y))
#print(x_nan_indices)
#print(y_nan_indices)

all_nan_indices = []

for item in x_nan_indices:
  all_nan_indices.append(item[0])

for item in y_nan_indices:
  all_nan_indices.append(item[0])

all_nan_indices = list(set(all_nan_indices))
#print(all_nan_indices)

clean_X = np.delete(X, all_nan_indices, axis = 0)
clean_y = np.delete(y, all_nan_indices)
clean_data = data.drop(all_nan_indices, axis = 0)

clean_data = clean_data.reset_index(drop=True)

The last part of our data preprocessing is standardizing the cleaned feature matrix, clean_X, in order to improve the performance of our KMeans model. This can easily be done by using the `StandardScaler()` function from the sklearn library.

In [11]:
standard_scaler_obj = StandardScaler()
clean_scaled_X = standard_scaler_obj.fit_transform(clean_X)

print(clean_scaled_X.shape)

(163745, 2068)


# Model Training and Evaluation

Now that our data has been cleaned and standardized, we can initialize and fit our KMeans model. We will evaluate the performance of our KMeans model at various cluster values.

In [12]:
import matplotlib.image as mpimg
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
from matplotlib import pyplot as plt
import seaborn as sns

## Principal Component Analysis (PCA) for Graphing and Dimensionalitiy Reduction

Before we train our KMeans models, we're going to run Principal Component Analysis (PCA) on our dataset. By running PCA with 2 and 3 components, we are able to determine which features have the most variance in order to visualize our clustering results in both 2 and 3 dimensions.

In [13]:
pca_2 = PCA(n_components = 2)
clean_pca_2 = pca_2.fit_transform(clean_scaled_X)

pca_3 = PCA(n_components = 3)
clean_pca_3 = pca_3.fit_transform(clean_scaled_X)

After running PCA for 2 and 3 components, we're also going to run PCA with 30, 50, and 100 components to reduce the dimensionality of our dataset in order to reduce the noise included in our KMeans model.

In [14]:
pca_30 = PCA(n_components = 30)
clean_pca_30 = pca_30.fit_transform(clean_scaled_X)

pca_50 = PCA(n_components = 50)
clean_pca_50 = pca_50.fit_transform(clean_scaled_X)

pca_100 = PCA(n_components = 100)
clean_pca_100 = pca_100.fit_transform(clean_scaled_X)

In [15]:
print(clean_pca_2.shape)
print(clean_pca_3.shape)
print(clean_pca_30.shape)
print(clean_pca_50.shape)
print(clean_pca_100.shape)
print(clean_scaled_X.shape)

(163745, 2)
(163745, 3)
(163745, 30)
(163745, 50)
(163745, 100)
(163745, 2068)


## KMeans - Clustering Functions

Create a list to track the results of each cluster and dimensional combination.

In [16]:
kMeansDataTracker = []

In [17]:
def kMeansExecution(input_data, num_clusters, num_features):
  ###----- set up arrays for averages -----###
  stdDeviations = []
  counts = []

  ###----- set up the KMeans model with num_clusters -----###
  kMeansModel = KMeans(n_clusters = num_clusters, random_state = 42)

  ###----- fit the KMeans model to the input data -----###
  kMeansModel.fit(input_data)

  ###----- get the cluster labels -----###
  cluster_labels = kMeansModel.labels_

  ###----- append the cluster labels to the clean_data dataset -----###
  clean_data[f'Features_{num_features}_Cluster_n_{num_clusters}'] = cluster_labels

  ###----- print the cluster affinities evaluation for each cluster -----###
  for cluster_group in range(num_clusters):
    cluster_affinities = clean_y[cluster_labels == cluster_group]

    stdDeviations.append(np.std(cluster_affinities))
    counts.append(len(cluster_affinities))

    print(f"Cluster group: {cluster_group}: " +
        f"Mean Affinity = {np.mean(cluster_affinities):.4f}, " +
        f"Std. Dev. = {np.std(cluster_affinities):.4f}, " +
        f"Count = {len(cluster_affinities)}")

  print(f"Average standard deviation: {np.mean(stdDeviations):.4f}")
  print(f"Weighted Average standard deviation: {np.average(stdDeviations, weights=counts):.4f}")
  kMeansDataEntry = [num_features, num_clusters, np.round(np.average(stdDeviations, weights=counts), decimals=4), np.round(np.mean(stdDeviations), decimals=4)]
  kMeansDataTracker.append(kMeansDataEntry)


In [31]:
def kMeans2dGraph(clean_data_clus_label_col_str, num_clusters, num_features):
  plt.figure(figsize = (12,8))
  custom_palette = ['#F0F8FF', '#FAEBD7', '#00FFFF', '#7FFFD4', '#F0FFFF', '#F5F5DC',
                    '#FFE4C4', '#FFEBCD', '#0000FF', '#8A2BE2', '#A52A2A', '#DEB887',
                    '#5F9EA0', '#7FFF00', '#D2691E', '#FF7F50', '#6495ED', '#FFF8DC',
                    '#DC143C', '#00FFFF', '#00008B', '#008B8B', '#B8860B', '#A9A9A9',
                    '#006400', '#A9A9A9', '#BDB76B', '#8B008B', '#556B2F', '#FF8C00',
                    '#9932CC', '#8B0000', '#E9967A', '#8FBC8F', '#483D8B', '#2F4F4F',
                    '#2F4F4F', '#00CED1', '#9400D3', '#FF1493', '#00BFFF', '#696969',
                    '#696969', '#1E90FF', '#B22222', '#FFFAF0', '#228B22', '#FF00FF',
                    '#DCDCDC', '#F8F8FF', '#FFD700', '#DAA520', '#808080', '#008000',
                    '#ADFF2F', '#808080', '#F0FFF0', '#FF69B4', '#CD5C5C', '#4B0082',
                    '#FFFFF0', '#F0E68C', '#E6E6FA', '#FFF0F5', '#7CFC00', '#FFFACD',
                    '#ADD8E6', '#F08080', '#E0FFFF', '#FAFAD2', '#D3D3D3', '#90EE90',
                    '#D3D3D3', '#FFB6C1', '#FFA07A', '#20B2AA', '#87CEFA', '#778899',
                    '#778899', '#B0C4DE', '#FFFFE0', '#00FF00', '#32CD32', '#FAF0E6',
                    '#FF00FF', '#800000', '#66CDAA', '#0000CD', '#BA55D3', '#9370DB',
                    '#3CB371', '#7B68EE', '#00FA9A', '#48D1CC', '#C71585', '#191970',
                    '#F5FFFA', '#FFE4E1', '#FFE4B5', '#FFDEAD', '#000080', '#FDF5E6',
                    '#808000', '#6B8E23', '#FFA500', '#FF4500', '#DA70D6', '#EEE8AA',
                    '#98FB98', '#AFEEEE', '#DB7093', '#FFEFD5', '#FFDAB9', '#CD853F',
                    '#FFC0CB', '#DDA0DD', '#B0E0E6', '#800080', '#663399', '#FF0000',
                    '#BC8F8F', '#4169E1', '#8B4513', '#FA8072', '#F4A460', '#2E8B57',
                    '#FFF5EE', '#A0522D', '#C0C0C0', '#87CEEB', '#6A5ACD', '#708090',
                    '#708090', '#FFFAFA', '#00FF7F', '#4682B4', '#D2B48C', '#008080',
                    '#D8BFD8', '#FF6347', '#40E0D0', '#EE82EE', '#F5DEB3', '#F5F5F5',
                    '#FFFF00', '#9ACD32', '#6f331e', '#0b6b70', '#b63092', '#1c4b54',
                    '#755a8c', '#1574eb', '#e086ff', '#dac096', '#311d8a', '#8aba24',
                    '#eb0e11', '#b319fe', '#729921', '#641cb5', '#2f126d', '#eaee48',
                    '#c7affb', '#330a8f', '#f03962', '#77235a', '#485996', '#fb8ece',
                    '#942b9b', '#b054d4', '#fc4838', '#2ceac3', '#0666e6', '#25257e',
                    '#f290e7', '#e41e9f', '#694fd0', '#225b51', '#8f86d7', '#a58565',
                    '#06b4d8', '#546846', '#7648b2', '#7ad8ca', '#7e9e5d', '#bc48ea',
                    '#36bea5', '#293d93', '#9eddd5', '#d50ec3', '#b24f92', '#71dc2d',
                    '#750f6d', '#e5280a', '#99e7b1', '#13ee0f', '#41f661', '#17679d',
                    '#f92c51', '#09e486', '#f99821', '#a5d3f5', '#04b043', '#b77fa4',
                    '#3fa354', '#1330a5', '#5a3f91', '#e2624e', '#1cdcd0', '#ce49d6',
                    '#377222', '#1fd7b2', '#8ba898', '#af8210', '#650297', '#76690e',
                    '#4df257', '#62bd5f', '#a3de16', '#a2a11b', '#6e5b37', '#a5c896',
                    '#f0d335', '#5c4588', '#150614', '#ef50b6', '#e80f95', '#bf748a',
                    '#2ccb4c', '#c2d7bc', '#7adfc1', '#e676fc', '#03cb36', '#01793c',
                    '#96382e', '#fc3733', '#a4d8ab', '#acade5', '#10235a', '#d33416',
                    '#b9dd6e', '#f87548', '#470a80', '#2047be', '#3adf64', '#d646cc',
                    '#bcfc4a', '#2f002c', '#5b361f', '#9980fb', '#b65878', '#452fb1',
                    '#48b273', '#d8e552', '#ec7d54', '#534174', '#46e528', '#bb61fb',
                    '#917094', '#d44a6e', '#dd58c8', '#5f20a7', '#c0711a', '#6e3b91',
                    '#1b4590', '#473c86', '#4ef44b', '#c9b45f', '#ffea78', '#7356f5',
                    '#19db0a', '#6157e1', '#5c7b33', '#008f80', '#ef6b39', '#11305e',
                    '#290184', '#b2a385', '#20bf4a', '#ceea38', '#beb4f6', '#c684b9',
                    '#18f4ee', '#4d5d70', '#021367', '#59ca8e', '#c01b33', '#7d14c3',
                    '#655f27', '#f870c5', '#06972a', '#44e104', '#d51abb', '#dd20de',
                    '#225fdd', '#34e88e', '#b426d6', '#1fba0d', '#b5e3b2', '#ce0b6b',
                    '#bbc967', '#fdbad4', '#e8c961', '#ed9793', '#acf797', '#a1acb8',
                    '#6146a3', '#0b2718', '#d80e89', '#39f756', '#3f5869', '#3aae2b',
                    '#89d25e', '#ad0685', '#03d219', '#6425b1', '#867fbe', '#af4210',
                    '#18bfa4', '#04d3b3', '#b5dec2', '#2ecf35', '#287d59', '#0577dd',
                    '#81ce7d', '#75f1d3', '#c7d5a2', '#b8f982', '#7dafe2', '#ccb745',
                    '#494d51', '#04a8d5', '#1a5275', '#6abeef', '#629a48', '#fe55aa',
                    '#825fd2', '#63eddc', '#cd74e6', '#f21d7e', '#0dec1a', '#3ef0ff',
                    '#73ef23', '#fb4f11', '#ce1da8', '#4ee074', '#78d607', '#d10c98',
                    '#ec2797', '#8ac81f', '#7a009b', '#192bf3', '#194184', '#1c2202',
                    '#2e1c98', '#0f0c47', '#383aea', '#73a3d2', '#4a847b', '#e102b3',
                    '#797a97', '#50ebba', '#40f418', '#85d7d5', '#de4064', '#37de77',
                    '#3d46a9', '#ea766b', '#5dabec', '#fc6d40', '#362358', '#b6e180',
                    '#b25125', '#173ce1', '#af446a', '#e573fa', '#53dbbd', '#2f3a87',
                    '#23d450', '#751460', '#ede802', '#ef6bdb', '#c5b291', '#185703',
                    '#509eef', '#669537', '#b8be6e', '#ce646d', '#a82053', '#ef86b9',
                    '#cad4b1', '#847ac7', '#0a7ca9', '#1c2013', '#5cdc4b', '#ee8c7d',
                    '#c7a6b1', '#f34b3d', '#a35604', '#56b641', '#a87e46', '#8ed5e9',
                    '#264c1a', '#2d7d27', '#5d89f0', '#4fb981', '#8502d1', '#a69158',
                    '#a9aa07', '#4cd2f8', '#372b0b', '#a01777', '#53a063', '#978e69',
                    '#2fdfd9', '#5589ed', '#acdf38', '#126280', '#d1c2ba', '#c2e627',
                    '#53f3bd', '#fe9703', '#22e120', '#8902ac', '#2f3173', '#401343',
                    '#121eaf', '#7e276b', '#65ec76', '#b1e22e', '#a9cf9d', '#dc1ad8',
                    '#42ddc9', '#00dca6', '#e12b1f', '#06ed7e', '#a2b984', '#b948fd',
                    '#fb3d51', '#97985d', '#dd93aa', '#182f67', '#c4f792', '#e09c24',
                    '#e65b87', '#9c032a', '#8426a1', '#1e6883', '#8d1f25', '#f88374',
                    '#5c787c', '#57abdf', '#167259', '#a0ea6d', '#3a3924', '#bd9a86',
                    '#0ef14f', '#cd4136', '#5637e2', '#d06d50', '#d4ce5d', '#516c7d',
                    '#883f3e', '#ce21e3', '#6a8c31', '#fb0aa6', '#3af313', '#8d695c',
                    '#34a211', '#cb01a9', '#850dc0', '#07d367', '#931eb9', '#645a40',
                    '#06b981', '#2b9d91', '#9f3183', '#f38eed', '#6a5eb0', '#f95004',
                    '#1b9e1d', '#68808d', '#ff765a', '#6ab410', '#8769ba', '#ff0611',
                    '#2647c2', '#0949b1', '#3384b4', '#1886b6', '#030e2e', '#8a093b',
                    '#d98670', '#6c05ce', '#e19fb6', '#041886', '#d8acca', '#a74953',
                    '#f99632', '#91c435']

  if num_clusters <= 500:
    sample_palette = random.sample(custom_palette, num_clusters)
  else:
    sample_palette = custom_palette

  sns.scatterplot(x = clean_pca_2[:,0], y = clean_pca_2[:,1], hue = clean_data[clean_data_clus_label_col_str], palette = sample_palette)

  plt.title(f"KMeans Clustering Results with {num_features} Features (n = {num_clusters})")
  plt.xlabel("Principal Component 1")
  plt.ylabel("Principal Component 2")

  legendColumnNum = 15

  plt.legend(title = 'Cluster group', loc = 'upper center', bbox_to_anchor=(0.5, -0.2), borderaxespad = 0., ncol=legendColumnNum)

  plt.show()

In [19]:
def kMeans3dGraph(clean_data_clus_label_col_str, num_clusters, num_features):
  fig = px.scatter_3d(
    x = clean_pca_3[:,0],
    y = clean_pca_3[:,1],
    z = clean_pca_3[:,2],
    color = clean_data[clean_data_clus_label_col_str],
    labels = {'color': 'Cluster'},
    title = f"KMeans Clustering Results with {num_features} Features (n = {num_clusters})",
    width=1100,
    height=800
  )
  fig.update_traces(marker=dict(size=3))

  fig.show()

## KMeans - Full Features

In [23]:
### Number of clusters = 5
clusterNumber = 5
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [24]:
### Number of clusters = 10
clusterNumber = 10
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [25]:
### Number of clusters = 15
clusterNumber = 15
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [26]:
### Number of clusters = 20
clusterNumber = 20
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [27]:
### Number of clusters = 40
clusterNumber = 40
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [28]:
### Number of clusters = 80
clusterNumber = 80
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [29]:
### Number of clusters = 200
clusterNumber = 200
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [30]:
### Number of clusters = 500
clusterNumber = 500
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [32]:
### Number of clusters = 1000
clusterNumber = 1000
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [33]:
### Number of clusters = 1500
clusterNumber = 1500
clusterLabelKey = f'Features_{clean_scaled_X.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_scaled_X, num_clusters = clusterNumber,
                num_features = clean_scaled_X.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_scaled_X.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

## KMeans - PCA applied, 30 Features

In [34]:
### Number of clusters = 5
clusterNumber = 5
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [35]:
### Number of clusters = 10
clusterNumber = 10
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [36]:
### Number of clusters = 15
clusterNumber = 15
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [37]:
### Number of clusters = 20
clusterNumber = 20
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [38]:
### Number of clusters = 40
clusterNumber = 40
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [39]:
### Number of clusters = 80
clusterNumber = 80
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [40]:
### Number of clusters = 200
clusterNumber = 200
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [41]:
### Number of clusters = 500
clusterNumber = 500
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [42]:
### Number of clusters = 1000
clusterNumber = 1000
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [43]:
### Number of clusters = 1500
clusterNumber = 1500
clusterLabelKey = f'Features_{clean_pca_30.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_30, num_clusters = clusterNumber,
                num_features = clean_pca_30.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_30.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

## KMeans - PCA applied, 50 Features

In [44]:
### Number of clusters = 5
clusterNumber = 5
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [45]:
### Number of clusters = 10
clusterNumber = 10
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [60]:
### Number of clusters = 15
clusterNumber = 15
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [47]:
### Number of clusters = 20
clusterNumber = 20
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [49]:
### Number of clusters = 40
clusterNumber = 40
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [50]:
### Number of clusters = 80
clusterNumber = 80
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [52]:
### Number of clusters = 200
clusterNumber = 200
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [53]:
### Number of clusters = 500
clusterNumber = 500
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [54]:
### Number of clusters = 1000
clusterNumber = 1000
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [55]:
### Number of clusters = 1500
clusterNumber = 1500
clusterLabelKey = f'Features_{clean_pca_50.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_50, num_clusters = clusterNumber,
                num_features = clean_pca_50.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_50.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

## KMeans - PCA applied, 100 Features

In [56]:
### Number of clusters = 5
clusterNumber = 5
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [57]:
### Number of clusters = 10
clusterNumber = 10
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [58]:
### Number of clusters = 15
clusterNumber = 15
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [59]:
### Number of clusters = 20
clusterNumber = 20
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [61]:
### Number of clusters = 40
clusterNumber = 40
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [62]:
### Number of clusters = 80
clusterNumber = 80
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [63]:
### Number of clusters = 200
clusterNumber = 200
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [64]:
### Number of clusters = 500
clusterNumber = 500
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [65]:
### Number of clusters = 1000
clusterNumber = 1000
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

In [66]:
### Number of clusters = 1500
clusterNumber = 1500
clusterLabelKey = f'Features_{clean_pca_100.shape[1]}_Cluster_n_{clusterNumber}'

kMeansExecution(input_data = clean_pca_100, num_clusters = clusterNumber,
                num_features = clean_pca_100.shape[1])
print("")

kMeans2dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

kMeans3dGraph(clean_data_clus_label_col_str = clusterLabelKey,
              num_clusters = clusterNumber, num_features = clean_pca_100.shape[1])
print("")

Output hidden; open in https://colab.research.google.com to view.

# Results

In [68]:
### Summary of results:

resultsDF = pd.DataFrame(kMeansDataTracker,
                         columns=['Num. of Features',
                                  'Num. of Clusters',
                                  'Weighted Avg. Std. Dev.',
                                  'Avg. Std. Dev.'
                                  ]
                         )

print(resultsDF)

    Num. of Features  Num. of Clusters  Weighted Avg. Std. Dev.  \
0               2068                 5                   1.5913   
1               2068                10                   1.5148   
2               2068                15                   1.5099   
3               2068                 5                   1.5913   
4               2068                10                   1.5148   
5               2068                15                   1.5099   
6               2068                20                   1.4669   
7               2068                40                   1.4876   
8               2068                80                   1.4777   
9               2068               200                   1.3758   
10              2068               500                   1.2995   
11              2068              1000                   1.2295   
12              2068              1500                   1.1690   
13                30                 5                   1.508