# K-Means
### NOTE: This version is for CSV file imports!

---

Created By: Xavier De Carvalho  
Created On: 06/07/2021  
Upated By: N/A  
Updated On: N/A  
Version: km0.0.01

### Requirements

---

##### Required Data Format
- File Type: CSV
- File Shape: (n) Columns, (n) Rows

##### Required Python Packages
- Numpy
- Matplotlib
    - PyPlot
    - ListedColormap
- Pandas
- ScikitLearn
    - KMeans

### Description

---

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. In simple terms, it can identify clusters that exist in your data on your behalf.

K-means clustering is an extensively used technique for data cluster analysis.

### Steps

---

- **Step 1** Choose the number K of clusters
- **Step 2** Select at random K points, the centroids (not necissarily from your dataset)
- **Step 3** Assign each data point to the closest centroid (That forms K clusters)
- **Step 4** Compute and place the new centroid of each cluster
- **Step 5** Reassign each data point to the new closest centroid.
    - If any reassignment took place, go to STEP 4, otherwise END and declare model is ready.

### Install Dependencies If Needed

---

NOTE: This might not be required if you're running your notebook instance in the cloud! 

Delete the cell below if this is the case...

In [None]:
# Import the sys dependency
import sys
# Install dependencies
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install sklearn

### Import Packages

---

In [None]:
# Import packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
# Confirm packages have been imported
print("Packages imported!")

### Import Dataset

---

In [None]:
# Import data from CSV
dataset = pd.read_csv('YOUR_CSV')
X = dataset.iloc[:,[3,4]].values # Tweak this as required and use only the columns you need to identify clusters
# Confirm data has been imported
print('Data has been imported from CSV!')

### Use elbow method to find optimal number of clusters

---

In [None]:
# Create the Elbow using Within-Cluster Sum of Square (WCSS)
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) # random_state can be tweaked as required
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
# Plot the elbow
plt.plot(range(1,11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

### Train K-Means model

---

In [None]:
# Train the model
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42) # Tweak n_clusters and random_state as required
y_pred = kmeans.fit_predict(X)
# Confirm model has been trained
print('Model has been trained!')

In [None]:
print(y_pred)

### Visualise clusters

---

In [None]:
# Visualise the clusters - Tweak this to match the number of clusters you have created
#   For reference: 
#       X and Y axis will be set using X[rows(WHERE each row CONTAINS the K cluster), column]
# Cluster 1
plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], s = 100, c = 'red', label = 'Cluster 1') 
# Cluster 2
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
# Cluster 3
plt.scatter(X[y_pred == 2, 0], X[y_pred == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
# Cluster 4
plt.scatter(X[y_pred == 3, 0], X[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
# Cluster 5
plt.scatter(X[y_pred == 4, 0], X[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
# Plot centroids
#   For reference:
#       X = [All Rows, Column 0], Y = [All Rows, Column 1]
#       s can be tweaked as required
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
# Format visualisation
# Replace the text marked with '@' with your own text.
# Don't forget to remove the '@' character!
plt.title('@YOUR_TITLE (Training Set)')
plt.xlabel('@YOUR_X_AXIS_NAME') # e.g. Propensity
plt.ylabel('@YOUR_Y_AXIS_NAME') # e.g. Income
plt.legend()
plt.show()