# How to Perform K-Means Clustering in Python
Here, you’ll take a step-by-step tour of the conventional version of the k-means algorithm. Understanding the details of the algorithm is a fundamental step in the process of writing your k-means clustering model in Python. 

# Understanding the K-Means Algorithm
K-means requires only a few steps. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster.

The main element of the algorithm works by a two-step process called **expectation-maximization**. 
<br>
The **expectation** step assigns each data point to its nearest centroid. 
<br>
Then, the **maximization** step computes the mean of all the points for each cluster and sets the new centroid.

# Clustering Evaluation
The quality of the cluster assignments is determined by computing the **sum of the squared error (SSE)** after the centroids converge, or match the previous iteration’s assignment. 
<br>
The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. 
<br>
Since this is a measure of error, the **objective of k-means is to try to minimize this value**.

# K-means Example
**First step:**
<br>
Generate the data using make_blobs(), a convenience function in scikit-learn used to generate synthetic clusters. 
<br>
make_blobs() uses these parameters:
<br>
n_samples is the total number of samples to generate.
<br>
centers is the number of centers to generate.
<br>
cluster_std is the standard deviation.


**Keep in mind that you can use any other dataset for clustering.
<br>
In this example, we generate our data.**


In [1]:
from sklearn.datasets import make_blobs
features, true_labels = make_blobs(n_samples=200, 
                                   centers=3, 
                                   cluster_std=2.75)

Prepare data for clustering:
<br>
Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1.

In [2]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Print the first five rows
scaled_features[:5]

array([[ 0.50061548,  0.45926973],
       [-0.49885037, -1.20154105],
       [ 0.41958518, -0.19574834],
       [-1.65087195, -0.74467391],
       [-0.93807273, -1.00750175]])

**Second step:**
<br>
The **KMeans estimator class in scikit-learn** is where you set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is flexible, providing several parameters that can be tuned.
<br>
Here are the parameters used in this example:
<br>
- init controls the initialization technique. The standard version of the k-means algorithm is implemented by setting init to "random". Setting this to "k-means++" employs an advanced trick to speed up convergence, which you’ll use later.
<br>
- n_clusters sets k for the clustering step. This is the most important parameter for k-means.
<br>
- n_init sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten k-means runs and return the results of the one with the lowest SSE.
<br>
- max_iter sets the number of maximum iterations for each initialization of the k-means algorithm.


Instantiate the KMeans class with the following arguments:

In [3]:
from sklearn.cluster import KMeans

kmeans = KMeans(init="random", 
                n_clusters=3, 
                n_init=10, 
                max_iter=300, 
                random_state=42)

**Third step:**
<br>
Now that the k-means class is ready, the next step is to fit it to the data in scaled_features. 
<br>
This will perform ten runs of the k-means algorithm on your data with a maximum of 300 iterations per run:

In [4]:
kmeans.fit(scaled_features)

KMeans(init='random', n_clusters=3, random_state=42)

In [5]:
# The lowest SSE value
kmeans.inertia_

167.87182245187537

In [6]:
# Final locations of the centroid
kmeans.cluster_centers_

array([[-0.13080457,  1.07293175],
       [-0.82557879, -0.79549786],
       [ 0.99114654,  0.07691387]])

In [7]:
# The number of iterations required to converge
kmeans.n_iter_

6

# Practical Exercise
Read more details related to k-means and its parameters in the following link:
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
<br>
Build some other clustering k-means models. 
<br>
Try different parameters for these models and compare them together.