Title: K-means clustering

Authors: Bauyrzhan Taimanov - Madina Amanatayeva

Date: Jan 18, 2024

In [2]:
# imports 
from matplotlib import pyplot as plt
import pandas as pd

# 0) Context


## Customer Segmentation with Parallel K-Means

In business, we often say "Customer is king" and "Customer is always right." Basically, it means we need to focus on what customers want. So, the idea here is to use this concept to group products that match what different groups of customers like. This can help a lot in online stores because it means showing customers the products they're most likely to be interested in.

![alt text](https://editor.analyticsvidhya.com/uploads/73654cluster.jpg)

This concept can be handy in places where we use digital marketing tools to promote products. For example, in some types of ads, we show products to people based on things like their interests. Then, by looking at whether they click on the ads or not, we can figure out who else might like those products.
Similarly, in other types of ads, we bid for space to show our products to people in real-time. We do this based on what we know about the people seeing the ads and what's worked well in the past. This way, we can make sure we're showing the right products to the right people at the right time.

## Concept

Customers buy products based on different needs and constraints, like budget and preference for certain items. To understand these behaviors, we can analyze customer invoice data and other non-personal data.
Here's a simplified approach:
* Analyze customer invoice data to find buying patterns.
* Use clustering algorithms to group customers based on similar purchase behaviors.
* Identify products that match each customer group's preferences.
* Tailor digital campaigns to each customer group for better targeting.

Though, with our current data limitations, we may not have all the necessary details to build a complete model.

# 1) Analysis of the Serial Algorithm

In the context of K-Means clustering, different algorithms can be used for initialization and optimization. Here are a few common algorithms used in K-Means:

* Random Initialization: This is the simplest method where K initial centroids are randomly chosen from the data points. While straightforward, random initialization may lead to suboptimal clustering results, particularly if the centroids are placed poorly.
* K-Means++ Initialization: This algorithm improves upon random initialization by selecting initial centroids that are spaced far apart from each other. The idea is to increase the likelihood of converging to a better solution. K-Means++ selects the first centroid randomly from the data points and subsequently chooses centroids based on the probability proportional to the squared distance from the closest existing centroid.
* Hartigan–Wong Algorithm: The Hartigan–Wong algorithm is an iterative refinement approach used to optimize the K-Means objective function. It starts by initializing centroids using a heuristic method such as K-Means++ and then repeatedly improves the clustering by reassigning data points to the nearest centroid and updating the centroids' positions. This process continues until convergence.
* Lloyd's Algorithm (Standard K-Means): Lloyd's algorithm, also known as the standard K-Means algorithm, is perhaps the most widely used approach for optimizing K-Means clustering. It iteratively assigns data points to the nearest centroid and recalculates the centroids' positions based on the mean of the data points assigned to each cluster. This process continues until convergence, where the assignments and centroids no longer change significantly.
* Elkan's Algorithm: Elkan's algorithm is an optimized version of Lloyd's algorithm that reduces the number of distance calculations by exploiting triangle inequality. It can be significantly faster than Lloyd's algorithm, particularly when dealing with large datasets with a low dimensionality.
Each of these algorithms can be effective for solving the customer segmentation problem using K-Means clustering. The choice of algorithm depends on factors such as computational resources, the size of the dataset, and the desired level of optimization. It's common to experiment with different initialization methods and optimization algorithms to find the best clustering solution.