# Clustering

## [Data - Mall Customers](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)

In [None]:
# URL: https://www.kaggle.com/api/v1/datasets/download/vjchoudhary7/customer-segmentation-tutorial-in-python?dataset_version_number=1

import requests
import zipfile
import io

def download_and_extract_zip(url, extract_to='.'):
  """Downloads a zip file from a URL and extracts it to a specified directory.

  Args:
    url: The URL of the zip file.
    extract_to: The directory to extract the zip file to. Defaults to the current directory.
  """
  try:
    response = requests.get(url, stream=True)
    response.raise_for_status()  # Raise an exception for bad status codes

    with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
      zip_ref.extractall(extract_to)
    print(f"Zip file downloaded and extracted to '{extract_to}' successfully.")

  except requests.exceptions.RequestException as e:
    print(f"Error downloading the file: {e}")
  except zipfile.BadZipFile as e:
    print(f"Error extracting the zip file: {e}")

# Example usage:
url = "https://www.kaggle.com/api/v1/datasets/download/vjchoudhary7/customer-segmentation-tutorial-in-python?dataset_version_number=1"
download_and_extract_zip(url)

In [None]:
import pandas as pd

df_mall_customers = pd.read_csv("Mall_Customers.csv")
df_mall_customers.head()

## Distance

<img src="https://weaviate.io/assets/images/hero-183a22407b0eaf83e53d574aee0a049a.png">

## [K-Means](https://www.youtube.com/watch?v=5I3Ei69I40s)

1. Choose the number of clusters (K):
    - Decide how many clusters you want to form (K). This number must be defined before running the algorithm.
2. Initialize the centroids:
    - Randomly select K data points from the dataset as the initial centroids (cluster centers). These centroids represent the center of each cluster.
3. Assign each data point to the nearest centroid:
    - For each data point in the dataset, calculate the distance between the point and each of the K centroids (using a distance metric such as Euclidean distance).
    - Assign each data point to the cluster with the closest centroid.
4. Recalculate the centroids: Once all the data points are assigned to clusters, recalculate the centroids. The new centroid of a cluster is the mean (average) of all the points in that cluster.
5. Repeat steps 3 and 4:
    - Reassign all the data points to the nearest centroids (based on the new centroids from Step 4).
    - Recalculate the centroids again based on the updated clusters.
    - This iterative process continues until the centroids no longer change significantly or a maximum number of iterations is reached.
6. Convergence:
    - The algorithm stops when the centroids stabilize (i.e., when they don't move anymore between iterations) or after a predefined number of iterations.
    - At this point, the data points are divided into K clusters, with each point assigned to the nearest centroid.

## K-elbow - How many cluster do we need?

<img src="https://miro.medium.com/v2/resize:fit:670/0*aY163H0kOrBO46S-.png">

## Prediction

## [Hierarchical clustering](https://www.youtube.com/shorts/W9j5pAIYbQQ)

### [Agglomerative Clustering](https://scikit-learn.org/dev/modules/generated/sklearn.cluster.AgglomerativeClustering.html)

## [DBSCAN](https://www.youtube.com/watch?v=_A9Tq6mGtLI)

### Setting parameters: The algorithm uses two main parameters:

 - **Epsilon (ε):** This is the maximum distance within which a point can have neighbors. The neighbors of a point are the points that are located within this distance.
 - **MinPts:** The minimum number of neighbors required to consider a point as a dense center.

### Categorization of points: The algorithm distinguishes three types of points:

 - **Core points:** Points that have at least MinPts neighbors in their ε-radius environment.
 - **Boundary points:** Points that have less than MinPts neighbors, but are located within the ε-radius region of a core point.
 - **Noise points:** Points that are neither core points nor boundary points and are not close enough to any core point.