# Project 3: Clustering - Mall Customer Segmentation 

## Dataset: 

Mall Customer Segmentation Dataset (available from Kaggle)

## Analysis Goals:

1. Data Preprocessing:
    - Handle missing values, if any, using appropriate techniques such as imputation or dropping.
    - Normalize numerical features using StandardScaler or MinMaxScaler.

2. Exploratory Data Analysis (EDA):
    - Visualize feature distributions: Use histograms or box plots to understand the distribution of features like age, annual income, and spending score.
    - Explore relationships between variables: Use scatter plots or pair plots to identify potential correlations or patterns.

3. Feature Engineering:
    - Identify relevant features: Select features such as age, annual income, and spending score for customer segmentation.

4. Dimensionality Reduction:
    - Apply PCA: Reduce the dimensionality of the feature space if there are many features.
    - Visualize reduced dimensions: Use scatter plots or heatmaps to visualize the reduced dimensions and understand feature importance.

5. Clustering:
    - Algorithms: Use KMeans and DBSCAN for clustering.
    - Determine the optimal number of clusters:
        - Elbow method: Plot the sum of squared distances (inertia) against the number of clusters and identify the "elbow" point where the inertia starts decreasing at a slower rate.
        - Silhouette score: Calculate the silhouette score for different numbers of clusters and choose the one with the highest score.

6. Cluster Analysis:
    - Analyze cluster characteristics:
        - Centroids: For KMeans, analyze the centroid of each cluster to understand the cluster's characteristics.
        - Cluster means: For both KMeans and DBSCAN, calculate the mean values of features within each cluster.
    - Visualize clusters:
        - Scatter plots: Visualize clusters in the reduced feature space to understand their distribution and separability.
    - Heatmaps: Visualize the mean values of features within each cluster using heatmaps.

7. Model Evaluation:
    - Evaluate clustering performance:
        - Silhouette score: Measure how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters.
        - Davies–Bouldin index: Measure the average similarity between each cluster and its most similar cluster, where lower values indicate better clustering.
    - Visualize clustering results:
        - Scatter plots: Plot the data points colored by their assigned cluster labels to visualize the clustering results.
        - Evaluate cluster separability and compactness.

## Analysis

### Load Data

| Column Name         | Description                                               |
|---------------------|-----------------------------------------------------------|
| CustomerID          | Unique ID assigned to the customer                        |
| Gender              | Gender of the customer                                    |
| Age                 | Age of the customer                                       |
| Annual Income (k$) | Annual Income of the customer                             |
| Spending Score      | Score assigned by the mall based on customer behavior and spending nature (1-100) |


In [1]:
import pandas as pd
df = pd.read_csv('data/Mall_Customers.csv')

df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### Data Preprocessing