# **Clustering**

The clustering dataset consists of generated data, which means there is no specific business context or attribute descriptions, as mentioned in the assessment documents. The necessary Python libraries and their API (Application Programming Interface) functions are imported at the relevant sections of the document, rather than importing them all at the beginning, ensuring clarity and improving the document's organization.

The report is divided into the following main sections:
- The nature of the generated data attributes is explained, and additional details about key concepts are provided for better understanding.
- Insights on the data are presented through relevant plots, and initial observations about the clusters are included.
- The data is prepared for use with the clustering algorithms, with details about the scaling and normalization processes included.
- Two algorithms are chosen for clustering the dataset, and the reasons for their selection are justified.
- Analyses are provided on how to determine if the clustering results are correct in the absence of class labels, while following the guidelines outlined in the assessment material.

## **1. Import, explore and visualize the data to gain insights**

The data is first loaded into a `DataFrame` data structure from the `assessment_cluster_dataset.csv` file using the *Pandas* library. Pandas is useful for working with table data like in spreadsheets or databases. It helps to explore, clean, and process the data. In Pandas, a data table is stored in a `DataFrame`. The *Numpy* library is also imported because it can be used by Pandas for data statistics.

In [None]:
# This line sets the filter for warnings. It tells Python to ignore all warnings that are generated during the execution of the program.
import warnings
warnings.filterwarnings('ignore')

# Import numpy and pandas libraries to import the data set into a Dataframe
import numpy as np, pandas as pd

# Load the data in DataFrame
data = pd.read_csv('data/dataset.csv', sep=',')
data.head()

### **Exploring data information**

The output information about the dataset, provided by `data.info()`, includes the number of rows, the number of columns, the names (data attributes) of the columns, how many entries in each column are not missing, and the data type of each column.

In [None]:
# Display data information
data.info()

The data imported into the DataFrame contains **1500** entries, numbered from **0** to **1499**. There are **3 columns** corresponding to **3 attributes** in total (**att1**, **att2**, and **att3**). Each column has **1500 non-null values** with the type `float64`.

In [None]:
data.describe()

The information displayed by `data.describe()` are the following:
- **count**: the number of non-null entries in a column used for the statistics.
- **mean**:  the **average of the values** in a column (for numbers only).
- **std**:   the **standard deviation** in a column (for numbers only).
- **min**:   the smallest value in a column (for numbers only).
- **25%**:   the **first quartile** or **25th percentile** in a column (for numbers only).
- **50%**:   the **second quartile** or **50th percentile/median** in a column (for numbers only).
- **75%**:   the **third quartile** or **75th percentile** in a column (for numbers only).
- **max**:   the largest value in a column (for numbers only).

#### **Standard diviation**


The **standard deviation** of a column in a dataset measures **how much the values in that column differ from the average (or mean) value**. The standard deviation shows how spread out the values are around the mean, where a **low standard deviation** means the values are **close to the mean with less difference**, and a **high standard deviation** means the values are **more spread out with more difference**.
 
For a dataset column $x$ with values $x_1, x_2, ..., x_n$ and mean $\bar{x}$, the standard deviation $\sigma_x$ is given by the following formula:
$$
\text{Standard deviation (} \sigma_x \text{)} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

#### **Quartiles and Percentiles**

A **percentile** measures **how values are spread in a column** of the dataset. It shows the value below which a certain percentage of entries fall. Quartiles split the column into four equal parts. For example, the **second quartile** (also known as the 50th percentile or median) is the **value at which 50% of the data points fall below**. This helps to understand how the data is distributed, showing the lower, middle, and upper parts of the values, as well as whether most scores are close to the mean or vary widely.

### **Visualizing the data for insights**

The data will now be visualized to **help understand its patterns, clusters and trends**. By using different types of 2D and 3D plots, important insights can be gained about the data. Some of these visualazations are useful to see how the values relate to each other within distinct clusters to the naked eye.

In [None]:
## Output histograms 
data.hist(figsize=(16, 4), layout=(1, len(data.columns)), edgecolor='black')

While these **histograms are not effective for identifying clusters**, as they display each attribute's data independently, they **clearly show how the data is spread around the mean value**, helping us understand the overall distribution and variation in the data values for each attribute.

In [None]:
# Import matplotlib.pyplot, seaborn and itertools libraries to display 2D scatter plots
import matplotlib.pyplot as plt, seaborn as sns
from itertools import combinations

# Create 2D scatter plots for all possible paired combinations of the attributed
cols = data.columns
indices_combinations = list(combinations(range(len(cols)), 2))
plt.figure(figsize=(20, 6))
for i, pair in enumerate(indices_combinations):
    plt.subplot(1, len(indices_combinations), i + 1)  # One row, many columns
    sns.scatterplot(data, x=cols[pair[0]], y=cols[pair[1]])
plt.tight_layout()
plt.show()

The 2D scatter plots offer valuable insights about the number of clusters present in the data. While they help visualize how the data points group together, accurately estimating the exact number of clusters can still be non trivial. These plots can **show patterns and relationships between attributes**, making it easier to see how they connect with one another. Overall, these 2D plots are useful for **understanding paired data correlations**, even if they **don't provide an accurate visualization** of the number of clusters.

In [None]:
# Import plotly.express for 3D scatter plots
import plotly.express as px

# Generate a 3d scaterplot and reduce the size of dots
data_3d = px.scatter_3d(data, x='att1', y='att2', z='att3')
data_3d.update_traces(marker=dict(symbol="circle", size=2)) 

# Adjust the plot size for improved rotated visualization
data_3d.update_layout(width=1000, height=800)
data_3d.show()

![First clusters visualisation](./img/3d_1.png)

The 3D scatter plots provide **clear insights into the number of clusters** and how the three attributes are connected with each other. They make it easier to see the relationships and patterns among the data points in three dimensions.

These are the main characteristics:  
- There are **7 distinct globular clusters** of different sizes.  
- Most clusters are **dense in the center** and **more sparse at the edges**.  
- A **few isolated scattered dots** appear to be noise in the data.  
- Some clusters are **close to one another**, while others **are well separated**.

## **2. Data preparation for clustering algorithms**

One key transformation for the data is **feature scaling**. Most machine learning algorithms don't perform well when input numerical **attributes have very different scales**. Two common methods are commonly used for attribute scaling: **min-max scaling** and **standardization**.

**Min-max scaling, often called normalization**, is straightforward: each attribute's values are adjusted to range from 0 to 1 by subtracting the minimum and dividing by the range (the difference between the maximum and the minimum values). Standardization works differently. It first subtracts the mean from each value (resulting in a zero mean) and then divides by the standard deviation, giving a standard deviation of 1. **Unlike min-max scaling, standardization doesntt limit values to a set range, making it less sensitive to outliers**.

In [None]:
# Import StandardScaler class from the sklearn.preprocessing
from sklearn.preprocessing import StandardScaler
# create the standard scaler object
scaler = StandardScaler()
# train the scaler on data
scaler.fit(data)
# apply it to data to transform it and assign it to another variable
data_transformed = scaler.transform(data)
# create a new dataframe
data_scaled = pd.DataFrame(data_transformed, columns=data.columns)
data_scaled.head()

## **3. Data clustering**

In this section, the selected algorithms, K-Means and DBScan, are presented and applied to the scaled data. The agglomerative hierarchical clustering was discarded, and the reasons for this decision will be provided later.

### **K-Means clustering**

K-means is a popular clustering method that starts by selecting $k$ random rows as initial cluster centers. Each object is assigned to the nearest cluster based on distance, and new centers are calculated for each cluster. This process **repeats** until **no objects change clusters** or **the distances to the centers stabilize**.

K-means performs effectively when natural **clusters are compact and clearly separated**, as is almost the case here, and it is **efficient for high-dimensional data**. However, it does not work well when the chosen **$k$ does not match the actual number of clusters**, is **sensitive to noise and outliers**, and **performs poorly with non-globular or differently-sized clusters**. 

For the data studied here, there is **strong confidence in the number (7) of clusters**, and most of them are **fairly elliptically compact** and **well-separated**, except for a few. Therefore, **K-Means** is deemed a suitable algorithm for this case.

In the following, two initialization techniques for K-Means, namely **`random`** and **`k-means++`**, will be applied, and their results will be presented and compared.

#### **Random Initialisation**

In [None]:
# Import KMeans class from sklearn.cluster library
from sklearn.cluster import KMeans

# create the kmeans clustering object
kmeans_random_model = KMeans(n_clusters=7, init='random').set_output(transform='pandas')
# kmeans_random_model = KMeans(n_clusters=7, init='random', random_state=42).set_output(transform='pandas')

# train and transform the scaled data
results_kmeans_random = kmeans_random_model.fit_transform(data_scaled)
results_kmeans_random.head()

The important hyperparameter to explain is **`random_state`**. It sets a random number for starting centroids, which makes results consistent when using the same number. **Using an integer for this helps get the same results every time**, but **<font color='red'>it's good to check if results stay stable with different random seeds</font>**. Common choices for these seeds are **0** and **42**.

In [None]:
kmeans_random_model.labels_

In [None]:
kmeans_random_model.cluster_centers_

In [None]:
kmeans_random_data = data_scaled.copy()
kmeans_random_data['cluster'] = kmeans_random_model.labels_
kmeans_random_data

In [None]:
kmeans_random_centroids = pd.DataFrame(kmeans_random_model.cluster_centers_, columns=data_scaled.columns)
kmeans_random_centroids['cluster'] = ['Centroid 0','Centroid 1','Centroid 2','Centroid 3','Centroid 4','Centroid 5','Centroid 6']
kmeans_random_centroids

In [None]:
# Import parallel_coordinates plotting class from pandas.plotting library
from pandas.plotting import parallel_coordinates

# Plot the centroids across all attributes
parallel_coordinates(kmeans_random_centroids, 'cluster',  marker='o')

After training the model several times, **it was observed that all the centroids are stable, different and do not overlap**. This visualization of centroids helps to illustrate how each cluster is formed and how they are positioned in relation to the data.

In [None]:
kmeans_random_data['cluster'].value_counts()

In [None]:
# Plot for data points
fig_3d_kmeans_random_data = px.scatter_3d(kmeans_random_data, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_random_data.update_traces(marker=dict(symbol="circle", size=2), opacity=0.7, name="Data")

# Plot for centroids
fig_3d_kmeans_random_centroids = px.scatter_3d(kmeans_random_centroids, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_random_centroids.update_traces(marker=dict(symbol="x", size=5), name="Centroids")

# Combine centroids into the main figure with adjusted legend entries
for i, trace in enumerate(fig_3d_kmeans_random_centroids.data):
    cluster_value = kmeans_random_centroids['cluster'].iloc[i]
    trace.name = f"{cluster_value}"
    fig_3d_kmeans_random_data.add_trace(trace)

# Update layout to separate legends, position the first legend on the left and resize the final 3D plot
fig_3d_kmeans_random_data.update_layout(legend=dict(title="Centroids", x=0, xanchor="left",y=1), width=1000, height=800, 
                                        title="D Scatter Plot with Centroids (k=7, init=random, default for other hyperparameters)")

# Add a second legend for centroids (workaround using annotations for the second legend)
fig_3d_kmeans_random_data.add_layout_image(dict(source=None,
                                                # Invisible image to create spacing for a legend
                                                x=1.1, y=1, xanchor="right", yanchor="top", layer="above"))

# Adjust layout to position centroids legend
fig_3d_kmeans_random_data.update_traces(selector=dict(name="Centroid"),showlegend=True)
fig_3d_kmeans_random_data.show()

![3D Scatter Plot with Centroids (k=7, init=random, default for other hyperparameters)](./img/3d_2.png)

According to the 3D plotting, it is clear that the **seven clusters are well identified** and **the centroids are well positioned in the center of the different clusters**. This indicates that the random initialization of the K-Means clustering is **effective and demonstrates a high level of stability**.

Let's consider two randomly scaled rows to predict their clusters using the trained K-Means model with random initialization.

In [None]:
new_rows = [[0.5, -1.6, -0.2], [-1.5, 1, 0.4]]
kmeans_random_model.predict(new_rows)

#### **K-mean++ Initialisation**

K-means++ improves the starting position of cluster centers by making them far apart. First, it picks the first cluster center randomly. Then, for each remaining center, the algorithm weights all rows based on how close they are to the nearest cluster center, making rows that are far away more likely to be chosen as the next center.

The **results of the `k-mean++` and `random` inisizalization methods are almost similar**.

In [None]:
kmeans_pp_model = KMeans(n_clusters=7, init='k-means++').set_output(transform='pandas')
results_kmeans_pp = kmeans_pp_model.fit_transform(data_scaled)
results_kmeans_pp.head()

In [None]:
kmeans_pp_model.labels_

In [None]:
kmeans_pp_model.cluster_centers_

In [None]:
kmeans_pp_data = data_scaled.copy()
kmeans_pp_data['cluster'] = kmeans_pp_model.labels_
kmeans_pp_data

In [None]:
kmeans_pp_centroids = pd.DataFrame(kmeans_pp_model.cluster_centers_, columns=data_scaled.columns)
kmeans_pp_centroids['cluster'] = ['Centroid 0','Centroid 1','Centroid 2','Centroid 3','Centroid 4','Centroid 5','Centroid 6']
kmeans_pp_centroids

In [None]:
parallel_coordinates(kmeans_pp_centroids, 'cluster',  marker='o')

In [None]:
kmeans_random_data['cluster'].value_counts()

In [None]:
kmeans_pp_data['cluster'].value_counts()

In [None]:
# Plot for data points
fig_3d_kmeans_pp_data = px.scatter_3d(kmeans_pp_data, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_pp_data.update_traces(marker=dict(symbol="circle", size=2), opacity=0.6, name="Data")

# Plot for centroids
fig_3d_kmeans_pp_centroids = px.scatter_3d(kmeans_pp_centroids, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_pp_centroids.update_traces(marker=dict(symbol="x", size=5), name="Centroids")

# Combine centroids into the main figure with adjusted legend entries
for i, trace in enumerate(fig_3d_kmeans_pp_centroids.data):
    cluster_value = kmeans_pp_centroids['cluster'].iloc[i]
    trace.name = f"{cluster_value}"
    fig_3d_kmeans_pp_data.add_trace(trace)

# Update layout to separate legends
fig_3d_kmeans_pp_data.update_layout(legend=dict(title="Data", x=0, xanchor="left", y=1), width=1000, height=800,
                                    title="3D Scatter Plot with Centroids (k=7, init=k-means++, default for other hyperparameters)")

# Add a second legend for centroids (workaround using annotations for the second legend)
fig_3d_kmeans_pp_data.add_layout_image(dict(source=None, x=1.1, y=1, xanchor="right", yanchor="top", layer="above"))

# Adjust layout to position centroids legend
fig_3d_kmeans_pp_data.update_traces(selector=dict(name="Centroid"),showlegend=True)
fig_3d_kmeans_pp_data.show()

![3D Scatter Plot with Centroids (k=7, init=k-means++, default for other hyperparameters)](./img/3d_3.png)

In [None]:
kmeans_pp_model.predict(new_rows)

#### **Searching for the optimal configuration**

In [None]:
from sklearn.model_selection import GridSearchCV
# Define the range of values to try out for each hyperparameter
param_grid = {
    'n_clusters': range(6, 9),
    'init': ['k-means++', 'random'],
    'n_init': [5, 10, 15],
    'max_iter': [100, 200, 300, 400, 500],
    'random_state': [1, 16, 34, 42, 57]
}
# Create the kmeans model
kmeans = KMeans()
# Use the grid search to try out all possible combinations of hyperparameter values from the grid above and fit a new model for each
grid_search = GridSearchCV(kmeans, param_grid)
grid_search.fit(data_scaled)
# output the hyperparameter values of the model that achieved highest performance
grid_search.best_params_

In [None]:
kmeans_pp_model = KMeans(n_clusters=8, init='k-means++', 
                         n_init=5, max_iter=100, random_state=42).set_output(transform='pandas')
results_kmeans_pp = kmeans_pp_model.fit_transform(data_scaled)
results_kmeans_pp.head()

In [None]:
kmeans_pp_model.labels_

In [None]:
kmeans_pp_model.cluster_centers_

In [None]:
kmeans_pp_data = data_scaled.copy()
kmeans_pp_data['cluster'] = kmeans_pp_model.labels_
kmeans_pp_data

In [None]:
kmeans_pp_centroids = pd.DataFrame(kmeans_pp_model.cluster_centers_, columns=data_scaled.columns)
kmeans_pp_centroids['cluster'] = ['centroid 0','centroid 1','centroid 2','centroid 3','centroid 4','centroid 5','centroid 6','centroid 7']
kmeans_pp_centroids

In [None]:
parallel_coordinates(kmeans_pp_centroids, 'cluster',  marker='o')

In [None]:
kmeans_pp_data['cluster'].value_counts()

In [None]:
# Plot for data points
fig_3d_kmeans_pp_data = px.scatter_3d(kmeans_pp_data, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_pp_data.update_traces(marker=dict(symbol="circle", size=2), opacity=0.6, name="Data")

# Plot for centroids
fig_3d_kmeans_pp_centroids = px.scatter_3d(kmeans_pp_centroids, x='att1', y='att2', z='att3', color='cluster')
fig_3d_kmeans_pp_centroids.update_traces(marker=dict(symbol="x", size=5), name="Centroids")

# Combine centroids into the main figure with adjusted legend entries
for i, trace in enumerate(fig_3d_kmeans_pp_centroids.data):
    cluster_value = kmeans_pp_centroids['cluster'].iloc[i]
    trace.name = f"{cluster_value}"
    fig_3d_kmeans_pp_data.add_trace(trace)

# Update layout to separate legends
fig_3d_kmeans_pp_data.update_layout(legend=dict(title="Data", x=0, xanchor="left", y=1), width=1000, height=800, 
                                    title="3D Scatter plot with centroids (GridSearch optimal hyperparameters)")  

# Add a second legend for centroids (workaround using annotations for the second legend)
fig_3d_kmeans_pp_data.add_layout_image(dict(source=None,# Invisible image to create spacing for a legend
                                            x=1.1, y=1, xanchor="right", yanchor="top", layer="above"))

# Adjust layout to position centroids legend
fig_3d_kmeans_pp_data.update_traces(selector=dict(name="Centroid"),showlegend=True)
fig_3d_kmeans_pp_data.show()

![3D Scatter plot with centroids (GridSearch optimal hyperparameters)](./img/3d_4.png)

In Figures 1 and 2, the lower Cluster 1 (in the 7-cluster setup) is spread out elliptically across `att2` but splits into two clusters in the optimal configuration found by GridSearch (8 clusters). Seven clusters may actually be more accurate. This split in the optimal configuration could be due to the following:

1. **Globular clouds**: K-Means works well when clusters are compact, well-separated clouds.
2. **Noise sensitivity**: K-Means is sensitive to noise and outliers, which are present here.
3. **Cluster shape and size**: K-Means performs poorly when clusters are not distinct or vary in size, especially around sparse Cluster 1 and its neighboring clusters.

**Objective evaluation** of the K-Means clustering will help confirm our subjective decision regarding the correct number of clusters.

#### **K-Means clustering evaluation**

Subjective analyses were already provided gradually in the previous content. This paragraph will be dedicated for the objective analysis of the K-Means clustering with random initialization. 

##### **Deciding the best number of clusters for K-Means**

###### **Elbow method**

In [None]:
inertias = []
for k in range(2, 11):
    # Only the n_clusters hyperparameter will vary, while the others will remain fixed according to the GridSearch results.    
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=5, max_iter=100, random_state=42)
    kmeans.fit(data_scaled)
    inertias.append(kmeans.inertia_)
plt.plot(range(2, 11), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

###### **Davies-Bouldin and Silhouette indexes**

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score
db_scores = []
sil_scores = []
for k in range(2, 11):
  kmeans = KMeans(n_clusters=k, init='k-means++', n_init=5, max_iter=100, random_state=42)
  labels = kmeans.fit_predict(data_scaled)
  db_scores.append(davies_bouldin_score(data_scaled, labels))
  sil_scores.append(silhouette_score(data_scaled, labels))
plt.subplots()
plt.plot(range(2, 11), db_scores, marker='o')
plt.plot(range(2, 11), sil_scores, marker='x')
plt.xlabel('Number of clusters')
plt.ylabel('Score')
plt.legend(['Davies-Bouldin', 'Silhouette'])

Given that the Elbow method diagram **doesn't reflect accuretly the number of clusters**, as the convergence starts at 7 clusters, it doesn't help to **decide about the right number of clusters**. However, the Davies-Bouldin (DBI) and Slihouette (SI) curves provide us with clear accurate indications:
1. The higher **SI value 7 indicates the best clustering schema and confirms the subjective observations**.
2. The intersections after the lower DBI values and the higher SI value occur between 7 and 8 clusters, confirming the tradeoff between these two options.

### **DBScan (density-based) clustering**

DBSCAN was selected as the second algorithm along with K-Means because it has some special advantages. Unlike K-Means, which works best with round clusters of similar sizes, **DBSCAN can find clusters of different shapes and sizes**. This means it can discover clusters that K-Means might miss. Also, DBSCAN is **good at ignoring noise points and border points**, so it **works well with data that has outliers**. However, DBSCAN can have trouble with clusters that have different densities since it uses fixed values for the `eps` and `min_samples` parameters. **Overall, DBSCAN's capability to manage complex diverse clusters with noise makes it useful choice alongside K-Means, which is a perfect fit for our situation**.

In [None]:
from sklearn.cluster import DBSCAN
dbscan_model = DBSCAN()
dbscan_labels = dbscan_model.fit_predict(data_scaled)
dbscan_data = data_scaled.copy()
dbscan_data['cluster'] = dbscan_model.labels_
dbscan_data

In [None]:
dbscan_data.groupby('cluster')['cluster'].value_counts()

In [None]:
fig_3d_dbscan_data = px.scatter_3d(dbscan_data, x='att1', y='att2', z='att3', color='cluster')
fig_3d_dbscan_data.update_traces(marker=dict(symbol="circle", size=2), name="Data")
fig_3d_dbscan_data.update_layout(width=1000, height=800,
                                 title="3D Scatter plot for DBScan (default hyperparameters)")
fig_3d_dbscan_data.show()

![3D Scatter plot for DBScan (default hyperparameters)](./img/3d_5.png)

Using the default hyperparameters (`min_samples=5` and `eps=0.5`) gave poor results, identifying only 2 noise points (cluster -1) and 3 clusters, which is much less than the expected 7 clusters. This is mainly because the `min_samples` value is not set correctly; it should be at least the dimensionality of the dataset plus 1 (3 + 1). Additionally, the default `eps=0.5` led to too few noise points, while the actual data has around hundreds of noise points, as can be seen visually. The other hyperparameters are better left at their default values.

Thatâ€™s why we need to find the correct values for `eps` and `min_samples` to achieve at least 7 clusters, along with an accurate count of noise and border points. To achieve this, we need to create a k-distance plot following the instructions given in the lab sheet.

In [None]:
from sklearn.neighbors import NearestNeighbors

def plot_k_distance(k):
    nbrs = NearestNeighbors(n_neighbors=k).fit(data_scaled)
    # Get distances to the k-th nearest neighbors
    distances, indices = nbrs.kneighbors(data_scaled)
    # Sort the distances to the k-th neighbor
    distances = np.sort(distances[:, k - 1])  # Adjust index for zero-based indexing
    # Plotting
    plt.figure(figsize=(10, 6))
    plt.plot(distances, marker='o', markersize=2.5)
    plt.ylabel('k-distance')
    plt.xlabel('Points sorted by distance to k-th nearest neighbor')
    plt.title(f'k-Distance Plot for k={k}')
    plt.grid(True)
    plt.show()

k = 10
plot_k_distance(k)

We can now empirically evaluate different combinations of `min_samples` (in the range [4, 10]) and `eps` (in the range
from 0.2 to 1 exclusively, using a step of 0.01) using a grid search.

In [None]:
from sklearn.metrics import silhouette_score
param_grid = {
    'eps': np.arange(0.2, 1, 0.01),
    'min_samples': range(4, 10)
}
dbscan = DBSCAN()
grid_search = GridSearchCV(dbscan, param_grid, scoring=silhouette_score)
grid_search.fit(data_scaled)
grid_search.best_params_

The results indicate that the optimal values are `eps=0.2` and `min_samples=4`. However, to achieve the correct number of noise points, an `eps` value of `2.3` yields the best outcome of 7 clusters while accurately identifying a mostly perfect cloud of noise and border points.

In [None]:
dbscan_model = DBSCAN(eps=0.23, min_samples=4)
dbscan_labels = dbscan_model.fit_predict(data_scaled)
dbscan_data = data_scaled.copy()
dbscan_data['cluster'] = dbscan_model.labels_
dbscan_data

In [None]:
dbscan_data.groupby('cluster')['cluster'].value_counts()

In [None]:
fig_3d_dbscan_data = px.scatter_3d(dbscan_data, x='att1', y='att2', z='att3', color='cluster')
fig_3d_dbscan_data.update_traces(marker=dict(symbol="circle", size=2), name="Data")
fig_3d_dbscan_data.update_layout(width=1000, height=800, title="3D Scatter plot for DBScan (with the optimal hyperparameters eps=0.23 and min_samples=4)")
fig_3d_dbscan_data.show()

![3D Scatter plot for DBScan (with the optimal hyperparameters eps=0.23 and min_samples=4)](./img/3d_6.png)