# 1: Data Generation

In [None]:
from src.data_generator import generate_data

dataset_2 = generate_data(dim=2, k=3, n_per_cluster=100, radius=7, cluster_dim=2, has_noise=True, n_noise=100)
dataset_4 = generate_data(dim=4, k=4, n_per_cluster=100, radius=7, cluster_dim=3, has_noise=True, n_noise=150)
dataset_10 = generate_data(dim=10, k=5, n_per_cluster=100, radius=7, cluster_dim=5, has_noise=True, n_noise=200)
dataset_20 = generate_data(dim=20, k=5, n_per_cluster=100, radius=7, cluster_dim=7, has_noise=True, n_noise=200)
dataset_100 = generate_data(dim=100, k=10, n_per_cluster=100, radius=7, cluster_dim=20, has_noise=False, n_noise=0)

For each dataset, we chose a radius of 7 for each cluster. As the total dimensionality increases, we also increase the amount of clusters in the dataset, as well as the number of noise points.

To better visualize the curse of dimensionality, we will omit the noise points in the final dataset (100 dimensions) completely. As will be seen in the plots below, even then we will be unable to see the clusters in that dataset.

# 2: Describing and Visualizing the Data

First, we will use Principal Component Analysis to reduce the dimensionality of the datasets to 2D. Then we shall visualize the data points in 2D using a scatter plot.

Since the data is yet unclustered, the scatter plot is expected to be messy, especially for higher-dimensional datasets.

In [None]:
# First, we will visualize the data in 2D using PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
dataset_2_pca = pca.fit_transform(dataset_2)
dataset_4_pca = pca.fit_transform(dataset_4)
dataset_10_pca = pca.fit_transform(dataset_10)
dataset_20_pca = pca.fit_transform(dataset_20)
dataset_100_pca = pca.fit_transform(dataset_100)

plt.scatter(dataset_2_pca[:, 0], dataset_2_pca[:, 1])
plt.title("2-Dimensional Data")
plt.show()

plt.scatter(dataset_4_pca[:, 0], dataset_4_pca[:, 1])
plt.title("4-Dimensional Data")
plt.show()

plt.scatter(dataset_10_pca[:, 0], dataset_10_pca[:, 1])
plt.title("10-Dimensional Data")
plt.show()

plt.scatter(dataset_20_pca[:, 0], dataset_20_pca[:, 1])
plt.title("20-Dimensional Data")
plt.show()

plt.scatter(dataset_100_pca[:, 0], dataset_100_pca[:, 1])
plt.title("100-Dimensional Data")
plt.show()

For the sake of completeness, we will also visualize the first 2 dimensions of the datasets using a scatter plot.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(dataset_2[:, 0], dataset_2[:, 1])
plt.title("2-Dimensional Data")
plt.show()

plt.scatter(dataset_4[:, 0], dataset_4[:, 1])
plt.title("4-Dimensional Data")
plt.show()

plt.scatter(dataset_10[:, 0], dataset_10[:, 1])
plt.title("10-Dimensional Data")
plt.show()

plt.scatter(dataset_20[:, 0], dataset_20[:, 1])
plt.title("20-Dimensional Data")
plt.show()

plt.scatter(dataset_100[:, 0], dataset_100[:, 1])
plt.title("100-Dimensional Data")
plt.show()

As can be seen, the scatter plots still become messier as the number of dimensions increases. The odds of the first 2 dimensions capturing the underlying structure of the data are slim.

This is in large part because the proportion of the dimensions that are relevant to the clustering task is low.

- In the first dataset, the only 2 dimensions are also the only ones relevant
- In the second dataset (4 dimensions), one of the dimensions (chosen randomly) becomes irrelevant (i.e. 25%)
- In the third dataset (10 dimensions), half of the dimensions are irrelevant
- In the fourth dataset (20 dimensions), 65% are irrelevant
- In the final dataset (100 dimensions), 80% are irrelevant