<a href="https://colab.research.google.com/github/tinayiluo0322/XAI_Projects/blob/main/Explainable_Deep_Learning_II_NV_Embed_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIPI 590 - XAI | Assignment #08
### Explainable Deep Learning II
### Visualization of NV-Embed-v2 Embedding Spaces with t-SNE, PCA, and UMAP
### Luopeiwen Yi


## Introduction

#### Objective
This notebook is dedicated to exploring the embedding space generated by the ["NV-Embed-v2"](https://huggingface.co/nvidia/NV-Embed-v2) model from NVIDIA, which is featured on the MTEB (Model Training and Evaluation Benchmark) leaderboard. We aim to understand the underlying structure and distribution of embeddings generated by this advanced model through the application and comparison of three prominent dimensionality reduction techniques: PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection).

#### Methods
- **PCA**: A linear technique for reducing dimensions and identifying the principal components that capture the greatest variance within the data.
- **t-SNE**: A non-linear technique adept at creating space-efficient visualizations of high-dimensional data, making it easier to identify clusters and relative distances within the data.
- **UMAP**: A modern manifold learning technique for dimension reduction that balances the preservation of local and global data structure, often outperforming t-SNE in terms of speed and scalability while maintaining similar quality visualizations.

#### Data Selection: Key Terms in Data Science Learning
To delve deeply into the model's capabilities, we have selected a diverse array of keywords central to the field of data science. These terms range from foundational concepts like "data" and "science" to advanced methodologies such as "neural networks" and "predictive modeling". This choice enables us to:
- **Visualize Embedding Relationships**: Examine how these key terms are spatially related within the model's embedding space.
- **Assess Model's Semantic Capture**: Evaluate the model's effectiveness in capturing semantic and syntactic similarities or distinctions.
- **Demonstrate Dimensionality Reduction Efficacies**: Highlight the unique advantages and limitations of PCA, t-SNE, and UMAP in visualizing complex data relationships.

#### Implementation
1. **Model Selection**: Utilize the "NV-Embed-v2" model from NVIDIA, chosen from the MTEB leaderboard for its robust performance in generating embeddings.
2. **Data Preparation**: Generate embeddings for the predefined set of keywords that reflect broad and specific data science topics.
3. **Dimensionality Reduction**:
   - Implement PCA, t-SNE, and UMAP on the extracted embeddings.
   - Use 2D and 3D scatter plots to visualize and analyze the distribution and clustering patterns of the embeddings.
4. **Comparison and Analysis**:
   - Evaluate the effectiveness and suitability of each dimensionality reduction technique based on the visualization results.
   - Discuss each method's strengths and weaknesses, especially in relation to the embeddings from "NV-Embed-v2".

In [None]:
import os

# Remove Colab default sample_data if it exists
if os.path.isdir("./sample_data"):
    !rm -r ./sample_data

# Clone GitHub repo (force re-clone if it already exists)
repo_name = "XAI_Projects"
git_path = 'https://github.com/tinayiluo0322/XAI_Projects.git'

if os.path.isdir(repo_name):
    !rm -rf "{repo_name}"
!git clone "{git_path}"

# Install dependencies from requirements.txt if it exists
#requirements_file = os.path.join(repo_name, 'requirements.txt')
#if os.path.isfile(requirements_file):
    #!pip install -r "{requirements_file}"
#else:
    #print("No requirements.txt found, skipping dependency installation.")

# Change working directory to location of notebook
notebook_dir = 'Explainable_Deep_Learning_II'
path_to_notebook = os.path.join(repo_name, notebook_dir)

# Check if the directory exists
if os.path.isdir(path_to_notebook):
    %cd "{path_to_notebook}"
    %ls
else:
    print(f"Directory {path_to_notebook} not found")

Cloning into 'XAI_Projects'...
remote: Enumerating objects: 6137, done.[K
remote: Counting objects: 100% (59/59), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 6137 (delta 26), reused 6 (delta 6), pack-reused 6078 (from 1)[K
Receiving objects: 100% (6137/6137), 134.91 MiB | 15.06 MiB/s, done.
Resolving deltas: 100% (49/49), done.
Updating files: 100% (5025/5025), done.
/content/XAI_Projects/Explainable_Deep_Learning_II/XAI_Projects/Explainable_Deep_Learning_II
placeholder


In [None]:
# from google.colab import userdata
# userdata.get('secretName')

In [1]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0
!pip uninstall -y transformer-engine
!pip install torch==2.2.0
!pip install transformers==4.42.4
#!pip install flash-attn==2.2.0
!pip install sentence-transformers==2.7.0

[0mCollecting torch==2.2.0
  Using cached torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.0)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.0)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.0)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.2.0)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-n

In [None]:
!pip install datasets



In [2]:
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from sentence_transformers import SentenceTransformer
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Define the path in Google Drive to save the embeddings
path = '/content/drive/My Drive/Colab Notebooks/embeddings.npy'

## Embedding Model: NV-Embed-v2

In this exercise, we will be utilizing **NV-Embed-v2**, a state-of-the-art generalist embedding model renowned for its top-ranking performance on the Massive Text Embedding Benchmark (MTEB). As of August 30, 2024, NV-Embed-v2 leads the MTEB leaderboard with an impressive score of 72.31 across 56 text embedding tasks. It excels particularly in the retrieval sub-category, scoring 62.65 across 15 tasks, which underscores its efficacy in enhancing Retrieval-Augmented Generation (RAG) technologies.

NV-Embed-v2 is distinguished by several innovative features that enhance its embedding capabilities. The model leverages a Large Language Model (LLM) that attends to latent vectors, optimizing the quality of pooled embeddings. This is further augmented by a two-staged instruction tuning method tailored to improve performance across both retrieval and non-retrieval tasks. Additionally, a novel hard-negative mining technique is employed that utilizes positive relevance scores to more accurately eliminate false negatives, thereby refining its predictive accuracy.

For the purposes of our visualization exercise, NV-Embed-v2's advanced embedding generation capabilities will allow us to explore and analyze the embedding space with unprecedented detail and accuracy. By applying dimensionality reduction techniques such as PCA, t-SNE, and UMAP, we aim to visually decipher the complex relationships and clustering within the embeddings generated by this powerful model. This hands-on exploration will not only demonstrate the practical applications of NV-Embed-v2 but also provide insights into the underlying structures that govern embedding spaces in high-performance models.


In [None]:
# Load NV-Embed-v2
model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/60.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.66k [00:00<?, ?B/s]

configuration_nvembed.py:   0%|          | 0.00/3.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nvidia/NV-Embed-v2:
- configuration_nvembed.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_nvembed.py:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nvidia/NV-Embed-v2:
- modeling_nvembed.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

1_Pooling/config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

In [6]:
# Words about data science learning
words = [
    "data", "science", "machine", "learning", "statistics", "analytics", "deep", "learning",
    "neural", "networks", "python", "R", "visualization", "AI", "big", "data", "regression",
    "classification", "clustering", "NLP", "predictive", "modeling", "algorithms", "supervised",
    "unsupervised", "reinforcement", "learning", "decision", "trees", "random", "forests",
    "SVM", "k-means", "dimensionality", "reduction", "feature", "engineering", "gradient",
    "boosting", "data", "wrangling", "data", "cleaning", "ETL", "SQL", "NoSQL", "Hadoop",
    "Spark", "TensorFlow", "Keras", "PyTorch", "scikit-learn", "pandas", "numpy", "matplotlib",
    "seaborn", "plotly", "dash", "shiny", "biostatistics", "text", "mining", "time", "series",
    "forecasting", "computer", "vision", "deepfake", "GAN", "blockchain", "IoT", "edge", "computing"
]

In [None]:
# Get embeddings for the words
embeddings = np.array(model.encode(words))

# Display the shape of the embeddings to confirm size
print(embeddings.shape)

(73, 4096)


In [None]:
# Save the embeddings to the specified Google Drive path
np.save(path, embeddings)

In [4]:
# Load the embeddings from the specified Google Drive path
embeddings = np.load(path)

# Check the embeddings by printing or analyzing their shape
print(embeddings.shape)

(73, 4096)


## Principal Components Analysis (PCA)

* Focus is on capturing global linear relationships in the data
* Use to: simplify and find global linear relationships and patterns in the data

#### How does PCA work?

1. Standardize the Data: Scale the data so each feature has a mean of 0 and standard deviation of 1
2. Compute the Covariance Matrix: Calculate the covariance matrix to understand how features vary together
3. Compute Eigenvalues and Eigenvectors: Derive the eigenvalues and eigenvectors from the covariance matrix. Eigenvectors represent principal components, and eigenvalues indicate their significance
4. Sort Eigenvalues and Eigenvectors: Order them by descending eigenvalues to prioritize the most significant components
5. Select Principal Components: Choose the top 𝑘 eigenvectors corresponding to the largest eigenvalues
6. Transform the Data: Project the original data onto the selected principal components to reduce dimensions

#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space

2D Embedding Space Visualization

In [8]:
# Apply PCA
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# Plot PCA results using Plotly for interactivity
fig_pca = px.scatter(
    embeddings_pca, x=0, y=1,
    text=words,
    title="2D PCA of NV-Embed-v2 Embeddings",
    labels={'0': 'Principal Component 1', '1': 'Principal Component 2'}
)
fig_pca.update_traces(marker=dict(size=8))
fig_pca.show()

3D Embedding Space Visualization

In [7]:
# Apply PCA with 3 components
pca = PCA(n_components=3)
embeddings_pca = pca.fit_transform(embeddings)

# Plot PCA results in 3D using Plotly for interactivity
fig_pca = px.scatter_3d(
    x=embeddings_pca[:, 0],
    y=embeddings_pca[:, 1],
    z=embeddings_pca[:, 2],
    text=words,
    title="3D PCA of NV-Embed-v2 Embeddings",
    labels={'x': 'Principal Component 1', 'y': 'Principal Component 2', 'z': 'Principal Component 3'}
)
fig_pca.update_traces(marker=dict(size=5))
fig_pca.show()

### Analysis of PCA Embeddings Visualization

#### 2D Visualization

In the 2D PCA visualization, we observed distinct clustering patterns among data science-related terms. General terms commonly used across various aspects of the field are tightly clustered, reflecting strong interrelatedness and common usage. This compact grouping underscores their fundamental importance and pervasive application in data science. Interestingly, the data reveals two main clusters, suggesting that certain groups of data science terms are more closely related to each other than to others, which warrants further investigation to understand the underlying connections.

However, certain specific tools and libraries like R, GAN, and Keras are positioned far from the rest of the terms and also maintain considerable distances from each other, indicating unique application contexts or functionalities that set them apart from more commonly associated data science technologies. Similarly, SVM, Spark, ETL, NLP, and Hadoop are observed to cluster closer to each other yet remain distant from the main clusters, hinting at a specialized niche or shared functionality that differentiates them from the broader data science tools.

AI and IoT are found to be farther from the main clustering groups, which may reflect their emerging role and the evolving nature of their integration into data science workflows. Surprisingly, SQL and NoSQL, despite both being database technologies, are positioned far from each other. This separation highlights the fundamental differences in their design and typical usage scenarios—SQL being more traditional and structured, whereas NoSQL caters to flexible, schema-less data handling.

This diverse positioning suggests that these technologies are utilized across a broad range of data science sub-domains, leading to their scattered distribution due to varied usage contexts. Understanding these distances and groupings can provide valuable insights into the evolving landscape of data science tools and methodologies.

#### 3D Visualization

The 3D PCA visualization adds a layer of depth to the observations made from the 2D analysis, revealing more intricate relationships between the terms.
This visualization shows two prominent clusters focusing on libraries and modeling methods, which provides a more nuanced understanding of how data science tools and techniques are related spatially. Specific modeling techniques such as clustering, k-means, regression, and various forms of learning—including reinforcement, predictive, supervised, and unsupervised learning—are found to be tightly clustered. This grouping underscores the shared theoretical foundations and similar application contexts, indicating that these methods often coalesce around common analytical and modeling objectives.

The tools and libraries such as Matplotlib, Python, scikit-learn, NumPy, and R are observed to have a more dispersed arrangement in the 3D space. This spread reflects their versatility and widespread use across different sub-domains of data science, each serving a broad range of functionalities that might not tightly correlate with one another.

In terms of data management technologies, SQL and NoSQL are notably distant from each other. This separation underscores the fundamental architectural differences between the two: SQL is traditionally used for structured data with predefined schemas, whereas NoSQL is favored for its flexibility in handling unstructured data, making each suitable for distinct types of data management tasks.

A surprising proximity is noted between terms traditionally associated with advanced statistical methods and those linked with cutting-edge machine learning technologies, such as R and GAN. Their closeness in the 3D space might suggest an emerging intersection where statistical rigor meets innovative data processing, which is particularly relevant in areas like predictive modeling and artificial intelligence research.

This 3D visualization not only confirms some of the relationships suggested by the 2D view but also provides additional insights into how different data science concepts interact within a more complex spatial framework. By examining these relationships in a three-dimensional context, we gain a richer understanding of the data science landscape, revealing how theoretical and practical aspects of the field converge.

#### Implications and Insights
The PCA visualizations, both in 2D and 3D, provide compelling insights into the semantic relationships and contextual groupings within the data science lexicon as perceived by the NV-Embed-v2 model. The clustering patterns observed offer a window into how various concepts and tools are interconnected or distinct within the model's learned structure.

For instance, the nearness of R and GAN might inspire further exploration into integrative approaches that combine classical statistical methods with cutting-edge generative models. The dispersion of widely-used libraries suggests a versatility that transcends specific niches, potentially guiding decisions on educational focus or tool adoption based on desired breadth or specialization in data science roles.

Overall, these visualizations not only illuminate the model's understanding of data science terminologies but also serve as a guide for navigating the complex landscape of tools, techniques, and theories that define the field. Such insights are invaluable for both novices seeking to enter the field and seasoned practitioners aiming to optimize their methodologies or explore new technologies.

## t-distributed Stochastic Neighbor Embedding (t-SNE)

* Constructs a lower-dimensional representation where similar data points are placed closer together
* Use to: Emphasize visualization, reveal local patterns and clusters


#### How does t-SNE work?

1. Compute Pairwise Similarities: Measure how similar each pair of data points is in the high-dimensional space using a Gaussian kernel
2. Initialize Embeddings: Start with random low-dimensional embeddings for each data point
3. Compute Similarities in Low-Dimensional Space: Measure similarities between low-dimensional embeddings using a Student's t-distribution
4. Optimize Embeddings: Adjust the embeddings to minimize the difference between the distributions of similarities in high-dimensional and low-dimensional spaces
5. Reduce Dimensionality: Obtain a reduced-dimensional representation of the data, preserving local relationships between data points

#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space
* `perplexity` - a hyperparameter that balances the attention given to local versus global aspects of the data. It affects the quality of the resulting embeddings. Higher perplexity values consider more points as neighbors of each other, potentially resulting in more global views of the data.
* `n_iter` - the number of iterations the algorithm will run for. More iterations can lead to better convergence and potentially better embeddings, but it also increases computation time

2D Embedding Space Visualization

In [9]:
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot t-SNE results using Plotly
fig_tsne = px.scatter(
    embeddings_tsne, x=0, y=1,
    text=words,
    title="2D t-SNE of NV-Embed-v2 Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_tsne.update_traces(marker=dict(size=8))
fig_tsne.show()


3D Embedding Space Visualization

In [10]:
# Apply 3D t-SNE
tsne = TSNE(n_components=3, perplexity=30, n_iter=300, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot 3D t-SNE results using Plotly
fig_tsne = px.scatter_3d(
    x=embeddings_tsne[:, 0],
    y=embeddings_tsne[:, 1],
    z=embeddings_tsne[:, 2],
    text=words,  # Ensure 'words' is a list that matches the 'embeddings' one-to-one
    title="3D t-SNE of NV-Embed-v2 Embeddings",
    labels={'x': 'Component 1', 'y': 'Component 2', 'z': 'Component 3'}
)
fig_tsne.update_traces(marker=dict(size=5))
fig_tsne.show()

### Analysis of t-SNE Embeddings Visualization

#### 2D Visualization Analysis
In the 2D t-SNE visualization, the clustering of data science-related terms appears somewhat diffuse, with each term being relatively dispersed. This scattered distribution indicates that, while t-SNE excels in revealing local structure and subtle relationships within high-dimensional data, it does not always result in distinct clustering, particularly in 2D where the capability to separate overlapping groups might be limited. This might suggest a complex interconnectivity among the terms where no single group dominates in terms of shared context or functionality.

#### 3D Visualization Analysis
Contrastingly, the 3D t-SNE visualization provides a more nuanced view, revealing a dense, ball-shaped cluster where core data science concepts are tightly grouped. This closer clustering reflects a stronger relational affinity and suggests these terms share more in common in their underlying contexts or are frequently used together within the field.

Surprisingly, specific terms like "forecasting," "clustering," "computing," "statistics," "TensorFlow," and "engineering" are positioned much farther from this central cluster. Their distance implies that these terms might encompass unique contexts or specialized applications that distinguish them from more general data science terminology.

The term "tree" also appears isolated from the main group, which could be rationalized by its dual relevance to both data structures in computer science and decision processes in machine learning, such as decision trees. This dual applicability might cause it to float between related but distinct conceptual groups, not fully belonging to any single one in the embedding space created by t-SNE.

### Implications and Insights
The stark contrast between the 2D and 3D visualizations highlights t-SNE's sensitivity to dimensionality and its impact on the ability to discern clusters. While 2D plots provide a broad overview, 3D plots can unearth deeper layers of structure and association among terms, offering richer insights into the data's intrinsic patterns.

## Uniform Manifold Approximation and Projection (UMAP)

* Uses manifold learning (nonlinear dimensionality reduction) to understand the underlying structure or shape of the data
* Focus on capturing complex, non-linear relationships in the data
* Use to: preserve local structure and handle complex, nonlinear relationships



#### How does UMAP work?
1. Construct Local Neighborhoods: Define local neighborhoods for each data point in the high-dimensional space based on proximity
2. Optimize Low-Dimensional Embeddings: Minimize the discrepancy between local neighborhoods in the high-dimensional and low-dimensional spaces using stochastic gradient descent
3. Preserve Global Structure: Balance the preservation of local and global structures using a fuzzy simplicial set representation
4. Reduce Dimensionality: Obtain a lower-dimensional representation of the data while preserving both local and global relationships
5. Effective Visualization: UMAP provides an effective tool for visualizing high-dimensional data in a reduced-dimensional space, capturing complex relationships and structures


#### Implementation in Python
Need to set:
* `n_components` - The number of dimensions in the embedded space
* `n_neighbors` - determines the number of neighboring points used in the construction of the high-dimensional fuzzy topological representation of the data. It controls the local connectivity structure in the high-dimensional space. Higher values result in a more global view of the data, while lower values emphasize local structure
* `min_dist` - controls the minimum distance between embedded points in the low-dimensional representation. It acts as a regularization parameter preventing points from being too close to each other in the embedding space. Larger values of min_dist result in a more spread-out embedding, while smaller values allow points to be closer together

2D Embedding Space Visualization

In [11]:
# Apply UMAP
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot UMAP results using Plotly
fig_umap = px.scatter(
    embeddings_umap, x=0, y=1,
    text=words,
    title="2D UMAP of NV-Embed-v2 Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_umap.update_traces(marker=dict(size=8))
fig_umap.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



3D Embedding Space Visualization

In [12]:
# Apply 3D UMAP
umap_model = umap.UMAP(n_components=3, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot 3D UMAP results using Plotly
fig_umap = px.scatter_3d(
    x=embeddings_umap[:, 0],
    y=embeddings_umap[:, 1],
    z=embeddings_umap[:, 2],
    text=words,  # Ensure 'words' is defined and matches the embeddings one-to-one
    title="3D UMAP of NV-Embed-v2 Embeddings",
    labels={'x': 'Component 1', 'y': 'Component 2', 'z': 'Component 3'}
)
fig_umap.update_traces(marker=dict(size=5))
fig_umap.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



### UMAP Clustering Analysis

#### 2D Visualization Analysis
In the 2D UMAP visualization, the clustering of data science-related terms is not as distinct as one might expect, with each term appearing somewhat dispersed across the plot. Despite this diffusion, certain trends emerge, suggesting some underlying relationships:
- Visualization libraries like **Matplotlib**, **Seaborn**, and **Plotly** are grouped together, reflecting their common usage in data visualization tasks.
- Machine learning frameworks such as **TensorFlow** and **PyTorch** are seen closer to each other, indicating their shared functionalities in neural network development.
- Core programming and data manipulation tools **Python**, **Pandas**, and **NumPy** are clustered, highlighting their integration in data processing workflows.
- Data management technologies **SQL** and **NoSQL** also cluster closely, suggesting a thematic similarity despite their operational differences.
- **ETL** processes and big data frameworks like **Hadoop** are near each other, underscoring their roles in data integration and processing pipelines.
- **Text** and **Data** find themselves in a smaller, separate cluster, far away from the main grouping, possibly reflecting their foundational role in data science distinct from specific tools and technologies.
- Surprisingly, advanced analytics and machine learning terms like **GAN**, **R**, **SVM**, **Spark**, **NLP**, and **Keras** are closely grouped, raising questions about their shared contexts that might not be immediately apparent.

#### 3D Visualization Analysis
The 3D visualization reveals more pronounced clustering, displaying a cloud-like formation where many data science terms related to modeling methods and libraries are tightly grouped:
- **SQL** and **NoSQL** form a distinct sub-cluster away from the main cloud, emphasizing their specialized role in data management.
- Contrary to the 2D results, **AI** appears significantly distanced from the central cluster, intriguingly positioned nearer to **IoT**, hinting at emerging intersections in these technologies.
- Other terms like **GAN**, **R**, **SVM**, **Spark**, **NLP**, **Keras**, and surprisingly **ETL**, are positioned far from the main cluster, which is unexpected and suggests complex underlying relationships or distinct uses in data science that separate them from more conventional methods and tools.
- Notably, the positioning of **ETL** away from **Hadoop** in this visualization contradicts the 2D findings, presenting an interesting divergence that warrants further investigation into how these tools are viewed within the larger framework of data management and processing.

### Implications and Insights
This UMAP analysis offers intriguing insights into the relationships and distances between key data science terms, with differences between 2D and 3D representations. These observations encourage deeper examination of how different technologies and methodologies are perceived and grouped by UMAP's complex algorithms.

## Final Analysis of Dimensionality Reduction Techniques

#### PCA Embeddings Visualization Analysis
- **2D and 3D PCA Visualizations** reveal nuanced clustering and dispersal patterns among data science terms. The general terms cluster tightly, indicating high relatedness and common application across the field, whereas tools and methodologies like R, GAN, and Keras are distinctly apart, suggesting unique applications or functionalities. Notably, SQL and NoSQL show significant separation, highlighting the structural and functional differences between these database technologies.

- **Insights from PCA**: PCA effectively captures the global linear relationships and provides a clear delineation of high-level groupings and separations. It helps understand the overarching structures within the dataset but may not be as effective in capturing the local, non-linear intricacies that more advanced techniques like t-SNE or UMAP can reveal.

#### t-SNE Embeddings Visualization Analysis
- **2D t-SNE Visualization** presents a more scattered arrangement of data science terms, reflecting t-SNE's strength in highlighting local relationships at the cost of global structure. **3D t-SNE** deepens this view by clustering core concepts tightly in a dense, ball-shaped formation, while more specialized terms like "forecasting" and "TensorFlow" drift farther away, emphasizing their unique contexts.

- **Insights from t-SNE**: t-SNE excels in revealing hidden structures and subtle local patterns within high-dimensional data, making it invaluable for identifying clusters and relationships that are not immediately obvious. Its sensitivity to parameter settings like perplexity can significantly influence the visualization outcomes.

#### UMAP Clustering Analysis
- **2D and 3D UMAP Visualizations** underscore UMAP's ability to balance the preservation of both local and global data structures, showing distinct clusters that are less tightly bound than those in t-SNE but more integrated than in PCA. The placement of terms like AI and IoT suggests emerging intersections and evolving roles within the field.

- **Insights from UMAP**: UMAP provides a comprehensive view by effectively clustering related terms while maintaining a connection to the broader data context. This makes UMAP exceptionally good at facilitating an understanding of both macro and micro-scale structures within the data.

### Comparative Outcomes
- **Handling Complexities**: UMAP seems to handle the complexities of the "NV-Embed-v2" embeddings most effectively, striking a balance between detailing local patterns and maintaining an awareness of the overall structure.
- **Providing Meaningful Insights**: While PCA offers clarity and t-SNE provides depth in local cluster analysis, UMAP offers the most rounded insights into both local and global relationships, making it particularly useful for a nuanced understanding of data science terms.
- **Understanding Semantic/Syntactic Structures**: UMAP's ability to preserve semantic relationships across different scales provides deep insights into the syntactic and semantic underpinnings of the dataset.

### Conclusion
The application of PCA, t-SNE, and UMAP to the "NV-Embed-v2" model's embeddings has revealed distinct capabilities of each method. UMAP’s performance, in particular, in capturing a holistic view of the embeddings' structure, alongside its detail in local relationships, suggests its suitability for tasks where both types of relationships are crucial for the analysis. These insights not only enrich our understanding of the data science landscape but also guide the selection of appropriate tools for specific types of data analysis tasks in real-world applications.
