---
title: "Unsupervised Learning"
format:
    html: 
        code-fold: false
---

<br>
<br>

# Overview

## Dimension Reduction

### PCA

In this section, I will use a few different text embedding methods in order to vectorize our reviews data. Similar to what was done in the EDA section for visualizing frequent keywords for different rating scores, I will use a simple bag-of-words approach to vectorizing bigrams (pairs of words). Next, I will us text frequency-inverse document frequency (TF-IDF) to vectorize single words in the text. The rationale for using TF-IDF over bag-of-words for single terms comes from the lack of useful information provided when applying bag-of-words to single terms in the [EDA](../eda/main.ipynb) section. For a refresher on the result provided by using bag-of-words, please refer to to diagram in the [EDA](../eda/main.ipynb) section, and for further context on TF-IDF, see the equations outlined in the literature review on the [Home](../../index.qmd) page. 

From there, I will leverage several different unsupervised learning techniques. To begin, I use two types of dimension reduction techniques to collapse our embedded text data into a low dimensional space for easier visualization. For this, I will use Principle Components Analysis (PCA) and t-distributed Stochastic Neighbor Embedding. In case you are unfamiliar with these two topics - PCA works by identifying an axes in high-dimensional space, along which the preserved variance of the data is maximized. These so-called "principle" components are eigenvectors of the covariance matrix, and their selection (i.e. how many principle components we take) depends on the respective share of total variance preserved by their eigenvalues[@EigenPCA]. 

**Here is a helpful visualization of what is going on in PCA:**
<br>
![](../../xtra/multiclass-portfolio-website/images/pca.gif){width="600px"} 
<br>
Source: [Builtin](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)

### t-SNE

On the other hand, t-SNE takes a non-linear, probabilistic approach to dimension reduction that works in two stages. First t-SNE constructs probability distributions over different pairs of high-dimensional points, where it then assigns higher probabilities to similar points and lower probabilities to dissimilar points[@WikiTSNE]. From there, creates a similar probability distributions in a lower dimensional space, and shrinks the difference between the to distributions by minimizing the kullback-Leibler (KL) divergence between the two. In simple terms, the KL divergence simply measures the difference between two different probability distributions[@WikiKL]. T-SNE also requires the use of a `perplexity` hyperparameter, which represents a guess as to how many close neighbors a given point should have, or the "balance between preserving the global and local structure of the data"[@perplexity]. Feel free to head over [here](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) for a more robust explanation of t-SNE and KL divergence.

**Example of how `perplexity` Influences t-SNE Results**
<br>
![](../../xtra/multiclass-portfolio-website/images/perplexity.png){width="600px"}
<br>
Source: [Single Cell Discoveries](https://www.scdiscoveries.com/blog/knowledge/what-is-t-sne-plot/)

## Clustering

Once the data is properly collapsed into a lower-dimensional space, I will apply several clustering methods in order to better understand how different pieces of text group together. For this, I will use K-Means Clustering, Hierarchical Clustering, and DBSCAN. The goal for using clustering methods within the context of this study is to uncover underlying patterns in text for different review rating scores. 

### K-Means

As a first step, I will apply a K-Means clustering algorithm to the dimension-reduced data. The K-Means alorithm starts by randomly selecting $k$ points in the dataset, where $k$ is a hyperparameter that we can optimize by using the elbow method (covered below). From there, the algorithm takes these $k$ centroids and calculates their distance to all other points in the data set, assigning all of the closest points to their respective centroid. For my distance metric, I elect use euclidean distance[@eucdistance]: 

$$
\text{for a point} \ x = (x_1, x_2, ..., x_n) \ \text{and centroid} \ \mu = (\mu_1, \mu_2, \mu_n) \ \text{their distance} \ d(x,\mu) = \sqrt{\sum_{i=1}^{n}(x_{i}-\mu_{i})^2}
$$

After all data points have been assigned to their initial clusters, we calculate the mean of all data points for a given cluster[@Week8Slides]:

$$
\mu_{j}^{\text{new}} \leftarrow \frac{1}{|S_{j}|} \sum_{x_{i} \in{S_{j}}} x_{i}
$$

From there, we repeat our disance calculation and cluster re-assignment until convergence.

**Example of K-Means Convergence**
<br>
![](../../xtra/multiclass-portfolio-website/images/kmeans.gif){width="400px"}
<br>
Source: [Wikipedia](https://commons.wikimedia.org/wiki/File:K-means_convergence.gif)

### Hierarchical Clustering



### Part 1: Dimensionality Reduction

The objective of this section is to explore and demonstrate the effectiveness of PCA and t-SNE in reducing the dimensionality of complex data while preserving essential information and improving visualization.

1. **PCA (Principal Component Analysis):**
   - Apply PCA to your dataset.
   - Determine the optimal number of principal components.
   - Visualize the reduced-dimensional data.
   - Analyze and interpret the results.

2. **t-SNE (t-distributed Stochastic Neighbor Embedding):**
   - Implement t-SNE on the same dataset.
   - Experiment with different perplexity values.
   - Visualize the t-SNE output to reveal patterns and clusters.
   - Compare the results of t-SNE with those from PCA.

3. **Evaluation and Comparison:**
   - Evaluate the effectiveness of PCA and t-SNE in preserving data structure.
   - Compare the visualization capabilities of both techniques.
   - Discuss the trade-offs and scenarios where one technique may perform better than the other.

### Part 2: Clustering Methods

Apply clustering techniques (K-Means, DBSCAN, and Hierarchical clustering) to a selected dataset. The goal is to understand how each method works, compare their performance, and interpret the results.

1. **Clustering Methods:**
   - Apply **K-Means**, **DBSCAN**, and **Hierarchical clustering** to your dataset.
   - Write a technical summary for each method (2–4 paragraphs per method) explaining how it works, its purpose, and any model selection methods used (e.g., Elbow, Silhouette).

2. **Results Section:**
   - Discuss and visualize the results of each clustering analysis.
   - Compare the performance of different clustering methods, noting any insights gained from the analysis.
   - Visualize cluster patterns and how they relate (if at all) to existing labels in the dataset.
   - Use professional, labeled, and clear visualizations that support your discussion.

3. **Conclusion:**
   - Summarize the key findings and their real-world implications in a non-technical way. Focus on the most important results and how they could apply to practical situations.

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.