# unserpervised learning pipeline

## import dependancies

In [1]:
from hdbscan import HDBSCAN
import numpy as np
import pandas as pd
import re
from transformers import AutoTokenizer, AutoModel
import torch
from tqdm import tqdm
from umap import UMAP
import warnings

# mute depracation warning for specifiec version of packages
warnings.filterwarnings('ignore', category=FutureWarning, message="'force_all_finite' was renamed to 'ensure_all_finite'")

## test GPU availability

Test the GPU is available and properly for use with BERT model. The dependandancies listed in the requirements.txt are the dependancies that function with my GPU hardware. If you hardware configuration is different that is fine. The required functions will swich to CPU processing which will still work, but might take a little longer. If the print output below is "cuda" that means you are using GPU, if it "cpu" then BERT will be using your CPU. Either why the notebook will still execute.

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


# load model parameters defined through gridsearch

The gridesarch which produced these parameters is not included in this notebook, because it is time and GPU intensive. However, it is inlcued in the project. The gridsearch may be performed again by running the file _03_unserpervised_pipeline_gridsearch.py in the ./scripts folder. The final parameters chosed were chosed based off of a combination of the highest cluster purity in conjunction with the input categories and the lowest overall noise.

The hyperparameters tested in the orignal gridsearch include:

Here are the definitions for each hyperparameter:

* `n_components` represents the number of dimensions to reduce the data to in UMAP - it determines the dimensionality of the transformed output space, with 20 meaning the high-dimensional data will be projected into a 20-dimensional space

* `n_neighbors` controls how UMAP balances local versus global structure in the data - it specifies the size of local neighborhoods to consider when constructing the high-dimensional graph, with higher values (like 10) resulting in more emphasis on global structure

* `min_dist` sets the minimum distance between points in the low dimensional representation for UMAP - a value of 0.2 allows for moderately tight clustering while still maintaining some separation between distinct groups

* `metric` (for UMAP) defines how distances between points are calculated in the original space - 'cosine' measures the cosine of the angle between vectors, making it well-suited for high-dimensional data

* `min_cluster_size` specifies the smallest size a cluster can be in HDBSCAN - clusters with fewer than 10 points will be considered noise

* `cluster_selection_epsilon` determines the distance threshold for cluster membership in HDBSCAN - points within 0.1 distance units of a cluster will be considered part of that cluster

* `metric` (for HDBSCAN) defines how distances between points are calculated during clustering - 'euclidean' measures straight-line distances between points in the transformed space

* `cluster_selection_method` specifies how HDBSCAN selects which clusters to keep - 'eom' (Excess of Mass) tries to find the most stable clusters across different density thresholds

And the values tested are:

```python
param_grid = {
	'umap': {
		'n_components': [10, 20, 30],
		'n_neighbors': [10, 15, 20],
		'min_dist': [0.1, 0.2],
		'metric': ['cosine']
	},
	'hdbscan': {
		'min_cluster_size': [5, 10, 15],
		'min_samples': [3, 5, 7],
		'cluster_selection_epsilon': [0.1, 0.2, 0.3],
		'metric': ['euclidean'],
		'cluster_selection_method': ['eom']
	}
}
```

In [3]:
model_params = {
   'umap': {
       'n_components': 20,  # Number of dimensions to reduce to
       'n_neighbors': 10,   # Size of local neighborhood
       'min_dist': 0.2,     # Minimum distance between points in low dimensional space
       'metric': 'cosine'   # Distance metric for comparing points
   },
   'hdbscan': {
       'min_cluster_size': 10,  # Minimum size for a cluster
       'min_samples': 5,        # Number of samples in neighborhood for core points
       'cluster_selection_epsilon': 0.1,  # Distance threshold for cluster membership
       'metric': 'euclidean',   # Distance metric for comparing points
       'cluster_selection_method': 'eom'  # Method for selecting clusters
   }
}

## load and preprocess the reuters data

For this project we will be using the Reuters-21578 that is avaialable via NLTK. The data has already been loaded and saved in the data folder.

The Reuters-21578 dataset is one of the most widely used datasets for text classification 
and natural language processing (NLP) research. It consists of real news articles that 
appeared on the Reuters newswire in 1987.

Key characteristics:
- Contains 21,578 news documents collected from Reuters newswire in 1987
- Articles are labeled with categories (topics) such as 'earnings', 'acquisitions', 'grain', 'crude oil', etc.
- Documents can belong to multiple categories (multi-label classification)
- Highly imbalanced dataset: some categories have many documents while others have very few
- Contains both training and test splits in the original dataset

The Reuters-21578 categories were manually assigned by personnel from Reuters Ltd. 
and Carnegie Group, Inc. during the creation of the CONSTRUE text categorization system. 
The process involved:

1. Personnel from Reuters Ltd. (journalists/editors) originally tagged stories with 
   topic codes during their normal workflow

2. The categorization was later refined by Carnegie Group Inc. personnel to create 
   a more structured and consistent set of categories

The TOPICS categories are the most commonly used in research and benchmarking, as they 
were applied more consistently than the other category types.

Note: The original labeling process was designed for real-world news categorization 
rather than creating a perfect machine learning dataset, which explains some of the 
dataset's inherent biases and inconsistencies.

In this notebook, we'll be using NLTK's version of the Reuters corpus, which can be 
accessed using the nltk.corpus.reuters interface.

In [4]:
def load_and_preprocess_data(file_path):
    df = pd.read_parquet(file_path)
    def clean_text(text):
        # Remove extra whitespace and standardize text
        text = re.sub(r'\s+', ' ', str(text).strip())
        text = re.sub(r'[^\w\s.,!?-]', '', text)
        return text.lower()
    df['cleaned_text'] = df['text'].apply(clean_text)
    return df

df = load_and_preprocess_data('../data/reuters_data.parquet')

display(df)

Unnamed: 0,id,text,categories,cleaned_text
0,test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade],asian exporters fear damage from u.s.-japan ri...
1,test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain],china daily says vermin eat 7-12 pct grain sto...
2,test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,"[crude, nat-gas]",japan to revise long-term energy demand downwa...
3,test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...,"[corn, grain, rice, rubber, sugar, tin, trade]",thai trade deficit widens in first quarter tha...
4,test/14833,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...,"[palm-oil, veg-oil]",indonesia sees cpo price rising sharply indone...
...,...,...,...,...
10783,training/999,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...,"[interest, money-fx]",u.k. money market shortage forecast revised do...
10784,training/9992,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY\n Q...,[earn],knight-ridder inc ltkrn sets quarterly qtly di...
10785,training/9993,TECHNITROL INC &lt;TNL> SETS QUARTERLY\n Qtly...,[earn],technitrol inc lttnl sets quarterly qtly div 1...
10786,training/9994,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...,[earn],nationwide cellular service inc ltncel 4th qtr...


## get embeddings via BERT

BERT (Bidirectional Encoder Representations from Transformers) embeddings are an excellent 
choice for this text clustering task for several reasons:

1. Contextual Understanding:
   - Unlike traditional word embeddings (e.g., Word2Vec), BERT generates contextualized 
     embeddings where the same word can have different representations based on its context
   - This is particularly valuable for news articles where words can have different 
     meanings in different contexts (e.g., "bank" in financial vs. geographical contexts)

2. Pre-trained Knowledge:
   - BERT is pre-trained on a massive corpus of text, giving it broad understanding of 
     language patterns and relationships
   - This transfer learning approach is especially useful for news article analysis as 
     it captures semantic relationships without requiring additional training

Model Choice: bert-base-uncased
- This is the standard BERT base model with 12 layers, 768 hidden dimensions
- 'Uncased' means the model treats uppercase and lowercase letters the same
- While there are larger BERT variants available, bert-base-uncased provides a good 
  balance between computational efficiency and performance
- The model produces 768-dimensional embeddings for each text input, capturing rich 
  semantic information suitable for clustering

Note: We use the [CLS] token embedding (first token) as our document representation, 
which is a common practice for document-level tasks as it serves as an aggregate 
representation of the entire text.

In [5]:
def get_bert_embeddings(texts, batch_size=32):
    texts = [str(text) if text is not None else "" for text in texts]
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    model = AutoModel.from_pretrained('bert-base-uncased')
    model = model.to(device)
    model.eval()
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i:i + batch_size]
        # Tokenize text and move to GPU if available
        inputs = tokenizer(batch_texts, padding=True, truncation=True,
                           max_length=512, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        # Generate embeddings without computing gradients
        with torch.no_grad():
            outputs = model(**inputs)
            embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            all_embeddings.append(embeddings)
    return np.vstack(all_embeddings)

# Generate BERT embeddings for the text
embeddings = get_bert_embeddings(df['cleaned_text'].values)

100%|██████████| 338/338 [01:14<00:00,  4.53it/s]


In [6]:
# convert embeddings to dataframe only for display purposes

embedding_df = pd.DataFrame(
    embeddings, 
    columns=[f'dim_{i}' for i in range(embeddings.shape[1])]
)

display(embedding_df)

display(embedding_df.describe())

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,...,dim_758,dim_759,dim_760,dim_761,dim_762,dim_763,dim_764,dim_765,dim_766,dim_767
0,-0.428730,-0.025434,-0.264896,0.456063,-0.177185,-0.439705,-0.110295,0.354280,0.096356,0.210663,...,0.649765,-0.221645,0.208191,-0.476144,-0.237052,-0.140197,0.067187,0.211033,0.649412,-0.231024
1,-0.270779,-0.274225,-0.600579,-0.292290,-0.592692,0.153562,-0.284960,0.369683,0.486812,-0.087363,...,0.648648,-0.140062,-0.272591,-0.367131,-0.005673,0.045423,-0.427298,0.073786,0.740310,0.128450
2,-0.444324,-0.276958,-0.206134,-0.177583,-0.638809,-0.226003,-0.214179,0.131220,0.275722,-0.411958,...,0.937353,-0.435612,0.101641,-0.318186,0.256640,-0.743144,-0.190932,0.037883,0.984742,-0.113217
3,-0.837269,-0.159402,-0.122768,0.272326,-0.535650,0.644488,-0.094840,0.670715,-0.574275,0.320695,...,0.769111,-0.379672,0.243916,-0.081050,0.465016,-0.225803,-0.215873,-0.424787,0.548916,-0.197874
4,-0.525324,-0.333880,-0.081065,-0.035053,-0.434420,0.240422,-0.269029,0.873720,-0.156470,-0.045422,...,0.600770,-0.633934,-0.204564,-0.225983,0.274420,-0.200259,-0.188022,-0.084735,0.569763,-0.138966
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10783,-0.739016,-0.392456,0.440443,-0.377801,0.214573,0.005754,-0.173556,0.008574,-0.061628,-0.360064,...,0.452739,-0.028916,0.345560,-0.270563,0.119662,-0.393551,-0.296690,-0.532035,0.393636,0.314855
10784,-0.465597,-0.161368,0.359532,0.004244,-0.381905,-0.133427,0.040702,-0.069312,-0.028190,-0.114309,...,0.143969,-0.290041,0.211303,0.145902,0.401821,-0.109177,-0.337553,-0.258821,0.217003,0.316857
10785,-0.491718,-0.232266,0.537869,0.191433,-0.228181,-0.151771,0.049638,0.075680,-0.225646,0.120087,...,0.073498,-0.385664,0.170001,0.183848,0.637824,-0.007299,-0.257174,-0.232548,0.257113,0.388722
10786,-0.949895,0.108980,0.018789,-0.027704,-0.255346,0.442332,0.564583,0.139734,-0.672353,-0.050057,...,0.303461,-0.208623,0.205580,-0.155188,0.570183,0.011948,-0.335062,0.052731,0.344069,0.376916


Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,...,dim_758,dim_759,dim_760,dim_761,dim_762,dim_763,dim_764,dim_765,dim_766,dim_767
count,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,...,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0
mean,-0.693534,-0.205948,0.023409,-0.027518,-0.131916,0.008834,-0.105544,0.228415,-0.031647,-0.098369,...,0.366819,-0.186784,0.156445,-0.274742,0.260788,-0.072981,-0.150026,-0.055484,0.475829,0.299041
std,0.253929,0.240211,0.374507,0.244759,0.288768,0.239949,0.30151,0.268116,0.31786,0.216575,...,0.27685,0.24155,0.229899,0.220447,0.298352,0.268296,0.244497,0.313928,0.276261,0.286421
min,-1.716896,-1.264565,-1.585952,-1.182613,-1.108657,-0.930997,-1.039647,-0.778194,-1.336819,-0.945767,...,-0.824152,-1.27799,-0.844779,-1.097836,-0.820896,-1.27123,-0.956422,-1.147288,-0.517364,-0.862748
25%,-0.858509,-0.358958,-0.222796,-0.174928,-0.327748,-0.145086,-0.322435,0.039761,-0.241382,-0.239823,...,0.175277,-0.334677,0.016401,-0.421553,0.055488,-0.244352,-0.325171,-0.258677,0.27626,0.123099
50%,-0.689825,-0.205472,0.032592,-0.013841,-0.125726,0.00714,-0.127739,0.199375,-0.032366,-0.09557,...,0.356986,-0.174598,0.169477,-0.269653,0.272866,-0.072376,-0.148268,-0.05996,0.472423,0.316509
75%,-0.521527,-0.041661,0.284502,0.136575,0.066607,0.159141,0.101994,0.395306,0.183722,0.044899,...,0.552796,-0.020091,0.309363,-0.123082,0.478782,0.102015,0.02021,0.157526,0.66748,0.49007
max,0.245685,0.735193,1.228458,0.838937,1.131221,1.005181,0.964238,1.406344,1.138857,0.812154,...,1.402199,0.758056,0.995694,0.510922,1.322496,0.98625,0.737226,1.323447,1.439965,1.308281


## reducte dimensionality  via UMAP

Dimensionality Reduction with UMAP 
(Uniform Manifold Approximation and Projection)

UMAP is used here to reduce our high-dimensional BERT embeddings (768 dimensions) into 
a more manageable lower-dimensional space (20 dimensions) while preserving the essential 
structure of the data. 

Why UMAP?
1. Better Preservation of Global Structure:
   - Unlike t-SNE, UMAP better preserves both local and global structure of the data
   - This is crucial for document clustering as we want to maintain relationships 
     between different topic groups

2. Computational Efficiency:
   - UMAP scales better to large datasets compared to alternatives like t-SNE
   - This is important for processing the entire Reuters corpus efficiently

3. Theoretical Foundation:
   - UMAP is based on rigorous mathematical foundations from manifold learning and 
     topological data analysis
   - This helps ensure the reduced representation is meaningful, not just visually appealing

Parameters Used:
- n_components=20: Reducing to 20 dimensions, balancing information retention with 
  dimensionality reduction
- n_neighbors=10: Each point is connected to its 10 nearest neighbors, helping preserve 
  local structure
- min_dist=0.2: Minimum distance between points in the embedded space, controlling how 
  tightly points cluster together
- metric='cosine': Using cosine distance, which is appropriate for comparing document 
  embeddings as it focuses on the angle between vectors rather than their magnitude

The reduced dimensionality (20D) serves as an optimal input for subsequent clustering, 
being both low enough for efficient processing but high enough to retain meaningful 
document relationships.

In [7]:
def reduce_dimensions(embeddings):
    umap_reducer = UMAP(
        n_neighbors=model_params['umap']['n_neighbors'],
        min_dist=model_params['umap']['min_dist'],
        n_components=model_params['umap']['n_components'],
        random_state=None,
        metric=model_params['umap']['metric'],
        n_jobs=-1
    )
    return umap_reducer.fit_transform(embeddings)

reduced_embeddings = reduce_dimensions(embeddings)

In [8]:
# Convert reduced embeddings to dataframe with named columns
reduced_df = pd.DataFrame(
    reduced_embeddings, 
    columns=[f'umap_{i+1}' for i in range(reduced_embeddings.shape[1])]
)

display(reduced_df)

display(reduced_df.describe())

Unnamed: 0,umap_1,umap_2,umap_3,umap_4,umap_5,umap_6,umap_7,umap_8,umap_9,umap_10,umap_11,umap_12,umap_13,umap_14,umap_15,umap_16,umap_17,umap_18,umap_19,umap_20
0,12.476774,4.926239,1.202698,5.324039,3.468304,5.008056,4.265051,4.689017,4.536188,5.364054,5.330823,3.957876,6.343245,4.701857,5.030842,4.333262,6.152179,4.144895,3.943902,4.267467
1,12.254752,5.031675,1.615729,5.519593,3.658180,5.043080,4.468949,4.520995,4.565976,6.085142,5.158059,4.690832,5.471339,4.786910,5.203535,4.093320,6.065857,4.852396,5.152416,4.747603
2,12.307848,4.975024,1.450275,5.367286,3.781982,5.225036,4.314219,4.350098,4.571009,5.383678,5.188679,4.353727,6.138043,4.673213,4.764834,4.998540,5.898718,4.813584,4.898196,4.119071
3,12.040392,5.037462,1.920964,5.376246,4.166490,5.279925,4.334654,3.963124,4.626873,5.630208,5.171674,4.706514,5.667783,4.690176,4.899860,4.914391,5.537569,5.639688,5.868374,4.008059
4,12.383485,4.991853,1.415640,5.469172,3.611956,5.110348,4.296072,4.039837,4.517943,5.894064,5.149813,4.588213,5.804625,4.749483,5.040839,4.594365,6.017574,5.055973,5.035311,4.490597
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10783,11.417279,5.007935,2.454906,4.951458,4.533519,5.443195,5.229127,3.269327,5.116947,4.776906,5.035627,5.218732,4.860031,4.864752,4.286323,6.847292,5.454393,5.763556,5.360132,4.321248
10784,3.811187,4.487108,-0.584398,4.619744,6.338072,5.086279,5.027823,4.203545,5.341569,4.658635,5.032262,5.546281,4.214554,7.547571,4.619164,5.419848,4.698947,5.400413,4.750975,5.197391
10785,3.798437,4.489888,-0.642198,4.609520,6.323380,5.059988,5.036307,4.205247,5.342998,4.671871,5.006425,5.535074,4.196317,7.543933,4.612348,5.410929,4.710127,5.410806,4.742738,5.224290
10786,6.098542,5.037541,8.219587,4.455418,5.344000,5.305983,5.506177,4.373201,5.405365,4.557479,5.369881,5.792155,4.187212,5.150856,5.586587,5.900385,4.288100,4.617467,5.387799,6.065273


Unnamed: 0,umap_1,umap_2,umap_3,umap_4,umap_5,umap_6,umap_7,umap_8,umap_9,umap_10,umap_11,umap_12,umap_13,umap_14,umap_15,umap_16,umap_17,umap_18,umap_19,umap_20
count,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0,10788.0
mean,10.050451,5.00563,3.145961,4.986564,5.157342,5.047472,5.027951,5.577701,5.002333,4.903574,5.214313,4.945123,5.010972,5.0683,5.113915,5.030876,5.411095,4.778346,4.724228,4.738302
std,2.817895,0.465276,2.423734,0.564765,1.812076,0.339891,0.734057,1.941386,0.402346,0.58727,0.239373,0.696642,0.875368,0.597563,0.502218,0.680785,0.812879,0.51628,0.632073,0.519363
min,-1.392936,-0.072942,-1.144218,-2.996494,2.775743,3.442166,1.902146,1.765009,1.93801,2.272973,1.357,2.928653,2.438301,2.161571,1.769426,3.326659,3.231033,2.778313,-0.22586,3.609665
25%,6.977,4.949847,1.641783,4.670503,3.898301,4.777057,4.481314,4.16596,4.711554,4.488091,5.158546,4.434127,4.358405,4.783869,4.750524,4.546239,4.741763,4.564437,4.18189,4.305992
50%,11.719476,5.031109,2.016366,5.177838,4.55426,5.072793,4.965954,4.584288,5.006377,4.691502,5.239709,4.903962,4.952112,4.968262,5.103028,4.971076,5.664508,4.777419,4.815386,4.720332
75%,12.154092,5.066708,4.958774,5.325003,5.863509,5.259517,5.289204,7.34866,5.213143,5.420791,5.312774,5.601209,5.70652,5.153558,5.46581,5.550923,6.034364,5.075306,5.158939,5.131131
max,13.018955,13.97749,10.174134,6.520395,10.153961,10.21861,9.632088,9.806961,6.989809,7.099588,5.821776,6.427012,6.723781,8.012016,6.70741,7.114976,6.723334,6.725153,7.116347,7.757644


## clustering via HDBSCAN

HDBSCAN is the final step in our pipeline, grouping similar documents together based 
on their UMAP-reduced BERT embeddings. It's particularly well-suited for this task 
for several reasons:

Why HDBSCAN?
1. Natural Pairing with UMAP:
   - HDBSCAN and UMAP share similar theoretical foundations in manifold learning and 
     topological data analysis
   - Both methods work by finding underlying manifold structures in the data
   - UMAP's output maintains density information that HDBSCAN can effectively utilize

2. Advantages over Traditional Clustering:
   - Unlike k-means, HDBSCAN doesn't require specifying the number of clusters beforehand
   - Can identify noise points (outliers) rather than forcing them into clusters
   - Handles clusters of varying densities and shapes, important for text data where 
     topics may have different levels of cohesion

3. Hierarchical Nature:
   - Creates a hierarchy of clusters, allowing for analysis at different granularity levels
   - Particularly useful for news articles where topics might have subtopics

Parameters Used:
- min_cluster_size=10: Minimum number of documents to form a cluster
- min_samples=5: Number of points required in neighborhood to be considered a core point
- cluster_selection_epsilon=0.1: Distance threshold for expanding clusters
- metric='euclidean': Distance metric for comparing points in UMAP space
- cluster_selection_method='eom': 'Excess of Mass' method for selecting flat clusters

The combination of BERT → UMAP → HDBSCAN creates a powerful pipeline because:
- BERT captures semantic relationships in the text
- UMAP preserves these relationships while making the data tractable
- HDBSCAN can find natural groupings in this reduced space without forcing artificial 
  cluster boundaries

In [9]:
def perform_clustering(reduced_embeddings):
    clusterer = HDBSCAN(
        min_cluster_size=model_params['hdbscan']['min_cluster_size'],
        min_samples=model_params['hdbscan']['min_samples'],
        metric=model_params['hdbscan']['metric'],
        cluster_selection_method=model_params['hdbscan']['cluster_selection_method'],
        cluster_selection_epsilon=model_params['hdbscan']['cluster_selection_epsilon'],
        prediction_data=True,
        core_dist_n_jobs=-1,  # Added this for parallelism
        allow_single_cluster=True  # This helps avoid the force_all_finite warning
    )
    cluster_labels = clusterer.fit_predict(reduced_embeddings)
    # Calculate and display clustering statistics
    n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
    n_noise = list(cluster_labels).count(-1)
    return cluster_labels

cluster_labels = perform_clustering(reduced_embeddings)

In [10]:
# Convert cluster labels to dataframe
cluster_df = pd.DataFrame(cluster_labels, columns=['cluster'])

display(cluster_df)

display(cluster_df.describe())

Unnamed: 0,cluster
0,-1
1,79
2,-1
3,-1
4,92
...,...
10783,65
10784,6
10785,6
10786,9


Unnamed: 0,cluster
count,10788.0
mean,41.298851
std,50.164608
min,-1.0
25%,-1.0
50%,11.0
75%,75.0
max,156.0


## create final DataFrame with UMAP components and cluster labels

In [11]:
umap_df = pd.DataFrame(
    reduced_embeddings,
    columns=[f'umap_{i+1}' for i in range(model_params['umap']['n_components'])]
)
umap_df['cluster_label'] = cluster_labels
final_df = pd.concat([df, umap_df], axis=1)

## format and save results for later notebooks

In [12]:
# format the final_df for saving
export_df = final_df

# remove id and cleaned_text columns from export_df
export_df = export_df.drop(columns=['id', 'cleaned_text'])

# rename categories to category_lables
export_df = export_df.rename(columns={'categories': 'category_labels'})

# add new column category_label_1 that shows the first category in the list
export_df['category_label_1'] = export_df['category_labels'].apply(lambda x: x[0] if isinstance(x, np.ndarray) and x.size > 0 else None)

# change the order of the columns to categories, cluster_label, text, and then the umap dimensions
cols = ['category_labels', 'category_label_1', 'cluster_label', 'text'] + [col for col in final_df.columns if col.startswith('umap_')]
export_df = export_df[cols]

# save
export_df.to_parquet('../data/reuters_with_clusters.parquet')

# select 10 random rows from final_df
df_sample = export_df.sample(10)
df_sample.to_csv('../data/clustered_data_sample.csv', index=False)

display(export_df)

Unnamed: 0,category_labels,category_label_1,cluster_label,text,umap_1,umap_2,umap_3,umap_4,umap_5,umap_6,...,umap_11,umap_12,umap_13,umap_14,umap_15,umap_16,umap_17,umap_18,umap_19,umap_20
0,[trade],trade,-1,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,12.476774,4.926239,1.202698,5.324039,3.468304,5.008056,...,5.330823,3.957876,6.343245,4.701857,5.030842,4.333262,6.152179,4.144895,3.943902,4.267467
1,[grain],grain,79,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,12.254752,5.031675,1.615729,5.519593,3.658180,5.043080,...,5.158059,4.690832,5.471339,4.786910,5.203535,4.093320,6.065857,4.852396,5.152416,4.747603
2,"[crude, nat-gas]",crude,-1,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,12.307848,4.975024,1.450275,5.367286,3.781982,5.225036,...,5.188679,4.353727,6.138043,4.673213,4.764834,4.998540,5.898718,4.813584,4.898196,4.119071
3,"[corn, grain, rice, rubber, sugar, tin, trade]",corn,-1,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...,12.040392,5.037462,1.920964,5.376246,4.166490,5.279925,...,5.171674,4.706514,5.667783,4.690176,4.899860,4.914391,5.537569,5.639688,5.868374,4.008059
4,"[palm-oil, veg-oil]",palm-oil,92,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...,12.383485,4.991853,1.415640,5.469172,3.611956,5.110348,...,5.149813,4.588213,5.804625,4.749483,5.040839,4.594365,6.017574,5.055973,5.035311,4.490597
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10783,"[interest, money-fx]",interest,65,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...,11.417279,5.007935,2.454906,4.951458,4.533519,5.443195,...,5.035627,5.218732,4.860031,4.864752,4.286323,6.847292,5.454393,5.763556,5.360132,4.321248
10784,[earn],earn,6,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY\n Q...,3.811187,4.487108,-0.584398,4.619744,6.338072,5.086279,...,5.032262,5.546281,4.214554,7.547571,4.619164,5.419848,4.698947,5.400413,4.750975,5.197391
10785,[earn],earn,6,TECHNITROL INC &lt;TNL> SETS QUARTERLY\n Qtly...,3.798437,4.489888,-0.642198,4.609520,6.323380,5.059988,...,5.006425,5.535074,4.196317,7.543933,4.612348,5.410929,4.710127,5.410806,4.742738,5.224290
10786,[earn],earn,9,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...,6.098542,5.037541,8.219587,4.455418,5.344000,5.305983,...,5.369881,5.792155,4.187212,5.150856,5.586587,5.900385,4.288100,4.617467,5.387799,6.065273


## next steps

we will now proceed to examine the results in much greater detail in the analysis_of_results.ipynb notebook