# K-Means Clustering Tutorial

This tutorial demonstrates how to use the Lexos `KMeans` class with a preprocessed document-term matrix (DTM). You will learn how to perform clustering, visualize the results, and export your outputs for further analysis.

## Load Your Data

In this step, we load all the text files that you want to cluster. Each file is treated as one document.

The code looks for `.txt` files in the `FilesToUse` folder and loads them in alphabetical order.

You'll see confirmation of how many documents were loaded and their labels.

In [None]:
import glob
from lexos.io.loader import Loader

# Get a sorted list of all text files in the "FilesToUse" directory
file_paths = sorted(glob.glob("FilesToUse/*.txt"))

# Load the documents using the Loader class
loader = Loader()
loader.load(paths=file_paths)

print(f"Loaded {len(loader.texts)} documents: {loader.names}")

Now we will convert our texts into spaCy Docs. Since our test files are novels, we'll increase maximum character length to 2,000,000.

In [None]:
from lexos.tokenizer import Tokenizer

tokenizer = Tokenizer()
docs = tokenizer.make_docs(texts=loader.texts, max_length=2_000_000)

# Convert the generator to a list
docs = list(docs)

## Build the Document-Term Matrix (DTM)

Now we turn our loaded text documents into a Document-Term Matrix (DTM) using the Lexos `DTM` class.

Each row in the DTM represents a term, and each column represents a document. The values inside the matrix tell us how important each word is in each document.

For this example, we'll customize how the matrix is built using the following settings:

- `tf_type="linear"`: Uses raw term frequencies (how often a word appears).
- `idf_type="smooth"`: Reduces the impact of words that appear in many documents.
- `norm="l2"`: Normalizes values for fair comparison across different-length documents.
- `min_df=2`: Removes words that appear in fewer than 2 documents.
- `max_n_terms=100`: Keeps only the top 100 most important terms.

The `.to_df()` function converts the matrix into a readable table.

In [None]:
# Import the DTM class
from lexos.dtm import DTM

# Create a Document-Term Matrix (DTM) with specific settings
dtm = DTM(
    tf_type = "linear",
    idf_type = "smooth",
    norm="l2",
    min_df=2,
    max_n_terms=100,
)

# Build the DTM using the loaded documents and their labels
dtm(docs=docs, labels=loader.names)

# Convert the DTM to a DataFrame for easier viewing and analysis
df = dtm.to_df()
df.head()

## Perform K-Means Clustering

Now we’re ready to use the k-means clustering module from Lexos. This tool will analyze your document-term matrix and group the documents into clusters based on how similar their word usage is.

Here, we create a `KMeans` object and pass it the DTM we created earlier. When we call the object with our chosen number of clusters (_k_), we get back an array in which each item corresponds to one of our documents (in the order submitted) and the number is the cluster to which it has been assigned.

In [None]:
# Import the KMeans class
from lexos.cluster import KMeans

labels = dtm.labels  # Get the labels from the DTM

# Create the clustering object using our DTM and 4 clusters
kmeans = KMeans(dtm=dtm, labels=labels, k=4)

You can save the data from you cluster analysis to a CSV file with `kmeans.to_csv("filename.csv")`.

### `KMeans` Settings

The `KMeans` class allows you to fine-tune how the clustering is performed by setting these optional parameters:

- `k`: The number of clusters to create.
- `init`: This is the initialization strategy, which can be "k-means++" or "random". "k-Means++" selects initial cluster centers using a weighted probability distribution to speed up convergence. This can help can help to constrain the initial placement of the centroids. The "random" option chooses K observations at random from the data to serve as the initial centroids. The default is "k-means++".
- `max_iter`: The maximum number of iterations of the k-means algorithm for a single run. The default is 300.
- `n_init`: The number of times (N) the k-means algorithm will be run with different centroid seeds (the tolerance for convergence). The final results will be the best output of those N consecutive runs. The default is 10.
- `tol` The relative tolerance with respect to inertia to declare convergence. The default is 0.0001.
- `random_state`: A number to use as the initial seed to insure that the results are reproducible. The default is 42.
```

Try changing some of the settings in the previous cell to see how they affect the clustering.

### Choosing the Best Number of Clusters (k)

K-means clustering requires you to choose a number of clusters (_k_) before you begin your analysis. We can decide on an arbitrary number, or we can leverage some statisical guidance. One way to do this is by using an elbow plot.

The elbow plot shows how much better the clustering gets as we increase _k_. At some point, the improvement slows down — this point is called the elbow, and it's usually a good choice for _k_.

You can generate an elblow plot with the `elbow_plot()` method, submitting a range of numbers (between 1 and 10 in the example below).

In [None]:
kmeans.elbow_plot(k_range=range(1, 10))

In the graph above:

- The x-axis shows the number of clusters (_k_) we tried.
- The y-axis shows the inertia (or within-cluster sum of squares), which measures how compact the clusters are.
- Lower values of inertia mean tighter, more defined clusters.
- The "elbow" is where the curve sharply changes direction — it’s the point beyond which adding more clusters doesn't significantly reduce inertia.

In our case, the elbow occurs at `k = 4`, meaning that 4 clusters is a good balance between under- and over-clustering.

## Visualize Clusters

Lexos provides three methods of visualizing the results of a k-means cluster analysis. In each case, Lexos first applies PCA (Principal Component Analysis) to reduce the dimensions of the data so it can be viewed in a 2D or 3D graph.

## Scatter Plots

Scatter plots use represent document clusters as dots in Cartesian space. Each dot is a document, and colors represent clusters.

You can create a two-dimensional scatter plot as shown below:

In [None]:
kmeans.scatter(dim=2, title="KMeans Clustering 2D Plot", show=True)

Run your mouse over the plot to see the toolbar, which includes an option to save the image as a `.png` file.

Set `show=False` if you want to save the plot to a variable or file without saving it.

You can save the image programmatically with `kmeans.save("filename.png")`. The format of your output (e.g. `.jpg`, `.pdf`, `.svg`) will be determined by the extension in your filename. You can also set `html=True` to save your image as a web page (e.g. `kmeans.save("filename.html", html=True)`).

In some cases, documents in a two-dimensional plot can overlap, making the plot hard to read. In this case, you might want to try a three-dimensional plot. You can do this by setting `dim=3`:

In [None]:
kmeans.scatter(dim=3, title="KMeans Clustering 3D Plot", show=True)

The toolbar in three-dimensional plots has additional options for helping you manipulate the image.

You can also save the image using the same methods described above.

### Voronoi Diagrams

A Voronoi diagram shows how the clustering algorithm "divides" the PCA-reduced 2D space into decision regions.

Each colored area represents the region closest to one of the cluster centroids. The black X's mark the calculated centroids. Your documents are plotted as colored dots based on which cluster they belong to.

This visualization is useful for understanding how clearly separated your clusters are — especially when the groups overlap or are close together.

You can generate a Voronoi diagram as follows:

In [None]:
kmeans.voronoi(show=True)

As with scatter plots, there is a Plotly toolbar when you run your mouse over the diagram.

You can also save the image using the same methods described above.