

This section clones a repository (nlp-snippets) from GitHub, which may contain pre-written NLP code or utilities for the project.
The pwd command prints the current directory to verify the working directory.


In [None]:
! git clone https://github.com/dylanjcastillo/nlp-snippets.git


Cloning into 'nlp-snippets'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 126 (delta 54), reused 82 (delta 23), pack-reused 0 (from 0)[K
Receiving objects: 100% (126/126), 2.66 MiB | 9.61 MiB/s, done.
Resolving deltas: 100% (54/54), done.


In [None]:
! cd nlp-snippets/

In [None]:
!pwd

/content/nlp-snippets


In [None]:
import os
import random
import re
import string

import nltk
import numpy as np
import pandas as pd

from gensim.models import Word2Vec

from nltk import word_tokenize
from nltk.corpus import stopwords

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score

nltk.download("stopwords")
nltk.download("punkt")

SEED = 42
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
np.random.seed(SEED)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


os, random, re, string: Used for file handling, randomization, and text preprocessing.
nltk: Natural language toolkit for text processing (tokenization, stopwords).
numpy & pandas: Used for numerical operations and handling data in tabular form (dataframes).
gensim: Provides word embedding models like Word2Vec for creating word vector representations.
scikit-learn: Contains clustering algorithms (e.g., MiniBatchKMeans) and evaluation metrics like silhouette scores.

In [None]:
def clean_text(text, tokenizer, stopwords):
    """Pre-process text and generate tokens

    Args:
        text: Text to tokenize.

    Returns:
        Tokenized text.
    """
    text = str(text).lower()  # Lowercase words
    text = re.sub(r"\[(.*?)\]", "", text)  # Remove [+XYZ chars] in content
    text = re.sub(r"\s+", " ", text)  # Remove multiple spaces in content
    text = re.sub(r"\w+…|…", "", text)  # Remove ellipsis (and last word)
    text = re.sub(r"(?<=\w)-(?=\w)", " ", text)  # Replace dash between words
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", "", text
    )  # Remove punctuation

    tokens = tokenizer(text)  # Get tokens from text
    tokens = [t for t in tokens if not t in stopwords]  # Remove stopwords
    tokens = ["" if t.isdigit() else t for t in tokens]  # Remove digits
    tokens = [t for t in tokens if len(t) > 1]  # Remove short tokens
    return tokens

In [None]:
!pwd

/content/nlp-snippets


In [None]:
import os
import random
import re
import string

import nltk
import numpy as np
import pandas as pd

from gensim.models import Word2Vec

from nltk import word_tokenize
from nltk.corpus import stopwords

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score

# Download necessary NLTK data packages
nltk.download("stopwords")
nltk.download("punkt")
nltk.download('punkt_tab') # Download the missing punkt_tab data

SEED = 42
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
np.random.seed(SEED)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Downloads NLTK resources (stopwords and Punkt tokenizer) for text processing.


Purpose: Pre-processes and tokenizes the text.

Steps:

Converts the text to lowercase.
Removes unwanted characters such as numbers, punctuation, special characters, and extra spaces.
Tokenizes the text using the specified tokenizer (e.g., NLTK word_tokenize).
Removes stopwords.
Filters out short tokens and digits.
Returns: A list of cleaned tokens

In [None]:
custom_stopwords = set(stopwords.words("english") + ["news", "new", "top"])
text_columns = ["title", "description", "content"]

df_raw = pd.read_csv("data/news_data.csv")
df = df_raw.copy()
df["content"] = df["content"].fillna("")

for col in text_columns:
    df[col] = df[col].astype(str)

# Create text column based on title, description, and content
df["text"] = df[text_columns].apply(lambda x: " | ".join(x), axis=1)
df["tokens"] = df["text"].map(lambda x: clean_text(x, word_tokenize, custom_stopwords))

# Remove duplicated after preprocessing
_, idx = np.unique(df["tokens"], return_index=True)
df = df.iloc[idx, :]

# Remove empty values and keep relevant columns
df = df.loc[df.tokens.map(lambda x: len(x) > 0), ["text", "tokens"]]

docs = df["text"].values
tokenized_docs = df["tokens"].values

print(f"Original dataframe: {df_raw.shape}")
print(f"Pre-processed dataframe: {df.shape}")

Original dataframe: (10437, 15)
Pre-processed dataframe: (9882, 2)


Ensures reproducibility of results by setting a fixed random seed for various operations (e.g., shuffling, initializing models).

Trains a Word2Vec model on the tokenized documents with 100-dimensional vectors.
The Word2Vec model learns the vector representations of words based on their co-occurrence in the documents.

In [None]:
model = Word2Vec(sentences=tokenized_docs, vector_size=100, workers=1, seed=SEED)

In [None]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



In [None]:
model.wv.most_similar("trump")


[('trumps', 0.9885427355766296),
 ('president', 0.9746479988098145),
 ('donald', 0.9274885654449463),
 ('ivanka', 0.9203841686248779),
 ('impeachment', 0.9195799827575684),
 ('pences', 0.9152251482009888),
 ('avlon', 0.9148187637329102),
 ('biden', 0.9145993590354919),
 ('breitbart', 0.9144167900085449),
 ('vice', 0.9067206978797913)]

Purpose: Converts the list of documents (tokenized) into vectors using the word embeddings.
Steps:
For each document, it creates a vector by averaging the vectors of the words in that document.
If no words in a document have embeddings (i.e., not in the model's vocabulary), it returns a zero vector.

In [None]:
def vectorize(list_of_docs, model):
    """Generate vectors for list of documents using a Word Embedding

    Args:
        list_of_docs: List of documents
        model: Gensim's Word Embedding

    Returns:
        List of document vectors
    """
    features = []

    for tokens in list_of_docs:
        zero_vector = np.zeros(model.vector_size)
        vectors = []
        for token in tokens:
            if token in model.wv:
                try:
                    vectors.append(model.wv[token])
                except KeyError:
                    continue
        if vectors:
            vectors = np.asarray(vectors)
            avg_vec = vectors.mean(axis=0)
            features.append(avg_vec)
        else:
            features.append(zero_vector)
    return features

vectorized_docs = vectorize(tokenized_docs, model=model)
len(vectorized_docs), len(vectorized_docs[0])

(9882, 100)

Purpose: Performs clustering using MiniBatchKMeans, which is efficient for large datasets.
Steps:
Applies MiniBatchKMeans with a specified number of clusters (k) and mini-batch size (mb).
Optionally prints silhouette scores for evaluating the quality of clusters.
Prints out silhouette scores for each cluster to assess how well the clustering algorithm performed.

In [None]:
def mbkmeans_clusters(
    X,
    k,
    mb,
    print_silhouette_values,
):
    """Generate clusters and print Silhouette metrics using MBKmeans

    Args:
        X: Matrix of features.
        k: Number of clusters.
        mb: Size of mini-batches.
        print_silhouette_values: Print silhouette values per cluster.

    Returns:
        Trained clustering model and labels based on X.
    """
    km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
    print(f"For n_clusters = {k}")
    print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
    print(f"Inertia:{km.inertia_}")

    if print_silhouette_values:
        sample_silhouette_values = silhouette_samples(X, km.labels_)
        print(f"Silhouette values:")
        silhouette_values = []
        for i in range(k):
            cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
            silhouette_values.append(
                (
                    i,
                    cluster_silhouette_values.shape[0],
                    cluster_silhouette_values.mean(),
                    cluster_silhouette_values.min(),
                    cluster_silhouette_values.max(),
                )
            )
        silhouette_values = sorted(
            silhouette_values, key=lambda tup: tup[2], reverse=True
        )
        for s in silhouette_values:
            print(
                f"    Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
            )
    return km, km.labels_

In [None]:
clustering, cluster_labels = mbkmeans_clusters(
    X=vectorized_docs,
    k=50,
    mb=500,
    print_silhouette_values=True,
)
df_clusters = pd.DataFrame({
    "text": docs,
    "tokens": [" ".join(text) for text in tokenized_docs],
    "cluster": cluster_labels
})

For n_clusters = 50
Silhouette coefficient: 0.11
Inertia:3558.382223620125
Silhouette values:
    Cluster 42: Size:31 | Avg:0.33 | Min:0.04 | Max: 0.53
    Cluster 4: Size:100 | Avg:0.32 | Min:-0.16 | Max: 0.52
    Cluster 36: Size:145 | Avg:0.27 | Min:-0.04 | Max: 0.51
    Cluster 16: Size:110 | Avg:0.25 | Min:-0.02 | Max: 0.44
    Cluster 43: Size:85 | Avg:0.25 | Min:-0.02 | Max: 0.44
    Cluster 34: Size:80 | Avg:0.24 | Min:-0.00 | Max: 0.44
    Cluster 25: Size:35 | Avg:0.24 | Min:0.02 | Max: 0.43
    Cluster 11: Size:137 | Avg:0.24 | Min:-0.03 | Max: 0.45
    Cluster 33: Size:60 | Avg:0.23 | Min:-0.06 | Max: 0.46
    Cluster 24: Size:67 | Avg:0.22 | Min:-0.27 | Max: 0.47
    Cluster 44: Size:45 | Avg:0.21 | Min:-0.02 | Max: 0.41
    Cluster 1: Size:127 | Avg:0.21 | Min:-0.03 | Max: 0.41
    Cluster 41: Size:68 | Avg:0.21 | Min:-0.08 | Max: 0.43
    Cluster 35: Size:81 | Avg:0.20 | Min:-0.00 | Max: 0.41
    Cluster 17: Size:254 | Avg:0.20 | Min:-0.05 | Max: 0.39
    Cluster 7: Size

In [None]:
print("Most representative terms per cluster (based on centroids):")
for i in range(50):
    tokens_per_cluster = ""
    most_representative = model.wv.most_similar(positive=[clustering.cluster_centers_[i]], topn=5)
    for t in most_representative:
        tokens_per_cluster += f"{t[0]} "
    print(f"Cluster {i}: {tokens_per_cluster}")

Most representative terms per cluster (based on centroids):
Cluster 0: december lawsuits manhattan decided baker 
Cluster 1: leo delay jo mps referendum 
Cluster 2: professional edition expensive popular performance 
Cluster 3: obama emmanuel impeach whistleblowers congress 
Cluster 4: category humberto landfall charleston wrath 
Cluster 5: stabbing murdering neighbor convicted manslaughter 
Cluster 6: supply yields wireless flagship managers 
Cluster 7: hospital bomb dozens soldiers injuring 
Cluster 8: geneva escalation assembly countrys italy 
Cluster 9: prize amount walk formula born 
Cluster 10: proposal compromise 31st impasse reject 
Cluster 11: serial passenger shocked contained conducted 
Cluster 12: orleans television opens produced mayo 
Cluster 13: agree imran alliance abu deadline 
Cluster 14: regulator rome automaker berlin geneva 
Cluster 15: appearances 20th haul mcavoy april 
Cluster 16: squad qualifying warm foursomes finals 
Cluster 17: likes tips deals computers cof

Purpose: Identifies the most representative documents for a specific cluster (e.g., test_cluster = 29).
Method: Measures the Euclidean distance between each document vector and the centroid of the cluster, then sorts documents by closeness to the cluster's centroid.
Output: Displays the top 3 most representative documents for the chosen cluster.

In [None]:
test_cluster = 29
most_representative_docs = np.argsort(
    np.linalg.norm(vectorized_docs - clustering.cluster_centers_[test_cluster], axis=1)
)
for d in most_representative_docs[:3]:
    print(docs[d])
    print("-------------")

Netflix's 'Unbelievable' Is True Crime Meets 'True Detective', Perfect For 'Mindhunter' Fans | The limited series will premiere on September 13. | A new Netflix original series will dive deep into the trauma and legalities that come with a rape accusation. The emotional show, Unbelievable, will air on the streaming platform on September 13 as a limited series. Reviews are already calling it a true crime… [+2186 chars]
-------------
How Fast Fashion Is Destroying the Planet | In “Fashionopolis,” Dana Thomas exposes the environmental, economic and humanitarian hazards of cheap clothing production. | Among the books delights are Thomass sketches of her individual subjects. I cant get her description of a woman as peaches-and-cream pretty out of my head; I know exactly what she looks like. The author also has a gift for bringing luxury to life: She conjure… [+3349 chars]
-------------
The work of beloved TV artist Bob Ross is finally being recognized in an exhibition | If you've ever wante

Conclusion
The code demonstrates a typical document clustering pipeline, starting with text preprocessing, generating word embeddings with Word2Vec, and using MiniBatchKMeans for efficient clustering.
Evaluation is performed using silhouette scores, and the code also offers insight into the most representative terms and documents for each cluster.