
# Data Preparation and Clustering Analysis

This notebook demonstrates how to collect, preprocess, and analyze text data for clustering purposes. 
We will start by gathering data from a specified directory, followed by preprocessing the text, and finally applying different clustering algorithms. 
We will also compare the performance of various clustering methods.

## Sections Overview
1. **Data Collection**: Walk through directories to collect text data.
2. **Data Preprocessing**: Clean and prepare the text data.
3. **Clustering**: Apply different clustering algorithms and visualize the results.
4. **Comparison**: Compare the performance of clustering algorithms.


In [None]:
import os
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
import yaml._yaml
from sklearn.compose import ColumnTransformer
from sklearn.cluster import DBSCAN, Birch
from sklearn.decomposition import PCA

import nltk
from nltk.corpus import stopwords
from tqdm import tqdm  # Importing tqdm for progress bars

In [None]:
path = r"/home/kaiser/work/repos/obsidian"


## 1. Data Collection

In this section, we traverse through the directories to collect text files. 
We will extract relevant information and store it in a DataFrame for further analysis.


In [None]:
def walk_in_data(rootdir=path):
    for folder, _, files in os.walk(rootdir):
        print("visited", folder)
        for filename in files:
            print("visited file", filename)

In [None]:
walk_in_data()

In [None]:
df = pd.DataFrame()


def walk_in_data_and_add(rootdir=path):
    rows_list = []

    for folder, dirnames, files in os.walk(rootdir):
        for filename in files:
            if filename[-2:] != "md":
                continue
            with open(folder + "/" + filename, "r") as f:
                dict = {}
                dict.update({"directory": folder})
                dict.update({"name": ".".join(filename.split(".")[:-1])})
                dict.update({"extension": filename.split(".")[-1]})
                dict.update({"text": f.read()})

                rows_list.append(dict)
    return pd.DataFrame(rows_list)

In [None]:
df = walk_in_data_and_add()
df.to_csv("data/first.csv", index=False)
df


## 2. Data Preprocessing

Here we preprocess the text data, which includes removing unnecessary characters, handling YAML front matter, and vectorizing the text data.


In [None]:
print(df["text"].iloc[0])

Let's detect files that contain YAML front matter.

In [None]:
df_with_yaml = df[
    df["text"].str.contains(r"(?s)^---\s*\n(.*?)\n---\s*(\n|$)", regex=True)
]

df_without_yaml = df[~df.isin(df_with_yaml)]
df_with_yaml = df_with_yaml.reset_index(drop=True)
df_with_yaml

Try to extract front matter to separate column:

In [None]:
df_with_yaml["yaml_content"] = df_with_yaml["text"].str.extract(
    r"(?s)^---\s*\n(.*?)\n---\s*(\n|$)", expand=False
)[0]
df_with_yaml

Looks that dict column will be painful to use for ML algorithms. It is essential to extract features using `DictVectorizer`

In [None]:
def erase_yaml(row):
    len_of_yaml = 1 + len(row["yaml_content"]) + 8  # --- symbols + \n
    row["text"] = row["text"][len_of_yaml:]
    return row


df_with_yaml = df_with_yaml.apply(erase_yaml, axis="columns")
df_with_yaml

In [None]:
def preprocess_yaml(row):
    try:
        row = yaml.safe_load(row) if pd.notnull(row) else None
    except yaml.constructor.ConstructorError:
        return "{}"

    if row == None:
        return row
    for key in row.keys():
        if isinstance(row[key], list) and len(row[key]) == 1:
            row[key] = row[key][0]
    return row


df_with_yaml["yaml_content"] = df_with_yaml["yaml_content"].apply(preprocess_yaml)
df_with_yaml["yaml_content"]

In [None]:
df_with_yaml

### TF-IDF

In [None]:
def get_word_cloud(text, ngram_range=(1, 1)):
    vec = TfidfVectorizer(ngram_range=ngram_range)
    X = vec.fit_transform(text)
    words_tfidf = dict(
        zip(vec.get_feature_names_out(), X.sum(axis=0).A1)
    )  # np.asarray(X.sum(axis=0)).ravel()
    wordCloud = WordCloud(
        width=2000, height=2000, random_state=42, background_color="white"
    ).generate_from_frequencies(words_tfidf)

    plt.figure(figsize=(15, 15))
    plt.axis("off")
    plt.imshow(wordCloud, interpolation="bilinear")
    plt.show()

#### A bit of visualization (`text` column)

In [None]:
get_word_cloud(df_with_yaml["text"], (1, 6))

### Getting YAML information

In [None]:
normalized_yaml_content = pd.json_normalize(df_with_yaml["yaml_content"]).fillna("")
normalized_yaml_content

These columns are redundant and are not valuable for data analysis. So just drop them:

In [None]:
normalized_yaml_content.drop(
    columns=[
        "sr-due",
        "sr-interval",
        "sr-ease",
        "excalidraw-plugin",
        "complexity",
        "cssclasses",
    ],
    inplace=True,
)

In [None]:
df_yaml = pd.concat([df_with_yaml, normalized_yaml_content], axis=1).drop(
    columns=["yaml_content", "extension"]
)
df_yaml

In [None]:
df_yaml.loc[:, "aliases"] = df_yaml["aliases"].astype("str")

In [None]:
df_yaml.info()

Transform `date` column dtype to `datetime64`:

In [None]:
df_yaml["date"] = pd.to_datetime(df_yaml["date"])

Let's gather unique tags:

In [None]:
unique_tags = set()
for x in df_yaml["tags"].str.replace("[", "").str.replace("]", "").str.split(", "):
    if type(x) != float:
        for y in x:
            unique_tags.add(y)
unique_tags.remove("")
unique_tags

In [None]:
for tag in unique_tags:
    df_yaml["tag_" + tag] = df_yaml["tags"].apply(lambda x: tag in x)

df_yaml.drop(columns=["tags"], inplace=True)
df_yaml

To do the same procedure with `aliases` is not needed - we will process this column with TF-IDF.

In [None]:
object_columns = df_yaml.columns.drop("date")
datetime_columns = pd.Index(["date"])
object_columns, datetime_columns

### Joining with files without YAML frontmatter

We have totally forgot about documents without yaml frontmatter (actually, they could contain some tags, but in the document using "#notation"). Adding them to our data:

In [None]:
df_joined = pd.merge(df, df_yaml, "left", ["directory", "name"])
df_joined["text"] = df_joined["text_y"].fillna(df_joined["text_x"])
df_joined.drop(columns=["text_x", "text_y"], inplace=True)


def fill_na_custom(series):
    # print(series.name[:3])
    if series.name[:3] == "tag":
        return series.fillna(False)
    elif series.dtype == "float64":  # Numeric columns
        return series.fillna(0)
    elif series.dtype == "object":  # String columns
        return series.fillna("missing")
    elif series.dtype == "datetime64[ns]":  # Datetime columns
        return series.fillna(pd.Timestamp("2024-01-01"))
    else:
        return series.fillna("other")  # Default fill for other types


# Apply the custom fill logic
df_joined = df_joined.apply(fill_na_custom)

df_joined

### NLTK for joined Dataframe

In [None]:
nltk.download("stopwords")
stop_words = set(stopwords.words(["english", "russian"]))


def clean_text(text):
    # Replace slashes with spaces
    text = text.replace("/", " ")

    text = re.sub(r"[^\w\s]", "", text.lower())
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text


text_columns = ["text", "aliases", "link", "directory", "name"]
tqdm.pandas()  # Initialize tqdm for pandas
for column in text_columns:
    df_joined["text"] = df_joined["text"].progress_apply(clean_text)
    df_joined["aliases"] = df_joined["aliases"].progress_apply(clean_text)
    df_joined["link"] = df_joined["link"].progress_apply(clean_text)
    df_joined["directory"] = df_joined["directory"].progress_apply(clean_text)
    df_joined["name"] = df_joined["name"].progress_apply(clean_text)

In [None]:
get_word_cloud(df_joined["text"], (2, 2))

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("text_directory", TfidfVectorizer(), "directory"),
        ("text_name", TfidfVectorizer(), "name"),
        ("text_text", TfidfVectorizer(), "text"),
        ("text_aliases", TfidfVectorizer(), "aliases"),
        ("text_link", TfidfVectorizer(), "link"),
    ],
)

# Apply transformations
transformed_data = pd.DataFrame(preprocessor.fit_transform(df_joined).toarray())
transformed_data

## 3. Clustering & Comparison

We will apply different clustering algorithms like DBSCAN and Birch to the preprocessed text data. Finally, we compare the results of different clustering algorithms to evaluate their performance on our dataset.


### DBSCAN

In [None]:
%%time
pca = PCA(n_components=2)
X_pca = pca.fit_transform(transformed_data)

nrows = 2
ncols = 5

fig, axes = plt.subplots(nrows, ncols, figsize=(25, 10))
for i, eps in enumerate(np.linspace(1, 3, nrows * ncols)):
    dbscan = DBSCAN(eps=eps, min_samples=10)
    clusters = dbscan.fit_predict(transformed_data)

    row = i // ncols
    col = i % ncols

    scatter = axes[row, col].scatter(
        X_pca[:, 0], X_pca[:, 1], c=clusters, cmap="plasma"
    )
    axes[row, col].set_title(f"DBSCAN eps = {eps:.2f}")
    axes[row, col].set_xlabel("PCA Component 1")
    axes[row, col].set_ylabel("PCA Component 2")

    # Get unique cluster labels
    unique_labels = np.unique(clusters)

    # Create legend handles
    handles = [
        Patch(color=scatter.cmap(scatter.norm(label)), label=f"Cluster {label}")
        for label in unique_labels
    ]

    # Add the legend to the plot
    axes[row, col].legend(handles=handles, title="Clusters", loc="upper right")

fig.tight_layout()
plt.show()

### Birch
The BIRCH (**Balanced Iterative Reducing and Clustering using Hierarchies**) algorithm is a hierarchical clustering method designed to efficiently cluster large (not our case) datasets. BIRCH incrementally builds a tree-like data structure called the Clustering Feature Tree (CF Tree), which summarizes the dataset. This structure allows BIRCH to handle large datasets effectively, making it suitable for scenarios where memory efficiency and scalability are important.

In [None]:
%%time
pca = PCA(n_components=2)
X_pca = pca.fit_transform(transformed_data)

nrows = 3
ncols = 5

thresholds = np.linspace(0.1, 1.6, ncols)
n_clusters = [2, 5, 8]
fig, axes = plt.subplots(nrows, ncols, figsize=(25, 20))
for i in range(nrows * ncols):
    row = i // ncols
    col = i % ncols

    dbscan = Birch(threshold=thresholds[col], n_clusters=n_clusters[row])
    clusters = dbscan.fit_predict(transformed_data)

    scatter = axes[row, col].scatter(
        X_pca[:, 0], X_pca[:, 1], c=clusters, cmap="plasma"
    )
    axes[row, col].set_title(f"Birch threshold = {thresholds[col]:.2f}, num of clusters = {n_clusters[row]}")
    axes[row, col].set_xlabel("PCA Component 1")
    axes[row, col].set_ylabel("PCA Component 2")

    # Get unique cluster labels
    unique_labels = np.unique(clusters)

    # Create legend handles
    handles = [
        Patch(color=scatter.cmap(scatter.norm(label)), label=f"Cluster {label}")
        for label in unique_labels
    ]

    # Add the legend to the plot
    axes[row, col].legend(handles=handles, title="Clusters", loc="upper right")

fig.tight_layout()
plt.show()

Looks like Birch is the best clustering algorithm.

In [None]:
best_clustering_algo = Birch(threshold=0.1, n_clusters=2)
df_joined["birch_cluster"] = best_clustering_algo.fit_predict(transformed_data)
df_joined

In [None]:
df_joined.query("birch_cluster == 1")["aliases"].value_counts()

In [None]:
df_joined.query("birch_cluster == 0")["aliases"].value_counts()

Whoops, it seems that the clusters were separated by presence of YAML frontmatter.