
# Data Preparation and Clustering Analysis

This notebook demonstrates how to collect, preprocess, and analyze text data for clustering purposes. 
We will start by gathering data from a specified directory, followed by preprocessing the text, and finally applying different clustering algorithms. 
We will also compare the performance of various clustering methods.

## Sections Overview
1. **Data Collection**: Walk through directories to collect text data.
2. **Data Preprocessing**: Clean and prepare the text data.
3. **Clustering**: Apply different clustering algorithms and visualize the results.
4. **Comparison**: Compare the performance of clustering algorithms.


In [None]:
import os
import re
import yaml
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import yaml._yaml
from sklearn.compose import ColumnTransformer
from sklearn.cluster import DBSCAN, Birch

In [None]:
path = r"/home/kaiser/work/repos/obsidian"


## 1. Data Collection

In this section, we traverse through the directories to collect text files. 
We will extract relevant information and store it in a DataFrame for further analysis.


In [None]:
def walk_in_data(rootdir=path):
    for folder, _, files in os.walk(rootdir):
        print("visited", folder)
        for filename in files:
            print("visited file", filename)

In [None]:
walk_in_data()

In [None]:
df = pd.DataFrame()


def walk_in_data_and_add(rootdir=path):
    rows_list = []

    for folder, dirnames, files in os.walk(rootdir):
        for filename in files:
            if filename[-2:] != "md":
                continue
            with open(folder + "/" + filename, "r") as f:
                dict = {}
                dict.update({"directory": folder})
                dict.update({"name": ".".join(filename.split(".")[:-1])})
                dict.update({"extension": filename.split(".")[-1]})
                dict.update({"text": f.read()})

                rows_list.append(dict)
    return pd.DataFrame(rows_list)

In [None]:
df = walk_in_data_and_add()
df.to_csv("data/first.csv", index=False)
df


## 2. Data Preprocessing

Here we preprocess the text data, which includes removing unnecessary characters, handling YAML front matter, and vectorizing the text data.


In [None]:
print(df["text"].iloc[0])

Let's detect files that contain YAML front matter.

In [None]:
df_with_yaml = df[
    df["text"].str.contains(r"(?s)^---\s*\n(.*?)\n---\s*(\n|$)", regex=True)
].reset_index(drop=True)
df_with_yaml

Try to extract front matter to separate column:

In [None]:
df_with_yaml["yaml_content"] = df_with_yaml["text"].str.extract(
    r"(?s)^---\s*\n(.*?)\n---\s*(\n|$)", expand=False
)[0]
df_with_yaml

Looks that dict column will be painful to use for ML algorithms. It is essential to extract features using `DictVectorizer`

In [None]:
def erase_yaml(row):
    len_of_yaml = 1 + len(row["yaml_content"]) + 8  # --- symbols + \n
    row["text"] = row["text"][len_of_yaml:]
    return row


df_with_yaml = df_with_yaml.apply(erase_yaml, axis="columns")
df_with_yaml

In [None]:
def preprocess_yaml(row):
    try:
        row = yaml.safe_load(row) if pd.notnull(row) else None
    except yaml.constructor.ConstructorError:
        return "{}"
    
    if row == None:
        return row
    for key in row.keys():
        if isinstance(row[key], list) and len(row[key]) == 1:
            row[key] = row[key][0]
    return row


df_with_yaml["yaml_content"] = df_with_yaml["yaml_content"].apply(preprocess_yaml)
df_with_yaml["yaml_content"]

In [None]:
df_with_yaml

### TF-IDF

In [None]:
def get_word_cloud(text, ngram_range=(1, 1)):
    vec = TfidfVectorizer(ngram_range=ngram_range)
    X = vec.fit_transform(text)
    words_tfidf = dict(zip(vec.get_feature_names_out(), X.sum(axis=0).A1)) # np.asarray(X.sum(axis=0)).ravel()
    wordCloud = WordCloud(
        width=2000, height=2000, random_state=42, background_color="white"
    ).generate_from_frequencies(words_tfidf)

    plt.figure(figsize=(15, 15))
    plt.axis("off")
    plt.imshow(wordCloud, interpolation="bilinear")
    plt.show()

#### A bit of visualization (`text` column)

In [None]:
get_word_cloud(df_with_yaml['text'], (1, 6))

In [None]:
normalized_yaml_content = pd.json_normalize(df_with_yaml["yaml_content"]).fillna("")
normalized_yaml_content

These columns are redundant and are not valuable for data analysis. So just drop them:

In [None]:
normalized_yaml_content.drop(columns=["sr-due", "sr-interval", "sr-ease", "excalidraw-plugin", "complexity", "cssclasses"], inplace=True)

In [None]:
df_yaml = pd.concat([df_with_yaml, normalized_yaml_content], axis=1).drop(columns=["yaml_content", "extension"])
df_yaml

In [None]:
df_yaml.loc[:, "aliases"] = df_yaml["aliases"].astype("str")

In [None]:
df_yaml.info()

Transform `date` column dtype to `datetime64`:

In [None]:
df_yaml["date"] = pd.to_datetime(df_yaml["date"])

Let's gather unique tags:

In [None]:
unique_tags = set()
for x in df_yaml["tags"].str.replace("[", "").str.replace("]", "").str.split(", "):
    if type(x) != float:
        for y in x:
            unique_tags.add(y)
unique_tags.remove("")
unique_tags

In [None]:
for tag in unique_tags:
    df_yaml["tag_" + tag] = df_yaml["tags"].apply(
        lambda x: tag in x
    )

df_yaml.drop(columns=["tags"], inplace=True)
df_yaml

To do the same procedure with `aliases` is not needed - we will process this column with TF-IDF.

In [None]:
object_columns = df_yaml.columns.drop("date")
datetime_columns = pd.Index(["date"])
object_columns, datetime_columns

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('text_directory', TfidfVectorizer(), 'directory'),
        ('text_name', TfidfVectorizer(), 'name'),
        ('text_text', TfidfVectorizer(), 'text'),
        ('text_aliases', TfidfVectorizer(), 'aliases'),
        ('text_link', TfidfVectorizer(), 'link'),
    ],
)

# Apply transformations
transformed_data = pd.DataFrame(preprocessor.fit_transform(df_yaml).toarray())
transformed_data


## 3. Clustering

We will apply different clustering algorithms like DBSCAN and Birch to the preprocessed text data.


In [None]:
dbscan = DBSCAN(eps=2)

dbscan.fit(transformed_data)

In [None]:
dbscan.labels_

In [None]:
df_yaml['cluster'] = dbscan.labels_

In [None]:
from sklearn.decomposition import PCA
# Reduce dimensions with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(transformed_data)

# Plotting the clusters after PCA
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_yaml['cluster'], cmap='plasma')
plt.title('DBSCAN Clustering (PCA Reduced)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster Label')
plt.show()


## 4. Comparison of Clustering Algorithms

Finally, we compare the results of different clustering algorithms to evaluate their performance on our dataset.
