In [None]:
from model import TextClustering

Let's load the data we processed in notebook 2.

In [None]:
import pandas as pd
from pathlib import Path

datadir = Path("data/processed")
if not datadir.exists():
    logger.info(f"Creating directory {datadir}")
    datadir.mkdir(parents=True)
datafile = datadir / Path("posts.parquet")
df = pd.read_parquet(datafile)
df.head()

This algorithm is designed by a Swiss company, specialized in authorship. You can read their blog on the Qanon authorship research with their algorithm [here](https://www.prnewswire.com/news-releases/qanon-is-two-different-people-shows-machine-learning-analysis-from-orphanalytics-301192981.html).

While they didnt publish their code, based on their paper i was able to reproduce their results. I implemented their model in the `model.py` file, and we can import it here.

In [None]:
clustering = TextClustering()

We will break up the text in k=100 chunks, run a CountVectorizer on trigrams, and then calculate the manhattan distance between the vectors of the chunks. This gives us a `k x k` distance matrix, on which we will run a dimensionality reduction algorithm (PCA or t-SNE).

In [None]:
import numpy as np
k = 100
X = clustering(df["text"], k=k, batch=True, method="PCA")
X.shape

We will use the labels as created in the preprocessing notebook.

In [None]:
labels = clustering.get_labels(df)
labels

And with this, we can visualize the results and obtain similar results as in the original paper.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=labels)

imgdir = Path("../img")
if not imgdir.exists():
    print(f"Creating directory {imgdir}")
    imgdir.mkdir(parents=True)

imgfile = imgdir / Path("clustering.png")
plt.savefig(imgfile)