### Topic Modelling.
In this notebook, we will perform topic modelling individually on each of the 3 sections that are to be considered.



In [3]:
import pandas as pd
link = '../all_filings_and_sections.csv'
df = pd.read_csv(link)

df['filedAt'] = pd.to_datetime(df['filedAt'], infer_datetime_format=True)
df = df.drop(columns='Unnamed: 0')

### BertTopic(requires sentence_transformers and Bertopic)
We get the topics. We will use  Bertopic, but will add a stop word removal so that they do not appear in the topics. They will however appear when being encoded for the sentence transformer. For production, we would use the 'all-mpnet-base-v2' as a sentence transformer, which as of moment is the top performer in [Sentence Embeddings](https://www.sbert.net/docs/pretrained_models.html). For the purposes of this MVP, we will use the "all-MiniLM-L6-v2", which takes approximately 20 minutes to finish per section, vs 1h:40m on an MPS device. We then add maximal marginal relevance, which works by selecting representatives of a given cluster, that are diverse as measured by their cosine similarity. In practice, we get N sentence embeddings but that are not too similar, so that the use of diverse information is maximized.

Let's start by preprocessing the content of the sections, since we cannot embed them directly, without running the risk of truncating a significant part of the document. We will use Langchain for this purpose. We can use the sentence transformer of our choice so that when we compute the embeddings we know that there is no risk of truncation.


In order to use sentence transformers, we need to extract the page content from each document. As of now, this cannot be done in a vectorized way.



In [None]:
import pickle
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

# Load the splitter to use
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, model_name='sentence-transformers/all-MiniLM-L6-v2')
sections = [f'Section{s}' for s in ['1', '1A', '7']]
split_sections = {}

# For each section load the corresponding column and split the documents using the sentence-transformer
for s in sections:
    loader_section_s = DataFrameLoader(df, page_content_column=s)
    docs_section_s = loader_section_s.load()
    split_sections[s] = splitter.split_documents(docs_section_s)

split_sections_text = {s: {'text': [doc.page_content for doc in v],
                           'meta': [doc.metadata for doc in v]} for s, v in split_sections.items()}
with open("../data/split_sections_text.pickle", "wb") as file:
    pickle.dump(split_sections_text, file)

Now we can compute the embeddings.

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = {s: embedding_model.encode(v['text'], show_progress_bar=True) for s,v in split_sections_text.items()}


We save them in case we need to use them later.

In [None]:
with open("../data/embeddings.pickle", "wb") as file:
    pickle.dump(embeddings, file)

We now load the required elements, namely the count vectorizer and the representation model. We include up to 2 n-grams so that we can get topics consisting of two-words.


In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import MaximalMarginalRelevance


vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
representation_model = MaximalMarginalRelevance(diversity=0.2)

We can now fit and save our 3 topic models.

In [None]:
for s,v in split_sections_text.items():
    topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topic_model.fit(v['text'], embeddings[s])
    topic_model.save(f'topic_models/topic_models_{s}', serialization='safetensors', save_ctfidf=True, save_embedding_model=embedding_model)


Let's save the model before we go into the topic analysis. Next we move into [analyzing](topic_analysis.ipynb) the topics.