***

### Embeddings

- Creates embeddings (transforms unstructured texts into vectors of numbers based on pre-trained LLM)
- Inspired by [BERTopic documentation](https://maartengr.github.io/BERTopic/index.html)

***

In [None]:
# TODO: change the code so i'm working on one dataset with an extra variable: Referential (boolean)

In [None]:
# load data
data = "/data/df.xlsx"  # the documents we want to embed must be in their own rows
df = pd.read_excel(data)

# inspect df
df.head(1)

- How long are the documents we will embed? 
- We should check, as BERT has a limit of 512 tokens. 

In [None]:
# Quick visualization to see text lengths
ax = l.hist(bins=30)  # create histogram using axes object
ax.set_title('How many "tokens" long are control texts?')
plt.axvline(x=512, color='red', linewidth=1)
props = dict(boxstyle='square', facecolor='white', alpha=0.5)

ax.text(0.95, 0.7, f'{z} values >= 512', transform=ax.transAxes, fontsize=10,
        verticalalignment='top', horizontalalignment='right', bbox=props)
plt.show()

In [None]:
# How many are too long?
l = df['full_control_text'].apply(lambda x: len(x.split(' ')))
z = len([x for x in l if x >= 512])
print(f'{z} values are over length 512 and will be truncated.') 

In [None]:
# to create embeddings, we need a list of documents (control texts in our case)
docs = list(df['full_control_text'].values)

In [None]:
# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"  
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

In [None]:
# function to do BERT embedding on text
def get_embedding(text):
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)  # truncates
    output = model(**encoded_input)
    
    # Use the pooler output (representation of [CLS] token) and detach it to convert tensor to numpy
    embedding = output[1].detach().numpy()
    return embedding

In [None]:
# create a column in our dataset for embeddings
df['BERTembeddings'] = df['full_control_text'].apply(get_embedding)

#### Finetuned Model

- Let's try the same thing - but using our finetuned version of BERT. 
- Again, we'll just add these new embeddings as a new column.
- That lets us easiliy compare results if we like.

In [None]:
# Load finetuned model 
model_name = "bert-base-uncased"  
model = BertModel.from_pretrained(model_name)

# Make tokenizer based on the model
tokenizer = BertTokenizer.from_pretrained(model_name)

In [None]:
# Load fine-tuned sentence-transformers model
model_path = "outputs/sentence_transformers_compatible_model"
finetuned_model = SentenceTransformer(model_path)

In [None]:
# Generate embeddings for each document
embeddings = finetuned_model.encode(docs, show_progress_bar=True)

# Add the embeddings to DataFrame
df['finetune_embeddings'] = embeddings.tolist()

### Topic modeling

Having created the embeddings, we can do most of the rest of the BERTopic algorithm in one function
<!-- - Dimensionality reduction
- Clustering
- Tokenizer
- Weighting Scheme -->


- Step 1 - Extract embeddings

` embedding_model = SentenceTransformer("all-MiniLM-L6-v2")`

- Step 2 - Reduce dimensionality

`umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')`

- Step 3 - Cluster reduced embeddings

`hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)`

- Step 4 - Tokenize topics

`vectorizer_model = CountVectorizer(stop_words="english")`

- Step 5 - Create topic representation

`ctfidf_model = ClassTfidfTransformer()`

- Step 6 - (Optional) Fine-tune topic representations 

`representation_model=representation_model` 

See [documentation](https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#min_topic_size "More info on minimum topic size and other parameters") for more on min_topic_size and other parameter choices.

In [None]:
# topic modeling function
def topic_modeling(min_topic_size):
    # Load fine-tuned sentence-transformers model
    finetuned_model = SentenceTransformer(r"outputs\fine_tuned_model")
    # Load pre-generated embeddings
    pre_generated_embeddings = np.array(list(df['finetune_embeddings'].values))
    # specifying dimensionality reduction
    umap_model = UMAP.UMAP(n_neighbors=15, n_components=5, metric='cosine', low_memory=False)  # may need to tweak
    # specifying cluster model  - requires setting minimum topic size ()
    hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', prediction_data=True) 
    # Create two representation models (maybe one is better - nice to check)
    keybert_model = KeyBERTInspired(random_state=42)
    mmr_model = MaximalMarginalRelevance(diversity=0.3)
    representation_model = {
        "KeyBERT": keybert_model,
        "MMR": mmr_model
    }

In [None]:
# Instantiate BERTopic with fine-tuned model's embeddings and the representation model
topic_model = BERTopic(embedding_model='_____________',  
                    verbose=False,
                    n_gram_range=(1, 3),  # we use n-grams of 1-3 words
                    min_topic_size=min_topic_size,
                    calculate_probabilities=True,
                    representation_model=representation_model)

In [None]:
# run the topic modeling
topics, probs = topic_model.fit_transform(docs, embeddings=pre_generated_embeddings)
    return topics, probs, topic_model