# **Tutorial** - Topic Modeling with BERTopic
(last updated 01-09-2022)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic
!pip install --upgrade joblib==1.1.0

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
len(docs)

18846

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-11-14 18:41:54,574 - BERTopic - Transformed documents to Embeddings
2022-11-14 18:42:34,384 - BERTopic - Reduced dimensionality
2022-11-14 18:43:04,478 - BERTopic - Clustered reduced embeddings


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,6813,-1_to_the_and_of
1,0,1829,0_game_team_games_he
2,1,590,1_key_clipper_chip_encryption
3,2,527,2_ites_cheek_yep_huh
4,3,443,3_monitor_card_video_drivers


In [None]:
len(freq)

215

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('game', 0.010448871276050502),
 ('team', 0.009096858810746813),
 ('games', 0.0072451416014537385),
 ('he', 0.007127886563888632),
 ('players', 0.006349499426397751),
 ('season', 0.006292184542535288),
 ('hockey', 0.006167159374583367),
 ('play', 0.005814499133601291),
 ('25', 0.005694243777649715),
 ('year', 0.005660847590219884)]

**NOTE**: BERTopic is stocastic which means that the topics might differ across runs. This is mostly due to the stocastisc nature of UMAP.

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

[0, 3, -1, 43, 107, -1, -1, 0, 0, -1]

In [None]:
len(topic_model.topics_)

18846

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through a hundred topics to get a good 
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. 
Instead, we can quantitatively visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
probs.shape

(18846, 214)

In [None]:
import random
random_document_id = random.randint(0, len(probs))
print(docs[random_document_id])
print('Topic: {}'.format(topic_model.topics_[random_document_id]))

# topic_model.visualize_distribution(probs[random_document_id], min_probability=0.015)
topic_model.visualize_distribution(probs[random_document_id], min_probability=0.003)

    |                                                                  
    | > Mary at that time appeared to a girl named Bernadette at       
    | > Lourdes.  She referred to herself as the Immaculate Conception.
    | > Since a nine year old would have no way of knowing about the   
    | > doctrine, the apparition was deemed to be true and it sealed   
    | > the case for the doctrine.                                     
    |Bernadette was 14 years old when she had her visions, in 1858,    
    |four years after the dogma had been officially proclaimed by the  
    |Pope.                                                             
    |                                                                  
    | Yours,                                                           
    | James Kiefer

I forgot exactly what her age was but I remember clearly
that she was born in a family of poverty and she did not
have any education, whatsoever, at the age of the apparitions.
She suffere

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=10)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.


In [None]:
topic_model.visualize_heatmap(top_n_topics = 20, width=1000, height=1000)

You can set `n_clusters` in `visualize_heatmap` to order the topics by their similarity. This will result in blocks being formed in the heatmap indicating which clusters of topics are similar to each other. This step is very much recommended as it will make reading the heatmap easier.

In [None]:
topic_model.visualize_heatmap(n_clusters = 5, top_n_topics = 20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank(log_scale=True)

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
topic_model.update_topics(docs, n_gram_range=(2, 3))

In [None]:
topic_model.get_topic(2)   # We select topic that we viewed before

[('more each why', 0.27832230856202955),
 ('ites yep', 0.27832230856202955),
 ('not this again', 0.27832230856202955),
 ('not forget ites', 0.27832230856202955),
 ('of why ken', 0.27832230856202955),
 ('this again lets', 0.27832230856202955),
 ('us more each', 0.27832230856202955),
 ('why cheek', 0.27832230856202955),
 ('again lets not', 0.27832230856202955),
 ('ken huh not', 0.27832230856202955)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

2022-11-14 19:04:01,205 - BERTopic - Reduced number of topics from 215 to 61


<bertopic._bertopic.BERTopic at 0x7f82c8dec390>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[0, 3, -1, 43, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, 18, -1, 4, -1, -1, -1, 14, 42, -1, 0, -1, 4, -1, 3, 19, -1, 36, 35, 0, 17, 12, -1, -1, -1, -1, -1, 39, 47, 26, 0, -1, -1, 6, 1, -1, -1, -1, 36, 1, 56, -1, -1, -1, 8, -1, 0, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 38, -1, -1, -1, 0, 33, 11, 0, 54, -1, -1, 47, 31, -1, -1, 20, -1, -1, 0, 2, -1, 3, 18, -1, -1, -1, 4, -1, -1, -1, -1, 28, 2, 0, -1, -1, -1, -1, 12, -1, -1, 14, 18, 52, -1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, 2, -1, -1, -1, -1, -1, 0, 43, 2, -1, 1, 16, 13, -1, 12, -1, 7, -1, -1, 12, 27, 0, 6, 37, 37, 43, -1, -1, -1, -1, 5, 44, 3, -1, 2, -1, -1, -1, -1, -1, -1, -1, 17, 22, -1, -1, -1, 11, -1, 0, -1, -1, 0, -1, 0, 53, -1, -1, 2, -1, -1, 18, -1, -1, -1, 2, 15, -1, 59, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 11, -1, 48, 15, -1, -1, -1, -1, 37, 4, -1, -1, 57, 2, 0, 29, -1, 9, 1, -1, 1, -1, -1, 0, -1, -1, -1, 26, 0, -1, -1, 0, 0, 7, -1, -1, 17, -1, 6, 6, 2, 9, -1, 23, -1, -1, 26, 3, -1, -1, 32, -1, -1, -1, -1, -1, -1, -1,

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[55, 21, 56, 9, 47]

In [None]:
topic_model.get_topic(7)

[('the remote', 0.0048154657400008875),
 ('to compile', 0.004686015838793279),
 ('parse error', 0.004390586458889873),
 ('in libxmulibxmua is', 0.004221851556530347),
 ('libxmulibxmua is', 0.004221851556530347),
 ('in libxmulibxmua', 0.004221851556530347),
 ('error before', 0.004221851556530347),
 ('parse error before', 0.004221851556530347),
 ('libxmulibxmua is undefined', 0.004221851556530347),
 ('is undefined', 0.004101629815855426)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Save model
topic_model.save("/content/drive/MyDrive/KDM-ICP11/my_model")	


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



In [None]:
# Load model
my_model = BERTopic.load("/content/drive/MyDrive/KDM-ICP11/my_model")	

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens") #Trained on NLI and STSB datasets

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

#Assignment: 
- Use another data source - Twitter / Reddit API
- Use another base model ([Here](https://www.sbert.net/docs/pretrained_models.html) is for a list of supported sentence transformers models).
- Compare and contrast topics with previously studied shallow learning technique LDA


# Another Data Source from Reddit

In [None]:
!pip install praw
!pip install asyncpraw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting praw
  Downloading praw-7.6.1-py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 14.3 MB/s 
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.4.2-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.5 MB/s 
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.6.1 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.4.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting asyncpraw
  Downloading asyncpraw-7.6.0-py3-none-any.whl (195 kB)
[K     |████████████████████████████████| 195 kB 15.1 MB/s 
Collecting aiosqlite<=0

In [None]:
import praw
#import asyncpraw as praw
from asyncpraw.models import MoreComments
reddit = praw.Reddit(
    user_agent="icp6",
    client_id="mj0MooumrU-UuM5CIw5NTQ",
    client_secret="sDjVxLtRSryKvjvTbSLMbowsm7mMYQ",
    redirect_uri='http://127.0.0.1:65010/'
                                # 'authorize_callback'
    # username="USERNAME",
    # password="PASSWORD"
)

In [None]:
import time
from tqdm.notebook import tqdm
topic_library = {
    'Blood cancer' : [ 'https://www.reddit.com/r/science/comments/ysicvs/study_has_defined_five_new_subgroups_of_the_most/', 
    'https://www.reddit.com/r/AskDocs/comments/y6yjfy/is_this_lymphoma_some_type_of_blood_cancer/',
    'https://www.reddit.com/r/newsbotbot/comments/yomcln/reuters_gsks_blood_cancer_drug_fails_main_goal_of/'],
    'Breast cancer': [
        'https://www.reddit.com/r/DiagnoseMe/comments/w2dntr/is_this_inflammatory_breast_cancer_showed_up_a/',
    'https://www.reddit.com/r/doihavebreastcancer/comments/ycqzs4/inflammatory_breast_cancer/',
    'https://www.reddit.com/r/doihavebreastcancer/comments/ykcrrg/inflammatory_breast_cancer_concerns/',
    ],
    'Prostate cancer': [
         'https://www.reddit.com/r/Testosterone/comments/ye09j9/va_claims_trt_increases_risk_of_heart_disease_and/',
    'https://www.reddit.com/r/science/comments/wvwlmf/previous_use_of_cannabis_associated_with_a_lower/', 
    'https://www.reddit.com/r/ProstateCancer/comments/ytaxib/37_yo_diagnosed_with_prostate_cancer_gleason/',
    
    
    ]
}
for topic, urls in tqdm(topic_library.items()):
  for url in tqdm(urls):
    submission = reddit.submission(url=url)
    submission.comments.replace_more(limit=0) # flatten tree
    all_comments = []
    comments = submission.comments # all comments
    # comments = submission.comments.list() # all comments
    for top_level_comment in submission.comments:
      if isinstance(top_level_comment, MoreComments):
          continue
      all_comments.append(top_level_comment.body)
      with open('{}-reddit_author.txt'.format(topic), 'a') as f:
          f.write(submission.selftext + '\n')
      with open('{}-reddit_comments.txt'.format(topic), 'a') as f:
        for i, comment in enumerate(comments):
          f.write(comment.body + '\n')
    # time.sleep(1.5)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



  0%|          | 0/3 [00:00<?, ?it/s]

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



  0%|          | 0/3 [00:00<?, ?it/s]

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
import shutil
for topic, _ in tqdm(topic_library.items()):
  shutil.copy2('{}-reddit_author.txt'.format(topic), '/content/drive/MyDrive/KDN-ICP11')
  shutil.copy2('{}-reddit_comments.txt'.format(topic), '/content/drive/MyDrive/KDN-ICP11')

  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
import pandas as pd
reddit_df = pd.DataFrame(columns = ['text'])
for topic, _ in tqdm(topic_library.items()):
  with open('{}-reddit_comments.txt'.format(topic)) as f:
    lines = f.readlines()
    reddit_sub_df = pd.DataFrame(lines, columns = ['text'])
    reddit_sub_df = reddit_sub_df.sample(20)
    reddit_df= reddit_df.append(reddit_sub_df).reset_index(drop=True)
# lines

  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
reddit_df

Unnamed: 0,text
0,Welcome to r/science! This is a heavily modera...
1,\n
2,Welcome to r/science! This is a heavily modera...
3,See what the haematologists say but I wouldn’t...
4,Thank you for your submission. **Please note t...
5,**Reply here if you are an unverified user wis...
6,The most important thing is that you are conti...
7,\n
8,\n
9,**Reply here if you are an unverified user wis...


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer,HashingVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tfidf_vect = TfidfVectorizer(analyzer='word', stop_words = 'english', max_df = 0.7)
tfidf_tokens = tfidf_vect.fit_transform(reddit_df['text'])

In [None]:
tfidf_vocab = tfidf_vect.get_feature_names_out()

In [None]:
print(tfidf_tokens[0])
tfidf_vocab[100]

  (0, 18)	0.13619141135439072
  (0, 264)	0.13619141135439072
  (0, 263)	0.1188520476694801
  (0, 45)	0.1188520476694801
  (0, 186)	0.1188520476694801
  (0, 271)	0.1188520476694801
  (0, 106)	0.1188520476694801
  (0, 200)	0.27238282270878145
  (0, 150)	0.10743090078644281
  (0, 192)	0.10289833767224425
  (0, 57)	0.12642638812964477
  (0, 48)	0.2377040953389602
  (0, 16)	0.13619141135439072
  (0, 47)	0.27238282270878145
  (0, 197)	0.13619141135439072
  (0, 15)	0.13619141135439072
  (0, 17)	0.13619141135439072
  (0, 211)	0.13619141135439072
  (0, 128)	0.13619141135439072
  (0, 165)	0.27238282270878145
  (0, 189)	0.13619141135439072
  (0, 195)	0.1188520476694801
  (0, 86)	0.13619141135439072
  (0, 71)	0.13619141135439072
  (0, 258)	0.12642638812964477
  (0, 163)	0.27238282270878145
  (0, 184)	0.13619141135439072
  (0, 72)	0.27238282270878145
  (0, 157)	0.13619141135439072
  (0, 216)	0.09890033366433569
  (0, 143)	0.13619141135439072
  (0, 102)	0.13619141135439072
  (0, 202)	0.3565561430084

'health'

In [None]:
lda = LatentDirichletAllocation(n_components = 10, doc_topic_prior=1)
lda.fit(tfidf_tokens)

LatentDirichletAllocation(doc_topic_prior=1)

In [None]:
import numpy as np 
topic_words = {}
n_top_words = 15
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    # store the words most relevant to the topic
    topic_words[topic] = [tfidf_vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  sub, users, checked, diagnose, muscles, oncologists, doctors, changes, breasts, immediately, physician, gladly, blows, level, laypeople
Topic: 1
  va, little, disability, ll, monthly, best, good, luck, wish, surgery, bad, 100, really, trt, doc
Topic: 2
  visit, doctor, does, information, use, www, ccl, informal, second, terms_of_use, submission, catch, terms, note, taken
Topic: 3
  subreddit, concerns, compose, moderators, contact, questions, venal, alliance, https, com, injection, mod, real, relationship, final
Topic: 4
  hear, science, wishing, treatment, personal, comment, people, reddit, want, cured, away, time, keeping, result, low
Topic: 5
  update, automatically, performed, bot, action, message, research, did, lymphocyte, count, askdocs, say, normal, wiki, high
Topic: 6
  removed, sorry, need, contacted, pectoral, real, relationship, final, crapshot, low, ccl, injection, mod, took, doing
Topic: 7
  comments, continue, life, pay, trimix, total, big, genetic, died, doin

# Optimize LDA model with Coherence Score

In [None]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

def lemmatization(texts):
  texts_out = []
  for sent in texts:
    doc= nlp(" ".join(sent))
    texts_out.append([token.lemma_ for token in doc])
  return texts_out
data_lemmatized = lemmatization(reddit_df['text'])
print(data_lemmatized[:20])

[['w', 'e', 'l', 'c', 'o', 'm', 'e', '  ', 't', 'o', '  ', 'r', '/', 's', 'c', 'I', 'e', 'n', 'c', 'e', '!', '  ', 't', 'h', 'I', 's', '  ', 'I', 's', '  ', 'a', '  ', 'h', 'e', 'a', 'v', 'I', 'l', 'y', '  ', 'm', 'o', 'd', 'e', 'r', 'a', 't', 'e', 'd', '  ', 's', 'u', 'b', 'r', 'e', 'd', 'd', 'I', 't', '  ', 'I', 'n', '  ', 'o', 'r', 'd', 'e', 'r', '  ', 't', 'o', '  ', 'k', 'e', 'e', 'p', '  ', 't', 'h', 'e', '  ', 'd', 'I', 's', 'c', 'u', 's', 's', 'I', 'o', 'n', '  ', 'o', 'n', '  ', 's', 'c', 'I', 'e', 'n', 'c', 'e', '.', '  ', 'h', 'o', 'w', 'e', 'v', 'e', 'r', ',', '  ', 'w', 'e', '  ', 'r', 'e', 'c', 'o', 'g', 'n', 'I', 'z', 'e', '  ', 't', 'h', 'a', 't', '  ', 'm', 'a', 'n', 'y', '  ', 'p', 'e', 'o', 'p', 'l', 'e', '  ', 'w', 'a', 'n', 't', '  ', 't', 'o', '  ', 'd', 'I', 's', 'c', 'u', 's', 's', '  ', 'h', 'o', 'w', '  ', 't', 'h', 'e', 'y', '  ', 'f', 'e', 'e', 'l', '  ', 't', 'h', 'e', '  ', 'r', 'e', 's', 'e', 'a', 'r', 'c', 'h', '  ', 'r', 'e', 'l', 'a', 't', 'e', 's', ' 

In [None]:
from scipy.sparse import data
import gensim.corpora as corpora

id2word = corpora.Dictionary(data_lemmatized)

# create corpus
texts  = data_lemmatized

# term document frequency
corpus = [id2word.doc2bow(text) for text in texts]

# view
print(corpus[:20])

[[(0, 1), (1, 78), (2, 1), (3, 1), (4, 1), (5, 1), (6, 4), (7, 3), (8, 5), (9, 7), (10, 1), (11, 21), (12, 1), (13, 1), (14, 2), (15, 23), (16, 2), (17, 22), (18, 16), (19, 65), (20, 1), (21, 2), (22, 16), (23, 6), (24, 3), (25, 22), (26, 16), (27, 28), (28, 41), (29, 12), (30, 25), (31, 36), (32, 37), (33, 9), (34, 5), (35, 15), (36, 5), (37, 1)], [(0, 1)], [(0, 1), (1, 78), (2, 1), (3, 1), (4, 1), (5, 1), (6, 4), (7, 3), (8, 5), (9, 7), (10, 1), (11, 21), (12, 1), (13, 1), (14, 2), (15, 23), (16, 2), (17, 22), (18, 16), (19, 65), (20, 1), (21, 2), (22, 16), (23, 6), (24, 3), (25, 22), (26, 16), (27, 28), (28, 41), (29, 12), (30, 25), (31, 36), (32, 37), (33, 9), (34, 5), (35, 15), (36, 5), (37, 1)], [(0, 1), (1, 52), (4, 1), (5, 1), (7, 2), (8, 3), (11, 14), (15, 14), (16, 6), (17, 14), (18, 4), (19, 20), (20, 1), (21, 5), (22, 14), (23, 2), (24, 1), (25, 14), (26, 6), (27, 19), (28, 24), (29, 4), (30, 13), (31, 13), (32, 21), (33, 16), (35, 8), (36, 16), (38, 2), (39, 3), (40, 2), (

In [None]:
import gensim
model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word,
                                   num_topics=25, random_state=100,
                                   chunksize=100, passes=50,
                                   per_word_topics=True, minimum_phi_value=0.01, alpha= 'symmetric')

In [None]:
from gensim.models import CoherenceModel

# Calculate Coherence score
coherence_model_lda = CoherenceModel(model=model, texts = texts, dictionary= id2word, coherence = 'c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("\nCoherence Score: ", coherence_lda)


Coherence Score:  0.39533430380608287


# Visualize Topics

In [None]:
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 13.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=5b11fe7f6f3b6cee29707ff2fc075a6e12cccc1967c7d9434582f0daae5e4ea6
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
  Building wheel for sklearn 

In [None]:
import pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
#import pyLDAvis.gensim
import pickle 

pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(model, corpus, id2word)
LDAvis_prepared

TypeError: ignored

PreparedData(topic_coordinates=                        x                   y  topics  cluster       Freq
topic                                                                    
10    -0.232013+0.000000j -0.003099+0.000000j       1        1  40.864381
19    -0.202670+0.000000j  0.009472+0.000000j       2        1  22.282780
2     -0.222041+0.000000j  0.040085+0.000000j       3        1  13.094488
8     -0.204174+0.000000j  0.017400+0.000000j       4        1  12.176285
17    -0.241974+0.000000j -0.039207+0.000000j       5        1   7.230169
12    -0.202055+0.000000j  0.000501+0.000000j       6        1   3.243116
4     -0.048630+0.000000j -0.260246+0.000000j       7        1   0.391612
7     -0.003600+0.000000j -0.180920+0.000000j       8        1   0.148896
13     0.194758+0.000000j -0.321535+0.000000j       9        1   0.127658
16     0.072654+0.000000j  0.046097+0.000000j      10        1   0.027538
23     0.072654+0.000000j  0.046097+0.000000j      11        1   0.027538
22     

# Another base model

In [None]:
len(reddit_df)

60

# Topic modeling

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(reddit_df['text'])

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2022-11-14 19:20:07,560 - BERTopic - Transformed documents to Embeddings
2022-11-14 19:20:12,124 - BERTopic - Reduced dimensionality
2022-11-14 19:20:12,144 - BERTopic - Clustered reduced embeddings


# Extracting topics

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,0,30,0_your_to_the_you
1,1,16,1_venal_alliance_the_
2,2,14,2_to_this_any_removed


In [None]:
len(freq)

3

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('your', 0.08218916535287318),
 ('to', 0.07981240038613065),
 ('the', 0.07077662331172603),
 ('you', 0.07045436366991276),
 ('is', 0.06965485308428696),
 ('not', 0.062416730036730594),
 ('have', 0.056193910398348715),
 ('but', 0.05491006040455053),
 ('and', 0.054577646497233935),
 ('of', 0.051159293397282764)]

In [None]:
topic_model.topics_[:10]

[2, 1, 2, 0, 0, 2, 0, 1, 1, 2]

In [None]:
len(topic_model.topics_)

60

# Visualization

In [None]:
topic_model.visualize_topics()


k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.



TypeError: ignored

# Visualize Topic Probabilities

In [None]:
probs.shape

(60, 3)

In [None]:
import random
random_document_id = random.randint(0, len(probs))
print(docs[random_document_id])
print('Topic: {}'.format(topic_model.topics_[random_document_id]))

# topic_model.visualize_distribution(probs[random_document_id], min_probability=0.015)
topic_model.visualize_distribution(probs[random_document_id], min_probability=0.003)

Is there a way to connect a PowerBook 145, Mac IIsi, and Personal LaserWriter
LS so that I can (not necessarily silmultaneoulsy) print from either the IIsi,
or PB, and file share between the IIsi and PB?
I know I can get the ($expensive$) LW NT upgrade for my LS, but I can't afford
that...
Topic: 2


# Visualize Topic Hierarchy

In [None]:
topic_model.visualize_hierarchy(top_n_topics=10)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



# Visualize Terms

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

# Visualize Topic Similarity

In [None]:
topic_model.visualize_heatmap(top_n_topics = 20, width=1000, height=1000)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



In [None]:
topic_model.visualize_heatmap(n_clusters = 2, top_n_topics = 20, width=1000, height=1000)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



# Visualize Term Score Decline

In [None]:
topic_model.visualize_term_rank(log_scale=True)

# Topic Representation
## Update Topics

In [None]:
topic_model.update_topics(reddit_df['text'], n_gram_range=(2, 3))

In [None]:
topic_model.get_topic(2)   # We select topic that we viewed before

[('if you', 0.044679377321206845),
 ('action was', 0.03673876316082127),
 ('automatically please', 0.03673876316082127),
 ('and this action', 0.03673876316082127),
 ('am bot', 0.03673876316082127),
 ('the moderators of', 0.03673876316082127),
 ('questions or', 0.03673876316082127),
 ('am bot and', 0.03673876316082127),
 ('please contact', 0.03673876316082127),
 ('this action was', 0.03673876316082127)]

# Topic Reduction

In [None]:
topic_model.reduce_topics(reddit_df['text'], nr_topics=60)

2022-11-14 19:40:41,687 - BERTopic - Reduced number of topics from 3 to 3


<bertopic._bertopic.BERTopic at 0x7f8100e0c790>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[2, 1, 2, 0, 0, 2, 0, 1, 1, 2, 0, 1, 0, 0, 2, 1, 2, 2, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 2, 2, 1, 2, 0, 0, 1, 1, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0]


# Search Topics

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[2, 0, 1]

In [None]:
topic_model.get_topic(7)

False

# Model serialization

In [None]:
# Save model
topic_model.save("/content/drive/MyDrive/KDM-ICP11/my_model2")	

In [None]:
# Load model
my_model = BERTopic.load("/content/drive/MyDrive/KDM-ICP11/my_model2")	

# **Embedding Models**
## Sentence-Transformers

# Sentence Embeddings with Transformers

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask



#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In [None]:
sentence_embeddings

tensor([[-0.0746, -0.2330, -0.0850,  ...,  0.5448,  0.6725, -0.2300],
        [ 0.2604,  0.2537,  0.1447,  ...,  0.3068,  0.3916, -0.1535],
        [ 0.2383,  0.3197,  0.2613,  ...,  0.2829,  0.3043,  0.5536]])

# Applying my data in Sentence Transformer

In [None]:
#Sentences we want sentence embeddings for
sentences = reddit_df['text'].values.tolist()

#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])


0     Welcome to r/science! This is a heavily modera...
1                                                    \n
2     Welcome to r/science! This is a heavily modera...
3     See what the haematologists say but I wouldn’t...
4     Thank you for your submission. **Please note t...
5     **Reply here if you are an unverified user wis...
6     The most important thing is that you are conti...
7                                                    \n
8                                                    \n
9     **Reply here if you are an unverified user wis...
10    My Dad died of CCL but his genetic info linked...
11                                                   \n
12    Thank you for your submission. **Please note t...
13    See what the haematologists say but I wouldn’t...
14    *I am a bot, and this action was performed aut...
15                                                   \n
16    *I am a bot, and this action was performed aut...
17    *I am a bot, and this action was performed

In [None]:
sentence_embeddings

tensor([[-0.1721,  0.0508,  0.0438,  ...,  0.0504,  0.0484, -0.0047],
        [-0.7565,  0.3075, -0.0162,  ...,  0.8047,  0.2963, -0.1001],
        [-0.1721,  0.0508,  0.0438,  ...,  0.0504,  0.0484, -0.0047],
        ...,
        [-0.1459, -0.0604,  0.4893,  ...,  0.1219, -0.4064,  0.1274],
        [-0.0063,  0.0546, -0.0157,  ..., -0.1443, -0.0582,  0.0334],
        [ 0.3607, -0.1823,  0.0381,  ..., -0.3491,  0.0435,  0.3744]])

# Semantic Textual Similarity

In [None]:
reddit_df

Unnamed: 0,text
0,Welcome to r/science! This is a heavily modera...
1,\n
2,Welcome to r/science! This is a heavily modera...
3,See what the haematologists say but I wouldn’t...
4,Thank you for your submission. **Please note t...
5,**Reply here if you are an unverified user wis...
6,The most important thing is that you are conti...
7,\n
8,\n
9,**Reply here if you are an unverified user wis...


In [None]:
sentence1 = reddit_df.iloc[[0]]
sentence2 = reddit_df.iloc[[1]]
print(sentence1.shape)
print(sentence2.shape)

(1, 1)
(1, 1)


In [None]:
reddit_df.shape
#sentences = reddit_df['text']#.values.tolist()
#sentences.shape()

(60, 1)

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Two lists of sentences
#sentence1 = sentence1.values.tolist()
#sentence2 = sentence2.values.tolist() 

#Compute embedding for both lists
#embeddings1 = model.encode(sentence1, convert_to_tensor=True)
#embeddings2 = model.encode(sentence2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(sentence_embeddings, sentence_embeddings)

#Output the pairs with their score
for i in range(len(sentence_embeddings)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentence_embeddings[i], sentence_embeddings[i], cosine_scores[i][i]))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        -4.1833e-02,  3.6004e-01, -3.5817e-02,  2.5329e-01, -2.4723e-02,
         4.8259e-01, -2.3814e-01, -1.3043e-01, -1.7557e-01,  2.8809e-01,
         3.3405e-02, -2.4878e-01, -9.5829e-02, -1.9098e-01,  5.4408e-02,
        -6.8453e-02, -2.3410e-01,  1.0773e-01, -2.2208e-01, -1.1165e-01,
        -2.3888e-01,  1.1333e-02,  7.4527e-02, -7.9351e-02,  7.0242e-02,
        -1.4941e-01, -8.8077e-02,  1.4858e-01,  1.5384e-01, -2.0481e-01,
         1.6035e-01, -2.1919e-01, -2.1513e-02, -1.9796e-01, -2.1603e-01,
        -1.2151e-01, -1.6162e-01, -2.0673e-02,  4.6083e-02, -2.0793e-01,
        -1.5914e-03, -2.9326e-01, -8.0528e-02,  2.0278e-01,  3.0266e-01,
         5.7685e-02, -2.2562e-02,  2.7360e-01, -2.3831e-02,  1.0989e-01,
        -5.0160e-02, -2.6469e-01,  1.2205e-01, -7.9747e-02, -1.6827e-02,
        -6.5379e-02, -1.0369e-03, -6.5232e-02, -1.2242e-01,  1.4631e-01,
        -4.1563e-02,  1.7183e-01,  9.1847e-02,  1.2909e-01,

# Semantic Search

In [None]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 

In [None]:
reddit_df.head()

Unnamed: 0,text
0,Welcome to r/science! This is a heavily modera...
1,\n
2,Welcome to r/science! This is a heavily modera...
3,See what the haematologists say but I wouldn’t...
4,Thank you for your submission. **Please note t...


In [None]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['See what the haematologists say but I wouldn’t be particularly concerned if your CD4% is lower', 
          'because your absolute lymphocyte count is high. When your lymphocyte count comes down your CD4 % will increase',
          'I don’t know why your lymphocyte count is (a bit) higher than normal, but it’s not usually anything concerning.' ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=False)
#corpus_embeddings = sentence_embeddings
# Query sentences:
queries = ['lymphocyte', 'Blood cancer.', 'breast cance.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

RuntimeError: ignored

# Cross-Encoders

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers import SentenceTransformer, util
#model = SentenceTransformer('all-MiniLM-L6-v2')
model = CrossEncoder('distilroberta-base')
scores = model.predict([["See what the haematologists say but I wouldn’t be particularly concerned if your CD4% is lower, because your absolute lymphocyte count is high."],  
                        ["When your lymphocyte count comes down your CD4 % will increase."]])

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
scores

array([0.5484427, 0.5511168], dtype=float32)