<a href="https://colab.research.google.com/github/sishef/nlpworkshop/blob/main/Example_ChatterbotCorpusWithTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Example: Using semantic similarity on an existing corpus of sentences
This example shows you how to efficiently install an exisating corpus of conversation data (in chatterbot-corpus) and then efficiently encode all of it into embeddings that can be used afterwards to quickly search for semantically similar sentences.

**NOTE:** When you get to the code below that actually does the encoding of all the corpus sentences it is quite slow to run. You can darmatically speed it up by telling Colab to use GPUs (hardware acceleration).
Do this by going to the Runtim menu option above --> Change Runtime Type --> Hardware Accelerator = GPU

In [None]:
!pip install transformers
!pip install sentence-transformers
!pip install chatterbot-corpus

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer, util

#model = SentenceTransformer('stsb-roberta-base')
model = SentenceTransformer('stsb-roberta-large') #--> the best one but slow to encode if you aren't using GPU

In [None]:
import os
import inspect
import torch
from yaml import load
import chatterbot_corpus

# This is the location of the corpus YAML files installed with the chatterbot corpus package
data_path = os.path.join(os.path.dirname(inspect.getfile(chatterbot_corpus)), 'data/english')

all_data = []
all_embeddings = []

# Loop through each YAML file
for conv_file in os.listdir(data_path):

  # Read the YAML and turn into a long list of sentences  
  conv_data = load(open(os.path.join(data_path, conv_file), 'r'))
  conv_data = [item for sublist in conv_data['conversations'] for item in sublist]

  # Use the transformer model to encode the full list of sentences and store them in our variable
  print(f"Encoding {len(conv_data)} lines in file {conv_file}...")
  all_embeddings.append(model.encode(conv_data, convert_to_tensor=True))
  all_data.append(conv_data)

# Turn a list of lists into one big list for both the embeddings and the original sentence data
all_embeddings = torch.cat(all_embeddings)
all_data = [item for sublist in all_data for item in sublist]

In [None]:
# Lets try it out
query="what is your favorite dish?"

query_emb = model.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(query_emb, all_embeddings)[0] # runs cosine similarity against every set of embeddings in the list (i.e. all our training data)
top_results = torch.topk(cos_scores, 5) # pick the 5 best matches

for score, idx in zip(top_results[0], top_results[1]):
  print(f"match: {all_data[idx]}, response: {all_data[idx+1]}, score: {score}")