# LLM Workshop - Dev Days 2023

This colab belongs to the "LLM, Generative AI and vector databases" workshop from the CM.com Dev Days 2023. In this workshop, you will create your own chatbot which can answer questions based on information it can find in a knowledge base of your choice.

## Initializing project

Please make sure to enable GPU computing by selecting a GPU: `runtime -> Change runtime type -> Hardware accelerator -> gpu`

In [None]:
# Install the needed packages
!pip3 install faiss-cpu
!pip3 install pandas
!pip3 install openai
!pip3 install sentence-transformers
!pip3 install ydata-profiling
!pip3 install --upgrade Pillow

*We* need to restart: `Runtime -> Restart runtime`

In [None]:
# Import packages
import os
import timeit

import openai
from sentence_transformers import SentenceTransformer, util
import faiss
import pandas as pd
from ydata_profiling import ProfileReport

## Your first Large Language Model

Lets build a first chatbot to get the feeling how a LLM works!

In [None]:
# Set correct settings for OpenAI API connection
openai.api_type = "azure"
openai.api_key = ""
openai.api_base = ''
openai.api_version = '2023-03-15-preview'
MODEL = "gpt-35-turbo"
MAX_TOKENS = 1000

In [None]:
# Create a system prompt: tell the language model what it should do and how it should behave
# EDIT THIS!
system_prompt = """You are a movie score predictor that assists users in
predicting matches from the top teams in the Eredivisie. Only predictions for
the following clubs are allowed: Ajax, Feyenoord, PSV, AZ, FC Twente, Sparta,
FC Utrecht, SC Heerenveen. Please provide the predicted score in the format:
<GOALS_FOR_HOME_CLUB>-<GOALS_FOR_AWAY_CLUB>"""

# Define a user prompt: on what question or text you want a response on
# EDIT THIS!
user_prompt = "Ajax - Feyenoord"

# Define the complete prompt to send to OpenAI
prompt = [
    {"role": "system", "content": system_prompt},
     {"role": "user", "content": user_prompt},
]

# Ask for a chat answer
# https://platform.openai.com/docs/api-reference
completion = openai.ChatCompletion.create(
            engine=MODEL,
            messages=prompt,
            temperature=0.2,
            max_tokens=MAX_TOKENS,
            n=1
        )['choices'][0]['message']['content']

completion

'Based on recent performances and statistics, I predict that the score for Ajax vs Feyenoord will be 2-1 in favor of Ajax.'

### Few-shot vs zero-shot



[Additional code] Note how the model is having a hard time keeping the correct format of only returning the score. You have multiple options here: optimize the prompt, retrieve the needed information from the output, or show the model some examples. The latter is called few-shot prompting and is shown below:

In [None]:
prompt = [
    {"role": "system", "content": system_prompt},
     {"role": "user", "content": "Ajax - PSV"},
     {"role": "assistant", "content": "1-1"},
     {"role": "user", "content": "Sparta - AZ"},
     {"role": "assistant", "content": "0-0"},
     {"role": "user", "content": "Groningen - Heerenveen."},
     {"role": "assistant", "content": "Groningen is a club which is not in scope, my apologies."},
     {"role": "user", "content": user_prompt},
]

In [None]:
completion = openai.ChatCompletion.create(
            engine=MODEL,
            messages=prompt,
            temperature=0.2,
            max_tokens=MAX_TOKENS,
            n=1
        )['choices'][0]['message']['content']

In [None]:
completion

Note that this is pretty costly, as you pay more for longer (input and output)  texts. Therefore, optimizing the prompt or performing some regex is often a better option. Also more sophisticated ideas are known, such as finetuning (/ retraining) the model or performing so-called finetuning. However, this is out of scope for now.

## Semantic search for Generative Augmented Retrieval

You cannot put all your textual information in your LLM model. For example if we wanted to create a score predictor for your local football club, you want your LLM to have access to information about the clubs, the standings and past results. Putting this all in the LLM is slow, costly and not all information is needed. We only need information about the two clubs which are mentioned in the input.

Therefore, the following idea has been established: from all documents you have as background material, pick the most relevant ones and send these towards the LLM to ask for an answer. We do this by performing semantic search.

In this proces, we compare how semantically similar texts are: so we want to measure how the incoming question matches your documents. This can be done by vectorizing your texts and perform some vector similarity measure.


### Creating embeddings from your data

Transformer models are popular and known for their good understanding of texts. The SentenceTransformers library has pre-trained models which transformers texts into vectors that incorporate meaning of the text, so-called embeddings. https://www.sbert.net/docs/quickstart.html

In [None]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name, device='cuda')

Let's try to create some embeddings!

In [None]:
# Encode your first sentence into an embedding
query = "Someone is going to McDonalds."
query_emb = model.encode(query)
query_emb

In [None]:
sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]
# Encode multiple sentences
sentences_emb = model.encode(sentences)

# Perform cosine similarity on these vectors to obtain a score which is higher the more semantically equal two sentences are
cos_sim = util.cos_sim(query_emb, sentences_emb)
cos_sim

In [None]:
# Function which uses the cosine similarity under the hood to rank the most similar sentences first
util.semantic_search(query_emb, sentences_emb)

### Load data

Now we are going to try this in a larger dataset containing movie descriptions. First, lets load in the data.

Download and upload the data to Colab on the left of your screen: `files -> upload to session storage`.

 https://drive.google.com/drive/folders/1HH35yX0RpA44d5PYxXopTPB7l1-7aagb?usp=sharing

In [None]:
# Read the tables
movies = pd.read_csv("/content/movies.csv")
# ratings = pd.read_csv("/content/ratings_small.csv")

# Show first 5 rows
movies.head(5)

In [None]:
# For you to explore the data and see what is there!
ProfileReport(movies, title="Movie data explanatory data analysis").to_notebook_iframe()

Lets encode al our data

In [None]:
embeddings = model.encode(movies['overview'].astype(str), show_progress_bar=True, normalize_embeddings=True)
embeddings.shape
# 47s on gpu

In [None]:
# Only needed when GPU is not available and you do not want to wait:

# import pickle
# with open('/content/embeddings.pickle', 'rb') as handle:
#     embeddings = pickle.load(handle)

## Semantic search using a Vector Library

If you build up your knowledge base for all information on a certain subject, you eventually end up with huuuuuge lists of embeddings which you have to compare every time you get a question. For 40k rows this is fine, but for 4M this is already getting pretty slow. For this issue, vector libraries and vector databases are developed which contain algorithms to speed this up.

Lets use Vector library FAISS, add our embeddings to an index and perform a semantic search.

In [None]:
# We want to search for a roman or greek movie
xq = model.encode(["A film or Roman or Greek antiquity"], normalize_embeddings=True)

In [None]:
# https://github.com/facebookresearch/faiss/wiki/Getting-started
# Create an 'index'
index = faiss.IndexFlatL2(model.get_sentence_embedding_dimension())
# Fill with our calculated embeddings
index.add(embeddings)

In [None]:
# Perform the search function to obtain the k closest embeddings to our roman/greek movie search
k = 5
_, I = index.search(xq, k)
# This are the indices from the records that are semantically closest towards the query
I

In [None]:
# What movies are these? Let's find out by printing their description
pd.set_option('max_colwidth', 300)
movies[['title', 'overview']].iloc[I.tolist()[0]]

In [None]:
# This needs to be real quick! Let's test the speed of semantic search.
n = 1000
timeit.timeit(lambda: index.search(xq, k), number=n) / n

### Speeding things up: Approximate nearest neighbour search (only for nerds :))



This speed is okay, but for large databases with millions of records, this is getting pretty slow. We need another strategy: *approximate nearest neighbor search*. We can do this using the Inverted File Index, *IndexIVFFlat*. This clusters all the embeddings into groups and during semantic search, it only searches in the groups where the question(/query) embedding belongs to. In this way, we do not have to search through all embeddings! :D More information: https://www.pinecone.io/learn/vector-indexes/#inverted-file-index

In [None]:
# Initialize the index
nlist = 40
quantizer = faiss.IndexFlatL2(model.get_sentence_embedding_dimension())
index_ivf = faiss.IndexIVFFlat(quantizer, model.get_sentence_embedding_dimension(), nlist)
# Let the index learn which groups there are
index_ivf.train(embeddings)
index_ivf.add(embeddings)
# Set n-probe to 5, meaning we search in the group of the incoming question/query and the 5 neighbouring group
index_ivf.nprobe = 5

In [None]:
# Perform nearest neighbour search
_, indices = index_ivf.search(xq, k)

In [None]:
movies[['title', 'overview']].iloc[indices.tolist()[0]]

Identical results! But how much faster are we?

In [None]:
n = 1000
timeit.timeit(lambda: index_ivf.search(xq, k), number=n) / n

More than 10 times faster and similar accuracy!

## Combine LLM with semantic search: Generative Augmented Retrieval

In [None]:
# Settings
system_prompt = """
You are a movie recommender chatbot. Based on incoming user information and preference, we have selected a product that fits the user.
These movies are included in the incoming message.
Use these movies to create a message why one of these movies suits the user. Include the movie title and a reason to buy.
Your tone of voice when answering questions is very sarcastic.
"""

In [None]:
def create_chat_prompt(system: str, user: str):
  return [
      {"role": "system", "content": system},
       {"role": "user", "content": user}
      ]

In [None]:
def generate_answer(question: str):

  # Encode question to obtain an embedding
  xq = model.encode([question], normalize_embeddings=True)

  # Obtain the indices of the top N items
  _, indices = index.search(xq, 1)

  # Structure the item information data for the prompt
  item_info = "\n".join(["Movie title: '" + str(row['title']) + "'. Description: " + row['overview'] for index, row in movies.iloc[indices.tolist()[0]].reset_index().iterrows()])
  print(item_info + "\n\n")

  # Create user prompt by concatenating question and item info
  user_prompt = question + "\n" + item_info

  # Create full prompt
  prompt = create_chat_prompt(system_prompt, user_prompt)

  # Send it!
  completion = openai.ChatCompletion.create(
            engine=MODEL,
            messages=prompt,
            temperature=1,
            max_tokens=MAX_TOKENS,
            n=1
        )['choices'][0]['message']['content']

  return completion

In [None]:
import time
start_time = time.time()
answer = generate_answer("I want to see a film about Roman or Greek antiquity")
print(answer)
print("--- %s seconds ---" % (time.time() - start_time))

## Create your own chatbot below!

Be creative! At the end, every team will have to present their chatbot in a 5 min presentation. Additional ideas can be found on the slide.