This is an example Colab notebook that explores and implements a use-case of document embedding and retrieval. In this specific use-case, we will examine movies, their summaries, and how we can develop an informational chatbot using this data.


Adapted from: https://github.com/google/generative-ai-docs/blob/main/site/en/gemini-api/tutorials/document_search.ipynb

All code used from the adapted notebook is cited in code.

In [1]:
!pip install -U -q "google-generativeai>=0.7.2"

In [205]:
import textwrap
import numpy as np
import pandas as pd
import google.generativeai as genai
import os
import time

from google.colab import userdata
from IPython.display import Markdown

Make sure to have an API key set up to be able to run this notebook. This can be done in Google API Studio.

In [220]:
GOOGLE_API_KEY=userdata.get('GEMINI_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

Here, we fetch the movie txt files. I formatted it such that the title of the file itself is the name of the movie and the text within the file is purely the summary. This example dataset is available for use in the same Github link used to access this notebook.

In [90]:
folder_path = '/content/movies'
files = os.listdir(folder_path)

In [221]:
titles = []
summaries = []

In [213]:
for filename in files:
  file_path = os.path.join(folder_path, filename)
  with open(file_path, "r") as file:
    content = file.read()
  title = filename[:len(filename) - 4]
  summary = content
  titles.append(title) # appending to titles
  summaries.append(summary) # appending to summaries
  print(title)

The Half of It
Dunkirk
Jojo Rabbit
Oppenheimer
18x2 Beyond Youthful Days
Parasite
Everything Everywhere all at Once
Manchester by the Sea
Schindler's List
The Godfather
Dead Poet's Society
Lady Bird
Gone With the Wind
West Side Story
The Theory of Everything
The Great Gatsby
La La Land
Titanic
500 Days of Summer
Barbie
Do Revenge
The Imitation Game
Interstellar
The Perks of Being a Wallflower


In [208]:
df = pd.DataFrame({"title": titles, "content": summaries})
df

Unnamed: 0,title,content
0,The Half of It,Ellie Chu lives in the remote town of Squahami...
1,Dunkirk,"In 1940, during the Battle of France, Allied s..."
2,Jojo Rabbit,During the collapse of Nazi Germany in the cit...
3,Oppenheimer,"In 1926, 22-year-old doctoral student J. Rober..."
4,18x2 Beyond Youthful Days,"Jimmy, the founder of a video game development..."
5,Parasite,The Kim family lives in a semi-basement flat (...
6,Everything Everywhere all at Once,Evelyn Quan Wang is a middle-aged Chinese immi...
7,Manchester by the Sea,Lee Chandler is a depressed and asocial janito...
8,Schindler's List,"In German-occupied Kraków during World War II,..."
9,The Godfather,"In 1945, the New York City Corleone family don..."


Here, we define the two models that we will be using:


*   **embedding_model**: This is used primarily for generating the embeddings of a given prompt
*   **gemini_model**: This is used for the Q&A section of the notebook, in which we provide a prompt to Gemini and receive a response


In [206]:
embedding_model = 'models/embedding-001'
gemini_model = genai.GenerativeModel('models/gemini-1.5-flash')

Now that we have our dataframe, we then can create the embeddings for each movie. This is vital for finding the most similar document in this dataframe to a query that the user may ask.

In [209]:
embeddings = []
for index, row in df.iterrows():
  title = row["title"]
  summary = str(row["content"])
  embed = genai.embed_content(model=embedding_model,
                              content=summary,
                              task_type="retrieval_document",
                              title=title
  )["embedding"]

  embeddings.append(embed)
  time.sleep(3)

df["embeddings"] = embeddings

In [210]:
df

Unnamed: 0,title,content,embeddings
0,The Half of It,Ellie Chu lives in the remote town of Squahami...,"[0.040409114, 0.0020535272, -0.019358685, -0.0..."
1,Dunkirk,"In 1940, during the Battle of France, Allied s...","[0.013893375, 0.001847346, -0.021615293, -0.03..."
2,Jojo Rabbit,During the collapse of Nazi Germany in the cit...,"[-0.025704551, 0.033689447, 0.0054823584, 0.00..."
3,Oppenheimer,"In 1926, 22-year-old doctoral student J. Rober...","[-0.0050539807, -0.0053329207, -0.02729902, -0..."
4,18x2 Beyond Youthful Days,"Jimmy, the founder of a video game development...","[0.02347291, 0.0016603536, -0.025433268, -0.01..."
5,Parasite,The Kim family lives in a semi-basement flat (...,"[0.047810633, 0.012849592, -0.051752906, -0.03..."
6,Everything Everywhere all at Once,Evelyn Quan Wang is a middle-aged Chinese immi...,"[-0.018914413, -0.010303412, -0.029896662, -0...."
7,Manchester by the Sea,Lee Chandler is a depressed and asocial janito...,"[0.033810396, 0.010540827, -0.007948213, 0.024..."
8,Schindler's List,"In German-occupied Kraków during World War II,...","[-0.021749232, -0.0076108095, -0.06814597, -0...."
9,The Godfather,"In 1945, the New York City Corleone family don...","[-0.0054041487, 0.020592896, -0.04643777, 0.00..."


Now that we have the embeddings for each movie and its summary, we can now generate embeddings for a user query and compare it to all the embeddings in this dataframe. This is useful because we can then use **cosine-similarity** to find the closest document that is relevant to the query.

In [218]:
# Provided code from https://github.com/google-gemini/cookbook/blob/main/examples/Talk_to_documents_with_embeddings.ipynb
def generate_query_embeddings(query, df):
  request = genai.embed_content(model=embedding_model, content=query, task_type="retrieval_query")
  query_embed = request["embedding"]

  dot_product = np.dot(np.stack(df["embeddings"]), query_embed)
  index = np.argmax(dot_product)
  response = df.iloc[index]["content"]

  return response

Once we find the most relevant document to the query, we can generate the prompt for Gemini to process and respond to using the following **make_prompt** function.

In [217]:
# Provided prompt from https://github.com/google-gemini/cookbook/blob/main/examples/Talk_to_documents_with_embeddings.ipynb
def make_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = textwrap.dedent("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
  strike a friendly and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it. Refer to the movie list in answering questions
  at all times.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

    ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt

Finally, we can put it all together with the following method. As shown below, the entire flow can be broken down into three steps:

1. Generate the embedding for the query
2. Generate the prompt using the query and the relevant movie that the calculation of cosine-similarity indicated
3. Give the prompt to Gemini, which contains the user query and the summary of the most relevant movie to the query, so that Gemini can now respond to the original user query



In [207]:
def ask_movie_chatbot(query):
  response = generate_query_embeddings(query, df)
  prompt = make_prompt(query, response)
  answer = gemini_model.generate_content(prompt)

  return Markdown(answer.text)

Here's an example of how this works!

In [219]:
query = "What movie is about a scientist?"
ask_movie_chatbot(query)

The movie *The Theory of Everything* is about Stephen Hawking, a brilliant astrophysics student at the University of Cambridge.  The film follows his life, from his diagnosis with motor neuron disease to his groundbreaking work on black holes and his rise to international fame as a physicist.
