## Overview

This project demonstrates how to use the Gemini API to create embeddings,will use the Python client library to build a word embedding that allows you to compare search strings, or questions, to document contents.

In this project, will use embeddings to perform document search over a set of documents to ask questions related to the documents.

In [152]:
!pip install -U -q google-generativeai

In [153]:
import textwrap
import google.generativeai as GenAi
import google.ai.generativelanguage as glm
import pandas as pd
import numpy as np
from google.colab import userdata

from IPython.display import Markdown, display

In [154]:
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
GenAi.configure(api_key=GOOGLE_API_KEY)

## Embedding generation

In this section, you will see how to generate embeddings for a piece of text using the embeddings from the Gemini API.

See the [Embeddings quickstart](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Embeddings.ipynb) to learn more about the `task_type` parameter used below.

In [155]:
# Example 1
text = "SATHISH Mattapalli"
result = GenAi.embed_content(model="models/text-embedding-004", content=text)

# Print just a part of the embedding to keep the output manageable
print(str(result['embedding'])[:50], '... TRIMMED]')

[-0.023678103, 0.04343279, -0.05387894, 0.00810668 ... TRIMMED]


In [156]:
print(len(result['embedding']))

768


In [157]:
# Example 2
title = "6 ways we supported learning and education in 2024"
sample_text = ("Title: 6 ways we supported learning and education in 2024"
    "\n"
    "Full article:\n"
    "\n"
    "Gemini API & Google AI Studio: An approachable way to explore and prototype with generative AI applications")

model = 'models/embedding-001'
embedding = GenAi.embed_content(model=model,
                                content='sample_text',
                                task_type="retrieval_document",
                                title=title)

print(embedding)

{'embedding': [0.058082152, -0.010528708, -0.03629449, 0.0521812, 0.05498597, 0.06341316, 0.013103872, -0.014023898, 0.03269887, 0.04185978, -0.029658034, 0.011816381, -0.042047273, -0.039411407, 0.030298395, -0.076788254, -0.011837379, 0.056390002, -0.011904153, 0.03627013, 0.007622999, 0.018235749, 0.0037318566, -0.0055210325, 0.015927318, -0.031068234, -0.008151011, -0.070511214, -0.018628987, 0.028235123, -0.05253229, 0.043766163, -0.02781425, 0.042436853, -0.021410724, -0.042646695, -0.02753972, -0.0008122271, -0.027216215, -0.02216579, -0.00889182, -0.072435215, -0.014347731, 0.017958324, 0.004720065, -0.049846135, 0.035029326, 0.013905076, -0.007294941, -0.060779214, 0.022042504, -0.0066578514, 0.08864284, -0.028215619, 0.003656741, -0.031895053, 0.025039444, -0.016132873, -0.014320613, -0.017309133, 0.018090209, 0.03398677, 0.0009894903, 0.030281445, -0.040673267, -0.03285413, -0.07414294, 0.0059097265, 0.05373042, -0.028807035, -0.011909862, -0.012380076, 0.053620517, 0.028198

In [158]:
print(len(embedding['embedding']))

768


# Building an embeddings database

Here are four sample text to use to build the embeddings database, we will use Gemini API to create embeddings of each of the documents.

In [159]:
DOCUMENT_A = {
    "title": "Evolution of the Internet",
    "content": "The internet's evolution began in the 1960s as a U.S. military project aimed at creating a resilient communication network. Its original purpose was to ensure functionality even in the face of partial infrastructure failures. Over time, this network grew into the modern internet, connecting billions of users globally via the Internet Protocol Suite (TCP/IP). A significant leap forward came in the early 1990s with the advent of the World Wide Web by Tim Berners-Lee, which revolutionized public access and usability."
}

DOCUMENT_B = {
    "title": "Impacts of Global Warming",
    "content": "Global warming represents a significant shift in Earth’s climate, largely driven by human activities like burning fossil fuels, which increase atmospheric greenhouse gases. This phenomenon leads to intensified weather patterns, including more frequent hurricanes, prolonged droughts, and devastating floods. Furthermore, it accelerates polar ice melting and rising sea levels, endangering coastal regions. The consequences extend to agriculture, water availability, and biodiversity, highlighting the urgent need for mitigation efforts."
}

DOCUMENT_C = {
    "title": "Advances in Green Energy",
    "content": "Recent progress in green energy technologies underscores the global push towards sustainable power solutions. Solar technology has become increasingly efficient and cost-effective, with innovations like advanced photovoltaic cells making solar energy more accessible. Wind energy has also seen remarkable improvements, with cutting-edge turbines generating power more consistently and in greater quantities. These innovations are pivotal in reducing dependence on fossil fuels and combatting climate change through renewable energy adoption."
}

DOCUMENT_D = {
    "title": "Artificial Intelligence in Healthcare",
    "content": "Artificial Intelligence (AI) is transforming healthcare by enhancing diagnostic accuracy, personalizing treatments, and streamlining administrative tasks. AI algorithms are capable of analyzing vast datasets to identify patterns and predict patient outcomes, leading to earlier interventions and better care. Examples include AI-driven tools for medical imaging analysis and virtual assistants that improve patient interaction. This technology is not only improving efficiency but also expanding access to high-quality healthcare worldwide."
}

documents = [DOCUMENT_A, DOCUMENT_B, DOCUMENT_C, DOCUMENT_D]


In [160]:
df = pd.DataFrame(documents)
df.columns = ['Title', 'Content']
df

Unnamed: 0,Title,Content
0,Evolution of the Internet,The internet's evolution began in the 1960s as...
1,Impacts of Global Warming,Global warming represents a significant shift ...
2,Advances in Green Energy,Recent progress in green energy technologies u...
3,Artificial Intelligence in Healthcare,Artificial Intelligence (AI) is transforming h...


Now we will create the embeddings for each of the text and will add it to the data frame.

In [161]:
def embed_fn(title, Content):
  return GenAi.embed_content(model="models/embedding-001", content=title, task_type='retrieval_document', title=title)["embedding"]
df['Embedding'] = df.apply(lambda x: embed_fn(x['Title'], x['Content']), axis=1)
df

# response = GenAi.embed_content(model="models/text-embedding-004", content=title, task_type='retrieval_document', title=title)["embedding"]
# print(response)

Unnamed: 0,Title,Content,Embedding
0,Evolution of the Internet,The internet's evolution began in the 1960s as...,"[0.03831492, -0.026607854, -0.043914452, -0.00..."
1,Impacts of Global Warming,Global warming represents a significant shift ...,"[0.013367976, -0.04466875, -0.029548366, 0.016..."
2,Advances in Green Energy,Recent progress in green energy technologies u...,"[0.006432495, 0.0057838606, -0.040501002, -0.0..."
3,Artificial Intelligence in Healthcare,Artificial Intelligence (AI) is transforming h...,"[0.02277285, -0.052651204, -0.038627297, -0.02..."


## Document search with Q&A

Now that the embeddings are generated, let's create a Q&A system to search these documents. You will ask a question about hyperparameter tuning, create an embedding of the question, and compare it against the collection of embeddings in the dataframe.

The embedding of the question will be a vector (list of float values), which will be compared against the vector of the documents using the dot product. This vector returned from the API is already normalized. The dot product represents the similarity in direction between two vectors.

The values of the dot product can range between -1 and 1, inclusive. If the dot product between two vectors is 1, then the vectors are in the same direction. If the dot product value is 0, then these vectors are orthogonal, or unrelated, to each other. Lastly, if the dot product is -1, then the vectors point in the opposite direction and are not similar to each other.

Note, with the new embeddings model (`embedding-001`), specify the task type as `QUERY` for user query and `DOCUMENT` when embedding a document text.

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.

In [162]:
query = "What were the original purposes of the internet's creation in the 1960s, and how has it evolved into its current form?"
model = 'models/embedding-001'

request = GenAi.embed_content(model=model,
                              content=query,
                              task_type="retrieval_query")
print(request)

{'embedding': [0.018128846, -0.053428844, -0.050558563, -0.01625502, 0.011635872, 0.031778377, 0.033031106, 0.013995166, 0.025032334, 0.062753275, 0.02026338, 0.03223248, 0.011596497, 0.003504628, 0.07471315, -0.03346641, 0.01432801, 0.025758905, -0.0044261436, -0.058921654, -0.037283167, -0.017816246, -0.00089843635, -0.020403309, -0.00055217504, -0.008701618, 0.0038953996, -0.041881185, -0.001249064, -0.029964661, -0.052667927, 0.066917844, -0.030525934, -0.025875662, -4.8984242e-05, -0.030156212, 0.02391402, 0.028336478, -0.06940351, 0.08582472, 0.006636826, -0.0046984158, -0.07840328, 0.036097683, -0.0041771303, -0.028193621, -0.020981707, 0.04013237, 0.043409973, -0.05023337, 0.004670513, 0.07562865, 0.040254246, -0.04026467, 0.013595716, -0.02741997, 0.02793851, -0.011890958, -0.016532876, -0.0068630525, 0.042864818, -0.049098097, 0.0026542256, -0.0005585496, 0.010150841, -0.02776014, -0.0067178863, 0.03834721, 0.032066744, -0.046376042, 0.0252808, 0.018324634, 0.059684195, -0.03

In [163]:
# Use the find_best_passage function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passage out of the database.
def find_best_passage(query, dataframe):

  query_embedding = GenAi.embed_content(model=model,
                                        content=query,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embedding']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Content'] # Return text from index with max value

In [164]:
passage = find_best_passage(query, df)
passage

"The internet's evolution began in the 1960s as a U.S. military project aimed at creating a resilient communication network. Its original purpose was to ensure functionality even in the face of partial infrastructure failures. Over time, this network grew into the modern internet, connecting billions of users globally via the Internet Protocol Suite (TCP/IP). A significant leap forward came in the early 1990s with the advent of the World Wide Web by Tim Berners-Lee, which revolutionized public access and usability."

In [165]:
query1 ='How is Artificial Intelligence transforming healthcare to improve diagnostics, treatments, and efficiency?'
# request1 = GenAi.embed_content(model=model,
#                               content=query1,
#                               task_type="retrieval_query")
#print(request1)
def find_best_passage1(query1, dataframe):
  query_embedding = GenAi.embed_content(model=model,
                                        content=query1,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embedding']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Content'] # Return text from index with max value

passage1 = find_best_passage1(query1, df)
passage1

'Artificial Intelligence (AI) is transforming healthcare by enhancing diagnostic accuracy, personalizing treatments, and streamlining administrative tasks. AI algorithms are capable of analyzing vast datasets to identify patterns and predict patient outcomes, leading to earlier interventions and better care. Examples include AI-driven tools for medical imaging analysis and virtual assistants that improve patient interaction. This technology is not only improving efficiency but also expanding access to high-quality healthcare worldwide.'

In [166]:
query2 ='How global warming impacting the world?'
# request2 = GenAi.embed_content(model=model,
#                               content=query1,
#                               task_type="retrieval_query")
# #print(request1)
def find_best_passage2(query2, dataframe):
  query_embedding = GenAi.embed_content(model=model,
                                        content=query2,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embedding']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Content'] # Return text from index with max value

passage2 = find_best_passage2(query2, df)
passage2

'Global warming represents a significant shift in Earth’s climate, largely driven by human activities like burning fossil fuels, which increase atmospheric greenhouse gases. This phenomenon leads to intensified weather patterns, including more frequent hurricanes, prolonged droughts, and devastating floods. Furthermore, it accelerates polar ice melting and rising sea levels, endangering coastal regions. The consequences extend to agriculture, water availability, and biodiversity, highlighting the urgent need for mitigation efforts.'