## Overview

This project demonstrates how to use the Gemini API to create embeddings,will use the Python client library to build a word embedding that allows you to compare search strings, or questions, to document contents.

In this project, will build a rag application

In [3]:
!pip install -U -q google-generativeai

In [4]:
import textwrap
import google.generativeai as GenAi
import google.ai.generativelanguage as glm
import pandas as pd
import numpy as np
from google.colab import userdata

from IPython.display import Markdown, display

In [5]:
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
GenAi.configure(api_key=GOOGLE_API_KEY)

# Building an embeddings database

Here are 2 sample text to use to build the embeddings database, we will use Gemini API to create embeddings of each of the documents.We pulled data from websites to create the DataFrame dynamically. Using Python, you can use web scraping libraries like BeautifulSoup (from the bs4 library) for extracting data from HTML content. Additionally, APIs provided by websites (if available) can be a more structured and reliable way to fetch data.

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function for data cleaning
def clean_text(text):
    """
    Cleans input text by removing extra whitespace, newlines, and ensuring UTF-8 encoding.
    """
    return (
        text.replace("\n", " ")  # Remove newlines
        .replace("\r", " ")     # Remove carriage returns
        .strip()                # Strip leading/trailing spaces
        .encode("utf-8", "ignore")  # Ignore non-UTF characters
        .decode("utf-8")        # Decode back to string
    )

# URLs to scrape
urls = [
    "https://en.wikipedia.org/wiki/History_of_the_automobile",
    "https://en.wikipedia.org/wiki/Global_warming"
]

data = []

for url in urls:
    try:
        # Fetch webpage content
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to fetch {url} (Status Code: {response.status_code})")
            continue

        # Parse HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract title (fallback to "Title Not Found" if missing)
        title_tag = soup.find('h1')  # Most articles have <h1> for titles
        title = clean_text(title_tag.text) if title_tag else "Title Not Found"

        # Extract content (combine the first 3 paragraphs)
        paragraphs = soup.find_all('p')  # Get all <p> tags
        content = clean_text(" ".join([p.text for p in paragraphs[:3]])) if paragraphs else "Content Not Found"

        # Add extracted data to the list
        data.append({"Title": title, "Content": content})

    except Exception as e:
        print(f"Error scraping {url}: {e}")

# Create a DataFrame
df = pd.DataFrame(data)

# Perform additional cleaning
df.dropna(inplace=True)  # Remove rows with missing values
df.reset_index(drop=True, inplace=True)  # Reset index for a clean DataFrame

# Display the DataFrame
print(df)


                       Title  \
0  History of the automobile   
1             Climate change   

                                             Content  
0  Crude ideas and designs of automobiles can be ...  
1  Present-day climate change includes both globa...  


Now we will create the embeddings for each of the text and will add it to the data frame.

In [17]:
def embed_fn(title, Content):
  return GenAi.embed_content(model="models/embedding-001", content=title, task_type='retrieval_document', title=title)["embedding"]
df['Embedding'] = df.apply(lambda x: embed_fn(x['Title'], x['Content']), axis=1)
df


Unnamed: 0,Title,Content,Embedding
0,History of the automobile,Crude ideas and designs of automobiles can be ...,"[0.02445126, -0.023332464, -0.067578495, -0.01..."
1,Climate change,Present-day climate change includes both globa...,"[-0.014170468, -0.034987755, -0.06454118, 0.01..."


## Document search with Q&A

Now that the embeddings are generated, let's create a Q&A system to search these documents. You will ask a question about hyperparameter tuning, create an embedding of the question, and compare it against the collection of embeddings in the dataframe.

The embedding of the question will be a vector (list of float values), which will be compared against the vector of the documents using the dot product. This vector returned from the API is already normalized. The dot product represents the similarity in direction between two vectors.

The values of the dot product can range between -1 and 1, inclusive. If the dot product between two vectors is 1, then the vectors are in the same direction. If the dot product value is 0, then these vectors are orthogonal, or unrelated, to each other. Lastly, if the dot product is -1, then the vectors point in the opposite direction and are not similar to each other.

Note, with the new embeddings model (`embedding-001`), specify the task type as `QUERY` for user query and `DOCUMENT` when embedding a document text.

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.

In [18]:
query = "Who created a small-scale steam-powered vehicle was created?"
model = 'models/embedding-001'

request = GenAi.embed_content(model=model,
                              content=query,
                              task_type="retrieval_query")
print(request)

{'embedding': [-0.026826635, -0.04280142, -0.056344602, -0.034657888, 0.045785226, 0.02341177, 0.057652794, -0.023186712, -0.026868613, 0.03340324, 0.04717992, -0.012326628, -0.006728747, 0.002680267, -0.022994498, 0.009573285, -0.0055102385, 0.0037289173, 0.0058312444, -0.04484913, 0.05276572, 0.022221332, -0.036666464, 0.017306665, -0.006452038, -0.008397161, 0.007468636, -0.03868062, -0.01009747, 0.011730041, -0.07851583, 0.058791123, -0.035904437, -0.011909979, -0.040258814, -0.009836895, 0.0028718885, -0.010888016, -0.0515669, 0.07759871, 0.036239196, 0.053189483, -0.056783278, -0.038672272, 0.042950418, -0.03977321, -0.045924194, 0.039767098, -0.009520786, -0.029797532, 0.04471093, 0.016361043, 0.03188916, -0.046700876, -0.0011000285, -0.07216628, 0.002928771, -0.040969793, -0.0248441, 0.008575118, 0.020667786, -0.01589764, -0.007454219, 0.058337387, 0.0054748715, -0.022651216, -0.037466418, 0.008430872, 0.04094743, -0.028460138, 0.056079704, -0.043632276, 0.056174573, -0.0269641

In [20]:
# Use the find_best_passage function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passage out of the database.
def find_best_passage(query, dataframe):

  query_embedding = GenAi.embed_content(model=model,
                                        content=query,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embedding']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Content'] # Return text from index with max value

In [21]:
passage = find_best_passage(query, df)
passage

'Crude ideas and designs of automobiles can be traced back to ancient and medieval times.[1][2] In 1649, Hans Hautsch of Nuremberg built a clockwork-driven carriage.[1][3] In 1672, a small-scale steam-powered vehicle was created by Ferdinand Verbiest;[4] the first steam-powered automobile capable of human transportation was built by Nicolas-Joseph Cugnot in 1769.[5][6] Inventors began to branch out at the start of the 19th century, creating the de Rivaz engine, one of the first internal combustion engines,[7] and an early electric motor.[8] Samuel Brown later tested the first industrially applied internal combustion engine in 1826. Only two of these were made.[9]  Development was hindered in the mid-19th century by a backlash against large vehicles, yet progress continued on some internal combustion engines. The engine evolved as engineers created two- and four-cycle combustion engines and began using gasoline. The first modern car—a practical, marketable automobile for everyday use—an

# Question and Answering Application
Let's try to use the text generation API to create a Q & A system. Input your own custom data below to create a simple question and answering example. You will still use the dot product as a metric of similarity.

In [22]:

def make_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = textwrap.dedent("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
  strike a friendly and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

    ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt

In [23]:

prompt = make_prompt(query, passage)
print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below.   Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.   However, you are talking to a non-technical audience, so be sure to break down complicated concepts and   strike a friendly and converstional tone.   If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: 'Who created a small-scale steam-powered vehicle was created?'
  PASSAGE: 'Crude ideas and designs of automobiles can be traced back to ancient and medieval times.[1][2] In 1649, Hans Hautsch of Nuremberg built a clockwork-driven carriage.[1][3] In 1672, a small-scale steam-powered vehicle was created by Ferdinand Verbiest;[4] the first steam-powered automobile capable of human transportation was built by Nicolas-Joseph Cugnot in 1769.[5][6] Inventors began to branch out at the start of the 19th century, creating the de Rivaz engine, one o

In [25]:
for m in GenAi.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro-latest
models/gemini-1.0-pro
models/gemini-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-pro-exp-0801
models/gemini-1.5-pro-exp-0827
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-exp-0827
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-1.5-flash-8b-exp-0827
models/gemini-1.5-flash-8b-exp-0924
models/learnlm-1.5-pro-experimental
models/gemini-exp-1114
models/gemini-exp-1121
models/gemini-exp-1206


In [28]:
model = GenAi.GenerativeModel('models/gemini-1.0-pro')
answer = model.generate_content(prompt)

In [29]:
Markdown(answer.text)

Ferdinand Verbiest created a small-scale steam-powered vehicle in 1672.