In [1]:
%pip install -U -q "google-generativeai>=0.8.3" chromadb

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.
kfp 2.5.0 requires kubernetes<27,>=8.0.0, but you have kubernetes 31.0.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


Obtain your API Key from Google AI Studio

In [2]:
import google.generativeai as genai

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
GOOGLE_GEMINI_API_KEY = user_secrets.get_secret("GOOGLE_GEMINI_API_KEY")
genai.configure(api_key=GOOGLE_GEMINI_API_KEY)

In [4]:
# Checking Available Embedding Models supported by Google
for m in genai.list_models():
    if "embedContent" in m.supported_generation_methods:
        print(m.name)

models/embedding-001
models/text-embedding-004


# Creating Embedding Database

In [5]:
# Adding Documents for creating Embedding Database
DOCUMENT1 = "Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans using natural language. The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP combines computational linguistics, machine learning, and deep learning to process and analyze large amounts of natural language data"
DOCUMENT2 = "Tokenization is one of the first steps in preprocessing text for natural language processing tasks. It involves splitting a string of text into individual words or tokens. Tokenization is a crucial step for text analysis as it enables algorithms to handle smaller, more manageable chunks of text. There are different approaches to tokenization, including word-level tokenization and sentence-level tokenization."
DOCUMENT3 = "Named Entity Recognition (NER) is a key NLP task that involves identifying and classifying entities in text into predefined categories such as names of people, organizations, locations, dates, and more. NER is useful in many applications, including information extraction, question answering, and content summarization. Advanced techniques like deep learning and transformers have significantly improved the performance of NER systems."
DOCUMENT4 = "Sentiment analysis is a process of determining the emotional tone behind a series of words, used to understand the attitude or opinion expressed in text. It's widely applied in customer service, marketing, and social media analysis. Sentiment analysis typically involves classifying text as positive, negative, or neutral. Techniques such as supervised learning, lexicon-based methods, and deep learning are commonly employed for sentiment classification."
DOCUMENT5 = "Part-of-speech (POS) tagging is a process in which each word in a sentence is labeled with its corresponding part of speech (e.g., noun, verb, adjective). POS tagging is essential for many NLP applications, such as syntactic parsing, information extraction, and machine translation. Common methods for POS tagging include rule-based, statistical, and machine learning-based approaches."
DOCUMENT6 = "Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. Unlike traditional one-hot encoding, word embeddings use dense vectors of real numbers that capture semantic relationships between words. Popular word embedding models include Word2Vec, GloVe, and FastText, all of which are used to learn word representations from large text corpora."
DOCUMENT7 = "Transformers are a type of neural network architecture that has revolutionized the field of NLP. Unlike traditional recurrent neural networks (RNNs), transformers use self-attention mechanisms to process words in parallel rather than sequentially. This architecture is highly effective for tasks such as machine translation, text generation, and question answering. Prominent transformer models include BERT, GPT, and T5"
DOCUMENT8 = "Machine translation (MT) is the task of automatically converting text from one language to another. Early MT systems relied on rule-based methods, but modern systems use statistical and neural machine translation techniques. Neural machine translation (NMT) is based on deep learning models, particularly sequence-to-sequence architectures, and has significantly improved the quality of automatic translations."
DOCUMENT9 = "Text classification is the task of categorizing text into predefined labels or classes. It's widely used for applications such as spam detection, topic modeling, and sentiment analysis. Common techniques for text classification include Naive Bayes, support vector machines (SVM), and deep learning methods. The goal is to build a model that can automatically assign labels to unseen text based on patterns learned from labeled data."
DOCUMENT10 = "Question answering (QA) systems aim to answer questions posed in natural language. QA systems can be categorized into two types: extractive and abstractive. Extractive QA systems retrieve answers directly from a given text, while abstractive QA systems generate new answers based on their understanding of the text. Recent advancements in QA systems have been driven by deep learning techniques, particularly transformer-based models like BERT and T5"

documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3, DOCUMENT4, DOCUMENT5, DOCUMENT6, DOCUMENT7, DOCUMENT8,
            DOCUMENT9, DOCUMENT10]

In [6]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specifiying to generate embeddings for document or queries
    document_mode = True
    def __call__(self, input):
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        retry_policy = {"retry": retry.Retry(predicate=retry.if_transient_error)}
        response = genai.embed_content(
            model = "models/text-embedding-004",
            content = input,
            task_type = embedding_task,
            request_options = retry_policy
        )
        # print("response: ", response)
        return response["embedding"]

In [7]:
import chromadb
DB_NAME = "informationonnlp"
embed_fun = GeminiEmbeddingFunction()
embed_fun.document_mode = True
chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name = DB_NAME, embedding_function = embed_fun)
db.add(documents = documents, ids = [str(i) for i in range(len(documents))])

In [8]:
print("Total Number of Documents in DB: ", db.count())
print("The First Documents is: ", db.peek(1))

Total Number of Documents in DB:  10
The First Documents is:  {'ids': ['0'], 'embeddings': array([[-4.71840911e-02,  2.97533516e-02, -1.76581424e-02,
        -9.62279737e-03, -3.14820074e-02,  4.40917425e-02,
         2.00140141e-02,  3.33735719e-02, -3.76888039e-03,
        -2.50776000e-02, -4.44155633e-02, -4.27163253e-03,
         7.05401003e-02,  7.13116955e-03,  4.92972136e-02,
        -5.22420779e-02,  3.71682532e-02,  4.91188951e-02,
        -1.16843678e-01,  3.05019207e-02,  2.45447713e-03,
        -3.36538292e-02,  1.00485270e-03, -5.13570383e-02,
        -4.45667142e-03,  1.80294793e-02,  1.89410988e-02,
         8.93730577e-03, -3.79313540e-04, -2.77177542e-02,
         5.55811040e-02,  5.23680449e-02,  2.86156517e-02,
        -7.42548555e-02, -1.83122158e-02, -2.06953194e-02,
         1.93992481e-02, -1.21836467e-02,  6.18599579e-02,
        -4.01744843e-02, -4.15999554e-02, -2.03616992e-02,
        -6.36188909e-02,  5.02354205e-02, -4.20208760e-02,
         1.54954158e-02,

# Retrieval

In [9]:
embed_fun.document_mode = True 
query = "How does word embedding helps us?"
result = db.query(query_texts = [query], n_results = 1)
print("Result: ", result)
print("###############")
print("Documents Retrived: ", result["documents"])
print("###############")
[[passage]] = result["documents"]
print([[passage]])

Result:  {'ids': [['5']], 'embeddings': None, 'documents': [['Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. Unlike traditional one-hot encoding, word embeddings use dense vectors of real numbers that capture semantic relationships between words. Popular word embedding models include Word2Vec, GloVe, and FastText, all of which are used to learn word representations from large text corpora.']], 'uris': None, 'data': None, 'metadatas': [[None]], 'distances': [[0.2926563024520874]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}
###############
Documents Retrived:  [['Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. Unlike traditional one-hot encoding, word embeddings use dense vectors of real numbers that capture semantic relationships between words. Popular wo

# Augmented Generation

In [10]:
print("passage: ", passage.replace("\n", " "))
print("query: ", query.replace("\n", " "))
print("################")
passage_oneline = passage.replace("\n", " ")
query_oneline = query.replace("\n", " ")
prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
PASSAGE: {passage_oneline}
"""
print(prompt)

passage:  Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. Unlike traditional one-hot encoding, word embeddings use dense vectors of real numbers that capture semantic relationships between words. Popular word embedding models include Word2Vec, GloVe, and FastText, all of which are used to learn word representations from large text corpora.
query:  How does word embedding helps us?
################
You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: How does word embedding helps us?
PASSAGE: Word embeddings are a type of wo

In [11]:
model = genai.GenerativeModel("gemini-1.5-flash-latest")
answer = model.generate_content(prompt)
print("Answer: ", answer)
print(answer.text)

Answer:  response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=protos.GenerateContentResponse({
      "candidates": [
        {
          "content": {
            "parts": [
              {
                "text": "Word embeddings are a really helpful way to represent words in a way that computers can understand.  Imagine you want a computer to understand that \"happy\" and \"joyful\" are similar words.  Instead of just treating them as completely separate, word embeddings give each word a kind of \"fingerprint\" \u2013 a set of numbers \u2013 and words with similar meanings have similar fingerprints. This lets computers understand the relationships between words better, which is crucial for many tasks like machine translation and text analysis, because it allows them to capture the nuances of language more effectively than older methods.\n"
              }
            ],
            "role": "model"
          },
          "finish_reason": "STOP",
          "av