In [1]:
import chromadb
# from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import GPT2TokenizerFast
from uuid import uuid4
from langchain_core.documents import Document
import os
from groq import Groq
import pandas as pd
from langchain.retrievers.self_query.base import SelfQueryRetriever

## 1 - Write a function that gets content, chunk size and returns a list of chunks

In [182]:
def chunk_data(content,meta_data={"source":"copy_paste"}):
    """Takes content(str) and optional meta_data(dict) and \n
    Returns a list of chunks"""
    tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=100,
        chunk_overlap=20,
        length_function = lambda text: len(tokenizer.encode(text)),
        # separators=["\n\n", "\n", ".", "!", "?", ",", " "],
        is_separator_regex=False,
    )
    return text_splitter.create_documents([content],metadatas=[{k:v} for k,v in meta_data.items()])


In [3]:
def create_uuids(chunks):
    uuid_list = [str(uuid4()) for _ in range(len(chunks))]
    return uuid_list

In [180]:
def store(db,doc_list,metadata_list,uuid_list):
    """
    stores data in the vector db, automatically embedding the input 
    with the default embedding function of chromadb
    """
    # print("doc_list elements: ",doc_list,"\nmetadata_list elements: ",metadata_list,"\nuuid_list elements ",uuid_list)
    db.add(documents = doc_list,ids=uuid_list,metadatas= metadata_list)

In [5]:
client = chromadb.PersistentClient(path="./test_db")
tag_db = client.get_or_create_collection(name="tag_collection",metadata={"hnsw:space": "cosine"})        
content_db = client.get_or_create_collection(name="content_collection",metadata={"hnsw:space": "cosine"})

In [133]:
chunks = chunk_data("""What are LLMs?
Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.

LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of the public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across numerous business functions and use cases.

Outside of the enterprise context, it may seem like LLMs have arrived out of the blue along with new developments in generative AI. However, many companies, including IBM, have spent years implementing LLMs at different levels to enhance their natural language understanding (NLU) and natural language processing (NLP) capabilities. This has occurred alongside advances in machine learning, machine learning models, algorithms, neural networks and the transformer models that provide the architecture for these AI systems.

LLMs are a class of foundation models, which are trained on enormous amounts of data to provide the foundational capabilities needed to drive multiple use cases and applications, as well as resolve a multitude of tasks. This is in stark contrast to the idea of building and training domain specific models for each of these use cases individually, which is prohibitive under many criteria (most importantly cost and infrastructure), stifles synergies and can even lead to inferior performance.

LLMs represent a significant breakthrough in NLP and artificial intelligence, and are easily accessible to the public through interfaces like Open AI’s Chat GPT-3 and GPT-4, which have garnered the support of Microsoft. Other examples include Meta’s Llama models and Google’s bidirectional encoder representations from transformers (BERT/RoBERTa) and PaLM models. IBM has also recently launched its Granite model series on watsonx.ai, which has become the generative AI backbone for other IBM products like watsonx Assistant and watsonx Orchestrate. 

In a nutshell, LLMs are designed to understand and generate text like a human, in addition to other forms of content, based on the vast amount of data used to train them. They have the ability to infer from context, generate coherent and contextually relevant responses, translate to languages other than English, summarize text, answer questions (general conversation and FAQs) and even assist in creative writing or code generation tasks. 

They are able to do this thanks to billions of parameters that enable them to capture intricate patterns in language and perform a wide array of language-related tasks. LLMs are revolutionizing applications in various fields, from chatbots and virtual assistants to content generation, research assistance and language translation.

As they continue to evolve and improve, LLMs are poised to reshape the way we interact with technology and access information, making them a pivotal part of the modern digital landscape.

Ebook
Generative AI + ML for the enterprise
Learn how organizations can confidently incorporate generative AI and machine learning into their business to gain a significant competitive advantage.

Related content
Register for the ebook on AI data stores

How large language models work 
LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.

During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized— broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.

To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they've acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.

Model performance can also be increased through prompt engineering, prompt-tuning, fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as “hallucinations” that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation. 

LLM use cases 
LLMs are redefining an increasing number of business processes and have proven their versatility across a myriad of use cases and tasks in various industries. They augment conversational AI in chatbots and virtual assistants (like IBM watsonx Assistant and Google’s BARD) to enhance the interactions that underpin excellence in customer care, providing context-aware responses that mimic interactions with human agents. 

LLMs also excel in content generation, automating content creation for blog articles, marketing or sales materials and other writing tasks. In research and academia, they aid in summarizing and extracting information from vast datasets, accelerating knowledge discovery. LLMs also play a vital role in language translation, breaking down language barriers by providing accurate and contextually relevant translations. They can even be used to write code, or “translate” between programming languages.

Moreover, they contribute to accessibility by assisting individuals with disabilities, including text-to-speech applications and generating content in accessible formats. From healthcare to finance, LLMs are transforming industries by streamlining processes, improving customer experiences and enabling more efficient and data-driven decision making. 

Most excitingly, all of these capabilities are easy to access, in some cases literally an API integration away. 

Here is a list of some of the most important areas where LLMs benefit organizations:

Text generation: language generation abilities, such as writing emails, blog posts or other mid-to-long form content in response to prompts that can be refined and polished. An excellent example is retrieval-augmented generation (RAG). 

Content summarization: summarize long articles, news stories, research reports, corporate documentation and even customer history into thorough texts tailored in length to the output format.

AI assistants: chatbots that answer customer queries, perform backend tasks and provide detailed information in natural language as a part of an integrated, self-serve customer care solution. 

Code generation: assists developers in building applications, finding errors in code and uncovering security issues in multiple programming languages, even “translating” between them. 

Sentiment analysis: analyze text to determine the customer’s tone in order understand customer feedback at scale and aid in brand reputation management. 

Language translation: provides wider coverage to organizations across languages and geographies with fluent translations and multilingual capabilities.  

LLMs stand to impact every industry, from finance to insurance, human resources to healthcare and beyond, by automating customer self-service, accelerating response times on an increasing number of tasks as well as providing greater accuracy, enhanced routing and intelligent context gathering. 

 

LLMs and governance  
Organizations need a solid foundation in governance practices to harness the potential of AI models to revolutionize the way they do business. This means providing access to AI tools and technology that is trustworthy, transparent, responsible and secure. AI governance and traceability are also fundamental aspects of the solutions IBM brings to its customers, so that activities that involve AI are managed and monitored to allow for tracing origins, data and models in a way that is always auditable and accountable. 

""")



In [134]:
chunks

[Document(metadata={}, page_content='What are LLMs?\nLarge language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.'),
 Document(metadata={}, page_content='LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of the public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across numerous business functions and use cases.'),
 Document(metadata={}, page_content='Outside of the enterprise context, it may seem like LLMs have arrived out of the blue along with new developments in generative AI. However, many companies, including IBM, have spent years implementing LLMs at different levels to enhance their natural language'),
 Document(metadata={}, page_content='many companies, including IBM, have spent years implementin

## 2 - Take an individual chunk and return embedding for it

## 3 - Functions take chunk and returns tags

In [7]:
client = Groq(
    api_key="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",
    
)


In [None]:

def create_tags(chunk):
    """Takes one chunk and returns a string of tags. 
    Chunk is of type Document(page_content,metadata)"""
    
    question = """Based on the given content generate 10 or less tags in the form of list seperated by comma.
    you should return the TAGS ONLY and nothing else,
    Your output should be SORTED LEXICOGRAPHICALLY, IN LOWERCASE. 
    It MUST look like - <tag1>, <tag2>, <tag3>"""
    content = chunk.page_content
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f"""{question}\n\n {content} """,
            }
        ],
        model="llama3-8b-8192",
        temperature=0
    )
    tags = chat_completion.choices[0].message.content
    return tags


In [200]:
def rewrite_information(old_content,new_content):
    """Takes content(str) and returns rewritten content(str)"""
    question = """Based on the given content """
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f"""Change the contents of the old content:{old_content} based on the new content:{new_content} and rewrite the everything by not losing any information.
                If there is a contradiction, over-write the old content with the new content. 
                Don't loose any information from the old content.
                The remaining relavant information from the new content should be kept.
                Structure the result in a manner that makes the most sense.
                JUST GIVE ME THE REWRITTEN CONTENT, without any other text.""",
            }
        ],
        model="llama3-8b-8192",
        temperature=0
    )
    return chat_completion.choices[0].message.content

In [208]:
rewrite_information("The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.","118 englewood has 2 floors, one floor has kitchen and the other floor has the bedrooms and bathroom. The house is built with wood and is spacious. It is on a cross road.")

'The 118 Englewood house is built with wood and has 2 floors. The first floor features a kitchen, while the second floor has 4 bedrooms and 1 bathroom. The house is spacious and wildlife friendly, with frequent sightings of rats. It is located on a cross road.'

In [100]:
def store_logic(text):
    chunks = chunk_data(text)
    doc_list = []
    uuid_list = create_uuids(chunks)
    content_metadata_list = []
    tag_metadata_list = []
    tag_list = []
    for chunk in chunks:
        doc_list.append(chunk.page_content)
        tags = create_tags(chunk)
        tag_list.append(tags)
        tag_metadata_list.append({k:v for k,v in chunk.metadata.items()})
        temp_meta = {k:v for k,v in chunk.metadata.items()}
        temp_meta["tags"] = tags
        content_metadata_list.append(temp_meta)
    print(tag_metadata_list) 
    print(uuid_list) 
    store(tag_db,tag_list,tag_metadata_list,uuid_list)
    store(content_db,doc_list,content_metadata_list,uuid_list)

In [151]:
text = """LLMs (Large Language Models) are advanced machine learning models designed to understand and generate human-like text. These models are based on deep learning techniques, specifically neural networks, and are trained on vast amounts of text data to learn the intricacies of language, grammar, context, and even some reasoning.

Key Characteristics of LLMs:
Size:

They are typically trained on billions or even trillions of parameters, hence the term "large." The more parameters, the more complex the model and its ability to generate nuanced and context-aware responses.
Architecture:

Most modern LLMs are based on the Transformer architecture, introduced in the "Attention Is All You Need" paper. This architecture allows the model to focus on relevant parts of the input sequence (using attention mechanisms) and can handle large contexts more efficiently than previous models like RNNs and LSTMs.
Training Data:

LLMs are trained on diverse text data, including books, websites, academic papers, and more, allowing them to generalize well across different topics and tasks. Training involves predicting the next word or phrase based on previous context, helping the model learn patterns in language.
Pretraining and Fine-tuning:

Pretraining: LLMs are pretrained on massive datasets to learn general language representations.
Fine-tuning: These models are then fine-tuned for specific tasks (e.g., summarization, question answering) using smaller, task-specific datasets.
Popular LLMs:
GPT (Generative Pretrained Transformer):

Developed by OpenAI, GPT models (including GPT-3, GPT-4, etc.) are some of the most widely known LLMs. They are designed to generate human-like text and perform tasks such as translation, summarization, and more.
BERT (Bidirectional Encoder Representations from Transformers):

Developed by Google, BERT focuses on understanding context in a bidirectional manner (looking at both sides of a word in a sentence) and excels at tasks like text classification and question answering.
T5 (Text-to-Text Transfer Transformer):

Developed by Google, T5 treats all NLP tasks as text-to-text tasks, making it highly flexible for a range of applications.
LLaMA (Large Language Model Meta AI):

Developed by Meta (formerly Facebook), LLaMA is a model designed to be smaller and more efficient than GPT-3, while maintaining strong performance across various NLP tasks.
Applications of LLMs:
Text generation: Creating human-like text for various contexts, from creative writing to code generation.
Chatbots and virtual assistants: Powering conversational agents that can respond intelligently to user queries.
Text summarization: Condensing large texts into concise summaries.
Machine translation: Translating text from one language to another.
Sentiment analysis: Identifying emotions or opinions in text data.
Question answering: Providing answers to user queries by understanding the context and retrieving relevant information.
Code generation: Writing code based on user inputs in natural language.
Challenges:
Bias: LLMs can learn and perpetuate biases present in the training data.
Energy consumption: Training LLMs requires significant computational resources, which can be costly and environmentally impactful.
Interpretability: LLMs are complex models, making it difficult to understand how they make decisions or predictions.
In summary, LLMs are powerful tools in natural language processing, capable of performing a wide range of tasks by leveraging their large-scale training on diverse datasets. However, they also present challenges in terms of bias, resource usage, and interpretability."""

In [224]:
store_logic("The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.")



[Document(metadata={'source': 'copy_paste'}, page_content='The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.')]
[{'source': 'copy_paste'}]
['250b2c09-ceb9-41cb-b546-dc9ebfb8621a']
doc_list elements:  ['concrete, englewood, house, kitchen, rats, wildlife'] 
metadata_list elements:  [{'source': 'copy_paste'}] 
uuid_list elements  ['250b2c09-ceb9-41cb-b546-dc9ebfb8621a']
doc_list elements:  ['The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.'] 
metadata_list elements:  [{'source': 'copy_paste', 'tags': 'concrete, englewood, house, kitchen, rats, wildlife'}] 
uuid_list elements  ['250b2c09-ceb9-41cb-b546-dc9ebfb8621a']


In [166]:
def retrieve_logic(query,threshold = 0.7):
    """Takes in list of query texts and returns list of docs"""
    queries = [query]
    results = content_db.query(query_texts=queries)
    find_similarities = True
    for q_no in range(len(queries)):
        for ind in range(len(results['distances'][q_no])):
            dist = results['distances'][q_no][ind]
            if dist > threshold:
                if ind == 0:
                    find_similarities = False
                results['ids'][q_no] = results['ids'][q_no][:ind]
                results['distances'][q_no] = results['distances'][q_no][:ind]
                results['metadatas'][q_no] = results['metadatas'][q_no][:ind]
                results['documents'][q_no] = results['documents'][q_no][:ind]
                break
    if find_similarities:
        return results
    else:
        return None


In [225]:
retrieve_logic("118 Englewood house")

{'ids': [['250b2c09-ceb9-41cb-b546-dc9ebfb8621a']],
 'distances': [[0.2117686844864899]],
 'metadatas': [[{'source': 'copy_paste',
    'tags': 'concrete, englewood, house, kitchen, rats, wildlife'}]],
 'embeddings': None,
 'documents': [['The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [219]:
retrieve_logic("118 Englewood house")

{'ids': [['33ede387-b584-4b87-b2e7-9896c29f5338']],
 'distances': [[0.19575874675142635]],
 'metadatas': [[{'source': 'copy_paste',
    'tags': 'bedrooms, bathroom, englewood, house, kitchen, rats, road, wildlife, wood'}]],
 'embeddings': None,
 'documents': [['The 118 Englewood house is built with wood and has 2 floors. The first floor features a kitchen, while the second floor has 4 bedrooms and 1 bathroom. The house is spacious and wildlife friendly, with frequent sightings of rats. It is located on a cross road.']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [228]:
k = retrieve_logic("118 Englewood house")
k

{'ids': [['34642a21-593c-4575-84ef-d196303568c8']],
 'distances': [[0.19575874675142635]],
 'metadatas': [[{'source': 'copy_paste',
    'tags': 'bedrooms, bathroom, englewood, house, kitchen, rats, road, wildlife, wood'}]],
 'embeddings': None,
 'documents': [['The 118 Englewood house is built with wood and has 2 floors. The first floor features a kitchen, while the second floor has 4 bedrooms and 1 bathroom. The house is spacious and wildlife friendly, with frequent sightings of rats. It is located on a cross road.']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [174]:
def remove(db,ids):
    db.delete(ids=ids)

In [216]:
def remove_logic(ids):
    remove(tag_db,ids)
    remove(content_db,ids)

In [217]:
def insert_logic(information):
    chunks = chunk_data(information)
    queries = [chunk.page_content for chunk in chunks]
    for query in queries:
        similar_info = retrieve_logic(query)
        similar_info_ids = similar_info['ids'][0]
        print(similar_info['documents'][0])
        similar_info_content = "\n".join(similar_info['documents'][0])
        if similar_info is not None:
            remove_logic(similar_info_ids)
            new_information = rewrite_information(similar_info_content,query)
            store_logic(new_information)


In [226]:
insert_logic("118 englewood has 2 floors, one floor has kitchen and the other floor has the bedrooms and bathroom. The house is built with wood and is spacious. It is on a cross road.")

['The 118 Englewood house is built with concrete and has 10 floors. It has 4 bedrooms, 1 kitchen and 1 bathroom. The house is wild life friendly, you can frequent sightings of rats here. It is not spacious.']
[Document(metadata={'source': 'copy_paste'}, page_content='The 118 Englewood house is built with wood and has 2 floors. The first floor features a kitchen, while the second floor has 4 bedrooms and 1 bathroom. The house is spacious and wildlife friendly, with frequent sightings of rats. It is located on a cross road.')]
[{'source': 'copy_paste'}]
['34642a21-593c-4575-84ef-d196303568c8']
doc_list elements:  ['bedrooms, bathroom, englewood, house, kitchen, rats, road, wildlife, wood'] 
metadata_list elements:  [{'source': 'copy_paste'}] 
uuid_list elements  ['34642a21-593c-4575-84ef-d196303568c8']
doc_list elements:  ['The 118 Englewood house is built with wood and has 2 floors. The first floor features a kitchen, while the second floor has 4 bedrooms and 1 bathroom. The house is sp

## 4 - Driver code (also stores tags against chunks to a dataframe)

In [13]:
class DataHandler:
    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super().__new__(cls, *args, **kwargs)
        return cls._instance
    
    def __init__(self) -> None:
        columns = ['uuid', 'content','tags']
        self.content_df = pd.DataFrame(columns=columns)
        self.client = chromadb.PersistentClient(path="./our_db")
        self.tag_db = self.client.get_or_create_collection(name="tag_collection")        
        self.content_db = self.client.get_or_create_collection(name="tag_collection")
    
    def get_tag_db(self):
        return self.tag_db
    
    def get_content_db(self):
        return self.content_db
    
    def get_content_df(self):
        return self.content_df
    
    def store(self,db,doc_list,metadata_list,uuid_list):
        """
        stores data in the vector db, automatically embedding the input 
        with the default embedding function of chromadb
        """
        db.add(documents = doc_list,ids=uuid_list,metadatas= metadata_list)

    def create_uuids(self,chunks):
        uuid_list = [uuid4() for _ in range(len(chunks))]
        return uuid_list
    
    def store_content(self, content,metadatas,uuid_list):
        self.store(self.content_db, content, metadatas, uuid_list)
    
    def store_tags(self, tags,metadatas,uuid_list):
        self.store(self.tag_db, tags, metadatas, uuid_list)



In [14]:
class SearchData:
    def __init__(self,persist_data_obj):
        self.obj = persist_data_obj
        self.content_df = self.obj.get_content_df()
        self.tag_db = self.obj.get_tag_db() #.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.0,"k": 4})
        self.content_db = self.obj.get_content_db() #.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.0,"k":4})

    def search_tags(self, query):
#         docs = self.tag_retriever.invoke(query, k=4)
#         print(docs)
#         nearest_embeddings = [(doc[0].metadata['id'],doc[1]) for doc in docs if 'id' in doc[0].metadata]
#         score_df = pd.DataFrame(nearest_embeddings, columns=['uuid', 'score'])
#         filtered_df = self.content_df[self.content_df['uuid'].isin([doc[0] for doc in nearest_embeddings])]
#         result = pd.merge(filtered_df, score_df, on='uuid', how='inner')
#         return result.sort_values(by='score')
        docs = self.tag_db.similarity_search_with_score(query, k=4)
        print("Docs:",docs)
        nearest_embeddings = [(doc[0].metadata['id'],doc[1]) for doc in docs if 'id' in doc[0].metadata]
        score_df = pd.DataFrame(nearest_embeddings, columns=['uuid', 'score'])
        filtered_df = self.content_df[self.content_df['uuid'].isin([doc[0] for doc in nearest_embeddings])]
        result = pd.merge(filtered_df, score_df, on='uuid', how='inner')
        return result.sort_values(by='score')
#         # Assuming docs contains the result with (document, score) where score is Euclidean distance
#         nearest_embeddings = [(doc[0].metadata['id'], doc[1]) for doc in docs if 'id' in doc[0].metadata]

#         # Convert Euclidean distance to Cosine similarity
#         cosine_similarities = [(uuid, 1 / (1 + distance)) for uuid, distance in nearest_embeddings]

#         # Create a DataFrame with the cosine similarities
#         score_df = pd.DataFrame(cosine_similarities, columns=['uuid', 'score'])

#         # Filter the content DataFrame based on the UUIDs
#         filtered_df = self.content_df[self.content_df['uuid'].isin([uuid for uuid, _ in cosine_similarities])]

#         # Merge the filtered DataFrame with the score DataFrame
#         result = pd.merge(filtered_df, score_df, on='uuid', how='inner')

#         # Return the results sorted by score
#         return result.sort_values(by='score', ascending=False)

    def search_content(self,query):
        docs = self.content_retriever.invoke(query, k=4)
        nearest_embeddings = [(doc[0].metadata['id'],doc[1]) for doc in docs if 'id' in doc[0].metadata]
        cosine_similarities = [(uuid, 1 / (1 + distance)) for uuid, distance in nearest_embeddings]
        score_df = pd.DataFrame(cosine_similarities, columns=['uuid', 'score'])
        filtered_df = self.content_df[self.content_df['uuid'].isin([doc[0] for doc in nearest_embeddings])]
        result = pd.merge(filtered_df, score_df, on='uuid', how='inner')
        return result.sort_values(by='score')
    


In [15]:
obj = PersistData()

NameError: name 'PersistData' is not defined

In [None]:
obj.store_data("""Foreign, economic and strategic relations
Main articles: Foreign relations of India and Indian Armed Forces

During the 1950s and 60s, India played a pivotal role in the Non-Aligned Movement.[269] From left to right: Gamal Abdel Nasser of United Arab Republic (now Egypt), Josip Broz Tito of Yugoslavia and Jawaharlal Nehru in Belgrade, September 1961.
In the 1950s, India strongly supported decolonisation in Africa and Asia and played a leading role in the Non-Aligned Movement.[270] After initially cordial relations with neighbouring China, India went to war with China in 1962 and was widely thought to have been humiliated.[271] This was followed by another military conflict in 1967 in which India successfully repelled Chinese attack.[272] India has had tense relations with neighbouring Pakistan; the two nations have gone to war four times: in 1947, 1965, 1971, and 1999. Three of these wars were fought over the disputed territory of Kashmir, while the third, the 1971 war, followed from India's support for the independence of Bangladesh.[273] In the late 1980s, the Indian military twice intervened abroad at the invitation of the host country: a peace-keeping operation in Sri Lanka between 1987 and 1990; and an armed intervention to prevent a 1988 coup d'état attempt in the Maldives. After the 1965 war with Pakistan, India began to pursue close military and economic ties with the Soviet Union; by the late 1960s, the Soviet Union was its largest arms supplier.[274]

Aside from its ongoing special relationship with Russia,[275] India has wide-ranging defence relations with Israel and France. In recent years, it has played key roles in the South Asian Association for Regional Cooperation and the World Trade Organization. The nation has provided 100,000 military and police personnel to serve in 35 UN peacekeeping operations across four continents. It participates in the East Asia Summit, the G8+5, and other multilateral forums.[276] India has close economic ties with countries in South America,[277] Asia, and Africa; it pursues a "Look East" policy that seeks to strengthen partnerships with the ASEAN nations, Japan, and South Korea that revolve around many issues, but especially those involving economic investment and regional security.[278][279]


The Indian Air Force contingent marching at the 221st Bastille Day military parade in Paris, on 14 July 2009. The parade at which India was the foreign guest was led by India's oldest regiment, the Maratha Light Infantry, founded in 1768.[280]
China's nuclear test of 1964, as well as its repeated threats to intervene in support of Pakistan in the 1965 war, convinced India to develop nuclear weapons.[281] India conducted its first nuclear weapons test in 1974 and carried out additional underground testing in 1998. Despite criticism and military sanctions, India has signed neither the Comprehensive Nuclear-Test-Ban Treaty nor the Nuclear Non-Proliferation Treaty, considering both to be flawed and discriminatory.[282] India maintains a "no first use" nuclear policy and is developing a nuclear triad capability as a part of its "Minimum Credible Deterrence" doctrine.[283][284] It is developing a ballistic missile defence shield and, a fifth-generation fighter jet.[285][286] Other indigenous military projects involve the design and implementation of Vikrant-class aircraft carriers and Arihant-class nuclear submarines.[287]

Since the end of the Cold War, India has increased its economic, strategic, and military co-operation with the United States and the European Union.[288] In 2008, a civilian nuclear agreement was signed between India and the United States. Although India possessed nuclear weapons at the time and was not a party to the Nuclear Non-Proliferation Treaty, it received waivers from the International Atomic Energy Agency and the Nuclear Suppliers Group, ending earlier restrictions on India's nuclear technology and commerce. As a consequence, India became the sixth de facto nuclear weapons state.[289] India subsequently signed co-operation agreements involving civilian nuclear energy with Russia,[290] France,[291] the United Kingdom,[292] and Canada.[293]


Prime Minister Narendra Modi of India (left, background) in talks with President Enrique Peña Nieto of Mexico during a visit to Mexico, 2016
The President of India is the supreme commander of the nation's armed forces; with 1.45 million active troops, they compose the world's second-largest military. It comprises the Indian Army, the Indian Navy, the Indian Air Force, and the Indian Coast Guard.[294] The official Indian defence budget for 2011 was US$36.03 billion, or 1.83% of GDP.[295] Defence expenditure was pegged at US$70.12 billion for fiscal year 2022–23 and, increased 9.8% than previous fiscal year.[296][297] India is the world's second-largest arms importer; between 2016 and 2020, it accounted for 9.5% of the total global arms imports.[298] Much of the military expenditure was focused on defence against Pakistan and countering growing Chinese influence in the Indian Ocean.[299] In May 2017, the Indian Space Research Organisation launched the South Asia Satellite, a gift from India to its neighbouring SAARC countries.[300] In October 2018, India signed a US$5.43 billion (over ₹400 billion) agreement with Russia to procure four S-400 Triumf surface-to-air missile defence systems, Russia's most advanced long-range missile defence system.[301]

Economy
Main article: Economy of India

A farmer in northwestern Karnataka ploughs his field with a tractor even as another in a field beyond does the same with a pair of oxen. In 2019, 43% of India's total workforce was employed in agriculture.[302]

India is the world's largest producer of milk, with the largest population of cattle. In 2018, nearly 80% of India's milk was sourced from small farms with herd size between one and two, the milk harvested by hand milking.[304]

Women tend to a recently planted rice field in Junagadh district in Gujarat. 55% of India's female workforce was employed in agriculture in 2019.[303]
According to the International Monetary Fund (IMF), the Indian economy in 2024 was nominally worth $3.94 trillion; it was the fifth-largest economy by market exchange rates and is, at around $15.0 trillion, the third-largest by purchasing power parity (PPP).[17] With its average annual GDP growth rate of 5.8% over the past two decades, and reaching 6.1% during 2011–2012,[305] India is one of the world's fastest-growing economies.[306] However, the country ranks 136th in the world in nominal GDP per capita and 125th in GDP per capita at PPP.[307] Until 1991, all Indian governments followed protectionist policies that were influenced by socialist economics. Widespread state intervention and regulation largely walled the economy off from the outside world. An acute balance of payments crisis in 1991 forced the nation to liberalise its economy;[308] since then, it has moved increasingly towards a free-market system[309][310] by emphasising both foreign trade and direct investment inflows.[311] India has been a member of World Trade Organization since 1 January 1995.[312]

The 522-million-worker Indian labour force is the world's second-largest, as of 2017.[294] The service sector makes up 55.6% of GDP, the industrial sector 26.3% and the agricultural sector 18.1%. India's foreign exchange remittances of US$100 billion in 2022,[313] highest in the world, were contributed to its economy by 32 million Indians working in foreign countries.[314] Major agricultural products include rice, wheat, oilseed, cotton, jute, tea, sugarcane, and potatoes.[13] Major industries include textiles, telecommunications, chemicals, pharmaceuticals, biotechnology, food processing, steel, transport equipment, cement, mining, petroleum, machinery, and software.[13] In 2006, the share of external trade in India's GDP stood at 24%, up from 6% in 1985.[309] In 2008, India's share of world trade was 1.7%;[315] In 2021, India was the world's ninth-largest importer and the sixteenth-largest exporter.[316] Major exports include petroleum products, textile goods, jewellery, software, engineering goods, chemicals, and manufactured leather goods.[13] Major imports include crude oil, machinery, gems, fertiliser, and chemicals.[13] Between 2001 and 2011, the contribution of petrochemical and engineering goods to total exports grew from 14% to 42%.[317] India was the world's second-largest textile exporter after China in the 2013 calendar year.[318]

Averaging an economic growth rate of 7.5% for several years prior to 2007,[309] India has more than doubled its hourly wage rates during the first decade of the 21st century.[319] Some 431 million Indians have left poverty since 1985; India's middle classes are projected to number around 580 million by 2030.[320] Though ranking 68th in global competitiveness,[321] as of 2010, India ranks 17th in financial market sophistication, 24th in the banking sector, 44th in business sophistication, and 39th in innovation, ahead of several advanced economies.[322] With seven of the world's top 15 information technology outsourcing companies based in India, as of 2009, the country is viewed as the second-most favourable outsourcing destination after the United States.[323] India is ranked 40th in the Global Innovation Index in 2023.[324] As of 2023, India's consumer market was the world's fifth-largest.[325]

Driven by growth, India's nominal GDP per capita increased steadily from US$308 in 1991, when economic liberalisation began, to US$1,380 in 2010, to an estimated US$2,731 in 2024. It is expected to grow to US$3,264 by 2026.[17] However, it has remained lower than those of other Asian developing countries such as Indonesia, Malaysia, Philippines, Sri Lanka, and Thailand, and is expected to remain so in the near future.


A panorama of Bangalore, the centre of India's software development economy. In the 1980s, when the first multinational corporations began to set up centres in India, they chose Bangalore because of the large pool of skilled graduates in the area, in turn due to the many science and engineering colleges in the surrounding region.[326]
According to a 2011 PricewaterhouseCoopers (PwC) report, India's GDP at purchasing power parity could overtake that of the United States by 2045.[327] During the next four decades, Indian GDP is expected to grow at an annualised average of 8%, making it potentially the world's fastest-growing major economy until 2050.[327] The report highlights key growth factors: a young and rapidly growing working-age population; growth in the manufacturing sector because of rising education and engineering skill levels; and sustained growth of the consumer market driven by a rapidly growing middle-class.[327] The World Bank cautions that, for India to achieve its economic potential, it must continue to focus on public sector reform, transport infrastructure, agricultural and rural development, removal of labour regulations, education, energy security, and public health and nutrition.[328]

According to the Worldwide Cost of Living Report 2017 released by the Economist Intelligence Unit (EIU) which was created by comparing more than 400 individual prices across 160 products and services, four of the cheapest cities were in India: Bangalore (3rd), Mumbai (5th), Chennai (5th) and New Delhi (8th).[329]

Industries

A tea garden in Sikkim. India, the world's second-largest producer of tea, is a nation of one billion tea drinkers, who consume 70% of India's tea output.
India's telecommunication industry is the second-largest in the world with over 1.2 billion subscribers. It contributes 6.5% to India's GDP.[330] After the third quarter of 2017, India surpassed the US to become the second-largest smartphone market in the world after China.[331]

The Indian automotive industry, the world's second-fastest growing, increased domestic sales by 26% during 2009–2010,[332] and exports by 36% during 2008–2009.[333] In 2022, India became the world's third-largest vehicle market after China and the United States, surpassing Japan.[334] At the end of 2011, the Indian IT industry employed 2.8 million professionals, generated revenues close to US$100 billion equalling 7.5% of Indian GDP, and contributed 26% of India's merchandise exports.[335]

The pharmaceutical industry in India emerged as a global player. As of 2021, with 3000 pharmaceutical companies and 10,500 manufacturing units India is the world's third-largest pharmaceutical producer, largest producer of generic medicines and supply up to 50–60% of global vaccines demand, these all contribute up to US$24.44 billions in exports and India's local pharmaceutical market is estimated up to US$42 billion.[336][337] India is among the top 12 biotech destinations in the world.[338][339] The Indian biotech industry grew by 15.1% in 2012–2013, increasing its revenues from ₹204.4 billion (Indian rupees) to ₹235.24 billion (US$3.94 billion at June 2013 exchange rates).[340]

Energy
Main articles: Energy in India and Energy policy of India
India's capacity to generate electrical power is 300 gigawatts, of which 42 gigawatts is renewable.[341] The country's usage of coal is a major cause of greenhouse gas emissions by India but its renewable energy is competing strongly.[342] India emits about 7% of global greenhouse gas emissions. This equates to about 2.5 tons of carbon dioxide per person per year, which is half the world average.[343][344] Increasing access to electricity and clean cooking with liquefied petroleum gas have been priorities for energy in India.[345]

Socio-economic challenges

Health workers about to begin another day of immunisation against infectious diseases in 2006. Eight years later, and three years after India's last case of polio, the World Health Organization declared India to be polio-free.[346]
Despite economic growth during recent decades, India continues to face socio-economic challenges. In 2006, India contained the largest number of people living below the World Bank's international poverty line of US$1.25 per day.[347] The proportion decreased from 60% in 1981 to 42% in 2005.[348] Under the World Bank's later revised poverty line, it was 21% in 2011.[p][350] 30.7% of India's children under the age of five are underweight.[351] According to a Food and Agriculture Organization report in 2015, 15% of the population is undernourished.[352][353] The Midday Meal Scheme attempts to lower these rates.[354]

A 2018 Walk Free Foundation report estimated that nearly 8 million people in India were living in different forms of modern slavery, such as bonded labour, child labour, human trafficking, and forced begging, among others.[355] According to the 2011 census, there were 10.1 million child labourers in the country, a decline of 2.6 million from 12.6 million in 2001.[356]

Since 1991, economic inequality between India's states has consistently grown: the per-capita net state domestic product of the richest states in 2007 was 3.2 times that of the poorest.[357] Corruption in India is perceived to have decreased. According to the Corruption Perceptions Index, India ranked 78th out of 180 countries in 2018 with a score of 41 out of 100, an improvement from 85th in 2014.[358][359]""")

In [None]:
obj.store_data("""What is Retrieval-Augmented Generation (RAG)?
RAG (Retrieval-Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as databases) with the capabilities of generative large language models (LLMs).  By combining this extra knowledge with its own language skills, the AI can write text that is more accurate, up-to-date, and relevant to your specific needs.

Get started for free
image of what is RAG
35:30
Grounding for Gemini with Vertex AI Search and DIY RAG
How does Retrieval-Augmented Generation work?
RAGs operate with a few main steps to help enhance generative AI outputs: 

Retrieval and Pre-processing: RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases. Once retrieved, the relevant information undergoes pre-processing, including tokenization, stemming, and removal of stop words.
Generation: The pre-processed retrieved information is then seamlessly incorporated into the pre-trained LLM. This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic. This augmented context enables the LLM to generate more precise, informative, and engaging responses. 
RAG operates by first retrieving relevant information from a database using a query generated by the LLM. This retrieved information is then integrated into the LLM's query input, enabling it to generate more accurate and contextually relevant text. RAG leverages vector databases, which store data in a way that facilitates efficient search and retrieval.

Why Use RAG?
RAG offers several advantages over traditional methods of text generation, especially when dealing with factual information or data-driven responses. Here are some key reasons why using RAG can be beneficial:

Access to updated information
Traditional LLMs are often limited to their pre-trained knowledge and data. This could lead to potentially outdated or inaccurate responses. RAG overcomes this by granting LLMs access to external information sources, ensuring accurate and up-to-date answers.

Factual grounding
LLMs are powerful tools for generating creative and engaging text, but they can sometimes struggle with factual accuracy. This is because LLMs are trained on massive amounts of text data, which may contain inaccuracies or biases.

RAG helps address this issue by providing LLMs with access to a curated knowledge base, ensuring that the generated text is grounded in factual information. This makes RAG particularly valuable for applications where accuracy is paramount, such as news reporting, scientific writing, or customer service.

Note: RAG may also assist in preventing hallucinations being sent to the end user. The LLM will still generate solutions from time to time where its training is incomplete but the RAG technique helps improve the user experience.

Contextual relevance
The retrieval mechanism in RAG ensures that the retrieved information is relevant to the input query or context.

By providing the LLM with contextually relevant information, RAG helps the model generate responses that are more coherent and aligned with the given context.

This contextual grounding helps to reduce the generation of irrelevant or off-topic responses.

Factual consistency
RAG encourages the LLM to generate responses that are consistent with the retrieved factual information.

By conditioning the generation process on the retrieved knowledge, RAG helps to minimize contradictions and inconsistencies in the generated text.

This promotes factual consistency and reduces the likelihood of generating false or misleading information.

Utilizes vector databases
RAGs leverage vector databases to efficiently retrieve relevant documents. Vector databases store documents as vectors in a high-dimensional space, allowing for fast and accurate retrieval based on semantic similarity.

Improved response accuracy
RAGs complement LLMs by providing them with contextually relevant information. LLMs can then use this information to generate more coherent, informative, and accurate responses, even multi-modal ones.

RAGs and chatbots
RAGs can be integrated into a chatbot system to enhance their conversational abilities. By accessing external information, RAG-powered chatbots helps leverage external knowledge to provide more comprehensive,informative, and context-aware responses, improving the overall user experience.""")

In [None]:
query = "Indian labour force"

In [None]:
search = SearchData(obj)
search.search_tags("Indian labour force")

In [None]:
search.search_content("Indian labour force")