In [26]:
import cohere
import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

# Cohere API key
api_key = 'BfTJNyke3jghhXpgRy6R1SSl8fRpso5o02i78moX'

# Set up Cohere client
co = cohere.Client(api_key)

In [27]:
from unstructured.partition.text import partition_text

elements = partition_text(filename="/Users/rohanchandra/Desktop/git/GSAHackathonTeam49/KnowledgeBase/can_order_food_safety_publications.txt")

with open("/Users/rohanchandra/Desktop/git/GSAHackathonTeam49/KnowledgeBase/can_order_food_safety_publications.txt", "r") as f:
  elements = partition_text(file=f)

with open("/Users/rohanchandra/Desktop/git/GSAHackathonTeam49/KnowledgeBase/can_order_food_safety_publications.txt", "r") as f:
  text = f.read()
elements = partition_text(text=text)

# Create the VectorStore Component

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

As an example, we’ll use the contents from LLM University: What are Large Language Models? which explains the architecture of large language models. It consists of four web pages, each in the Python list raw_documents below. Each entry is identified by its title and URL.


In [28]:
raw_documents = [
    {
        "title": "Food Supply Chains",
        "url": "https://www.usda.gov/coronavirus/food-supply-chain"},
    {
        "title": "USDA Programs",
        "url": "https://www.rd.usda.gov/programs-services/single-family-housing-programs/single-family-housing-direct-home-loans"},
    {
        "title": "Funding Opportunities",
        "url": "https://www.usda.gov/media/press-releases/2022/08/24/usda-announces-550-million-american-rescue-plan-funding-projects"},
    # {
    #     "title": "Single Family Loan Program",
    #     "url": "https://www.rd.usda.gov/programs-services/single-family-housing-programs/single-family-housing-guaranteed-loan-program"},
    # {
    #     "title": "Farm Loan Program",
    #     "url": "https://www.fsa.usda.gov/programs-and-services/farm-loan-programs/"},
    # {
    #     "title": "Transformer Models",
    #     "url": "https://docs.cohere.com/docs/transformer-models"}
]

We implement this in the Vectorstore class below, which takes the raw_documents list as input.

We also initialize a few instance attributes and methods. The attributes include self.raw_documents to represent the raw documents, self.docs to represent the chunked version of the documents, self.docs_embs to represent the embeddings of the chunked documents, and a couple of top_k parameters to be used for retrieval and reranking.

Meanwhile, the methods include load_and_chunk(), embed(), and index() for ingesting raw documents. As you’ll see, we will also specify a retrieve() method to retrieve relevant document chunks given a query.

## Load and Chunk the Documents
The load_and_chunk() method loads the raw documents from the URL and breaks them into smaller chunks. Chunking for information retrieval is a broad topic in and of itself, with many strategies being discussed within the AI community. For our example, we’ll utilize the partition_html method from the unstructured library. Read its documentation for more information about its chunking approach.

Each chunk is turned into a dictionary with three fields:

title: The web page’s title
text: The textual content of the chunk
url: The web page’s URL
This information will eventually be passed to the chatbot’s prompt for generating the response, so it’s crucial to populate relevant information into this dictionary. Note that we are not limited to these three fields. At a minimum, the Chat endpoint requires the text field, but beyond that, we can add custom fields that can provide more context about the document, such as subtitles, snippets, tags, and others.

The resulting dictionaries are stored in the self.docs attribute.

## Embed the Document Chunks
The embed() method generates embeddings of the chunked documents. We use the Embed endpoint and Cohere's embed-english-v3.0 model. Since the endpoint has a limit of 96 documents per call, we send them in batches.

With the Embed v3 model, we need to define an input_type, of which there are four options depending on the type of task. Using these input types ensures the highest possible quality for the respective tasks. Since our document chunks will be used for retrieval, we use search_document as the input_type.

The resulting chunk embeddings are stored in the self.docs_embs attribute.

## Index Document Chunks
The index() method indexes the document chunk embeddings. We build an index to store the embeddings in a structured and organized way in order to ensure efficient similarity search during retrieval.

There are many options available for building an index. For production environments, typically a vector database (like Weaviate or MongoDB) is required to handle the continuous process of indexing documents and maintaining the index.

In our example, however, we’ll keep it simple and use a vector library instead. We can choose from many open-source projects, such as Faiss, Annoy, ScaNN, or Hnswlib, which is the one we’ll use. These libraries store embeddings in in-memory indexes and implement approximate nearest neighbor (ANN) algorithms to make similarity search efficient.

The resulting document chunk embeddings are stored in the self.idx attribute.

## Implement Retrieval
The retrieve() method uses semantic search to retrieve relevant document chunks given a query, and it has two steps: (1) dense retrieval, (2) reranking.
### Dense Retrieval
We implement a dense retrieval system that leverages embeddings to retrieve document chunks, offering significant improvements over basic keyword-matching approaches. Embeddings can capture the contextual meaning of a document, thus enabling the retrieval of highly relevant results to the given query.

We embed the query using the same embed-english-v3.0 model that we used to embed the document chunks, but this time, we set input_type=”search_query”.

Search is performed by the knn_query() method from the hnswlib library. Given a query, it returns the document chunks most similar to the query. We define the number of document chunks to return using the attribute self.retrieve_top_k=10.

### Reranking
After dense retrieval, we implement a reranking step. While our dense retrieval component is already highly capable of retrieving relevant sources, the Rerank endpoint provides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results and sorts them according to their relevance to the query.

We call the Rerank endpoint with co.rerank() and pass the query and the list of document chunks to be reranked. We also define the number of top reranked document chunks to retrieve using the attribute self.rerank_top_k=3. The model we use is rerank-english-v3.0, which lets you rerank documents that contain multiple fields, in the form of JSON objects. In our case, we'll use the title and text fields for reranking.

This method returns the top retrieved document chunks as a Python list docs_retrieved, so that they can be passed to the chatbot, which we’ll implement next.

In [29]:
class Vectorstore:
    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()
        
    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)


    def index(self) -> None:
        """
        Indexes the documents for efficient retrieval.
        """
        print("Indexing documents...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} documents.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.
    
        Parameters:
        query (str): The query to retrieve document chunks for.
    
        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        #print ("Raw query results:")
        #print (self.idx.knn_query(query_emb, k=self.retrieve_top_k))
        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]
        
        #print ("Doc IDs:")
        #print (doc_ids)

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        #print ("Documents to rerank:")
        #print (docs_to_rerank)

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )
    
        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]
        #print("doc_ids_reranked:")
        #print(doc_ids_reranked)
        #print(rerank_results.results[0].index)
        #print(rerank_results.results[0].relevance_score)
        #print ("Re-ranked Documents:")
        #print (rerank_results)
        
        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved
        

## Process the Documents
We can now process the raw documents. We do that by creating an instance of Vectorstore. In our case, we get a total of 136 documents, chunked from the four web URLs.

In [30]:
test_vectorstore = Vectorstore(raw_documents)

Loading documents...
Embedding document chunks...
Indexing documents...
Indexing complete with 105 documents.


## Test Retrieval
Before going further, we first test the document retrieval part of the system. First, we create an instance of the Vectorstore with the raw documents that we have defined. Then, we use the retrieve method to retrieve the most relevant documents to the query multi-head attention definition.

In [31]:
test_vectorstore.retrieve("multi-head attention definition")

[{'title': 'Food Supply Chains',
  'text': 'considered in accordance with the hierarchy.',
  'url': 'https://www.usda.gov/coronavirus/food-supply-chain'},
 {'title': 'Food Supply Chains',
  'text': 'worker health and safety.',
  'url': 'https://www.usda.gov/coronavirus/food-supply-chain'},
 {'title': 'Food Supply Chains',
  'text': 'A: USDA is monitoring the situation closely in collaboration with our federal and state partners. FNS is ready to assist in the government-wide effort to ensure all Americans have access to food in times of need. In the event of an emergency or disaster situation, FNS programs are just one part of a much larger government-wide coordinated response. All our programs, including SNAP, WIC, and the National School Lunch and Breakfast Programs, have flexibilities and contingencies built-in to allow',
  'url': 'https://www.usda.gov/coronavirus/food-supply-chain'}]

In [32]:
test_vectorstore.retrieve("strawberry")

[{'title': 'Funding Opportunities',
  'text': 'help ensure underserved producers have the resources, tools, programs, and technical support they need to succeed.',
  'url': 'https://www.usda.gov/media/press-releases/2022/08/24/usda-announces-550-million-american-rescue-plan-funding-projects'},
 {'title': 'Food Supply Chains',
  'text': 'facilitate ongoing operations and support the food supply, while also mitigating the risk of spreading COVID-19.',
  'url': 'https://www.usda.gov/coronavirus/food-supply-chain'},
 {'title': 'Food Supply Chains',
  'text': 'considered in accordance with the hierarchy.',
  'url': 'https://www.usda.gov/coronavirus/food-supply-chain'}]

## Create the Chatbot Component
The Chatbot class handles the interaction between the user and the chatbot. It also handles the logic of the chatbot, including generating search queries based on a user message, and retrieving documents.

The Chatbot class takes an instance of the Vectorstore class. We initialize a self.vectorstore attribute for that instance, as well as a unique conversation ID that we’ll need for each conversation.  

### Get the User Message
Next, we create a run() method that will be used to run the chatbot application. It begins with the logic for getting the user message, along with a way for the user to end the conversation.

In [33]:
class Chatbot:
    def __init__(self, vectorstore: Vectorstore):
        """
        Initializes an instance of the Chatbot class.

        Parameters:
        vectorstore (Vectorstore): An instance of the Vectorstore class.

        """
        self.vectorstore = vectorstore
        self.conversation_id = str(uuid.uuid4())
        
    def run(self):
            """
            :param self: 
            :return: 
            
            Runs the chatbot application
            """
            while True:
                # Get the user message
                message = input ("User: ")
                
                # Typing "quit" ends the conversation
                if message.lower() == "quit":
                    print ("Ending chat.")
                    break
                else:
                    print (f"User: {message}")

                # Generate search queries, if any
                response = co.chat(message=message, search_queries_only=True)
                
                # if there are search queries, retrieve document chunks and respond
                if response.search_queries:
                    print ("Retrieving information...", end="")
                    
                    # Retrieve document chunks for each query
                    documents = []
                    for query in response.search_queries:
                        documents.extend(self.vectorstore.retrieve(query.text))
                    
                    # Use document chunks to respond
                    response = co.chat_stream(
                        message=message,
                        model='command-r',
                        documents=documents,
                        conversation_id=self.conversation_id,
                    )
                # If there is no search query, directly respond
                else:
                    response = co.chat_stream(
                        message=message,
                        model="command-r",
                        conversation_id=self.conversation_id,
                    )

                # Print the chatbot response, citations, and documents
                print("\nChatbot:")
                citations = []
                cited_documents = []

                # Display response
                for event in response:
                    if event.event_type == "text-generation":
                        print(event.text, end="")
                    elif event.event_type == "citation-generation":
                        citations.extend(event.citations)
                    elif event.event_type == "search-results":
                        cited_documents = event.documents

                # Display citations and source documents
                if citations:
                    print("\n\nCITATIONS:")
                    for citation in citations:
                        print(citation)

                    print("\nDOCUMENTS:")
                    for document in cited_documents:
                        print(document)

                print(f"\n{'-'*100}\n")

In [34]:
chatbot = Chatbot(test_vectorstore)

chatbot.run()

User: I want to know about funding opportunities, technical assistance, and USDA programs that support rural communities
Retrieving information...
Chatbot:
There are various funding opportunities and programs offered by the USDA to support rural communities.

The U.S. Department of Agriculture (USDA) has announced up to $550 million in funding to support diverse projects for underserved producers. The projects aim to help them access land, capital, and markets and train a new generation of agricultural professionals. This funding is provided through the American Rescue Plan Act (ARPA).
USDA also provides technical assistance and support for underserved producers including veterans, limited-resource producers, beginning farmers and ranchers, and those in high poverty areas. The support is offered through topics like business development, leadership, and management training. Additionally, there are partnership agreements worth at least $25 million in technical assistance for which applic

BadRequestError: status_code: 400, body: {'message': 'invalid request: message must be at least 1 token long or tool results must be specified.'}