<a href="https://colab.research.google.com/github/talha324860/100-pandas-puzzles/blob/master/Simple_Rag_with_chromaDB_Gemini_PartD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART D: A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB
    
    
    
    
    

In this notebook we will develop a Retrieval Augmented Generation (RAG) application.

The Parts are

* PART A: AN INTRO TO GEMINI API FOR TEXT GENERATION & CHAT
* PART B: CODE WITH CHROMADB FOR VECTOR STORAGE & SIMILARITY SEARCH
* PART C: CODE WITH CHROMADB FOR PERSISTENT VECTOR DB
* PART D: A SIMPLE RAG BASED ON GEMINI & CHROMADB
* PART E: ADVANCED TECHNIQUES FOR RAG BASED ON GEMINI & CHROMADB

# WHAT IS RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that combines large language models (LLMs) with external knowledge sources to improve the accuracy and reliability of AI-generated text.

## How Does RAG Work? Unveiling the Power of External Knowledge

Before we start the core RAG process, we need to provide a foundation as follows:

* **Building the Knowledge Base:** The system starts by transforming documents and information within the external knowledge base (like Wikipedia or a company database) into a special format called **vector representations**. These condense the meaning of each document into a series of **numbers**, capturing the essence of the content.

* **Vector Database for Speedy Retrieval**: These vector representations are then stored in a specialized database called a vector database. This database is optimized for efficiently **searching and retrieving** information based on **semantic similarity**. Imagine it as a super-powered library catalog that **understands the meaning** of documents, **not just keywords**.

Now, let's explore how RAG leverages this foundation:

* **User Input**: The RAG process begins with a question or **prompt** from the user. This could be anything from "What caused the extinction of the dinosaurs?" to a more open-ended request like "Write a creative story."

* **Intelligent Retrieval**: RAG doesn't rely solely on the **LLM's internal knowledge**. It employs an information retrieval component that acts like a super-powered search engine. This component scans the vast external knowledge base – like a company's internal database for specific domains – to find information **directly relevant** to the user's input. Unlike a traditional **search engine** that relies on **keywords**, RAG leverages the power of vector representations to understand the **semantic meaning** of the user's prompt and identify the most relevant documents.

* **Enriched Context Creation**: The retrieved information isn't just shown alongside the prompt. RAG cleverly **merges the user input with the relevant snippets** from the knowledge base. This creates a ***richer context*** for the LLM to understand the **user's intent** and formulate a well-informed response.

* **LLM Powered Response Generation**: Finally, the **enriched context** is fed to the Large Language Model (LLM). The LLM, along with its ability to process language patterns, now has a strong **foundation of factual** information to draw upon. This empowers it to generate a response that is both comprehensive and accurate, addressing the specific needs of the user's prompt.

In this part, we will learn how to build a persistent ChromaDB Vector Database for speedy retrieval in a Knowledge Base.

https://www.trychroma.com/
https://github.com/chroma-core/chroma

# CONTENT: A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB

In this comprehensive tutorial series, we delve into the exciting world of developing a Retrieval Augmented Generation (RAG) application. If you are eager to create a chatbot leveraging cutting-edge technologies like GEMINI and Chromadb, you are in the right place! This video is tailored for anyone interested in building a RAG system, whether you're a seasoned developer or just starting out.

In the first three parts of this series, we explored:

* Coding GEMINI API for Text Generation & Chat: Understanding how to implement and use the GEMINI API for creating dynamic text-based interactions.
* Creating a Persistent Chromadb for Vector Storage & Similarity Search: Learning how to store and retrieve vectors efficiently using Chromadb.

In this fourth installment, titled "A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB," we aim to construct a functional RAG pipeline using these powerful tools. Here's what you can expect:

Key Steps Covered in this Video:
* Creating a Knowledge Base from Scratch with a Persistent Chromadb: Learn how to build a robust knowledge base from multiple documents.
* Upload Multiple Documents and Create Knowledge Base: Step-by-step guide on uploading and organizing your documents.
* Test the Knowledge Base: Methods to ensure your knowledge base is functioning correctly.
* Load a Knowledge Base from a Persistent Chromadb: How to efficiently load and utilize your knowledge base.
* Connect to an LLM: Google GEMINI via the Chat API: Integrate the Google GEMINI model for enhanced interaction.
* Create the RAG Pipeline for the Existing Knowledge Base: Develop a seamless pipeline to utilize your knowledge base with GEMINI.
* A Simple Loop for User Interaction: Implement a user-friendly loop for interactions.
* A Gradio Interface to the RAG: Create an intuitive interface using Gradio for a better user experience.

All these steps will be implemented and coded in Python on Google Colab, ensuring you can follow along and replicate the process easily.

Follow Us:
Murat Karakaya  on LinkedIn
Murat Karakaya  on Twitter

Join our community of developers and tech enthusiasts! Don't forget to like, share, and subscribe to stay updated with our latest tutorials and tech insights.

Watch the video here:
* In English:
* In Turkish:



# WHY WE NEED A PERSISTENT CHROMADB?

In the context of a Retrieval-Augmented Generation (RAG) approach, saving and loading a persistent ChromaDB is particularly important for several reasons:

1. **Enhanced Data Durability**:
   - **Importance**: Ensures the retrieval database used for augmenting generative models is not lost between sessions or system restarts.
   - **RAG Relevance**: Maintains a consistent and reliable knowledge base that the generative model can reference, leading to more accurate and relevant responses.

2. **Operational Continuity**:
   - **Importance**: Allows seamless continuation of operations without needing to re-index or re-import data, saving time and computational resources.
   - **RAG Relevance**: Ensures that the generative model has continuous access to the same set of documents, which is essential for generating consistent and coherent responses over time.

3. **Facilitating Collaboration**:
   - **Importance**: Enables multiple users or systems to share and access the same dataset.
   - **RAG Relevance**: Supports collaborative development and usage of the RAG system, allowing different teams to work on improving the retrieval and generation processes simultaneously.

4. **Scalability**:
   - **Importance**: Provides a stable and persistent backend, enabling efficient handling of large datasets.
   - **RAG Relevance**: Essential for scaling the RAG system to handle more extensive and diverse knowledge bases, ensuring that the system can manage increased loads and deliver prompt, relevant information.


In a RAG system, the retriever (like ChromaDB) provides the generative model with relevant context from a knowledge base to generate informed and accurate responses. Persistent storage ensures that this knowledge base is durable, continuously available, and scalable, which is critical for the reliability, consistency, and performance of the RAG system.



# CREATING A KNOWLEDGE BASE FROM SCRATCH WITH A PERSISTENT CHROMADB

To make ChromaDB durable (persistent) rather than temporary on Google Colab, you can use external storage services like Google Drive or set up a cloud-based database. Google Colab provides temporary storage that resets after each session, so to maintain persistence across sessions, you'll need to save your data and configurations externally.

##1 Install required libraries

Install all the required libraries and helper functions

In [1]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet
%pip install pypdf --quiet
%pip install langchain --quiet
%pip install tqdm --quiet
%pip install -U langchain langchain-community langchain-text-splitters


import langchain
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import SentenceTransformersTokenTextSplitter


from pypdf import PdfReader

from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions

import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))



## 2 Initialize a Persistent ChromaDB client with a proper Google Drive connection

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
%cd drive/MyDrive/'Colab Notebooks'

/content/drive/MyDrive/Colab Notebooks


In [4]:
# Initialize ChromaDB client with Google Drive connection
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'


In [5]:
# Check if the chromadb_path exists or not. If so, delete all the files and folders in chromadb_path. But before deleting get the permission from the user.

import os
import shutil

def delete_all_files_and_folders(chromaDB_path):
  if os.path.exists(chromaDB_path):
    print(f"The directory '{chromaDB_path}' already exists.")
    permission = input("Do you want to delete all the files and folders in this directory? (y/n): ")
    if permission == "y":
      shutil.rmtree(chromaDB_path)
      print(f"All files and folders in '{chromaDB_path}' have been deleted.")
    else:
      print("No action taken.")
  else:
    print(f"The directory '{chromaDB_path}' does not exist.")



In [6]:
delete_all_files_and_folders(chromaDB_path)

The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' already exists.
Do you want to delete all the files and folders in this directory? (y/n): y
All files and folders in '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' have been deleted.


## 3 Define PersistentClient

Let's re-define the **create_chroma_client** function from the previous part so that this time we initialize a **persistent** ChromaDB client:

In [7]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient


In [8]:
def create_chroma_client(collection_name, embedding_function, chromaDB_path ):
  if chromaDB_path is not None:
    chroma_client = PersistentClient(path=chromaDB_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)
  else:
    chroma_client = Client()

  chroma_collection = chroma_client.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

  return chroma_client, chroma_collection

## 4 Create a collection as usual

In [9]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

In [10]:
chroma_client, chroma_collection = create_chroma_client(collection_name,
                                                        embedding_function,
                                                        chromaDB_path)

### Check the created collection:

In [11]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 0
All collections in ChromaDB client:
Papers


## 5 Define helper functions

In [12]:
from google.colab import files
def upload_multiple_files():
  uploaded = files.upload()
  file_names = list()
  for fn in uploaded.keys():
    #print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    file_names.append(fn)
  return file_names

In [13]:
def convert_PDF_Text(pdf_path):
  reader = PdfReader(pdf_path)
  pdf_texts = [p.extract_text().strip() for p in reader.pages]
  # Filter the empty strings
  pdf_texts = [text for text in pdf_texts if text]
  print("Document: ",pdf_path," chunk size: ", len(pdf_texts))
  return pdf_texts

In [14]:
def convert_Page_ChunkinChar(pdf_texts, chunk_size = 1500, chunk_overlap=0 ):
  character_splitter = RecursiveCharacterTextSplitter(
      separators=["\n\n", "\n", ". ", " ", ""],
      chunk_size=1500,
      chunk_overlap=0
)
  character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
  print(f"\nTotal number of chunks (document splited by max char = 1500): \
        {len(character_split_texts)}")
  return character_split_texts

In [15]:
def convert_Chunk_Token(text_chunksinChar,sentence_transformer_model, chunk_overlap=0,tokens_per_chunk=128 ):
  token_splitter = SentenceTransformersTokenTextSplitter(
      chunk_overlap=0,
      model_name=sentence_transformer_model,
      tokens_per_chunk=128)

  text_chunksinTokens = []
  for text in text_chunksinChar:
      text_chunksinTokens += token_splitter.split_text(text)
  print(f"\nTotal number of chunks (document splited by 128 tokens per chunk):\
       {len(text_chunksinTokens)}")
  return text_chunksinTokens

In [16]:
def add_meta_data(text_chunksinTokens, title, category, initial_id):
  ids = [str(i+initial_id) for i in range(len(text_chunksinTokens))]
  metadata = {
      'document': title,
      'category': category
  }
  metadatas = [ metadata for i in range(len(text_chunksinTokens))]
  return ids, metadatas

In [17]:
def add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection):
  print("Before inserting, the size of the collection: ", chroma_collection.count())
  chroma_collection.add(ids=ids, metadatas= metadatas, documents=text_chunksinTokens)
  print("After inserting, the size of the collection: ", chroma_collection.count())
  return chroma_collection

In [18]:
def retrieveDocs(chroma_collection, query, n_results=5, return_only_docs=False):
    results = chroma_collection.query(query_texts=[query],
                                      include= [ "documents","metadatas",'distances' ],
                                      n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [19]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


In [20]:
def load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model,
                                   chromaDB_path):

  collection_name= collection_name
  category= "Journal Paper"
  sentence_transformer_model=sentence_transformer_model
  embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(model_name=sentence_transformer_model)
  chroma_client, chroma_collection = create_chroma_client(collection_name, embedding_function, chromaDB_path)
  current_id = chroma_collection.count()
  file_names = upload_multiple_files()
  for file_name in file_names:
    print(f"Document: {file_name} is being processed to be added to the {chroma_collection.name} {chroma_collection.count()}")
    print(f"current_id: {current_id} ")
    pdf_texts = convert_PDF_Text(file_name)
    text_chunksinChar = convert_Page_ChunkinChar(pdf_texts)
    text_chunksinTokens = convert_Chunk_Token(text_chunksinChar,sentence_transformer_model)
    ids,metadatas = add_meta_data(text_chunksinTokens,file_name,category, current_id)
    current_id = current_id + len(text_chunksinTokens)
    chroma_collection = add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection)
    print(f"Document: {file_name} added to the collection: {chroma_collection.count()}")
  return  chroma_client, chroma_collection

## 6 Upload Multiple Documents and Create Knowledge Base

Run load_multiple_pdfs_to_ChromaDB() to fill in the colection

In [21]:
chroma_client, chroma_collection= load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model, chromaDB_path)

Saving 1912.10819v1.pdf to 1912.10819v1 (2).pdf
Saving peerj-cs-93.pdf to peerj-cs-93 (2).pdf
Document: 1912.10819v1 (2).pdf is being processed to be added to the Papers 0
current_id: 0 
Document:  1912.10819v1 (2).pdf  chunk size:  12

Total number of chunks (document splited by max char = 1500):         27


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]




Total number of chunks (document splited by 128 tokens per chunk):       73
Before inserting, the size of the collection:  0
After inserting, the size of the collection:  73
Document: 1912.10819v1 (2).pdf added to the collection: 73
Document: peerj-cs-93 (2).pdf is being processed to be added to the Papers 73
current_id: 73 
Document:  peerj-cs-93 (2).pdf  chunk size:  19

Total number of chunks (document splited by max char = 1500):         48


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]


Total number of chunks (document splited by 128 tokens per chunk):       126
Before inserting, the size of the collection:  73
After inserting, the size of the collection:  199
Document: peerj-cs-93 (2).pdf added to the collection: 199


## 7 Test the Knowledge Base

Query the Knowledge Base using the persistent ChromaDB client and & collection

In [22]:
query = "What are the methods of predict desicion of echr cases?"



In [23]:
retrieved_documents=retrieveDocs(chroma_collection, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> on Test Set The models could still be used to prioritise cases by identifying which cases are more likely to lead to violations. The heuristic does not provide any bene [UNK] in terms of prioritising cases. As the predictions for each Article would be the same, all complaints would be given the same priority. In this sense, the models may be more useful. As discussed above, the tendency to have a high precision means there are relatively few false positives. This means the cases identi [UNK] as violations and subsequently prioritised, will tend to be violations. The downside is

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.6878165006637573
Document 2:
	Document Text: 


> behavior conforms to the legal realists [UNK] theorization ( Leiter, 2007 ), according to which judges primarily decide cases by responding to the stimulus of the facts of the case. We define the problem of the ECtHR case prediction as a binary classification task. We utilise textual features, i. e., N - grams and topics, to train Support Vector Machine ( SVM ) classifiers ( Vapnik, 1998 ). We apply a linear kernel function that facilitates the interpretation of models in a straightforward manner. Our models can reliably predict ECtH

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.6883445978164673
Document 3:
	Document Text: 


> 5Rules of ECtHR, http : / / www. echr. coe. int / Documents / Rules _ Court _ ENG. pdf. Main premise Our main premise is that published judgments can be used to test the possibility of a text - based analysis for ex ante predictions of outcomes on the assumption that there is enough similarity between ( at least ) certain chunks of the text of published judgments and applications lodged with the Court and / or briefs submitted by parties with respect to pending cases. Predictive tasks were based on the text of published judgments rather

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.692875862121582
Document 4:
	Document Text: 


> cases. We submit, though, that full acceptance of that reasonable assumption necessitates more empirical corroboration. Be that as it may, our more general aim is to work under this assumption, thus placing our work within the larger context of ongoing empirical research in the theory of adjudication about the determinants of judicial decision - making. Accordingly, in the discussion we highlight ways in which automatically predicting the outcomes of ECtHR cases could potentially provide insights on whether judges follow a so - called legal model ( Grey, 1983 ) of decision making or their

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.6982980966567993
Document 5:
	Document Text: 


> and predict the outcomes of judicial decisions ( Lawlor, 1963 ). According to Lawlor, reliable prediction of the activity of judges would depend on a scientific understanding of the ways

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7002371549606323
Document 6:
	Document Text: 


> Preoţiuc - Pietro, Lampos & Aletras, 2015 ; Preoţiuc - Pietro et al., 2015 ) and also provides a more concise semantic representation. [UNK] model The problem of predicting the decisions of the ECtHR is defined as a binary classification task. Our goal is to predict if, in the context of a particular case, there is a violation or non - violation in relation to a specific Article of the Convention. For that purpose, we use each set of textual features, i. e., N - grams and topics, to train Support Ve

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7054527997970581
Document 7:
	Document Text: 


> particularity. Boston University Law Review 78 : 773. Segal JA. 1984. Predicting Supreme Court cases probabilistically : the search and seizure cases, 1962 [UNK] 1981. American Political Science Review 78 ( 04 ) : 891 [UNK] 900 DOI 10. 2307 / 1955796. Segal JA, Spaeth HJ. 2002. The Supreme Court and the attitudinal model revisited. Cambridge : Cambridge University Press. Aletras etal ( 2016 ), PeerJ Comput. Sci., DOI 10. 7717 / peerj - cs. 93 18 / 19

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7078721523284912
Document 8:
	Document Text: 


> Such a model could also be used to prioritise cases. That is cases which indicate a high likelihood of violation can be prioritised. 2 Related Work Table 1 shows the results of studies that looked at predicting the outcome of legal cases. The [UNK] Court [UNK] column gives the legal court considered by the study. The majority of the studies looked at either the ECHR or the Supreme Court of the United States ( SCOTUS ). The [UNK] Data [UNK] column in Table 1 givesthe type of data used in the study. [UNK] Case documents [UNK] refers to text documents that outline the cases heard by the Court

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7101020812988281
Document 9:
	Document Text: 


> ##ly robust manner patterns of fact scenarios that correspond to well - established trends in the Court [UNK] s case law. Aletras etal ( 2016 ), PeerJ Comput. Sci., DOI 10. 7717 / peerj - cs. 93 12 / 19

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7140823602676392
Document 10:
	Document Text: 


> number of cases. In the Court [UNK] s own words :

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.7207286357879639


## 8 Observe the ChromeDB saved to the provided path

List the folders and files in the chromaDB_path

In [24]:
!ls "{chromaDB_path}"

71f158d0-357e-4e73-8695-f67c010fc8e4  chroma.sqlite3


## YES WE DID IT!

# LOAD A KNOWLEDGE BASE FROM A PERSISTENT CHROMADB



Let's kill the kernel so we ensure that nothing remains in the memory from all the above ChromaDB instance.

In [25]:
from google.colab import runtime
# Disconnect from the runtime
#!kill -9 -1

##1 Connect to source directory

First get connected to the ChromaDB directory

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
# change directory to chromaDB folder
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'
%cd {chromaDB_path}


/content/drive/MyDrive/Colab Notebooks/ChromaDBData


### Check that if chromadb_path exist or not and if exists does it contain chromadb files and folders

In [28]:
import os
if os.path.exists(chromaDB_path):
    print(f"The directory '{chromaDB_path}' exists.")

    # Check if the directory contains ChromaDB files and folders
    chromadb_files_and_folders = os.listdir(chromaDB_path)
    if any(file_or_folder.startswith('chroma') for file_or_folder in chromadb_files_and_folders):
        print("The directory contains ChromaDB files and folders.")
    else:
        print("The directory does not contain ChromaDB files and folders.")
else:
    print(f"The directory '{chromadb_path}' does not exist.")


The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' exists.
The directory contains ChromaDB files and folders.


##2 Install required libraries

Secondly install all the required libraries and helper functions

In [29]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet

In [30]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions


In [31]:
import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [32]:
def retrieveDocs(chroma_collection, query, n_results=5,
                 return_only_docs=False, filterType=None, filterValue=None):
    if filterType is not None and filterValue is not None:
        results = chroma_collection.query(
            query_texts=[query],
            include=["documents", "metadatas", "distances"],
            where={filterType: filterValue},
            n_results=n_results)

    else:
        results = chroma_collection.query(
            query_texts=[query],
            include= [ "documents","metadatas",'distances' ],
            n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [33]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


##3 Initailizing

 Now, we can begin to upload the persistent ChromaDB from the location by initailizing
*  the chromaDB client
*  the chromaDB collections

In [34]:
chroma_client = PersistentClient(path=chromaDB_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)

In [35]:
chroma_client.list_collections()

[Collection(name=Papers)]

In [36]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


In [37]:
chroma_collection = chroma_client.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

In [38]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")  # Access the name attribute directly
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 199
All collections in ChromaDB client:
Papers


##4 Test

Test the loaded ChromeDB client and the collection

In [39]:
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'documents': ['Conor O [UNK] Sullivan and Joeran Beel. [UNK] Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights. [UNK] In 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, 2019. Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights Conor O [UNK] Sullivan, Joeran Beel School of Computer Science and Statistics, Trinity College, Ireland ADAPT Centre osullc43©tcd. ie, beelj©tcd. ie Abstract. In this study, machine learning models were constructed'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '1912.10819v1 (2).pdf'}]}

```python
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'}],
 'documents': ['Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1,
 February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1
 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial
 Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively
  tasking these expensive vehicles is planning'],
 'uris': None,
 'data': None}
```

In [40]:
query = "What is Target Coverage?"

In [41]:
retrieved_documents=retrieveDocs(chroma_collection, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> . The majority of the studies that looked at SCOTUS decisionsused [UNK] Summary Information [UNK] [ 22 ] [ 10 ] [ 11 ] [ 12 ]. Thesearevariablesthat summarise the cases. Additionally, in Table 1 the [UNK] Target Variable [UNK] is usually a simpli [UNK] of potential case outcomes. For each study, the [UNK] Algorithm [UNK] gives the algorithm that achieved the highest accuracy when predicting the target variable. The [UNK] Train. Ace. [UNK] and [UNK] Test Acc. [UNK] give the training and test accuracy, respectfully, achieved by the study. CForall the studies, the

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.8299545049667358
Document 2:
	Document Text: 


> task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i. e., N - grams, and topics. Our models can predict the court [UNK] s decisions with a strong accuracy ( 79 % on average ). Our empirical analysis indicates that the formal facts of a case are the most important predictive factor. This is consistent with the theory of legal realism suggesting that judicial decision -

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.857610821723938
Document 3:
	Document Text: 


> Table 1. Summary of Previous Works Target. Train. Test Author Court Data Variable Algorlthm Acc. Acc. [ 3 ] ECHR Case [UNK] damn [UNK] SVM 801 [UNK] 7 NADocuments Non [UNK] Violation ' 0 Case Violation, [ 15 ] ECHR Documents Non [UNK] Violation SVM 795 % NA Case Violation, [ 18 ] ECHR Documents Non [UNK] Violation SVM 750 % 740 % Summary Af [UNK], Decision [ 17 ] SCOTUS Information Reversed Tree NA 75 % Justice. Sumamry Decsion : Stocha

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.8728786706924438
Document 4:
	Document Text: 


> Preoţiuc - Pietro, Lampos & Aletras, 2015 ; Preoţiuc - Pietro et al., 2015 ) and also provides a more concise semantic representation. [UNK] model The problem of predicting the decisions of the ECtHR is defined as a binary classification task. Our goal is to predict if, in the context of a particular case, there is a violation or non - violation in relation to a specific Article of the Convention. For that purpose, we use each set of textual features, i. e., N - grams and topics, to train Support Ve

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.8749464750289917
Document 5:
	Document Text: 


> way, the representation of the Full case is computed by taking the mean vector of all of its sub - parts.   * Topics : We create topics for each article by clustering together N - grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts. We thus use the C feature matrix ( see above ), which is a distributional representation ( Turney & Pantel, 2010 ) of the N - grams given the case as the context ; each column vector of the matrix represents an N - gram. Using this vector representation of words, we comput

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.8913159966468811
Document 6:
	Document Text: 


> than lodged applications or briefs simply because we did not have access to the relevant data set. We thus used published judgments as proxies for the material to which we do not have access. This point should be borne in mind when approaching our results. At the very least, our work can be read in the following hypothetical way : if there is enough similarity between the chunks of text of published judgments that we analyzed and that of lodged applications and briefs, then our approach can be fruitfully used to predict outcomes with these other kinds of texts. Case structure The judgments

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.9108981490135193
Document 7:
	Document Text: 


> a single topic. A representation of a cluster is derived by looking at the most frequent N - grams it contains. The main advantages of using topics ( sets of N - grams ) instead of single N - grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, it limits overfitting to training data ( Lampos et al., 2014 ;

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.9155511856079102
Document 8:
	Document Text: 


> crucial clarifications and caveats should be stressed. To begin with, the text of the [UNK] Circumstances [UNK] subsection has been formulated by the Court itself. As a result, it should not always be understood as a neutral mirroring of the factual background of the case. The choices made by the Court when it comes to formulations of the facts incorporate implicit or explicit judgments to the effect that some facts are more Aletras etal ( 2016 ), PeerJ Comput. Sci., DOI 10. 7717 / peerj - cs. 93 4 / 19

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.92230224609375
Document 9:
	Document Text: 


> ##section, with all the caveats we have already voiced, is a ( crude ) proxy for non - legal facts and the [UNK] Law [UNK] subsection is a ( crude ) proxy for legal reasons and arguments, the predictive superiority of the [UNK] Circumstances [UNK] subsection seems to cohere with extant legal realist treatments of judicial decision - making.

	Document Source: peerj-cs-93 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.9234130382537842
Document 10:
	Document Text: 


> dings and [UNK] No. Tokens [UNK] are the number of lower case words that make up the corpus. The [UNK] vocabulary size [UNK] is the number of words that have vector repre [UNK] sentations for that embedding. For each embedding, a 100 dimension and 200 dimension version are used. The GloVe embeddings were trained by [ 20 ] and the law2vec embeddings were trained by [ 4 ]. The echr2vec embeddings were trained speci [UNK] for this paper.

	Document Source: 1912.10819v1 (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 0.9250677227973938


## YES! WE DID IT!

# CONNECT TO AN LLM: GOOGLE GEMINI

## 1 Install & Import Libraries

In [42]:
!pip install -U google-genai



In [68]:
%pip install -q -U google-generativeai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/155.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.1/155.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [69]:
import os
import textwrap
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


## 2 Define Helper Functions

In [70]:
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [92]:
def build_chatBot(system_instruction):
    model = genai.GenerativeModel(
        model_name='gemini-2.5-flash',
        system_instruction=system_instruction
    )
    chat = model.start_chat(history=[])
    return chat

In [93]:
def generate_LLM_answer(prompt, context, chat):
  response = chat.send_message( prompt + context)
  return response.text

## 3 Connect to the LLM via the Chat API

In [94]:
# Used to securely store your API key
from google.colab import userdata
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get('geminiapikey')
client = genai.configure(api_key=GOOGLE_API_KEY)

In [95]:
system_prompt= """ You are an attentive and supportive academic assistant.
Your role is to provide assistance based solely on the provided context.

Here’s how we’ll proceed:
1. I will provide you with a question and related text excerpt.
2. Your task is to answer the question using only the provided partial texts.
3. If the answer isn’t explicitly found within the given context,
respond with 'I don't know'.
4. After each response, please provide a detailed explanation.
Break down your answer step by step and relate it directly to the provided context.
5. Sometimes, I will ask questions about the chat session, such as summarize
the chat or list the question etc. For this kind of questions do not try
to use the provided partial texts.
6. Generate the answer in the same language of the given question.

If you're ready, I'll provide you with the question and the context.
"""

In [96]:
RAG_LLM = build_chatBot(system_prompt)

## 4 Test the LLM connection

In [97]:
prompt="What is FC?"
context= """FC lets developers create a description
of a F in their code, then pass that description to a language
model in a request.

The response from the model includes the name of
a F that matches the description and the arguments to call it with.
FC lets you use F as tools in generative AI applications,
and you can define more than one F within a single request.
"""

In [98]:
response=generate_LLM_answer(prompt, context,RAG_LLM)
to_markdown(response)

> FC lets developers create a description of a F in their code, then pass that description to a language model in a request.
> 
> **Explanation:**
> 1.  **Identify the key term:** The question asks "What is FC?".
> 2.  **Locate the term in the text:** The text begins with "FC lets developers create a description...".
> 3.  **Extract the definition:** The sentence directly describes what FC is and its primary function. It states that "FC lets developers create a description of a F in their code, then pass that description to a language model in a request."

In [99]:
RAG_LLM.history

[parts {
   text: "What is FC?FC lets developers create a description\nof a F in their code, then pass that description to a language\nmodel in a request.\n\nThe response from the model includes the name of\na F that matches the description and the arguments to call it with.\nFC lets you use F as tools in generative AI applications,\nand you can define more than one F within a single request.\n"
 }
 role: "user",
 parts {
   text: "FC lets developers create a description of a F in their code, then pass that description to a language model in a request.\n\n**Explanation:**\n1.  **Identify the key term:** The question asks \"What is FC?\".\n2.  **Locate the term in the text:** The text begins with \"FC lets developers create a description...\".\n3.  **Extract the definition:** The sentence directly describes what FC is and its primary function. It states that \"FC lets developers create a description of a F in their code, then pass that description to a language model in a request.\""
 }

In [100]:
RAG_LLM.history.clear()
RAG_LLM.history

[]

# CREATE THE RAG PIPELINE FOR THE EXISTING KNOWLEDGE BASE

## 1 A simple RAG Pipeline

* preparea summary for the Knowledge Base
* get the query from the user
* query the Knowledge Base
* get the related chunks from the Knowledge Base
* combine the query + context from Knowledge Base
* submit the prompt (query + context) to the LLM
* get the response from the LLM

In [101]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")  # Access the name attribute directly
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 199
All collections in ChromaDB client:
Papers


In [102]:
def summarize_collection(chroma_collection):
  summary = [] # Initialize summary as a list
  print("Summarizing the collection...")
  # Verify collection properties
  print(f"\t Collection name: {chroma_collection.name}")  # Access the name attribute directly
  print(f"\t Number of document chunks in collection: {chroma_collection.count()}")
  summary.append(f"Collection name: {chroma_collection.name}") # Append to the list
  summary.append(f"Number of document chunks in collection: {chroma_collection.count()}")
  # Print distinct metadata "document" for each chunk in the collection
  print("\t Distinct 'document' metadata in the collection:")
  distinct_documents = set()  # Use a set to store unique document names

  # Iterate over chunks in the collection
  for chunk_id in range(chroma_collection.count()):
      metadata = chroma_collection.get([str(chunk_id)])['metadatas'][0]  # Get metadata for the chunk
      document_name = metadata.get("document", "Unknown")  # Get document metadata; default to "Unknown" if not present
      distinct_documents.add(document_name)  # Add document name to set for uniqueness

  # Print all distinct document names
  summary.append("Documents:")
  for document_name in distinct_documents:
      print("\t ",document_name)
      summary.append(document_name) # Append to the list

  print("Collection summarization completed.")

  # Join the list elements into a single string
  summary_string = "\n ".join(summary)
  return summary_string

In [103]:
s=summarize_collection(chroma_collection)

Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 199
	 Distinct 'document' metadata in the collection:
	  peerj-cs-93 (2).pdf
	  1912.10819v1 (2).pdf
Collection summarization completed.


In [104]:
print(s)

Collection name: Papers
 Number of document chunks in collection: 199
 Documents:
 peerj-cs-93 (2).pdf
 1912.10819v1 (2).pdf


In [105]:
def generateAnswer(RAG_LLM, chroma_collection,query,n_results=5, only_response=True):
    retrieved_documents= retrieveDocs(chroma_collection, query, 10, return_only_docs=True)
    prompt = "QUESTION: "+ query
    context = "\n EXCERPTS: "+ "\n".join(retrieved_documents)
    if not only_response:
      print("------- retreived documents -------\n")
      for i, doc in enumerate(retrieved_documents):
        print(f"Document {i+1}:")
        print(f"\tDocument Text: {doc}")
      print("------- RAG answer -------\n")
    output = generate_LLM_answer(prompt, context, RAG_LLM)

    display(to_markdown(output))
    print('\n')
    return output

## 2 Test the RAG pipeline

In [107]:
queries =["Who are the authors suggested a new attention mechanism?",
          "Who are the authors suggested a new controllable text generation mechanism?",
          "Who is Murat Karakaya?",
          "Why do we need to control how the text is produced? ",
          "How can we use the self attention mechanism to control the text generation?",
          "Summarize the paper named Controllable Text Generation",
          "How many blocks are suggested in the transformer?",
          "What about decoder?"
    ]

In [108]:
reply=generateAnswer(RAG_LLM, chroma_collection, queries[2],10, only_response=False)

------- retreived documents -------

Document 1:
	Document Text: 1An amicus curiae ( friend of the court ) is a person or organisation that offers testimony before the Court in the context of a particular case without being a formal party to the proceedings. In this paper, our particular focus is on the automatic analysis of cases of the European Court of Human Rights ( ECtHR or Court ). The ECtHR is an international court that rules on individual or, much more rarely, State applications alleging violations by some State Party of the civil and political rights set out in the European Convention on Human Rights ( ECHR or Convention ). Our task is to pred
Document 2:
	Document Text: . nips. cc / paper / 5872 - efficient - and - robust - automated - machine - learning. pdf Guimera, R., Sales [UNK] Pardo, M. : Justice blocks and predictability of us supreme court votes. PloS one 6 ( 11 ), e27188 ( 2011 ) Katz, D. M., Bommarito II, M. J., Blackman, J. : A general approach for predicting the

> I don't know.
> 
> **Explanation:**
> The provided text excerpts discuss topics related to the European Court of Human Rights, machine learning applications in legal prediction, legal theories like formalism and realism, and various research papers and authors. I have carefully reviewed all the provided text, including the body of the text and the bibliographic references. The name "Murat Karakaya" is not mentioned or identified in any part of the given context.





## 3 A simple loop for the User Interaction

In [109]:
summarize_collection(chroma_collection)
RAG_LLM.history.clear()
while True:
  question = input("Please enter your question, or type 'bye' to exit: ")
  if question == "bye":
    print("Thank you for using the service. Goodbye!")
    break
  else:
    generateAnswer(RAG_LLM, chroma_collection, question)




Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 199
	 Distinct 'document' metadata in the collection:
	  peerj-cs-93 (2).pdf
	  1912.10819v1 (2).pdf
Collection summarization completed.
Please enter your question, or type 'bye' to exit: what is echr?


> The ECHR is an international treaty established for the protection of civil and political liberties in European democracies that are committed to the rule of law. It was initially drafted in 1950 by the ten states that had created the Council of Europe the previous year and entered into force in 1953. Membership in the Council of Europe requires becoming a party to the Convention.
> 
> **Explanation:**
> 1.  **"The ECHR is an international treaty..."**: This directly states what ECHR is, as found in the first provided excerpt.
> 2.  **"...for the protection of civil and political liberties in European democracies committed to the rule of law."**: This explains the purpose and scope of the ECHR, also directly from the first excerpt.
> 3.  **"It was initially drafted in 1950 by the ten states which had created the Council of Europe in the previous year and entered into force in 1953."**: This provides information about its origin and when it became active, directly sourced from the first excerpt.
> 4.  **"Membership in the Council entails becoming party to the Convention..."**: This details the relationship between the ECHR and the Council of Europe, again from the first excerpt.



Please enter your question, or type 'bye' to exit: echr nedir?


> ECHR (Avrupa İnsan Hakları Sözleşmesi), hukukun üstünlüğüne bağlı Avrupa demokrasilerinde sivil ve siyasi özgürlüklerin korunmasına yönelik uluslararası bir antlaşmadır. Antlaşma, başlangıçta 1950 yılında, bir önceki yıl Avrupa Konseyi'ni kuran on devlet tarafından hazırlanmıştır ve 1953 yılında yürürlüğe girmiştir. Avrupa Konseyi'ne üyelik, bu Sözleşme'ye taraf olmayı gerektirir. ECHR ayrıca, Avrupa İnsan Hakları Sözleşmesi'nin potansiyel ihlallerini inceleyen uluslararası bir mahkeme olan Avrupa İnsan Hakları Mahkemesi'ni de ifade edebilir.
> 
> **Açıklama:**
> 1.  **"The ECHR is an international treaty for the protection of civil and political liberties in European democracies committed to the rule of law."** (ECHR, hukukun üstünlüğüne bağlı Avrupa demokrasilerinde sivil ve siyasi özgürlüklerin korunmasına yönelik uluslararası bir antlaşmadır.) ifadesi, ECHR'nin ne olduğunu ve amacını ilk paragraftan alıntılar.
> 2.  **"The treaty was initially drafted in 1950 by the ten states which had created the Council of Europe in the previous year. The Convention itself entered into force in 1953."** (Antlaşma, başlangıçta 1950 yılında, bir önceki yıl Avrupa Konseyi'ni kuran on devlet tarafından hazırlanmıştır ve 1953 yılında yürürlüğe girmiştir.) bilgisi, antlaşmanın kökeni ve yürürlüğe giriş tarihini yine ilk paragraftan sağlar.
> 3.  **"Membership in the Council entails becoming party to the Convention..."** (Avrupa Konseyi'ne üyelik, bu Sözleşme'ye taraf olmayı gerektirir.) ifadesi, Konsey üyeliği ile Sözleşme arasındaki ilişkiyi ilk paragraftan belirtir.
> 4.  **"The European Court of Human Rights ( ECHR ) is an international court that examines potential breaches of the European Convention on Human Rights."** (Avrupa İnsan Hakları Mahkemesi (AİHM - ECHR), Avrupa İnsan Hakları Sözleşmesi'nin potansiyel ihlallerini inceleyen uluslararası bir mahkemedir.) bilgisi ise, ECHR'nin aynı zamanda bir mahkeme olarak da anıldığı ve görevi hakkında ek bir tanımı metinlerden sağlamıştır.



Please enter your question, or type 'bye' to exit: bye
Thank you for using the service. Goodbye!


## 4 A Gradio Interface

In [110]:
%pip install gradio
import gradio as gr




In [None]:
RAG_LLM.history.clear()

# Replace with your actual function (assuming it generates an answer)
def generateAnswerInterFace(question):
    return generateAnswer(RAG_LLM, chroma_collection, question)

# Function to generate the info text
def get_info_text():
    return "INFO: " + summarize_collection(chroma_collection)
    # Assuming summarize_collection returns a string

# Use gr.Blocks instead of gr.Interface
with gr.Blocks() as demo:
    # Define interface components
    query_txt = gr.Textbox(label="Enter your question here:", placeholder="Type your question")
    answer_txt = gr.Textbox(label="Answer:", placeholder="Answer will be displayed here")

    # Create a button to trigger the prediction
    btn = gr.Button("Generate Answer")

    # Define the prediction function (order changed for button placement)
    def predict(question):
        answer = generateAnswerInterFace(question)
        return answer

    info_txt = gr.Textbox(get_info_text(), label="Info")  # Add info textbox after button

    # Connect button click to prediction function
    btn.click(predict, inputs=query_txt, outputs=answer_txt)

# Launch the interface
demo.launch(debug=True)


Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 199
	 Distinct 'document' metadata in the collection:
	  peerj-cs-93 (2).pdf
	  1912.10819v1 (2).pdf
Collection summarization completed.
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://e9dbff865d28cc78e7.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [None]:
RAG_LLM.history

# SUMMARY

WHY WE NEED A PERSISTENT CHROMADB?
In the context of a Retrieval-Augmented Generation (RAG) approach, saving and loading a persistent ChromaDB is particularly important for several reasons:

Enhanced Data Durability:

Importance: Ensures the retrieval database used for augmenting generative models is not lost between sessions or system restarts.
RAG Relevance: Maintains a consistent and reliable knowledge base that the generative model can reference, leading to more accurate and relevant responses.
Operational Continuity:

Importance: Allows seamless continuation of operations without needing to re-index or re-import data, saving time and computational resources.
RAG Relevance: Ensures that the generative model has continuous access to the same set of documents, which is essential for generating consistent and coherent responses over time.
Facilitating Collaboration:

Importance: Enables multiple users or systems to share and access the same dataset.
RAG Relevance: Supports collaborative development and usage of the RAG system, allowing different teams to work on improving the retrieval and generation processes simultaneously.
Scalability:

Importance: Provides a stable and persistent backend, enabling efficient handling of large datasets.
RAG Relevance: Essential for scaling the RAG system to handle more extensive and diverse knowledge bases, ensuring that the system can manage increased loads and deliver prompt, relevant information.
In a RAG system, the retriever (like ChromaDB) provides the generative model with relevant context from a knowledge base to generate informed and accurate responses. Persistent storage ensures that this knowledge base is durable, continuously available, and scalable, which is critical for the reliability, consistency, and performance of the RAG system.

.

.

.

In [None]:
def generateAnswer(RAG_LLM, chroma_collection,query,n_results=5):
    retrieved_documents= retrieveDocs(chroma_collection, query, 10, return_only_docs=True)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    prompt = "QUESTION: "+ query
    context = "\n EXCERPTS: "+ "\n".join(retrieved_documents)
    print("------- RAG answer -------\n")
    output = generate_LLM_answer(prompt, context, RAG_LLM)

    display(to_markdown(output))
    print('\n')
    return output

In [None]:
to_markdown(reply.text)

In [None]:
for message in chat.history:
  display(to_markdown(f'**{message.role}**: {message.parts[0].text}'))


In [None]:
model.count_tokens(chat.history)

In [None]:
response = chat.send_message(prompt)
to_markdown(response.text)


In [None]:
import os
import openai
from openai import OpenAI

'''
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
openai_client = OpenAI()
'''
openai_client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

In [None]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo-1106"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "As an attentive and supportive academic assistant, "
            "your task is to provide assistance based solely on the provided"
            " excerpts. Answer the following questions, ensuring your responses"
            " are derived exclusively from the provided partial texts. "
            "If the answer cannot be found within the provided excerpts, "
            "kindly respond with 'I don't know'."
            "After answering each question, please provide a detailed "
            "explanation, breaking down the answer step by step and relating "
            "it to the provided excerpts."
            "Return your response as a Json object with two key fields: "
            " 'Answer', which should contain the value of the answer, and "
            " 'Reason', which should provide an explanation of why this answer "
            "was generated."

        },
        {"role": "user", "content": f"Question: {query}. \n Excerpts: {information}"}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
def generateAnswer(query,n_results=5):
    retrieved_documents=retrieveDocs(query,n_results)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')
    return output

In [None]:
reply=generateAnswer(queries[5],10)


In [None]:
# prompt: convert the 'reply' to a dict

import ast
reply_dict = ast.literal_eval(reply)
print(f"Answer: {reply_dict['Answer']}")
print(f"Because; {reply_dict['Reason']}")

In [None]:
for query in queries:
  generateAnswer(query)

In [None]:
%pip install umap-learn

In [None]:
def project_embeddings(embeddings, umap_transform):
    umap_embeddings = np.empty((len(embeddings),2))
    for i, embedding in enumerate(tqdm(embeddings)):
        umap_embeddings[i] = umap_transform.transform([embedding])
    return umap_embeddings

In [None]:
import umap.umap_ as umap

embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10)
plt.gca().set_aspect('equal', 'datalim')
plt.title('Projected Embeddings')
plt.axis('off')

In [None]:
query = queries[3]

results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings'])

retrieved_documents = results['documents'][0]

for document in results['documents'][0]:
    print(document)
    print('')


In [None]:
query_embedding = embedding_function([query])[0]
retrieved_embeddings = results['embeddings'][0]

projected_query_embedding = project_embeddings([query_embedding], umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)


In [None]:
# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_query_embedding[:, 0], projected_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{query}')
plt.axis('off')

In [None]:
def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "Sen TÜBİTAK proje başvurularını inceleyen yapay zeka konusunda uzman bir akasemisyensin."
            "Aşağıda verilen soruya, aşağıdaki proje tanımına uygun olabilecek bir cevap üret: \n"
            "Projenin genel amacı, bankacılık sektöründeki risk yönetimi operasyonlarını geliştirmek ve finansal kurumların karşılaştığı zorlukları ele almak "
            "için yapay zeka (AI) tabanlı bir platform geliştirmektir. Proje, bankalara vadeli mevduatın erken bozulması, kredilerin erken ödenmesi ve çeşitli "
            "mevduat türlerinin belirlenmesi gibi davranışsal riskleri daha etkili bir şekilde yönetme kapasitesi sunmayı hedeflemektedir. Bu riskler, finansal "
            "kurumların bilanço dengesini etkileyebilir ve operasyonel verimliliği azaltabilir. "
            "Projenin çözmeyi amaçladığı temel problem, bankaların karlılık ve risk analizlerini gerçekleştirirken karşılaştığı karmaşık durumları doğru ve "
            "etkili bir şekilde yönetme ihtiyacıdır. Özellikle vadeli mevduatların erken kapanması ve kredilerin erken ödenmesi gibi durumlar, bankaların "
            "gelecekteki nakit akışlarını ve risk profillerini belirleme sürecini karmaşıklaştırabilir"

        },
        {"role": "user", "content": query}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
original_query = queries[0]
hypothetical_answer = augment_query_generated(original_query)

joint_query = f"{original_query} {hypothetical_answer}"
print(joint_query)

In [None]:
def extend_retrieved_documents(results, extension=4):
  original_ids= results['ids'][0]
  print("original_ids: ",original_ids)

  extended_ids = set()


  for id in original_ids:
    extended_ids.add(int(id))
    for i in range(1, extension):
      extended_ids.add(int(id)+i)


  extended_ids = sorted([int(x) for x in extended_ids])
  extended_ids = [str(x) for x in extended_ids if int(x) < chroma_collection.count()]
  print("extended_ids: ",extended_ids)
  return chroma_collection.get(extended_ids)['documents']

In [None]:
def retrieveDocs_augmented_query(query, n_results=5, extension=4):
    hypothetical_answer = augment_query_generated(query)
    print("------ hypothetical_answer ---------\n")
    print(hypothetical_answer,"\n")
    print("------------------------------------\n")
    joint_query = f"{query} {hypothetical_answer}"
    results = chroma_collection.query(query_texts=joint_query, n_results=n_results, include=['documents', 'embeddings'])
    retrieved_documents = extend_retrieved_documents(results, extension)
    #retrieved_documents = results['documents'][0]

    return retrieved_documents



In [None]:
retrieved_documents=retrieveDocs_augmented_query(query, 5)

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
results = chroma_collection.query(query_texts=joint_query, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
retrieved_embeddings = results['embeddings'][0]
original_query_embedding = embedding_function([original_query])
augmented_query_embedding = embedding_function([joint_query])

projected_original_query_embedding = project_embeddings(original_query_embedding, umap_transform)
projected_augmented_query_embedding = project_embeddings(augmented_query_embedding, umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_augmented_query_embedding[:, 0], projected_augmented_query_embedding[:, 1], s=150, marker='X', color='orange')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

In [None]:
def generateAnswer_augmented_query(query,n_results=5, extention=4):
    print("------- query -------\n")
    print(query,"\n")
    retrieved_documents=retrieveDocs_augmented_query(query,n_results,extention)
    print("------- retreived documents -------\n")
    for document in retrieved_documents:
        print(document)
        print('\n')

    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')

In [None]:
queries

In [None]:
generateAnswer_augmented_query(queries[0],10,5)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]
print(retrieved_documents)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = extend_retrieved_documents(results)
print(retrieved_documents)


In [None]:
chroma_collection.get(results['ids'][0])