Notebook created by [Nikolaos Tsopanidis](https://github.com/tSopermon)

# RAG System

Data ingestion is the first step in building a RAG system that will provide reliable data source to an LLM and enhance the capabilities of a pre-trained LLM.

For this task we need to import text-based documents referring to concert tours of 2025-2026 and store the information. The user will be able to ask questions to the LLM and receive information based on the reliable knowledge source which was created.

## Data ingestion process
In this process we will be able to load external documents and store them in a vector database for retrieval. The steps followed are:
 1. **Load**: Loading data into documents  
 2. **Split**: Splitting data into manageable chunks  
 3. **Embed**: Creating document embeddings  
 4. **Store**: Storing embeddings into a vector database 

<div style="text-align: center;">

![Alt](https://miro.medium.com/v2/resize:fit:720/format:webp/1*wHqtILSjqYsF6RnDq2CJDA.png)

Source: [medium.com - Amina Javaid](https://medium.com/@aminajavaid30/building-a-rag-system-the-data-ingestion-pipeline-d04235fd17ea)

</div>

## LangChain Implementation
**LangChain** is an open source framework used commonly in GenAI applications. Not only we can build apps, but we can use LangSmith, introduced by LangChain, to monitor LLMs, debug and evaluate code.

* First, we will install the required packages for the project, shown in the `requirements` section below:
```bash
    streamlit==1.44.1
    langchain-ollama==0.3.2
    langchain-chroma==0.2.3
    transformers==4.51.3
    langchain-core==0.3.52
    langchain-text-splitters==0.3.8
    langchain==0.3.23
    torch==2.6.0
    serpapi==0.1.5
    google-search-results==2.4.2
```

## 1. Importing necessary libraries for text processing and vector storage
- `re`: Regular expressions for text manipulation.
- `Document`: Class for handling documents in the LangChain framework.
- `CharacterTextSplitter`: Class for splitting text into smaller chunks based on characters.
- `Chroma`: Class for storing and managing vector embeddings.

In [12]:
import re, os
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma

## 2. Importing the library to create flexible chat prompts for LLMs. 

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

## 3. Importing libraries to create a retrieval chain. Chains are used to combine multiple components into a single workflow.
Chains encode sequencies of calls to LLMs, tools, and other chains.
In this case, we will create a retrieval chain that **retrieves relevant documents** from a vector store and then generates a response based on those documents.
The **retrieval chain** will use a retriever to find relevant documents and then use a language model to generate a response based on those documents.
The **history aware retriever** will be used to keep track of the conversation history and use it to improve the retrieval process.

In [3]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain, create_history_aware_retriever

## 4. Importing Streamlit to create the web app UI for handling user input and SerpAPI for web search and scraping events from Google Events engine.

In [4]:
import streamlit as st
from serpapi import GoogleSearch

## 5. Importing LangChain Ollama and Transformers libraries
This service uses the Ollama and Transformers libraries to create a chatbot that can answer questions based on a given document.
The chatbot uses the Ollama library to generate embeddings for the document and the Transformers library to generate responses based on the embeddings.
It also uses the LangChain library to create a retrieval chain that retrieves relevant documents based on the user's query.

In [None]:
from langchain_ollama import OllamaEmbeddings, ChatOllama
from transformers import pipeline

## 6. Setting the SerApi API key to use the engine and retrieve events
SepApi uses the SerpApi API key to access the Google Search API. You can get your own API key from https://serpapi.com/manage-api-key. Make sure to set the environment variable SERPAPI_API_KEY to your own API key before running this code.

In [None]:
SERPAPI_API_KEY = "e943f5221ef3d91e9eecf2076e110ed0a2615e29ae8919c1d73ad6b24b033534" # Replace with your own API key
os.environ["SERPAPI_API_KEY"] = SERPAPI_API_KEY

## 7. Configuring models for the embeddings and LLM

### 7.1. Embeddings
The embeddings model is used to convert text into vector representations. This is important for semantic search and retrieval tasks.
The `OllamaEmbeddings` class is used to create embeddings using the Ollama model. The `model` parameter specifies the name of the model to use. In this case, we are using the `nomic-embed-text` model, which is designed for text embeddings.

In [8]:
embeddings = OllamaEmbeddings(model="nomic-embed-text")

### 7.2. LLM
The LLM (Language Model) is used to generate text based on the input provided. The `ChatOllama` class is used to create a chat-based language model using the Ollama model. The `model` parameter specifies the name of the model to use. In this case, we are using the `llama3.1:8b` model, which is a variant of the LLaMA model with 8 billion parameters. The `grounding` parameter specifies the grounding method to use. In this case, we are using "strict" grounding, which means that the model will be more focused on the input context.
The LLM will also be used to generate concise summaries of the uploaded documents and scraped events. Summaries will then be saved in the Chroma vector store to be used for retrieval.

In [9]:
llm = ChatOllama(model="llama3.1:8b", grounding="strict")

**Important**: `nomic-embed-text` and `llama3.1:8b` need to be downloaded first by installing [Ollama](https://ollama.com/download) in your system and running the following commands:
```bash
ollama pull nomic-embed-text
ollama pull llama3.1:8b
```

## 8. Functions for Document Processing and Vector Base Creation

### 8.1. Function to check if the text is concert-related
This function checks if the text contains any concert-related keywords. If the text is related to concerts, the app will not load the document, ensuring that only relevant information is processed.
This is important for maintaining the focus of the application and ensuring that the user receives accurate and relevant information.

In [None]:
"""
## 8. Functions for Document Processing and Vector Base Creation

### 8.1. Function to check if the text is concert-related
This function checks if the text contains any concert-related keywords. If the text is related to concerts, the app will not load the document, ensuring that only relevant information is processed.
This is important for maintaining the focus of the application and ensuring that the user receives accurate and relevant information.
"""
def is_concert_related(text, CONCERT_RELATED_KEYWORDS):
    text_lower = text.lower() # Convert to lowercase for case-insensitive matching
    return any(keyword in text_lower for keyword in CONCERT_RELATED_KEYWORDS)

CONCERT_RELATED_KEYWORDS = ["concert", "music", "band", "performance", "stage", 
                            "ticket", "venue", "artist", "festival", "tour", "gig", 
                            "show", "orchestra", "symphony", "recital", "live music", 
                            "audience", "encore", "setlist", "soundcheck"]

### 8.2. Function to create a Chroma vector store
This function takes a text input and an embeddings model as arguments. It processes the text to remove unnecessary whitespace and non-alphanumeric characters, splits the text into smaller chunks, and creates a Chroma vector store using the provided embeddings model.
We will use `RecursiveCharacterTextSplitter` to split the text into smaller chunks as the concert documents consist of multiple paragraphs. This way we will maintain each paragraph semantically intact and avoid splitting them into smaller pieces.

In [None]:
def get_vector_store(text, embeddings):
    processed_text = re.sub(r'\s+', ' ', text)
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', processed_text)
    documents = [Document(page_content=processed_text)] # list of Document objects
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200,
                                                   chunk_overlap=20,
                                                   length_function=len)
    chunks = text_splitter.split_documents(documents)
    chunk_texts = [chunk.page_content for chunk in chunks]
    vector_store = Chroma.from_texts(texts=chunk_texts,
                                     embedding=embeddings,
                                     persist_directory="chroma_db",     # directory to persist the database
                                     collection_name="my_collection")   # name of the collection
    
    return vector_store