# Building a Local RAG Application using Ollama and LangChain

<sub style="display:none">This tutorial is preparated based on https://medium.com/@himanshushukla.shukla3/build-a-local-rag-application-42c06a9051e4</sub>

Local LLMs (Large Language Models) are AI models that run on your personal devices (like a computer or smartphone) instead of relying on cloud services. Running LLMs locally brings several key advantages:   

* Your data stays with you, minimizing concerns about sensitive information being sent to and stored on external servers.   
* You have complete control over the model and its parameters, allowing for customization and experimentation.   
* Use the LLM anytime, anywhere, even without an internet connection.
* Potentially avoid ongoing cloud usage fees or API call charges.   

This emerging trend empowers users with greater autonomy and flexibility, opening up new possibilities for personalized AI applications while addressing growing concerns about data privacy and security.   

## Ollama

Ollama is a tool that simplifies the process of running LLMs locally. Think of it as a user-friendly platform specifically designed for managing and interacting with local LLMs. 

Essentially, Ollama lowers the barrier to entry for running LLMs locally. By simplifying the technical aspects, it allows users to focus on exploring the capabilities of these models and developing innovative applications.

Here's what makes it important in the context of local LLMs:

* Ollama streamlines the often complex process of installing and configuring LLMs, making them accessible to a wider audience.
* It provides a centralized interface to download, organize, and run various LLMs, eliminating the need for manual configuration of each model.
* Ollama helps manage system resources effectively, ensuring smooth performance even on devices with limited hardware capabilities.

> To learn how to install and use Ollama, please refer to the Ollama guide in OdtuClass.

## Installing the packages

Please run the following cell to install the required Python packages. Since we will use langchain, we will import the langchain implementations of Ollama and Chromadb (a vector database).

> Installation may take a while depending on the compute power of your computer.

In [14]:
# Document loading, retrieval methods and text splitting
!pip install -qU langchain langchain_community

# Local vector store via Chroma
!pip install -qU langchain_chroma

# Local inference and embeddings via Ollama
!pip install -qU langchain_ollama

# Web Loader
!pip install -qU beautifulsoup4

## Implementing the Retrieval part of RAG

### Importing the Libraries

We need to import three libraries:

* `WebBaseLoader`: This tool from the `langchain_community` library allows us to easily fetch content from a website.
* `RecursiveCharacterTextSplitter`: This class from `langchain_text_splitters` is used to divide the text into smaller, more manageable chunks. This is crucial for working with LLMs that have limitations on the amount of text they can process at once.

In [15]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

### Fetching and Processing the Information
Create a `WebBaseLoader` instance pointed at a specific blog post and then use the `.load()` method to fetch the content from that URL. The loaded content is stored in the `data` variable.

In [16]:
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

Now, we can split the loaded webpage content into smaller chunks using `RecursiveCharacterTextSplitter`.
* `chunk_size=500`:  Each chunk will be approximately 500 characters long.
* `chunk_overlap=50`: Consecutive chunks will overlap by 50 characters. 

The overlap helps maintain context between chunks, which can be important for the LLM to understand the text as a whole.

In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)


Call the `split_documents()` method of the `text_splitter` object  created earlier. This method does the actual work of dividing the text into chunks according to the `chunk_size` and `chunk_overlap` parameters we specified.

Here, we're passing the `data` variable (which holds the loaded webpage content) as input to the `split_documents()` method. This tells the splitter what text it needs to split up.

In [18]:
all_splits = text_splitter.split_documents(data)

### Storing the Data with ChromaDB

Now we will embed our text chunks and store them in a vector database. This is a crucial step for efficient information retrieval.

We'll use `OllamaEmbeddings` to generate these embeddings locally with the `"nomic-embed-text"` model.

In [19]:
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

local_embeddings = OllamaEmbeddings(model="nomic-embed-text")

To use the `local_embeddings` defined above, we actually need to download this embedding model to your local computer. 

To do this, run the following command in Ollama terminal:

>llama pull nomic-embed-text

Now you should create your vectorstore using the `Chroma.from_documents()` method. This is where you'll combine the text splits (`all_splits`) with your chosen embeddings (`local_embeddings`) and store them in Chroma.

In [20]:
vectorstore = Chroma.from_documents(documents=all_splits, embedding=local_embeddings)

And now we have a working vector store! Test that similarity search is working:

In [21]:
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
print(len(docs))
print(docs[0])

4
page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.' metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-crit

### Connecting with a Local LLM using Ollama

Now you should initialize the `ChatOllama` model. This sets up your connection to the Ollama server and specifies which language model you'll be using.

We use Ollama with `llama3.1:8b` here, but you can explore other providers or model options depending on your hardware setup.

Before running the following cell you should execute the following command in Ollama terminal:

> ollama pull llama3.1:8b

In [22]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral:latest",
)

Let's test it to make sure you’ve set everything up properly. In the code below, we use `IPython.display` to render the LLM's response nicely within your Jupyter Notebook environment

 It's likely that the the responses from llms contain Markdown formatting. The `display()` function, when used with `Markdown(...)`,  tells Jupyter Notebook to interpret and render the text as Markdown. This ensures that any formatting instructions in the LLM's response (like bolding, lists, or headings) are displayed correctly.

In [23]:
from IPython.display import Markdown, display
response_message = model.invoke(
    "how omni helps in Audio Language Modeling"
)

# Compile the output to markdown
markdown_output = response_message.content

# Print or display the markdown output
display(Markdown(markdown_output))

 Omni is a versatile and powerful AI model that can be used in various applications, including audio language modeling. Here's how it can help:

1. Speech Recognition: Omni can be trained to transcribe speech from audio files into text, thereby enabling voice-to-text functionality. This is the first step in processing spoken language and can be crucial for developing virtual assistants, call center solutions, or any application where automatic speech recognition is required.

2. Speech Synthesis: On the other hand, Omni can also be used to convert text into spoken language, a process known as text-to-speech synthesis. This feature can be useful for applications like audio books, language learning platforms, or even for generating voice prompts in software interfaces.

3. Language Understanding: Once the speech is transcribed into text, Omni's natural language processing (NLP) capabilities can help understand the intent and context of the spoken words. This is essential for building intelligent conversational systems like chatbots or virtual assistants that can respond appropriately to user queries.

4. Sentiment Analysis: In the context of audio data, Omni can analyze the tone, pitch, and speed of speech to determine the speaker's emotional state, which can help in customer service applications for gauging customer satisfaction.

5. Speaker Identification: By analyzing various acoustic features of the voice, such as frequency, intensity, and duration, Omni can help identify different speakers in a given audio file. This is useful for applications like forensics or call center management where multiple people might be speaking at once.

In summary, Omni, with its advanced AI capabilities, can greatly facilitate various aspects of audio language modeling, making it easier to process, understand, and generate spoken language data.

## Creating a Summarization Chain

We will create a summarization chain which will produce a summary of the retrieved documents based on the search.

First, you'll need to import the necessary modules and define how you want to structure your prompt to the language model.

You should create a template for the prompt you'll send to the LLM by using the `from_template` method of the `ChatPromptTemplate`. 

In this case, the template is *"Summarize the main themes in these retrieved docs: {docs}*. The `{docs}` part is a placeholder that will be filled with the actual documents retrieved from your vectorstore.

In [24]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "{docs}"
)

Next, you need a way to format the documents retrieved from your vectorstore so they can be included in the prompt. This `format_docs` function below takes a list of documents and joins their `page_content` together with double newlines (\n\n) as separators. 

This creates a single string containing the content of all retrieved documents.

In [25]:
# Convert loaded documents into strings by concatenating their content
# and ignoring metadata
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Now, let's assemble a processing pipeline using LangChain's concise syntax. The statement is missing the most essential pieces. Make an educated guess :)) 

In [26]:
chain = {"docs": format_docs} | prompt | model | StrOutputParser()

Finally, you can use your setup to answer a question.

First search your vectorstore for documents relevant to the `question` and store them in the `docs` variable.

In [27]:
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)

The following code `invoke`s the `chain` with the retrieved `docs` as **input**. The chain formats the documents, creates the prompt, sends it to the LLM, and parses the response.

In [28]:
markdown_output = chain.invoke(docs)

Let's display the final outcome as Markdown.

In [29]:
display(Markdown(markdown_output))

 In summary, task decomposition can be achieved in three ways:

1. Using simple prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?" to guide the Large Language Model (LLM) to break down complex tasks into smaller and more manageable steps.
2. Employing task-specific instructions, such as using "Write a story outline" when writing a novel, which helps the model understand the specific requirements of the task at hand.
3. Incorporating human inputs, where a human guides or provides information to the LLM throughout the task decomposition process.

Chain of thought (CoT) is a popular method for enhancing the performance of models on complex tasks by instructing them to "think step by step". CoT helps to decompose hard tasks into smaller and simpler steps, making big tasks more manageable. Additionally, it offers insights into the model's thinking process.

### Question & Answering 

Instead of summarization, you can also perform question-answering with your local model and vector store. Here’s an example with a simple string prompt. 

Pay attention to the content and structure of the prompt. `{context}` and `{question}` are the placeholders to later insert the context information (retrieved from the vector database) and the question asked by the user. Right now they are not known.


In [30]:
RAG_TEMPLATE = """
You are an assistant for question-answering tasks. Use the following pieces of context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Context : ```{context}```

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

We will follow two approaches to create the chain. 

#### First Approach

In the first approach, the chain will be invoked via 

`chain.invoke({"context": context, "question": question})`. 

This means, we need to provide the context along with the question.


**What is RunnablePassthrough**

Now that we have the prompt template, we need to define a chain of operations to answer questions using relevant context. This chain should 

1. receive the context and the question from the input dictionary, 
2. generate an answer using a language model, and 
3. then extracts the answer as a string.

In the chain below, you may notice the `RunnablePassthrough` object. 

`RunnablePassthrough` is a fundamental component in LangChain that allows you to inject custom logic or data into a chain of language models. It's essentially a placeholder that can be filled with a specific function or value.

The RunnablePassthrough is used to preprocess the context before it's fed to the language model.

You can think in this way: At the moment the question and the context is not known. Once they are known, they will be passed through the chain. But, without them at the moment, we can still build our pipelien or chain.

In [31]:
from langchain_core.runnables import RunnablePassthrough

chain = (
    RunnablePassthrough.assign(
        context=lambda input: input["context"],
        question=lambda input: input["question"],  # Explicitly pass question
    )
    | rag_prompt
    | model
    | StrOutputParser()
)

Indeed, we could just use `RunnablePassthrough()` and the chain would still work. 

`RunnablePassthrough` receives the input (which is literally a dictionary containing the `question` key, and the `context`) and passes it along unchanged. 

These two keys then are fed to `rag_prompt` which has the input placeholders (`{question} and {context}`).

All these interactions between these components happen automatically in LangChain. 

In [32]:
from langchain_core.runnables import RunnablePassthrough


chain = (
        RunnablePassthrough() # this still works
    | rag_prompt
    | model
    | StrOutputParser()
)

Now we can test our chain. Remember that we will invoke the chain with `chain.invoke({"context": context, "question": question})`, which means we need to define the question and the context.

Question is defind for you, just execute the following cell:

In [33]:
question = "What strategies can be employed to ensure the consistency and alignment between various system specifications, such as requirements, design, and test during development? Provide examples of how these strategies might be applied in a real-world scenario.?"

As we did before, now do a `similarity_search` to find the docs similar to the `question`.

In [34]:

docs = vectorstore.similarity_search(question)
context = format_docs(docs)

Now, please execute the following cell to `invoke` the chain to obtain the final response:

In [35]:
markdown_output = chain.invoke({"context": context, "question": question})

display(Markdown(markdown_output))

 To ensure consistency and alignment between various system specifications during development, following strategies can be employed:

1. Requirements Traceability Matrix (RTM): This tool helps link the high-level requirements to their corresponding lower-level designs and test cases. By doing this, developers can easily track the progress of each requirement throughout the development life cycle and ensure they are all met.

Example: In a real-world scenario, an RTM for a self-driving car system might link the high-level requirement "The car should safely navigate through traffic" to its corresponding lower-level designs (e.g., lane detection algorithms) and test cases (e.g., simulated traffic scenarios).

2. Design Review: Periodic meetings where designers discuss, review, and critique each other's work can help ensure that the design is coherent and aligned with requirements. This process also allows early identification of potential issues before they become critical during implementation.

Example: In a software development project, weekly design reviews might be held to discuss and critique the progress made on different modules or features.

3. Test-Driven Development (TDD): Developing tests for new functionalities before actually writing the code can help ensure that the system meets its requirements. This approach also helps in catching issues early and fosters a better design as the focus is on meeting specific behavioral expectations.

Example: In a self-driving car project, TDD might involve creating tests to verify the performance of the lane detection algorithms before actually implementing them.

By employing these strategies, developers can ensure that their system meets its requirements, has a coherent design, and is thoroughly tested throughout the development life cycle.

Alternatively, you could format the retrieved docs within the chain. Here is how you could do it:

In [36]:
chain = (
    RunnablePassthrough.assign(context=lambda input: format_docs(input["context"]))
    | rag_prompt
    | model
    | StrOutputParser()
)

question = "What strategies can be employed to ensure the consistency and alignment between various system specifications, such as requirements, design, and test during development? Provide examples of how these strategies might be applied in a real-world scenario.?"

docs = vectorstore.similarity_search(question)


markdown_output = chain.invoke({"context": docs, "question": question})

display(Markdown(markdown_output))

1) **Requirements Traceability Matrix (RTM):** This tool helps ensure that each requirement is addressed by a corresponding design or test case, thus maintaining consistency. For instance, if a requirement states "The system should be able to process images quickly," the RTM would link this requirement to its implementation in the design phase (e.g., using a specific algorithm) and testing phase (e.g., performance benchmark tests).

2) **Model-Based Development (MBD):** MBD uses formal models of the system, such as UML diagrams or statecharts, to describe requirements, design, and test cases in a unified manner. This approach allows for easy tracking of changes and ensures consistency across different stages of development. For example, if a change is made to the design during the implementation phase, the corresponding models can be updated automatically, ensuring that the new design adheres to the original requirements.

3) **Test-Driven Development (TDD):** TDD emphasizes writing tests before the actual code, thereby ensuring that the system meets its specified requirements. As the development progresses, the tests are continuously updated and rerun to verify that no unintended changes have affected the system's behavior. For instance, if a new feature is introduced in the design phase, corresponding test cases would be written first before any implementation work begins.

In the code above, pay attention that, we did not explicitly pass the `question`. We did pass `context` since we had to apply `format_docs` on the retrieved documents before feeding them to `rag_prompt`.

#### Second Approach

As the second approach, instead of manually passing in docs, you can automatically retrieve them from our vector store based on the user question:

In [37]:
retriever = vectorstore.as_retriever()

qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | model
    | StrOutputParser()
)

**How the Chain Executes:**

* When you invoke the `qa_chain`, you likely pass a dictionary like this: `qa_chain.invoke({"question": "What is the capital of France?"})`.
* The `{"context": ..., "question": ...}` step receives this entire dictionary as its input.
* `RunnablePassthrough()` assigns the whole input dictionary `{"question": "What is the capital of France?"}` to the `"question"` key.
* The `retriever` also receives the same whole dictionary as input. Since the `retriever` is designed to extract and use the `"question"` key from the input dictionary, it uses *"What is the capital of France?"* for its similarity search.
* The retrieved documents (from retriever) are then formatted by `format_docs` and assigned to the `"context"` key.
* Finally, the resulting dictionary `{"context": formatted_docs, "question": {"question": "What is the capital of France?"}}` is passed to `rag_prompt`.

In [40]:
question = "What are the approaches to Task Decomposition?"

display(Markdown(qa_chain.invoke(question)))

1. Chain of Thought (CoT) prompting, where the model is instructed to break down complex tasks into smaller steps.
2. Using task-specific instructions that guide the model towards a particular goal, such as writing a story outline for a novel.
3. Human inputs can also be used in task decomposition to provide additional guidance or context to the model.

### Using Llama.cpp

Practical applications of LLMs can be limited by the need for high-powered computing or the necessity for quick response times. These models typically require sophisticated hardware and extensive dependencies, which can make difficult their adoption in more constrained environments.

This is where LLaMa.cpp (or LLaMa C++) comes to the rescue, providing a lighter, more portable alternative to the heavyweight frameworks.

Llama.cpp was developed by Georgi Gerganov. It implements the Meta’s LLaMa architecture in efficient C/C++, and it is one of the most dynamic open-source communities around the LLM inference with more than 900 contributors, 69000+ stars on the official GitHub repository, and 2600+ releases.

Llama.cpp’s backbone is the original Llama models, which is also based on the transformer architecture. The authors of Llama leverage various improvements that were subsequently proposed and used different models such as PaLM.

Let's first install the Python binding for llama.cpp

In [41]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.5.tar.gz (64.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.5/64.5 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25ldone
[?25h  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.5-cp311-cp311-macosx_15_0_arm64.whl size=3049304 sha256=3ee067e8004f8f785a682fdaba8e77c4f986f6ac5c199e722720e4d0e1405ce0
  Stored in directory: /Users/dogukanince/Library/Caches/pip/wh

Then you can import it:

In [42]:
from langchain_community.llms import LlamaCpp

The Llama class imported above is the main constructor leveraged when using Llama.cpp, and it takes several parameters and is not limited to the ones below. The complete list of parameters is provided in the official documentation:

* `model_path`: The path to the Llama model file being used
* `prompt`: The input prompt to the model. This text is tokenized and passed to the model.
* `device`: The device to use for running the Llama model; such a device can be either CPU or GPU.
* `max_tokens`: The maximum number of tokens to be generated in the model’s response
* `stop`: A list of strings that will cause the model generation process to stop
* `temperature`: This value ranges between 0 and 1. The lower the value, the more deterministic the end result. On the other hand, a higher value leads to more randomness, hence more diverse and creative output.
* `top_p`: Is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
* `echo`: A boolean used to determine whether the model includes the original prompt at the beginning (True) or does not include it (False)

When you use Ollama to download LLMs, they are stored in a specific directory on your system (in Windows): `C:/Users/<<windows user>>/.ollama/models/`. 

**Important:**  You'll need to replace `<<your_username>>` with your actual Windows username.

For Mac users, The default storage location for these models is within the user's home directory, specifically under `~/.ollama/models`.

This directory, known as the `model_path`, is crucial for accessing your LLMs. To ensure your code functions correctly, you must update the `model_path` variable in your scripts to point to this location. 



In [1]:
# Make sure the model path is correct for your system!
llm_llamacpp = LlamaCpp(
    model_path= "PATH_TO_YOUR_MODEL",    
    temperature=1,
    max_tokens=2500,
    top_p=1,
    verbose=True,  # Verbose is required to pass to the callback manager
)

NameError: name 'LlamaCpp' is not defined

Build the same chain as before but replace the model.

In [55]:
qa_chain_llamacpp = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm_llamacpp
    | StrOutputParser()
)

Write the code to invoke the model with a question:

In [56]:
question = "What are the approaches to Task Decomposition?"

display(Markdown(qa_chain_llamacpp.invoke(question)))

llama_perf_context_print:        load time =   23052.11 ms
llama_perf_context_print: prompt eval time =       0.00 ms /   369 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    30 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   25731.14 ms /   399 tokens


 

Assistant: Task decomposition can be approached in three ways: using simple prompting for LLMs, applying task-specific instructions, or incorporating human inputs.

### Speed Gains with LLaMa.cpp

Have you noticed the improved speed when using LLaMa.cpp? Its optimized C++ implementation often leads to faster responses compared to Ollama.

LLaMa.cpp, with its efficient C++ implementation, is designed for speed and often outperforms Ollama in benchmarks. This can be particularly important for applications that require real-time interactions or rapid responses.