# Create AI apps with LLMs

This notebook shows ways to experiment quickly with Large Language Models (LLMs) and 
Retrieval Augmented Generation (RAG) which can then be integrated into a UI / App

For putting things together, Langchain is a very useful framework that integrates lots of different providers 
(LLM, vector databases, agents...)

## RAG: Retrieval Augmented Generation

LLMs only know about things they were trained on. They cannot know about everything, especially not 
about documents and data from private sources, or content published after the model was trained.

To generate accurate answers querying specific content, the content needs to be passed to the LLM as part of the prompt.
However, there is a major problem: despite having a large context window compared to other types of NLP models, the window 
is not unlimited. Typical window size is 1024, 4096, and up to 32000 tokens, which is often too small for even medium size documents.
                                                                                                     
The solution is to index the content, and provide only relevant context to the LLM.

To do this, the content is chunked into small pieces of text, for each piece, an embedding vector of the sentence is created, and stored into
a vector store. Upon querying the data, an embedding of the query is created, and the vector store is queried for similar content.
The top N pieces of relevant content are retrieved and plugged into the prompt for the LLM to answer the query.

### Import some data to query

In this example, we retrieve a recent article from the web.

In [None]:
from langchain_community.document_loaders import WebBaseLoader

# for other types of documents, use:
# from langchain_community.document_loaders import UnstructuredHTMLLoader
# from langchain_community.document_loaders import TextLoader
# from langchain_community.document_loaders.csv_loader import CSVLoader
# from langchain_community.document_loaders import JSONLoader
# from langchain_community.document_loaders import UnstructuredMarkdownLoader
# from langchain.document_loaders import PyPDFLoader
# see: https://python.langchain.com/docs/modules/data_connection/document_loaders/ for more info

url = "https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends"
loader = WebBaseLoader(url)
data = loader.load()

In [None]:
# data

### Chunk the article into smaller manageable pieces

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, 
    chunk_overlap=0, 
    length_function=len, 
)
all_splits = text_splitter.split_documents(data)

In [None]:
len(all_splits)

### Load an embedding model

Note the same embedding model needs to be used to embed the pieces of text from the article, and later the query to be answered for this to work.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
import os

# If you are on Mac M1/M2, enable the following environment variable
# NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. 
# If you want this op to be added in priority during the prototype phase of this feature, 
# please comment on https://github.com/pytorch/pytorch/issues/77764. 
# As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` 
# to use the CPU as a fallback for this op. 
# WARNING: this will be slower than running natively on MPS.

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'mps'}
encode_kwargs = {'normalize_embeddings': False}
embedding=HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

### Ingest the data into the vector store

Langchain vector store interface takes care of embedding each piece of text and store it in the DB.

Here we use ChromaDB, a local vector database based on SQLLite. 

Note as we pass the split texts and the embedding function

In [None]:
from langchain.vectorstores import Chroma
import time
vectorstore = Chroma.from_documents(
    collection_name=str(time.time()), # !!! if we re-run with the same collection name, we end up with duplicates in the DB!
    documents=all_splits,
    embedding=embedding,
    persist_directory="./"  # if you want to persist the DB locally, and not have to reindex each time
)

### Test the vector store retrieval on some question

In [None]:
question = "What is RAG?"
docs = vectorstore.similarity_search_with_score(question, k=5)


In [None]:
docs

### Setup the LLM

### To use LlamaCpp

LlamaCpp lets you run a model as a local LLM engine so everything runs locally. 
I use a small version of the model (13B params) quantized to 4bit that takes a lot less space than the full model.

However, even then it requires a decent Nvidia GPU or a M1/M2 Mac. Alternatively, you can use a service like OpenAI.

Note that on Mac, it requires to compile with special flags. See the README for more details.

In [None]:
from langchain.llms import LlamaCpp
model_file = '/Users/emmanuel/workspace/models/llms/llama-2-13b-chat.Q4_0.gguf'
# downloaded from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF
# you may also try a smaller model:
# https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
llm = LlamaCpp(
    model_path=model_file,
    n_ctx=4096, # context window
    #verbose=True,
    device='mps',
    # model_kwargs={'device':'mps'},
    n_gpu_layers=1
)

## To use OpenAI

In [None]:
#!pip install langchain-openai

In [None]:
# from langchain_openai import OpenAI
# # llm = OpenAI(openai_api_key="...")
# llm = OpenAI(openai_api_key=os.environ.get("OPENAI_API_KEY"))

## To use Ollama

first setup Ollama (install the app, run it, this installs the command line

Then run the server with 
```ollama run <model>```


In [None]:
# from langchain.callbacks.manager import CallbackManager
# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# from langchain_community.llms import Ollama

# llm = Ollama(
#     model="llama2",
#     # callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
# )

### Setup a retrieval chain

Another useful langchain abstraction

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectorstore.as_retriever(search_kwargs={'k':2}, k=2, search_type="mmr"),
                                       return_source_documents=True)

In [None]:
qa_chain({'query': question})

### Improving results

We might be able to improve the results with a more specific prompt.

In [None]:
from langchain.prompts import PromptTemplate
template = """
[INST]
You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question, and only this context. 
If you don't know the answer, from the provided context, just say that you don't know. 
Do not attempt to define what acronyms stand for unless the definition was explicitly provided in the context.
Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:""
[/INST]
"""
rag_prompt = PromptTemplate.from_template(template)

In [None]:
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {'context': qa_chain, 'question': RunnablePassthrough()}
    | rag_prompt
    | llm
)

In [None]:
rag_chain.invoke(question)

### More tuning?

Sometimes, it's not enough, and we need ot revisit our strategy from the start
- Chunking: how is the document split up? Here we have text from the web. We know the article is split up into several independent paragraphs.
it might be useful to have larger chunks that include a whole paragraph, and if that is too much to fit in the context window, we can reduce the numebr of matches to provide, since with larger chunks, other paragraphs may not be relevant anyway.,

Setting chunk size to 5000 in the chunking phase helps improve results.

- Search method: we use MMR (Maximal Marginal Relevance) already. This is an option that may reduce redundant chunks.

- LLM model: choosing a model trained more closely to the task is always a good strategy. If you're dealing with code, use an instruct model trained on code.

- Prompt: prompt engineering is the best thing before requiring to fine tune the model. More precise instructions and directions in prompts help weed out bad answers.

- Fine Tuning: if nothing helps, the last resort might be to require to fine tune a model. That is expensive and time consuming, and to be considered with all the other options above.

