# Create AI apps with LLMs

This notebook shows ways to experiment quickly with Large Language Models (LLMs) and 
Retrieval Augmented Generation (RAG) which can then be integrated into a UI / App

For putting things together, Langchain is a very useful framework that integrates lots of different providers 
(LLM, vector databases, agents...)

## RAG: Retrieval Augmented Generation

LLMs only know about things they were trained on. They cannot know about everything, especially not 
about documents and data from private sources, or content published after the model was trained.

To generate accurate answers querying specific content, the content needs to be passed to the LLM as part of the prompt.
However, there is a major problem: despite having a large context window compared to other types of NLP models, the window 
is not unlimited. Typical window size is 1024, 4096, and up to 32000 tokens, which is often too small for even medium size documents.
                                                                                                     
The solution is to index the content, and provide only relevant context to the LLM.

To do this, the content is chunked into small pieces of text, for each piece, an embedding vector of the sentence is created, and stored into
a vector store. Upon querying the data, an embedding of the query is created, and the vector store is queried for similar content.
The top N pieces of relevant content are retrieved and plugged into the prompt for the LLM to answer the query.

### Import some data to query

In this example, we retrieve a recent article from the web.

In [49]:
from langchain_community.document_loaders import WebBaseLoader

# for other types of documents, use:
# from langchain_community.document_loaders import UnstructuredHTMLLoader
# from langchain_community.document_loaders import TextLoader
# from langchain_community.document_loaders.csv_loader import CSVLoader
# from langchain_community.document_loaders import JSONLoader
# from langchain_community.document_loaders import UnstructuredMarkdownLoader
# from langchain.document_loaders import PyPDFLoader
# see: https://python.langchain.com/docs/modules/data_connection/document_loaders/ for more info

url = "https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends"
loader = WebBaseLoader(url)
data = loader.load()

### Chunk the article into smaller manageable pieces

In [50]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, 
    chunk_overlap=10, 
    length_function=len, 
)
all_splits = text_splitter.split_documents(data)

In [51]:
len(all_splits)

357

### Load an embedding model

Note the same embedding model needs to be used to embed the pieces of text from the article, and later the query to be answered for this to work.

In [52]:
from langchain.embeddings import HuggingFaceEmbeddings
import os

# If you are on Mac M1/M2, enable the following environment variable
# NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. 
# If you want this op to be added in priority during the prototype phase of this feature, 
# please comment on https://github.com/pytorch/pytorch/issues/77764. 
# As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` 
# to use the CPU as a fallback for this op. 
# WARNING: this will be slower than running natively on MPS.

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'mps'}
encode_kwargs = {'normalize_embeddings': False}
embedding=HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

### Ingest the data into the vector store

Langchain vector store interface takes care of embedding each piece of text and store it in the DB.

Here we use ChromaDB, a local vector database based on SQLLite. 

Note as we pass the split texts and the embedding function

In [53]:
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=embedding,
    # persist_directory="./"  # if you want to persist the DB locally, and not have to reindex each time
)

### Test the vector store retrieval on some question

In [54]:
question = "What is RAG?"
docs = vectorstore.similarity_search_with_score(question, k=3)


In [55]:
docs

[(Document(page_content='"You can use RAG to go gather a ton of unstructured information, documents, etc., [and] feed it into a model without having to fine-tune or custom-train a model," Barrington said.\nThese benefits are particularly enticing for enterprise applications where up-to-date factual knowledge is crucial. For example, businesses can use RAG with foundation models to create more efficient and informative chatbots and virtual assistants.', metadata={'description': 'Discover the top 10 machine learning and AI trends for 2024 that are shaping technology and business, including multimodal, open source and customization.', 'language': 'en', 'source': 'https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends', 'title': '10 top AI and machine learning trends for 2024 | TechTarget'}),
  0.8618921637535095),
 (Document(page_content='"You can use RAG to go gather a ton of unstructured information, documents, etc., [and] feed it', metadata={'description

### Setup the LLM

Below I use LlamaCpp as a local LLM engine so everything runs locally. 
I use a small version of the model (7B params) quantized to 4bit that takes a lot less space than the full model.

However, even then it requires a decent Nvidia GPU or a M1/M2 Mac. Alternatively, you can use a service like OpenAI.

Note that on Mac, it requires to compile with special flags. See the README for more details.

In [56]:
from langchain.llms import LlamaCpp
model_file = '/Users/emmanuel/workspace/models/llms/llama-2-13b-chat.Q4_0.gguf'
# downloaded from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF
# you may also try a smaller model:
# https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
llm = LlamaCpp(
    model_path=model_file,
    n_ctx=4096, # context window
    #verbose=True,
    device='mps',
    # model_kwargs={'device':'mps'},
    n_gpu_layers=1
)

                device was transferred to model_kwargs.
                Please confirm that device is what you intended.
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /Users/emmanuel/workspace/models/llms/llama-2-13b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6: 

## To use OpenAI

In [57]:
#!pip install langchain-openai

In [58]:
#from langchain_openai import OpenAI
#llm = OpenAI(openai_api_key="...")

### Setup a retrieval chain

Another useful langchain abstraction

In [59]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectorstore.as_retriever(search_kwargs={'k':10}),
                                       return_source_documents=True)

In [60]:
qa_chain({'query': question})


llama_print_timings:        load time =     791.79 ms
llama_print_timings:      sample time =       3.77 ms /    36 runs   (    0.10 ms per token,  9559.21 tokens per second)
llama_print_timings: prompt eval time =   25280.58 ms /   366 tokens (   69.07 ms per token,    14.48 tokens per second)
llama_print_timings:        eval time =    3600.72 ms /    35 runs   (  102.88 ms per token,     9.72 tokens per second)
llama_print_timings:       total time =   28985.01 ms /   401 tokens


{'query': 'What is RAG?',
 'result': ' RAG stands for "Read, Ask, Generate" and is a technique for reducing hallucinations in language models by blending text generation with information retrieval.',
 'source_documents': [Document(page_content='"You can use RAG to go gather a ton of unstructured information, documents, etc., [and] feed it into a model without having to fine-tune or custom-train a model," Barrington said.\nThese benefits are particularly enticing for enterprise applications where up-to-date factual knowledge is crucial. For example, businesses can use RAG with foundation models to create more efficient and informative chatbots and virtual assistants.', metadata={'description': 'Discover the top 10 machine learning and AI trends for 2024 that are shaping technology and business, including multimodal, open source and customization.', 'language': 'en', 'source': 'https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends', 'title': '10 top AI an

### Improving results

We might be able to improve the results with a more specific prompt.

In [61]:
from langchain.prompts import PromptTemplate
template = """
You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Do not attempt to define what acronyms stands for unless the definition was explicitly provided in the context.
Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"
"""
rag_prompt = PromptTemplate.from_template(template)

In [62]:
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {'context': qa_chain, 'question': RunnablePassthrough()}
    | rag_prompt
    | llm
)

ggml_metal_free: deallocating


In [63]:
rag_chain.invoke(question)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     791.79 ms
llama_print_timings:      sample time =       6.53 ms /    73 runs   (    0.09 ms per token, 11186.03 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    6984.72 ms /    73 runs   (   95.68 ms per token,    10.45 tokens per second)
llama_print_timings:       total time =    7085.44 ms /    74 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     791.79 ms
llama_print_timings:      sample time =       6.85 ms /    74 runs   (    0.09 ms per token, 10806.07 tokens per second)
llama_print_timings: prompt eval time =  131379.85 ms /  1704 tokens (   77.10 ms per token,    12.97 tokens per second)
llama_print_timings:        eval time =    9563.49 ms /    73 runs   (  131.01 ms per token,     7.63 tokens per second)
llama_print_timings:       to

'RAG stands for Relevance-Aware Generation, a technique used to reduce hallucinations in AI-generated content by blending text generation with information retrieval. It enhances the accuracy and relevance of AI-generated content, enables LLMs to access external information, and reduces model size, increasing speed and lowering costs.'