# LLaMA.CPP-QA

This project focuses on document-based question answering using llama.cpp. Given a collection of documents, in PDF or text format, the model aims to accurately answer questions and return the sources of the answers, using the tools provided by Langchain.

-------------------------------------------------------------------------------------------------------------------------------

Install the required Python packages.

**Note**: A runtime restart is required after the cell is executed for the changes to be applied, due to conflicts in the versions of numpy.

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade llama-cpp-python
!pip install --upgrade transformers langchain chromadb sentence_transformers huggingface_hub PyMuPDF numpy

Obtain the quantized model from HuggingFace.

In [None]:
!mkdir models

## The following are the models that have been tested and produced good results.

# # PuddleJumper 13B (detailed answers)
# !wget -P models/ https://huggingface.co/TheBloke/PuddleJumper-13B-GGUF/resolve/main/puddlejumper-13b.Q6_K.gguf

# # llama-2-13B-chat-limarp-v2-merged-GGUF (shorter answers)
# !wget -P models/ https://huggingface.co/TheBloke/llama-2-13B-chat-limarp-v2-merged-GGUF/resolve/main/llama-2-13b-chat-limarp-v2-merged.Q6_K.gguf

# # MythoMax-L2-Kimiko-v2-13B-GGUF (shorter answers - similar to llama-2-13B-chat-limarp-v2-merged-GGUF)
# !wget -P models/ https://huggingface.co/TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF/resolve/main/mythomax-l2-kimiko-v2-13b.Q6_K.gguf

# # Athena-v1-GGUF (shorter answers, but more detailed than previous two - details not always correct)
# !wget -P models/ https://huggingface.co/TheBloke/Athena-v1-GGUF/resolve/main/athena-v1.Q6_K.gguf

# # TheBloke/Camel-Platypus2-13B-GGUF (similar to PuddleJumper - accurate numbers (maybe))
# !wget -P models/ https://huggingface.co/TheBloke/Camel-Platypus2-13B-GGUF/resolve/main/camel-platypus2-13b.Q6_K.gguf

# # EverythingLM-13b-V2-16K-GGUF
# !wget -P models/ https://huggingface.co/TheBloke/EverythingLM-13b-V2-16K-GGUF/resolve/main/everythinglm-13b-v2-16k.Q6_K.gguf

# # Chronohermes-Grad-L2-13B-GGUF (similar to PuddleJumper)
# !wget -P models/ https://huggingface.co/TheBloke/Chronohermes-Grad-L2-13B-GGUF/resolve/main/chronohermes-grad-l2-13b.Q6_K.gguf

# # Redmond-Puffin-13B-GGML
# !wget -P models/ https://huggingface.co/TheBloke/Redmond-Puffin-13B-GGUF/resolve/main/redmond-puffin-13b.Q6_K.gguf

# # Pygmalion-2-13B-GGUF
# !wget -P models/ https://huggingface.co/TheBloke/Pygmalion-2-13B-GGUF/resolve/main/pygmalion-2-13b.Q6_K.gguf

# # Mythalion-13B-GGUF
# !wget -P models/ https://huggingface.co/TheBloke/Mythalion-13B-GGUF/resolve/main/mythalion-13b.Q6_K.gguf

!wget -P models/ https://huggingface.co/TheBloke/llama2-22B-daydreamer-v2-GGUF/resolve/main/llama2-22b-daydreamer-v2.Q4_K_S.gguf

--2023-09-08 12:47:03--  https://huggingface.co/TheBloke/llama2-22B-daydreamer-v2-GGUF/resolve/main/llama2-22b-daydreamer-v2.Q4_K_S.gguf
Resolving huggingface.co (huggingface.co)... 18.164.174.55, 18.164.174.17, 18.164.174.118, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.55|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/31/39/31392ee792756a71b50c02fe9eb595950af39a9ef700f445df3cc4bb067713f7/cd17bac0940b23b0204725afe39e2794c621b65ed8431734d612f53e16e5a5b6?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama2-22b-daydreamer-v2.Q4_K_S.gguf%3B+filename%3D%22llama2-22b-daydreamer-v2.Q4_K_S.gguf%22%3B&Expires=1694436423&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5NDQzNjQyM319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8zMS8zOS8zMTM5MmVlNzkyNzU2YTcxYjUwYzAyZmU5ZWI1OTU5NTBhZjM5YTllZjcwMGY0NDVkZjNjYzRiYjA2NzcxM2Y3L2NkMTdiYWM

Import the required packages

In [None]:
from langchain.llms import LlamaCpp
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyMuPDFLoader, TextLoader
from langchain.chains import RetrievalQA
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

from multiprocessing import Pool
from tqdm import tqdm
from typing import List
import os
import glob
import shutil

Mount Google Drive to Colab. This step is optional and is required only if you prefer to load the documents from Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Specify the paths of the documents directory (source_dir), the Chroma database (persist_directory), which will be created once the documents are loaded, and the quantized LLaMA model. (model_path).

In [None]:
source_dir = "drive/MyDrive/docs"
persist_directory = "db"
model_path = "/content/models/llama2-22b-daydreamer-v2.Q4_K_S.gguf"

Load the documents from the specified directory and store them in a Chroma database. The supported file formats are .pdf and .txt.

In [None]:
LOADER_MAPPING = {
    ".pdf": (PyMuPDFLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
}

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

def load_single_document(file_path: str) -> List[Document]:
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADER_MAPPING:
        loader_class, loader_args = LOADER_MAPPING[ext]
        loader = loader_class(file_path, **loader_args)
        return loader.load()

    raise ValueError(f"Unsupported file extension '{ext}'")


all_files = []
for ext in LOADER_MAPPING:
    all_files.extend(
        glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
    )

with Pool(processes=os.cpu_count()) as pool:
    documents = []
    with tqdm(
        total=len(all_files), ncols=80
    ) as pbar:
        for i, docs in enumerate(
            pool.imap_unordered(load_single_document, all_files)
        ):
            documents.extend(docs)
            pbar.update()



if not documents:
    print("No documents to load")
else:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500, chunk_overlap=50
    )
    texts = text_splitter.split_documents(documents)

    if os.path.exists(persist_directory):
      shutil.rmtree(persist_directory)


    db = Chroma.from_documents(
        texts,
        embeddings,
        persist_directory=persist_directory,
    )
    db.persist()

100%|█████████████████████████████████████████████| 5/5 [00:00<00:00, 10.87it/s]


Create a vectorstore from the created database and create an index for it.

In [None]:
vectorstore = Chroma(embedding_function=embeddings, persist_directory=persist_directory)
index = VectorStoreIndexWrapper(vectorstore=vectorstore)

Load the model with LlamaCpp. Please adjust the parameters n_gpu_layers and n_batch of the model based on you GPU specifications. The following values work well with the default GPU provided by Google Colab.

**Note**: To disable token-wise streaming, comment out the **callback** parameter of the model

In [None]:
n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 64  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
  model_path=model_path,
  max_tokens=8000,
  n_batch=n_batch,
  n_gpu_layers=n_gpu_layers,
  callbacks=[StreamingStdOutCallbackHandler()],
  n_ctx=8000,
)

Create a chain with Langchain. The parameter **chain_type** can be: "stuff", "refine", "map_reduce" or "map_rerank". For document-based question answering, the model produces better results with "stuff".

For more information, please refer to the Langchain documentation: https://python.langchain.com/docs/modules/chains/document/

In [None]:
qa = RetrievalQA.from_chain_type(
  llm=llm,
  chain_type="stuff",
  retriever=vectorstore.as_retriever(),
  return_source_documents=True,
)

Query the model.

In [None]:
question = "What are some interesting facts about photovoltaic recycling?"

res = qa(question)
answer, docs = (
    res["result"],
    res["source_documents"],
)

print("\n\n> Question:")
print(question)
print(answer)

for document in docs:
    print("\n> " + document.metadata["source"] + ":")
    print(document.page_content)
