<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/search_with_local_llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/smthomas1704/restoration-rag.git

In [None]:
!pip install -r restoration-rag/requirements.txt

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4


In [None]:
model_name_or_path = "TheBloke/Llama-2-7B-chat-GGML"
model_basename = "llama-2-7b-chat.ggmlv3.q5_1.bin" # the model is in bin format

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

In [11]:
from langchain.llms import LlamaCpp
from llama_cpp import Llama
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path=model_path, # https://huggingface.co/TheBloke/Llama-2-7B-GGUF
    temperature=0.7, # for factual answers
    top_p=0.1,
    n_ctx=6000,
    callback_manager=callback_manager,
    verbose=True,
)

# lcpp_llm = Llama(
#     model_path=model_path,
#     n_threads=2, # CPU cores
#     n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
#     n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
#   )

# See the number of layers in GPU
# print(lcpp_llm.params.n_gpu_layers)


# question = "who wrote the book Innovator's dilemma?"
# This runs very very slow as it tries to hallucinate every single character one by one
# Last time it took more than 10mins.
# answer = llm(question)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


This next section is about processing the pdf files that we have into smaller chunks and generating embeddings for them. The embeddings will be stored in a vector DB and queried by LangChain.

Let's start with downloading, splitting and storing the files

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

DATA_PATH = 'restoration-rag/data' #Your root data folder path
DB_FAISS_PATH = 'vectorstore/db_faiss'

loader = PyPDFDirectoryLoader(DATA_PATH)
documents = loader.load()

print(len(documents))
print(documents[0].page_content[0:100])

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)
splits = text_splitter.split_documents(documents)

print(splits[31])


38
Ecol Solut Evid.  2023;4:e12246.	 		 	 | 1 of 11
https://doi.org/10.1002/2688-8319.12246
wileyonline
page_content='production model of knowledge production assumes reciprocal knowledge flow between science and practice; conservation practitioners, \nscientists and other stakeholders (i.e. the people invested in and affected by conservation decisions) jointly create actionable knowledge \nby working together to define research needs, set research agendas, implement research and generate products (e.g. data, publications,' metadata={'source': 'restoration-rag/data/20230400982.pdf', 'page': 2}


In [13]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',
                                       model_kwargs={'device': 'cpu'})

db = FAISS.from_documents(splits, embeddings)
db.save_local(DB_FAISS_PATH)


# If we ever need to load this back up from the files, this is the code
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
#                                        model_kwargs={'device': 'cpu'})
# db = FAISS.load_local(DB_FAISS_PATH, embeddings)

In [None]:
import langchain
from queue import Queue
from typing import Any
from langchain.llms.huggingface_text_gen_inference import HuggingFaceTextGenInference
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.schema import LLMResult
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts.prompt import PromptTemplate
from anyio.from_thread import start_blocking_portal #For model callback streaming

langchain.debug=True

template = """
[INST]Use the following pieces of context to answer the question. If no context provided, answer like a AI assistant.
{context}
Question: {question} [/INST]
"""

retriever = db.as_retriever(
        search_kwargs={"k": 6}
    )

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={
        "prompt": PromptTemplate(
            template=template,
            input_variables=["context", "question"],
        ),
    }
)

result = qa_chain({"query": "How to prioritize areas for ecological restoration"})
print(result)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "How to prioritize areas for ecological restoration"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "How to prioritize areas for ecological restoration",
  "context": "Climent-  Gil, E., Derak, M., López, G., Bonet, A., Aledo, A., & \nCortina-  Segarra, J. (2023). Prioritizing areas for ecological \nrestoration: A participatory approach based on cost-  \neffectiveness. Journal of Applied Ecology , 60, 1194–1205. \nhttps://doi.org/10.1111/1365-2664.14395\n\nLandscape-  scale prioritization models reflect alternative ap -\nproaches to assessing the effectiveness of restoration actions. \nThese models have used multiple criteria to define priority