<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/functional_trait_rag_with_jina_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download
Download the zipped literature file from this Google Drive location

In [None]:
!git clone https://github.com/smthomas1704/restoration-rag.git

!pip install -r restoration-rag/requirements.txt
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4
!pip install gdown==v4.6.3
!pip install openai
!pip install langchain_experimental

!gdown https://drive.google.com/file/d/10_inKhFuY5O8Sel88ZvlTqODsbbh4ula/view?usp=drive_link -O /content/functional_trait_literature_zipped.zip --fuzzy

!unzip /content/functional_trait_literature_zipped.zip

# Chunk and store

In this portion, we'll chunk all the files into smaller paragraphs and use that for generating embeddings. We'll separately store the chunks so we can use it later, without having to re-download all the literature again.

### References/Notes related to chunking
1. https://openai.com/blog/new-and-improved-embedding-model
2. text-embedding-ada-002 is the best model for text embedding generation
3. https://www.pinecone.io/learn/chunking-strategies/
4. https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdf

### Possible strategies for chunking
1. LaTex: LaTeX is a document preparation system and markup language often used for academic papers and technical documents. By parsing the LaTeX commands and environments, you can create chunks that respect the logical organization of the content (e.g., sections, subsections, and equations), leading to more accurate and contextually relevant results.
2. Latex taxes a string as input, so we will need to read
3. Most of these academic papers are written in Latex and then converted to PDF. Our best bet would be to convert the PDF to Latex format first and then use the Latex based chunker to chunk things. This way paragraphs and related information will be together and contextualized.
4. On the other hand its not a guarantee that the document was first written in LaTex.

In [28]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from huggingface_hub import hf_hub_download
from langchain.docstore.document import Document
import pandas as pd
import json

REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "function_trait_paper_small_chunks.jsonl"

# Download the chunks from Huggingface. We generated the chunks and uploaded it to Huggingface
file_name = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset", local_dir="/content/")

print(file_name)
chunks=[]

with open(file_name, "r") as final:
  chunks = json.load(final)

prod_splits=[]

for chunk in chunks:
  prod_splits.append(Document(
      page_content=chunk["page_content"],
      metadata={
          "source": chunk["title"],
          "id": chunk["id"]
      }
  ))

print(prod_splits[0])
print(prod_splits[1])

/content/function_trait_paper_small_chunks.jsonl
page_content='Tropical landscapes have been extensively degraded and deforested, but large-scale passive and active restoration projects have catalysed secondary forest regeneration over the last few decades (Chazdon, 2014).Tropical dry forests (TDFs) have a strong dry season of at least 3-4 months where little to no rain falls (Murphy & Lugo, 1986), which distinguishes TDFs from tropical wet forests, and offers a unique hurdle to restoration projects.Globally, 97% of TDFs are threatened by anthropogenic processes (Miles et al., 2006), and in Central America it is estimated that only 1.7% of the original extent of TDF exists (Griscom & Ashton, 2011;Miles et al., 2006).Despite the fact that TDFs are critically endangered (Janzen, 2002), the restoration of TDFs has been studied minimally compared to wetter tropical forests (Meli, 2003).Following agricultural abandonment in Central America in the 1990s, large tracts of land became available

## Now we will generate embeddings using different models for comparison
First one is jina-embedding-l-en-v1

In [29]:
!mkdir "/content/vectorstore"
DB_JINA_EMBEDDING_PATH = 'vectorstore/db_jina-embedding-l-en-v1'

embeddings = HuggingFaceEmbeddings(model_name='jinaai/jina-embedding-b-en-v1',
                                       model_kwargs={'device': 'cuda'})

db = FAISS.from_documents(prod_splits, embeddings)
serialized_bytes = db.serialize_to_bytes()
with open("/content/vectorstore/serialized_db.txt", "wb") as binary_file:
    # Write bytes to file
    binary_file.write(serialized_bytes)

db.save_local(DB_JINA_EMBEDDING_PATH)

mkdir: cannot create directory ‘/content/vectorstore’: File exists


In [30]:
docs = await db.asimilarity_search("I am designing a tropical dry forest restoration in an open field with no remaining tree cover. Should I plant species with higher or lower wood density to maximize initial survival?")
print(len(docs))
print(docs[0].metadata)
print(docs[1].metadata)
print(docs[2].metadata)
print(docs[3].metadata)

4
{'source': 'Below-ground traits mediate tree survival in a tropical dry forest restoration', 'id': '1.7.0'}
{'source': 'Species wood density and the location of planted seedlings drive early‐stage seedling survival during tropical forest restoration', 'id': '6.1.0'}
{'source': 'Below-ground traits mediate tree survival in a tropical dry forest restoration', 'id': '1.1.3'}
{'source': 'Using large‐scale tropical dry forest restoration to test successional theory', 'id': '11.17.1'}


In [25]:
from google.colab import userdata
from openai import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain_experimental.agents.agent_toolkits.csv.base import create_csv_agent
from langchain.agents.agent_types import AgentType
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import tiktoken
import os

OPENAI_API_KEY = userdata.get('COLABORATIVE_EARTH_KEY')
llm_model = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

llm = ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo")

memory = ConversationBufferMemory(
  memory_key='chat_history',
  return_messages=False
)
retriever = db.as_retriever(
    search_kwargs={"k": 20}
)

custom_template = """You are an AI assistant for assisted restoration papers.
You are given the following extracted parts of a long document.
=========
{context}
=========
Provide a conversational answer for the following question
Question: {question}.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Answer in Markdown:"""

custom_prompt = PromptTemplate(
    template=custom_template,
    input_variables=["context", "question", "source"],
)


conversation_chain_with_reference_prompt = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        # chain_type="refine",
        chain_type="stuff",
        retriever=retriever,
        verbose=True,
        memory=memory,
        combine_docs_chain_kwargs={"prompt": custom_prompt}
)


conversation_chain_without_reference = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        chain_type="refine",
        # chain_type="stuff",
        retriever=retriever,
        # return_source_documents=True,
        verbose=True,
        memory=memory,
        # combine_docs_chain_kwargs={"question_prompt": custom_prompt, "refine_prompt": custom_prompt}
)

###TODO
1. Add a cross encoder after the retrieval stage to re-rank the results before feeding it to the API.
2. Alternately, or maybe along with this, we may also want to combine several chunks to send as context, depending on the context lenght.
3. Or perhaps we also chunk in small portions and combine results to send to OpenAI. This way answers can be formed based on chunks from different papers and different sections.
4. Update this RAG to cite sources from the context provided. Refence: https://blog.langchain.dev/langchain-chat/

https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

In [None]:
query = "How can plant functional traits be useful when designing a restoration project?"
result = conversation_chain_with_reference_prompt({"question": query})
# result = conversation_chain_without_reference({"question": query})
answer = result["answer"]

answer