<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/functional_trait_rag_with_jina_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download
Download the zipped literature file from this Google Drive location

In [3]:
!git clone https://github.com/smthomas1704/restoration-rag.git

!pip install -r restoration-rag/requirements.txt
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4
!pip install gdown==v4.6.3
!pip install openai
!pip install langchain_experimental

!gdown https://drive.google.com/file/d/10_inKhFuY5O8Sel88ZvlTqODsbbh4ula/view?usp=drive_link -O /content/functional_trait_literature_zipped.zip --fuzzy

!unzip /content/functional_trait_literature_zipped.zip

# Chunk and store

In this portion, we'll chunk all the files into smaller paragraphs and use that for generating embeddings. We'll separately store the chunks so we can use it later, without having to re-download all the literature again.

### References/Notes related to chunking
1. https://openai.com/blog/new-and-improved-embedding-model
2. text-embedding-ada-002 is the best model for text embedding generation
3. https://www.pinecone.io/learn/chunking-strategies/
4. https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdf

### Possible strategies for chunking
1. LaTex: LaTeX is a document preparation system and markup language often used for academic papers and technical documents. By parsing the LaTeX commands and environments, you can create chunks that respect the logical organization of the content (e.g., sections, subsections, and equations), leading to more accurate and contextually relevant results.
2. Latex taxes a string as input, so we will need to read
3. Most of these academic papers are written in Latex and then converted to PDF. Our best bet would be to convert the PDF to Latex format first and then use the Latex based chunker to chunk things. This way paragraphs and related information will be together and contextualized.
4. On the other hand its not a guarantee that the document was first written in LaTex.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from huggingface_hub import hf_hub_download
from langchain.docstore.document import Document
import pandas as pd
import json

REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "all_afr_carbon_abstracts_only.jsonl"

# Download the chunks from Huggingface. We generated the chunks and uploaded it to Huggingface
file_name = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset", local_dir="/content/")

print(file_name)
chunks=[]

with open(file_name, "r") as final:
  chunks = json.load(final)

prod_splits=[]

for chunk in chunks:
  print(type(chunk))
  print(chunk)
  try:
    prod_splits.append(Document(
      page_content=chunk['page_content'],
      metadata={
          "source": chunk['title'],
          "id": chunk['id']
      }
    ))
  except:
    print("Exception for chunk:")
    print(chunk)

print(prod_splits[0])
print(prod_splits[1])

## Now we will generate embeddings using different models for comparison
First one is jina-embedding-l-en-v1

In [None]:
!mkdir "/content/vectorstore"
DB_JINA_EMBEDDING_PATH = 'vectorstore/db_jina-embedding-l-en-v1'

embeddings = HuggingFaceEmbeddings(model_name='jinaai/jina-embedding-b-en-v1',
                                       model_kwargs={'device': 'cuda'})

db = FAISS.from_documents(prod_splits, embeddings)
serialized_bytes = db.serialize_to_bytes()
with open("/content/vectorstore/serialized_db.txt", "wb") as binary_file:
    # Write bytes to file
    binary_file.write(serialized_bytes)

db.save_local(DB_JINA_EMBEDDING_PATH)

In [10]:
docs = await db.asimilarity_search("How can plant functional traits be useful when designing a restoration project?")
print(len(docs))
print(docs[0].metadata)
print(docs[1].metadata)
print(docs[2].metadata)
print(docs[3].metadata)

4
{'source': 'Using large‐scale tropical dry forest restoration to test successional theory', 'id': '9.18'}
{'source': 'Restoring Ecosystem Services Tool (REST): a program for selecting species for restoration projects using a functional-trait approach', 'id': '6.18'}
{'source': 'Restoring Ecosystem Services Tool (REST): a program for selecting species for restoration projects using a functional-trait approach', 'id': '6.4'}
{'source': 'Using soil amendments and plant functional traits to select native tropical dry forest species for the restoration of degraded Vertisols', 'id': '7.30'}


In [19]:
from google.colab import userdata
from openai import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain_experimental.agents.agent_toolkits.csv.base import create_csv_agent
from langchain.agents.agent_types import AgentType
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import tiktoken
import os

OPENAI_API_KEY = userdata.get('COLABORATIVE_EARTH_KEY')
llm_model = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

llm = ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo")

memory = ConversationBufferMemory(
  memory_key='chat_history',
  return_messages=False
)
retriever = db.as_retriever(
    search_kwargs={"k": 5}
)

custom_template = """You are an AI assistant for assisted restoration papers.
You are given the following extracted parts of a long document.
=========
{context}
=========
Provide a conversational answer for the following question
Question: {question}.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Answer in Markdown:"""

custom_prompt = PromptTemplate(
    template=custom_template,
    input_variables=["context", "question", "source"],
)


conversation_chain_with_reference_prompt = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        # chain_type="refine",
        chain_type="stuff",
        retriever=retriever,
        verbose=True,
        memory=memory,
        combine_docs_chain_kwargs={"prompt": custom_prompt}
)


conversation_chain_without_reference = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        chain_type="refine",
        # chain_type="stuff",
        retriever=retriever,
        # return_source_documents=True,
        verbose=True,
        memory=memory,
        # combine_docs_chain_kwargs={"question_prompt": custom_prompt, "refine_prompt": custom_prompt}
)

query = "What are some strategies for carbon sequestration?"
# result = conversation_chain_with_reference_prompt({"question": query})
result = conversation_chain_without_reference({"question": query})
answer = result["answer"]

answer



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Context information is below.
------------

rapidly in the initial 20 years after planting. In contrast, SOC sequestration increased rapidly after 20 years. In addition, evergreen species had higher carbon density in both biomass and soil than deciduous species and economic species (fruit trees). Carbon sequestration in evergreen and deciduous species is greater than in economic species. Our findings provide new evidence on the divergent responses of biomass and soil to carbon sequestration after reforestation with respect to stand ages and vegetation types. This study provides relevant information for ecosystem management as well as for carbon sequestration and global climate change policies.

------------
Given the context information and not prior knowledge, answer any questions
Human: What are some strategies for carbon sequestration?[0m



'In addition to reforestation initiatives and the restoration of mangrove forests, another effective strategy for carbon sequestration is the protection and conservation of existing forests, especially those within protected areas like the Atlantic Forest. These forests have been shown to have a greater potential for carbon sequestration compared to unprotected areas. By preserving these ecosystems, we can maximize their ability to store carbon and contribute to global climate change mitigation efforts. Furthermore, promoting the natural regeneration of forests in close proximity to reforestation projects or old-growth forests can also enhance carbon sequestration rates. By combining these strategies with nature-based solutions, we can optimize carbon storage potential while supporting biodiversity and ecosystem health. Additionally, active site engineering, as demonstrated in the study in arid regions of Western Australia, can enhance carbon sequestration rates in such areas, providin

###TODO
1. Add a cross encoder after the retrieval stage to re-rank the results before feeding it to the API.
2. Alternately, or maybe along with this, we may also want to combine several chunks to send as context, depending on the context lenght.
3. Or perhaps we also chunk in small portions and combine results to send to OpenAI. This way answers can be formed based on chunks from different papers and different sections.
4. Update this RAG to cite sources from the context provided. Refence: https://blog.langchain.dev/langchain-chat/

https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a