<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/functional_trait_rag_with_jina_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download
Download the zipped literature file from this Google Drive location

In [None]:
!git clone https://github.com/smthomas1704/restoration-rag.git

!pip install -r restoration-rag/requirements.txt
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4
!pip install gdown==v4.6.3
!pip install openai
!pip install langchain_experimental

!gdown https://drive.google.com/file/d/10_inKhFuY5O8Sel88ZvlTqODsbbh4ula/view?usp=drive_link -O /content/functional_trait_literature_zipped.zip --fuzzy

!unzip /content/functional_trait_literature_zipped.zip

# Chunk and store

In this portion, we'll chunk all the files into smaller paragraphs and use that for generating embeddings. We'll separately store the chunks so we can use it later, without having to re-download all the literature again.

### References/Notes related to chunking
1. https://openai.com/blog/new-and-improved-embedding-model
2. text-embedding-ada-002 is the best model for text embedding generation
3. https://www.pinecone.io/learn/chunking-strategies/
4. https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdf

### Possible strategies for chunking
1. LaTex: LaTeX is a document preparation system and markup language often used for academic papers and technical documents. By parsing the LaTeX commands and environments, you can create chunks that respect the logical organization of the content (e.g., sections, subsections, and equations), leading to more accurate and contextually relevant results.
2. Latex taxes a string as input, so we will need to read
3. Most of these academic papers are written in Latex and then converted to PDF. Our best bet would be to convert the PDF to Latex format first and then use the Latex based chunker to chunk things. This way paragraphs and related information will be together and contextualized.
4. On the other hand its not a guarantee that the document was first written in LaTex.

In [2]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from huggingface_hub import hf_hub_download
from langchain.docstore.document import Document
import pandas as pd
import json

REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "all_afr_carbon_large_chunks.jsonl"

# Download the chunks from Huggingface. We generated the chunks and uploaded it to Huggingface
file_name = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset", local_dir="/content/")

print(file_name)
chunks=[]

with open(file_name, "r") as final:
  chunks = json.load(final)

prod_splits=[]

for chunk in chunks:
  try:
    prod_splits.append(Document(
      page_content=chunk['page_content'],
      metadata={
          "source": chunk['title'],
          "id": chunk['id']
      }
    ))
  except:
    print("Exception for chunk:")
    print(chunk)

print(prod_splits[0])
print(prod_splits[1])

/content/all_afr_carbon_large_chunks.jsonl
page_content='Tropical forests are important carbon pools, comprising approximately 40% of terrestrial carbon storage (Dixon et al., 1994). The tropical forests are among the most productive ecosystems on the earth, estimated to account for above one-third of global net primary productivity (NPP) (Gaston et al., 1998;Field et al., 1998), but have been relatively under-sampled compared with their importance. Tropical deciduous forests are forests occurring in tropical regions characterized by pronounced seasonality in rainfall distribution, with few months of drought period.\nThe aboveground biomass (AGB) of a forest ecosystem is one of the fundamental parameters describing its functioning. Studies on biomass of forest vegetation are essential for determining storage of the carbon in the dominant tree component and computing the carbon cycling at regional as well as global level. Measurement of AGB of dominant tree species in different forest c

## Now we will generate embeddings using different models for comparison
First one is jina-embedding-l-en-v1

In [3]:
!mkdir "/content/vectorstore"
DB_JINA_EMBEDDING_PATH = 'vectorstore/db_jina-embedding-l-en-v1'

embeddings = HuggingFaceEmbeddings(model_name='jinaai/jina-embedding-b-en-v1',
                                       model_kwargs={'device': 'cuda'})

db = FAISS.from_documents(prod_splits, embeddings)
serialized_bytes = db.serialize_to_bytes()
with open("/content/vectorstore/serialized_db.txt", "wb") as binary_file:
    # Write bytes to file
    binary_file.write(serialized_bytes)

db.save_local(DB_JINA_EMBEDDING_PATH)

In [4]:
docs = await db.asimilarity_search("How can plant functional traits be useful when designing a restoration project?")
print(len(docs))
print(docs[0].metadata)
print(docs[1].metadata)
print(docs[2].metadata)
print(docs[3].metadata)

4
{'source': 'Restoring Ecosystem Services Tool (REST): A Computer Program for Selecting Species for Restoration Projects Using a Functional-Trait Approach', 'id': '267.20'}
{'source': 'Restoring Ecosystem Services Tool (REST): A Computer Program for Selecting Species for Restoration Projects Using a Functional-Trait Approach', 'id': '267.18'}
{'source': 'Using large-scale tropical dry forest restoration to test successional theory', 'id': '202.18'}
{'source': 'Restoring Ecosystem Services Tool (REST): A Computer Program for Selecting Species for Restoration Projects Using a Functional-Trait Approach', 'id': '267.4'}


In [8]:
from google.colab import userdata
from openai import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain_experimental.agents.agent_toolkits.csv.base import create_csv_agent
from langchain.agents.agent_types import AgentType
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import tiktoken
import os

OPENAI_API_KEY = userdata.get('COLABORATIVE_EARTH_KEY')
llm_model = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

llm = ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo")

memory = ConversationBufferMemory(
  memory_key='chat_history',
  return_messages=False
)

retriever = db.as_retriever(
    search_kwargs={"k": 5}
)

custom_template = """You are an AI assistant for assisted restoration papers.
You are given the following extracted parts of a long document.
=========
{context}
=========
Provide a conversational answer for the following question
Question: {question}.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Answer in Markdown:"""


conversation_chain_without_reference = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        # chain_type="refine",
        chain_type="stuff",
        retriever=retriever,
        # return_source_documents=True,
        verbose=True,
        memory=memory,
        # combine_docs_chain_kwargs={"question_prompt": custom_prompt, "refine_prompt": custom_prompt}
)

query = "What are some strategies for carbon sequestration?"

docs = db.similarity_search(query)
for doc in docs:
  print(doc)

result = conversation_chain_without_reference({"question": query})
answer = result["answer"]

answer

page_content='One management goal may be to maximize carbon storage across the landscape.\nTraits that are associated with plant growth and nutrient cycling are included here.' metadata={'source': 'Restoring Ecosystem Services Tool (REST): A Computer Program for Selecting Species for Restoration Projects Using a Functional-Trait Approach', 'id': '267.22'}
page_content='Estimation of C-stocks in MMW revealed the high potential of mangroves for sequestering carbon. The plantations can ' metadata={'source': 'Carbon stocks in natural and planted mangrove forests of Mahanadi Mangrove Wetland, East Coast of India', 'id': '58.10'}
page_content='In conclusion, forest is the land use pattern with the highest capacity for carbon sequestration, thus playing an extremely important role in alleviating global warming. The effect of different land use patterns on soil carbon sequestration was evaluated in this study, which will provide baseline data for forest management and land use planning, especi

'Some strategies for carbon sequestration include:\n\n1. **Afforestation and Reforestation:** Planting trees and restoring forests can help increase carbon storage capacity.\n\n2. **Soil Management:** Improving soil health through practices like no-till agriculture, cover cropping, and organic farming can enhance carbon sequestration in the soil.\n\n3. **Wetland Restoration:** Restoring wetlands like mangroves can help sequester carbon due to their high carbon storage capacity.\n\n4. **Bioenergy with Carbon Capture and Storage (BECCS):** This technology involves capturing carbon dioxide emissions from bioenergy production and storing it underground.\n\n5. **Carbon Capture and Storage (CCS):** This involves capturing carbon dioxide emissions from industrial processes or power plants and storing it underground to prevent it from entering the atmosphere.\n\n6. **Improved Forest Management:** Sustainable forest management practices can help maintain or increase carbon stocks in forests.\n\

###TODO
1. Add a cross encoder after the retrieval stage to re-rank the results before feeding it to the API.
2. Alternately, or maybe along with this, we may also want to combine several chunks to send as context, depending on the context lenght.
3. Or perhaps we also chunk in small portions and combine results to send to OpenAI. This way answers can be formed based on chunks from different papers and different sections.
4. Update this RAG to cite sources from the context provided. Refence: https://blog.langchain.dev/langchain-chat/

https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a