<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/functional_trait_rag_with_jina_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download
Download the zipped literature file from this Google Drive location

In [None]:
!git clone https://github.com/smthomas1704/restoration-rag.git

!pip install -r restoration-rag/requirements.txt
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4
!pip install gdown==v4.6.3
!pip install openai
!pip install langchain_experimental

# !gdown https://drive.google.com/file/d/10_inKhFuY5O8Sel88ZvlTqODsbbh4ula/view?usp=drive_link -O /content/functional_trait_literature_zipped.zip --fuzzy

# !unzip /content/functional_trait_literature_zipped.zip

# Chunk and store

In this portion, we'll chunk all the files into smaller paragraphs and use that for generating embeddings. We'll separately store the chunks so we can use it later, without having to re-download all the literature again.

### References/Notes related to chunking
1. https://openai.com/blog/new-and-improved-embedding-model
2. text-embedding-ada-002 is the best model for text embedding generation
3. https://www.pinecone.io/learn/chunking-strategies/
4. https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdf

### Possible strategies for chunking
1. LaTex: LaTeX is a document preparation system and markup language often used for academic papers and technical documents. By parsing the LaTeX commands and environments, you can create chunks that respect the logical organization of the content (e.g., sections, subsections, and equations), leading to more accurate and contextually relevant results.
2. Latex taxes a string as input, so we will need to read
3. Most of these academic papers are written in Latex and then converted to PDF. Our best bet would be to convert the PDF to Latex format first and then use the Latex based chunker to chunk things. This way paragraphs and related information will be together and contextualized.
4. On the other hand its not a guarantee that the document was first written in LaTex.

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from huggingface_hub import hf_hub_download
from langchain.docstore.document import Document
import pandas as pd
import json

REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "all_afr_carbon_large_chunks.jsonl"

# Download the chunks from Huggingface. We generated the chunks and uploaded it to Huggingface
file_name = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset", local_dir="/content/")

print(file_name)
chunks=[]

with open(file_name, "r") as final:
  chunks = json.load(final)

prod_splits=[]

for chunk in chunks:
  try:
    prod_splits.append(Document(
      page_content=chunk['page_content'],
      metadata={
          "source": chunk['title'],
          "id": chunk['id']
      }
    ))
  except:
    print("Exception for chunk:")
    print(chunk)

print(prod_splits[0])
print(prod_splits[1])

all_afr_carbon_large_chunks.jsonl:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

/content/all_afr_carbon_large_chunks.jsonl
page_content='\nThe German Federal State of Saxony aims to increase forest cover, supported by the implementation of afforestation programs. To analyze consequences of an increase in forest cover, this study investigates possible trade-offs between carbon storage and plant biodiversity caused by afforestation. Six afforestation scenarios with total forest cover ranging from 27.7% to 46% were generated in the Mulde river basin in Saxony with regard to different forest types. Carbon storage was calculated by the process-based Dynamic Vegetation Model LPJ-GUESS while random forest models were used to predict changes in plant species richness. We used eight different plant groups as responses: total number of plant species, endangered species, as well as species grouped by native status (three groups) and pollination traits (three groups). Afforestation led to an increase in carbon storage that was slightly stronger in coniferous forests as compar

## Now we will generate embeddings using different models for comparison
First one is jina-embedding-l-en-v1

In [2]:
!mkdir "/content/vectorstore"
DB_JINA_EMBEDDING_PATH = 'vectorstore/db_jina-embedding-l-en-v1'

embeddings = HuggingFaceEmbeddings(model_name='jinaai/jina-embedding-b-en-v1',
                                       model_kwargs={'device': 'cuda'})

db = FAISS.from_documents(prod_splits, embeddings)
serialized_bytes = db.serialize_to_bytes()
with open("/content/vectorstore/serialized_db.txt", "wb") as binary_file:
    # Write bytes to file
    binary_file.write(serialized_bytes)

db.save_local(DB_JINA_EMBEDDING_PATH)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/253 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [11]:
query = "What are the implications of the high above- and below-ground carbon stocks in terms of overall ecosystem carbon storage?"
docs = await db.asimilarity_search(query)
for doc in docs:
  print(doc)

page_content='The developed relationships were used to predict the above-ground carbon stocks per ha in 2008 at the respective sub-block levels.' metadata={'source': 'Estimation of carbon stocks in the forest plantations of Sri Lanka †', 'id': '568.9'}
page_content='Forest succession on former grasslands in the Alps and Thuringia caused increasing ecosystem carbon stocks mainly because of the stock development in the tree stems. Losses of carbon in the mineral soil are partially compensated for by the increasing carbon stock in the organic layers. Thus, the distribution of carbon within the ecosystem changes such that the vegetation is replacing the mineral soil as the largest carbon reservoir.\nIn the following, the processes leading to changed carbon stocks in the organic layers and the mineral soil are discussed in more detail, because soil carbon stocks play a vital role as long-term carbon stocks given that they are generally regarded as more stable (Garten & Ashwood, 2002) than c

In [12]:
from google.colab import userdata
from openai import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain_experimental.agents.agent_toolkits.csv.base import create_csv_agent
from langchain.agents.agent_types import AgentType
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import tiktoken
import os

OPENAI_API_KEY = userdata.get('COLABORATIVE_EARTH_KEY')
llm_model = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

llm = ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo")

memory = ConversationBufferMemory(
  memory_key='chat_history',
  return_messages=False
)

retriever = db.as_retriever(
    search_kwargs={"k": 5}
)

custom_template = """You are an AI assistant for assisted restoration papers.
You are given the following extracted parts of a long document.
=========
{context}
=========
Provide a conversational answer for the following question
Question: {question}.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Answer in Markdown:"""


conversation_chain_without_reference = ConversationalRetrievalChain.from_llm(
        llm=llm,
        # chain_type="stuff" will go through everything.
        # chain_type="refine",
        chain_type="stuff",
        retriever=retriever,
        # return_source_documents=True,
        verbose=True,
        memory=memory,
        # combine_docs_chain_kwargs={"question_prompt": custom_prompt, "refine_prompt": custom_prompt}
)

result = conversation_chain_without_reference({"question": query})
answer = result["answer"]

answer



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
The developed relationships were used to predict the above-ground carbon stocks per ha in 2008 at the respective sub-block levels.

Forest succession on former grasslands in the Alps and Thuringia caused increasing ecosystem carbon stocks mainly because of the stock development in the tree stems. Losses of carbon in the mineral soil are partially compensated for by the increasing carbon stock in the organic layers. Thus, the distribution of carbon within the ecosystem changes such that the vegetation is replacing the mineral soil as the largest carbon reservoir.
In the following, the processes leading to changed carbon stocks in the organic layers and the mineral soil are

'The implications of high above- and below-ground carbon stocks in terms of overall ecosystem carbon storage are significant. Forest ecosystems with high carbon stocks play a crucial role in sequestering carbon from the atmosphere, thereby helping to mitigate the impacts of anthropogenic CO2 emissions. The distribution of carbon within the ecosystem, with a large proportion stored in both aboveground vegetation and soil, contributes to the overall resilience and stability of the ecosystem. Additionally, the long-term storage of carbon in soil organic matter provides a more stable carbon reservoir compared to vegetation, which can be easily disturbed or harvested. Overall, high above- and below-ground carbon stocks in forest ecosystems contribute to overall ecosystem health, biodiversity, and the potential to act as carbon sinks to help combat climate change.'

###TODO
1. Add a cross encoder after the retrieval stage to re-rank the results before feeding it to the API.
2. Alternately, or maybe along with this, we may also want to combine several chunks to send as context, depending on the context lenght.
3. Or perhaps we also chunk in small portions and combine results to send to OpenAI. This way answers can be formed based on chunks from different papers and different sections.
4. Update this RAG to cite sources from the context provided. Refence: https://blog.langchain.dev/langchain-chat/

https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a