## Setup

Install the required packages for Llama Index Embeddings

In [3]:
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.



In [4]:
# Enable automatic reloading of modules
%load_ext autoreload
%autoreload 2

In [5]:
!pip install llama-index



In [6]:
import os
import openai
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

True

In [7]:
# Set the OpenAI API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

In [8]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

We'll define two crucial models :

LLM (Language Model): The OpenAI model with the configuration "gpt-3.5-turbo" and a temperature setting of 0.1. This model is responsible for generating responses based on user queries and relevant context.

Embedding Model: The HuggingFaceEmbedding model with the model name "sentence-transformers/all-mpnet-base-v2" and a maximum sequence length of 512. This model is used to create vector embeddings for individual text chunks.

In [9]:
# Select Embedding Model and LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

## Data Loading, Index Building
In this section, we load the data and construct the vector index.

In [10]:
from llama_index.core import SimpleDirectoryReader

# List of PDF files to be read
input_files = [
    "./data/Grasslan_carbon_sequ.pdf",
    "./data/Rot_gas_serre.pdf"
]

# Create a SimpleDirectoryReader instance
documents = SimpleDirectoryReader(input_files=input_files).load_data()

## Extract Nodes

We identify and extract the set of nodes that will be stored in the VectorIndex. This set encompasses nodes processed by the sentence window parser and the "base" nodes extracted using the standard parser.

In [11]:
nodes = node_parser.get_nodes_from_documents(documents)

In [12]:
base_nodes = text_splitter.get_nodes_from_documents(documents)

for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"
    print(node)

Node ID: node-0
Text: Animal (2010), 4:3, pp 334–350 &The Animal Consortium 2009
doi:10.1017/S1751731109990784animal Mitigating the greenhouse gas
balance of ruminant production systems through carbon sequestration in
grasslands J. F . Soussana1-, T. Tallec1and V. Blanfort1,2 1INRA
UR0874, UREP Grassland Ecosystem Research, 234, Avenue du Bre ´zet,
Clermont-Ferrand, ...
Node ID: node-1
Text: This will require avoidingland use changes that reduce ecosystem
soil C stocks (e.g.deforestation, ploughing up long-term grasslands)
and a cautious management of pastures, aiming at preserving
andrestoring soils and their soil organic matter content. Combinedwith
other mitigation measures, such as a reduction in the useof N
fertilisers, of foss...
Node ID: node-2
Text: and provide a variety of goods and services to support ﬂora,
fauna, and human populations worldwide. On aglobal scale, livestock
use 3.4 billion hectares of grazingland (i.e. grasslands and
rangelands), in addition to animal feed pr

We build both the sentence index, as well as the “base” index.

In [13]:
from llama_index.core import VectorStoreIndex

# Create a VectorStoreIndex instance using the 'nodes' dataset
sentence_index = VectorStoreIndex(nodes)

In [14]:
# Create another VectorStoreIndex instance using the 'base_nodes' dataset
base_index = VectorStoreIndex(base_nodes)

## Querying

In [15]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Create a QueryEngine instance
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

# Query the engine with a prompt
window_response = query_engine.query(
    "What is SOC?"
)
print(window_response)

SOC refers to soil organic carbon, which is a key component of the carbon cycle in ecosystems. It includes organic carbon stored in soil organic matter, such as plant residues, roots, and microbial biomass.


In [16]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: Root litter transfor-mation is also an important determinant of the C cycle in
grassland ecosystems, which is affected both by the root
litter quality and by the rhizosphere activity (Personeni andLoiseau, 2004 and 2005).
 Below-ground C generally has slower turnover rates than
above-ground C, as most of the organic C in soils (humic
substances) is produced by the transformation of plant litterinto more persistent organic comp ounds (Jones and Donnelly,
2004).  Coarse soil organic matter fractions (above 0.2 mm)
have a fast turnover in soils, and the mean residence time
of C in these fractions is reduced by intensive compared toextensive management (Klumpp
et al ., 2007).  SOC may
persist because it is bound to soil minerals and exists in
forms that microbial decomposers cannot access (Baldockand Skjemstad, 2000).  Therefore, SOC accumulation is oftenincreased in clayey compared to sandy soils.
 Sequestred SOC can, if undisturbed, remain in the soil
for centuries.  In native pr