# Sentence Window Vector Index

The sentence window vector index splits and creates embeddings per-sentence. Then, during retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences.

By default, the sentence window is 5.

In this case, chunk size settings are ignored during index construction in favor of following the window settings.

## Setup

In [1]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
set_global_service_context(ServiceContext.from_defaults(llm=llm, embed_model="local"))

# if you wanted to use OpenAIEmbeddings, we should also increase the batch size, 
# since it involves many more calls to the API
# set_global_service_context(ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50)))

  from .autonotebook import tqdm as notebook_tqdm
  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Build the index

Here, we build an index using chapter 3 of the recent IPCC climate report.

In [3]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  19.3M      0  0:00:01  0:00:01 --:--:-- 19.3M


In [3]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

In [4]:
from llama_index import SentenceWindowVectorIndex

sentence_index = SentenceWindowVectorIndex.from_documents(documents)

## Querying

### With SentenceWindowVectorIndex

In [13]:
query_engine = sentence_index.as_query_engine(similarity_top_k=2)
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

The concerns surrounding the Atlantic Meridional Overturning Circulation (AMOC) include low confidence in quantifying AMOC changes in the 20th century due to disagreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s are too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that the AMOC will decline for all SSP scenarios over the 21st century, but an abrupt collapse before 2100 is not expected.


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.

In [14]:
window = response.source_nodes[0].node.metadata["window"]
sentence = response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: WGI AR6 assessed that only the California Current system has 
undergone large-scale upwelling-favourable wind intensification since 
the 1980s (medium confidence) (WGI AR6 Section  9.2.1.5; García-
Reyes and Largier, 2010; Seo et al., 2012; Fox-Kemper et al., 2021). While no consistent pattern of contemporary changes in upwelling-
favourable winds emerges from observation-based studies, numerical 
and theoretical work projects that summertime winds near poleward 
boundaries of upwelling zones will intensify, while winds near 
equatorward boundaries will weaken (high confidence) (WGI AR6 
Section  9.2.3.5; García-Reyes et  al., 2015; Rykaczewski et  al., 2015; 
Wang et  al., 2015; Aguirre et  al., 2019; Fox-Kemper et  al., 2021). Nevertheless, projected future annual cumulative upwelling wind 
changes at most locations and seasons remain within ±10–20% of 
present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; 
Fox-Kemper et al., 2021). Continuous observation of the A

### Contrast with normal VectorStoreIndex

In [6]:
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

In [7]:
query_engine = vector_index.as_query_engine(similarity_top_k=2)
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

I'm sorry, but the concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation) are not mentioned in the provided context.


Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.

In [8]:
query_engine = vector_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

The context information does not provide any specific concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation).
