# Sentence Window Vector Index

The sentence window vector index splits and creates embeddings per-sentence. Then, during retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences.

By default, the sentence window is 5.

In this case, chunk size settings are ignored during index construction in favor of following the window settings.

## Setup

In [1]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
set_global_service_context(ServiceContext.from_defaults(llm=llm, embed_model="local"))

  from .autonotebook import tqdm as notebook_tqdm
  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Build the index

Here, we build an index using chapter 3 of the recent IPCC climate report.

In [3]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  19.3M      0  0:00:01  0:00:01 --:--:-- 19.3M


In [3]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

In [4]:
from llama_index import SentenceWindowVectorIndex

sentence_index = SentenceWindowVectorIndex.from_documents(documents)

## Querying

### With SentenceWindowVectorIndex

In [5]:
query_engine = sentence_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

The concerns surrounding the Atlantic Meridional Overturning Circulation (AMOC) include the limited confidence in quantifying AMOC changes in the 20th century due to low agreement in reconstructed and simulated trends. Furthermore, the short duration of direct observational records since the mid-2000s makes it difficult to determine the contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is highly likely that the AMOC will decline in all SSP scenarios throughout the 21st century, although an abrupt collapse before 2100 is not anticipated.


### Contrast with normal VectorStoreIndex

In [6]:
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

Since the default chunk size is much bigger than a sentence (1024 tokens), we leave the top k at the default of 2.

In [7]:
query_engine = vector_index.as_query_engine()
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

I'm sorry, but the concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation) are not mentioned in the provided context.


Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.

In [8]:
query_engine = vector_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What are the concerns surrounding the AMOC?")
print(response)

The context information does not provide any specific concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation).
