In [None]:
%pip install -q llama-index-embeddings-openai pip install llama-index-llms-openai

Note: you may need to restart the kernel to use updated packages.


In [1]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'

--2025-09-28 12:57:08--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-09-28 12:57:08 ERROR 404: Not Found.



### Ref: https://developers.llamaindex.ai/python/examples/node_parsers/semantic_chunking/

### Semmantic Chunking: Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a “chunk” contains sentences that are semantically related to each other.

In [17]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["overview.md"]).load_data()

In [18]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv("key.env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [20]:
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

In [21]:
nodes = splitter.get_nodes_from_documents(documents)

In [22]:
base_nodes = base_splitter.get_nodes_from_documents(documents)

In [23]:
nodes[0].get_content()

"# Overview of Insurellm\nInsurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers. It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.\n\n"

In [24]:
base_nodes[0].get_content()

"# Overview of Insurellm\nInsurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers. It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.\n\nInsurellm is hiring! We are looking for talented software engineers, data scientists and account executives to join our growing team. Come be a part of our movement to disrupt the insurance sector.\n\nInsurellm is an innovative insurance tech firm with 200 employees across the US.\nInsurellm offers 4 insurance software products:\n- Carllm, a portal for auto insurance companies\n- Homellm, a portal for home insurance companies\n- Rellm, an enterprise platform for the reinsurance sector\n- Marketllm, a marketplace for connecting consumers with insurance providers\n  \nInsurellm has more than 300 clients worldwi

In [25]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

In [26]:
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()

In [27]:
base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()

In [28]:
response = query_engine.query(
     "What is the marketplace for connecting insurance?"
)
print(str(response))

Marketllm is the marketplace for connecting consumers with insurance providers.


In [29]:
response = base_query_engine.query(
    "What is the marketplace for connecting insurance?"
)
print(str(response))

Marketllm is the marketplace for connecting consumers with insurance providers.


In [30]:
for n in response.source_nodes:
    display_source_node(n, source_length=20000)

**Node ID:** 51620384-e13f-478a-ad40-45da0d8d39c7<br>**Similarity:** 0.8074712743562875<br>**Text:** # Overview of Insurellm
Insurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers. It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.

Insurellm is hiring! We are looking for talented software engineers, data scientists and account executives to join our growing team. Come be a part of our movement to disrupt the insurance sector.

Insurellm is an innovative insurance tech firm with 200 employees across the US.
Insurellm offers 4 insurance software products:
- Carllm, a portal for auto insurance companies
- Homellm, a portal for home insurance companies
- Rellm, an enterprise platform for the reinsurance sector
- Marketllm, a marketplace for connecting consumers with insurance providers
  
Insurellm has more than 300 clients worldwide.<br>