The idea here is to chunk the document and create "propositions" for each chunk, which are self-contained statements about the chunk.

We then perform a quality check on the propositions and make sure they pass a quality bar before embedding.

In [1]:
import os
from dotenv import load_dotenv
import openai

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

In [2]:
sample_content = """Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.
Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.
This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional "manager mode" often advised by business schools and professional managers.
Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.
Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. "Founder Mode" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.
Challenges of Scaling Startups
As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success.
Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.
Steve Jobs' Management Style
Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart
. This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale.
"""

### Chunking

In [8]:
from llama_index.core import Document, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

docs_list = [Document(text=sample_content, metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ"})]

parser = LangchainNodeParser(RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=200, chunk_overlap=50))
nodes = parser.get_nodes_from_documents(docs_list)
print(len(nodes))

3


`RecursiveCharacterTextSplitter` attempts to split chunks at certain boundaries, like `\n\n` and then splits further if the chunk sizes are too large by splitting on other boundaries.

The `from_tiktoken_encoder` method counts chunk sizes in terms of tokens. So the `chunk_size` provided will be in units of tokens, not characters.

`LangchainNodeParser` is an adapter for using Langchain text parsers with LlamaIndex.

In [9]:
for i, doc in enumerate(nodes):
    doc.metadata["chunk_id"] = i+1

print(nodes[0].metadata)

{'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 1}


In [11]:
from pydantic import BaseModel, Field
from typing import List

class GeneratePropositions(BaseModel):

    propositions: List[str] = Field(
        description="List of propositions (factual, self-contained, and concise information)"
    )

In [34]:

from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage, MessageRole
from pandas.core.common import flatten
from llama_index.core import ChatPromptTemplate

structured_llm = OpenAI(model="gpt-4o-mini").as_structured_llm(output_cls=GeneratePropositions)

proposition_examples = [
    {"document": 
        "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.", 
     "propositions": 
        "['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"
    },
]
proposition_prompts = [[ChatMessage(role=MessageRole.USER, content=example["document"]), ChatMessage(role=MessageRole.ASSISTANT, content=example["propositions"])] for example in proposition_examples]
flattened_proposition_prompts = [x for sublist in proposition_prompts for x in sublist]
print([*flattened_proposition_prompts])
system = """Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

    1. Express a Single Fact: Each proposition should state one specific fact or claim.
    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses."""
prompt = ChatPromptTemplate([
    ChatMessage(role=MessageRole.SYSTEM, content=system),
    *flattened_proposition_prompts,
    ChatMessage(role=MessageRole.USER, content="{document}")
])

print(*[flattened_proposition_prompts])

[ChatMessage(role=<MessageRole.USER: 'user'>, content='In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.', additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']", additional_kwargs={})]
[ChatMessage(role=<MessageRole.USER: 'user'>, content='In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.', additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occur

If I were to use ChatPromptTemplate.from_messages, I would provide a list of tuples instead of `ChatMessage`s

In [36]:
propositions = []

for i, node in enumerate(nodes):
    response = structured_llm.chat(prompt.format_messages(document=node.text))

    for proposition in response.raw.propositions:
        propositions.append(Document(text=proposition, metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ", "chunk_id": i+1}))

print(propositions)

[Document(id_='4c2e93fe-e3eb-4ddf-b863-05ea1c27146b', embedding=None, metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 1}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Paul Graham published the essay "Founder Mode" in September 2024.', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='54179c96-fbda-403d-9ea7-eec1ae773150', embedding=None, metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 1}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Paul Graham\'s essay "Founder Mode" challenges conventional wisdom about scaling startups.', mimetype='text/p

### Quality Check

In [37]:
class GradePropositions(BaseModel):
    """Grade a given proposition on accuracy, clarity, completeness, and conciseness"""

    accuracy: int = Field(
        description="Rate from 1-10 based on how well the proposition reflects the original text."
    )
    
    clarity: int = Field(
        description="Rate from 1-10 based on how easy it is to understand the proposition without additional context."
    )

    completeness: int = Field(
        description="Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers)."
    )

    conciseness: int = Field(
        description="Rate from 1-10 based on whether the proposition is concise without losing important information."
    )

# LLM with function call
proposition_grader_llm = OpenAI(model="gpt-4o-mini").as_structured_llm(output_cls=GradePropositions)


# Prompt
evaluation_prompt_template = """
Please evaluate the following proposition based on the criteria below:
- **Accuracy**: Rate from 1-10 based on how well the proposition reflects the original text.
- **Clarity**: Rate from 1-10 based on how easy it is to understand the proposition without additional context.
- **Completeness**: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).
- **Conciseness**: Rate from 1-10 based on whether the proposition is concise without losing important information.

Example:
Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Propositons_1: Neil Armstrong was an astronaut.
Evaluation_1: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_2: Neil Armstrong walked on the Moon in 1969.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_3: Neil Armstrong was the first person to walk on the Moon.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.
Evaluation_4: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_5: The Apollo 11 mission occurred in 1969.
Evaluation_5: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Format:
Proposition: "{proposition}"
Original Text: "{original_text}"
"""
grader_prompt = ChatPromptTemplate.from_messages([
    (MessageRole.SYSTEM, evaluation_prompt_template),
    (MessageRole.USER, "{proposition}, {original_text}"),
])

In [42]:
thresholds = {
    "accuracy": 7,
    "clarity": 7,
    "completeness": 7,
    "conciseness": 7,
}

def evaluate_proposition(proposition, original_text):
    response = proposition_grader_llm.chat(grader_prompt.format_messages(proposition=proposition, original_text=original_text))
    scores = {"accuracy": response.raw.accuracy, "clarity": response.raw.clarity, "completeness": response.raw.completeness, "conciseness": response.raw.conciseness}
    return scores

def passes_quality_check(scores):
    for category, score in scores.items():
        if score < thresholds[category]:
            return False
    return True

passing_propositions = []

for i, proposition in enumerate(propositions):
    print(f"Evaluating Proposition {i+1}, chunk {proposition.metadata['chunk_id']}, {len(nodes)}")
    scores = evaluate_proposition(proposition.text, nodes[proposition.metadata["chunk_id"] - 1].text)
    if passes_quality_check(scores):
        passing_propositions.append(proposition)
    else:
        print(f"{i+1}) Propostion: {proposition.text} \n Scores: {scores}")
        print("Failed")


Evaluating Proposition 1, chunk 1, 3
Evaluating Proposition 2, chunk 1, 3
Evaluating Proposition 3, chunk 1, 3
Evaluating Proposition 4, chunk 1, 3
Evaluating Proposition 5, chunk 1, 3
Evaluating Proposition 6, chunk 1, 3
6) Propostion: The traditional approach is suitable for established companies. 
 Scores: {'accuracy': 8, 'clarity': 7, 'completeness': 6, 'conciseness': 8}
Failed
Evaluating Proposition 7, chunk 1, 3
Evaluating Proposition 8, chunk 1, 3
Evaluating Proposition 9, chunk 1, 3
Evaluating Proposition 10, chunk 1, 3
Evaluating Proposition 11, chunk 1, 3
Evaluating Proposition 12, chunk 1, 3
Evaluating Proposition 13, chunk 2, 3
Evaluating Proposition 14, chunk 2, 3
Evaluating Proposition 15, chunk 2, 3
Evaluating Proposition 16, chunk 2, 3
Evaluating Proposition 17, chunk 2, 3
Evaluating Proposition 18, chunk 2, 3
Evaluating Proposition 19, chunk 2, 3
Evaluating Proposition 20, chunk 2, 3
Evaluating Proposition 21, chunk 2, 3
21) Propostion: Traditional manager mode is curr

### Embedding propositions in a vectorstore

In [43]:
import faiss

from llama_index.core import (
    SimpleDirectoryReader,
    load_index_from_storage,
    VectorStoreIndex,
    StorageContext,
)
d = 1536
faiss_index = faiss.IndexFlatL2(d)


In [47]:
from llama_index.vector_stores.faiss import FaissVectorStore

vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    passing_propositions, storage_context=storage_context
)

Get a retriever to query the index

In [50]:
retriever = index.as_retriever(similarity_top_k=5)
query = "Who's management approach served as inspiartion for Brian Chesky's \"Founder Mode\" at Airbnb?"
res_proposition = retriever.retrieve(query)
print(res_proposition)

[NodeWithScore(node=TextNode(id_='e966791e-4047-4364-8814-57189461517d', embedding=None, metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='c339a9e4-1e76-4edb-8b1f-ad06313aa43a', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, hash='0d07ee0e080a17966c631b21cad633785db8d55b9d9c4dbe3449a41a3163de8a')}, text='Brian Chesky was advised to run Airbnb in a traditional managerial style.', mimetype='text/plain', start_char_idx=0, end_char_idx=73, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.5033988356590271), NodeWithSco

In [51]:
for i, r in enumerate(res_proposition):
    print(f"{i+1} Content: {r.text} --- Chunk id: {r.metadata['chunk_id']}")

1 Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- Chunk id: 3
2 Content: Brian Chesky is the co-founder of Airbnb. --- Chunk id: 3
3 Content: Brian Chesky's approach was influenced by Steve Jobs' management style at Apple. --- Chunk id: 3
4 Content: Brian Chesky found success by adopting a different approach. --- Chunk id: 3
5 Content: The traditional managerial style led to poor outcomes for Airbnb. --- Chunk id: 3


### Comparing performance with large chunks size

In [61]:
# I'm unsure if this line creates the vector store or not, need to look into the docs
d = 1536
faiss_index2 = faiss.IndexFlatL2(d)

vectorstore_larger = FaissVectorStore(faiss_index=faiss_index2)
vectorstore_larger_storage_context = StorageContext.from_defaults(vector_store=vectorstore_larger)
vectorstore_larger_index = VectorStoreIndex(nodes, storage_context=vectorstore_larger_storage_context)
retriever_larger = vectorstore_larger_index.as_retriever(similarity_top_k=5)

Now we test the performance on the index without proposition chunking:

In [63]:
res_larger = retriever_larger.retrieve(query)
for i, r in enumerate(res_larger):
    print(f"{i+1} Content: {r.text} --- Chunk id: {r.metadata['chunk_id']}")

1 Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.
Steve Jobs' Management Style
Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart
. This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale. --- Chunk id: 3
2 Content: Unique Founder Abilit

### Testing

In [65]:
test_query_1 = "what is the essay \"Founder Mode\" about?"
res_proposition = retriever.retrieve(test_query_1)
res_larger = retriever_larger.retrieve(test_query_1)


In [66]:
for i, r in enumerate(res_proposition):
    print(f"{i+1}) Content: {r.text} --- Chunk_id: {r.metadata['chunk_id']}")

1) Content: "Founder Mode" contrasts with the conventional "manager mode" advised by business schools. --- Chunk_id: 1
2) Content: "Founder Mode" is an emerging paradigm that is not yet fully understood or documented. --- Chunk_id: 1
3) Content: "Founder Mode" is an emerging paradigm. --- Chunk_id: 2
4) Content: Paul Graham's essay "Founder Mode" challenges conventional wisdom about scaling startups. --- Chunk_id: 1
5) Content: Graham hopes that "Founder Mode" will become well-understood over time. --- Chunk_id: 2


In [67]:
for i, r in enumerate(res_larger):
    print(f"{i+1}) Content: {r.text} --- Chunk_id: {r.metadata['chunk_id']}")

1) Content: Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.
Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.
This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional "manager mode" often advised by business schools and professional managers.
Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's v