This code implements the proposition chunking method, based on research from Tony Chen, et. al.. The system break downs the input text into propositions that are atomic, factual, self-contained, and concise in nature, encodes the propositions into a vectorstore, which can be later used for retrieval

In [1]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain.vectorstores import Chroma

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
sample_content = """Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.
Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.
This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional "manager mode" often advised by business schools and professional managers.
Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.
Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. "Founder Mode" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.
Challenges of Scaling Startups
As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success.
Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.
Steve Jobs' Management Style
Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart
. This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale.
"""

In [25]:
from langchain_core.documents import Document
docs_list=[Document(page_content=sample_content,
metadata={"Title": "Paul Graham's Founder Mode Essay", 
"Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ"})]

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50
)

docs_splitted=splitter.split_documents(docs_list)

In [26]:
#Adding chunk id to the documents list
for i, doc in enumerate(docs_splitted):
    doc.metadata['chunk_id']=i+1

In [43]:
docs_splitted[2]

Document(metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, page_content='Conventional Wisdom vs. Founder Mode\nThe essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.')

In [30]:
#Generate proposition
from pydantic import Field, BaseModel
from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_openai import ChatOpenAI

#Schema for LLM output (Proposition)
class GenerateProposition(BaseModel):
    """ List of all the propositions in a given document"""
    propositions:List[str] = Field(description="List of propositions (Factual, self-contained and concise information")

groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")
#llm=ChatOpenAI(temperature=0, model_name="gpt-4-turbo-preview")

structured_llm=llm.with_structured_output(GenerateProposition)

#Few shot prompting

proposition_examples = [
    {"document": 
        "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.", 
     "propositions": 
        "['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"
    },
]

proposition_prompt_example = ChatPromptTemplate.from_messages(
    [
        ('human','{document}'),
        ('ai','{propositions}')
    ]
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=proposition_prompt_example,
    examples=proposition_examples
)

#System prompt

system_prompt = """Break down the given document into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

    1. Express a Single Fact: Each proposition should state one specific fact or claim.
    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.
    
    Use the few shot prompting examples given to generate the propositions
    """

prompt = ChatPromptTemplate.from_messages(
    [
        ('system',system_prompt),
        (few_shot_prompt),
        ('human','{document}')
    ]
)

proposition_generator = prompt | structured_llm


In [31]:
propositions_list = []  #store all propositions in this list

for i in range(len(docs_splitted)):
    print("Entered")
    response = proposition_generator.invoke({'document':docs_splitted[i].page_content})
    for prop in response.propositions:
        propositions_list.append(Document(page_content=prop,metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ", "chunk_id": i+1}))
        

Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered
Entered


In [47]:
len(propositions_list)

82

### Quality Check

In [34]:
#Schema for llm output
class GradePropositions(BaseModel):
    """ Grade a given proposition on accuracy, clarity, completeness and conciseness"""
    accuracy : int = Field(
        description="Rate from 1-10 based on how well the proposition reflect the orignal text"
    )

    clarity : int = Field(
        description="Rate from 1-10 based on how easy it is to understand the proposition without additional context."
    )

    completeness : int = Field(
        description = "Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers)."
    )

    conciseness : int = Field(
        description = "Rate from 1-10 based on whether the proposition is concise without losing important information."
    )



In [35]:
#LLM for fucntion call
llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")
structured_llm = llm.with_structured_output(GradePropositions)

#Evaluation prompt
evaluation_system_prompt = """Please evaluate the following proposition based on the criteria below:
- **Accuracy**: Rate from 1-10 based on how well the proposition reflects the original text.
- **Clarity**: Rate from 1-10 based on how easy it is to understand the proposition without additional context.
- **Completeness**: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).
- **Conciseness**: Rate from 1-10 based on whether the proposition is concise without losing important information.

Example:
Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Propositons_1: Neil Armstrong was an astronaut.
Evaluation_1: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_2: Neil Armstrong walked on the Moon in 1969.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_3: Neil Armstrong was the first person to walk on the Moon.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.
Evaluation_4: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_5: The Apollo 11 mission occurred in 1969.
Evaluation_5: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Format:
Proposition: "{proposition}"
Original Text: "{original_text}"
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ('system',evaluation_system_prompt),
        ('user','{proposition}, {original_text}')
    ]
)

proposition_evaluator = prompt | structured_llm

In [44]:
#Define evaluation categories and thresholds
evaluation_categories = ["accuracy", "clarity", "completeness", "conciseness"]
thresholds = {"accuracy": 7, "clarity": 7, "completeness": 7, "conciseness": 7}

#Function to evaluate propositions
def evaluate_propositions(original_text,proposition):
    response = proposition_evaluator.invoke({'original_text':original_text,'proposition':proposition})

    #parse the response to have the scores
    scores = {
        "accuracy":response.accuracy,
        "clarity": response.clarity, 
        "completeness": response.completeness, 
        "conciseness": response.conciseness
    }

    return scores

# define function to check if proposition pass or not
def quality_check(scores):
    for category,score in scores.items():
        if score < thresholds[category]:
            return False
    return True

evaluated_propositions = []

#Evaluate each proposition through loop
for id, proposition in enumerate(propositions_list):
    scores = evaluate_propositions(docs_splitted[proposition.metadata['chunk_id'] - 1].page_content, proposition.page_content)
    if quality_check(scores):
        #proposition pass quality check, keep it
        evaluated_propositions.append(proposition)
    else:
        #proposition fail quality check, disregard it
        print(f"{id+1} Propositions Fail : {proposition.page_content} \n {scores}")




5 Propositions Fail : They maintain a unique management style. 
 {'accuracy': 8, 'clarity': 9, 'completeness': 9, 'conciseness': 6}
6 Propositions Fail : They do not adopt traditional corporate practices. 
 {'accuracy': 0, 'clarity': 0, 'completeness': 0, 'conciseness': 0}
7 Propositions Fail : Their companies grow. 
 {'accuracy': 8, 'clarity': 6, 'completeness': 7, 'conciseness': 5}
8 Propositions Fail : Their management style remains unique as their companies grow. 
 {'accuracy': 10, 'clarity': 10, 'completeness': 10, 'conciseness': 5}
19 Propositions Fail : "Mode" is not yet fully documented. 
 {'accuracy': 5, 'clarity': 7, 'completeness': 8, 'conciseness': 6}
20 Propositions Fail : "Manager mode" is a conventional paradigm. 
 {'accuracy': 8, 'clarity': 7, 'completeness': 8, 'conciseness': 6}
21 Propositions Fail : Business schools often advise "manager mode". 
 {'accuracy': 7, 'clarity': 8, 'completeness': 6, 'conciseness': 4}
23 Propositions Fail : Business schools advise. 
 {'acc

In [46]:
len(evaluated_propositions)

56

### Embedding propositions into vectorstore

In [48]:
#Embeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(evaluated_propositions,embeddings)

#Retriever
retriever = vectorstore.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k":4}
)

  from .autonotebook import tqdm as notebook_tqdm


In [49]:
query = "Who's management approach served as inspiartion for Brian Chesky's \"Founder Mode\" at Airbnb?"
retrieved_docs = retriever.invoke(query)

In [51]:
for i, r in enumerate(retrieved_docs):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: Brian Chesky adopted a "Founder Mode" at Airbnb. --- chunk id : 17
2) Content: Brian Chesky is a co-founder of Airbnb. --- chunk id : 14
3) Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- chunk id : 14
4) Content: Steve Jobs' management approach inspired Brian Chesky. --- chunk id : 17


### Comparing performance with larger chunk size

In [52]:
vectorstore_larger = FAISS.from_documents(docs_splitted,embeddings)

#Retriever
retriever_larger = vectorstore_larger.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k":4}
)

In [53]:
retrieved_docs_large = retriever_larger.invoke(query)

In [54]:
for i, r in enumerate(retrieved_docs_large):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a --- chunk id : 14
2) Content: Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, --- chunk id : 17
3) Content: manager mode, allowing founders to maintain their unique approach even as their companies scale. --- chunk id : 10
4) Content: This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm --- chunk id : 4


### Testing

#### Test-1

In [55]:
test_query_1 = "what is the essay \"Founder Mode\" about?"
retrieved_proposition = retriever.invoke(test_query_1)
retrived_larger = retriever_larger.invoke(test_query_1)

In [56]:
for i, r in enumerate(retrieved_proposition):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: "Founder Mode" is an emerging paradigm. --- chunk id : 4
2) Content: "Founder Mode" is an emerging paradigm. --- chunk id : 8
3) Content: Paul Graham wrote the essay "Founder Mode". --- chunk id : 1
4) Content: Paul Graham's essay "Founder Mode" challenges conventional wisdom about scaling startups. --- chunk id : 1


In [57]:
for i, r in enumerate(retrived_larger):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm --- chunk id : 4
2) Content: Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. "Founder Mode" is an emerging paradigm that is not yet fully understood or --- chunk id : 8
3) Content: Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than --- chunk id : 1
4) Content: Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups. --- chunk id : 3


#### Test-2

In [58]:
test_query_2 = "who is the co-founder of Airbnb?"
retrieved_proposition = retriever.invoke(test_query_2)
retrived_larger = retriever_larger.invoke(test_query_2)

In [59]:
for i, r in enumerate(retrieved_proposition):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: Brian Chesky is a co-founder of Airbnb. --- chunk id : 14
2) Content: Brian Chesky adopted a "Founder Mode" at Airbnb. --- chunk id : 17
3) Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- chunk id : 14
4) Content: In startups, the founder's direct involvement is crucial. --- chunk id : 4


In [60]:
for i, r in enumerate(retrived_larger):
    print(f"{i+1}) Content: {r.page_content} --- chunk id : {r.metadata['chunk_id']}")

1) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a --- chunk id : 14
2) Content: Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, --- chunk id : 17
3) Content: This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm --- chunk id : 4
4) Content: Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture. --- chunk id : 7
