In [21]:
import os
from dotenv import load_dotenv
load_dotenv("../../config/local.env")


True

In [63]:
sample_content = """Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.
Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.
This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional "manager mode" often advised by business schools and professional managers.
Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.
Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. "Founder Mode" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.
Challenges of Scaling Startups
As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success.
Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.
Steve Jobs' Management Style
Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart
. This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale.
"""

In [64]:
### Build Index
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings

In [99]:
embedding_model = OllamaEmbeddings(model='nomic-embed-text:v1.5', show_progress=True)

In [66]:
docs_list = [Document(page_content=sample_content, metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ"})]

`RecursiveCharacterTextSplitter` in LangChain is a **chunking algorithm** that splits text into pieces of size ~`chunk_size`, while trying to **preserve natural boundaries** (paragraphs → lines → sentences/words → characters) as much as possible.

---

## ✅ Core idea (why “recursive”?)

It tries multiple separators **in order**, from “best boundary” to “worst boundary”:

Typical default separators:

```python
["\n\n", "\n", " ", ""]
```

Meaning:

1. Split by **paragraphs** (`\n\n`)
2. If still too large, split by **newlines** (`\n`)
3. If still too large, split by **spaces** (` `)
4. If still too large, split by **characters** (`""`) ✅ last resort

That is the “recursive” part:
**if a chunk is too big → split it again using the next separator.**

---

## ✅ What does it output?

It produces chunks with:

* `chunk_size` (max size)
* `chunk_overlap` (overlap between chunks so context is not lost)

Example:

```python
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
```

So each chunk is ~1000 characters, but last 200 chars repeat into the next.

---

## ✅ Step-by-step example

Suppose you have:

```
[Paragraph A: 1500 chars]

[Paragraph B: 800 chars]
```

With `chunk_size=1000`:

### Step 1: split by `\n\n`

It gets:

* Chunk candidate A (1500 chars) ❌ too big
* Chunk candidate B (800 chars) ✅ fine

### Step 2: A is too big → split A using next separator `\n`

Now paragraph A becomes smaller line blocks.
If still too big…

### Step 3: split using `" "` (word boundaries)

If still too big…

### Step 4: split using `""` (character level)

Guaranteed to fit.

---

## ✅ How chunk_overlap is applied

After it forms chunks, it **slides a window**:

Example:

* chunk1 ends at position 1000
* chunk2 starts at position `1000 - overlap`

So overlap gives continuity like:

```
chunk1: [0..1000]
chunk2: [800..1800]
```

---

## ✅ Why it’s used so much in RAG

Because it prevents ugly cuts like this:

❌ bad split:

```
...the model was trained using transfo
rmer architecture...
```

✅ good split:

* prefers paragraph / line / sentence / word splits first

This improves:

* retrieval quality
* embedding meaning
* answer coherence

---

## ✅ Important practical details

### 1) It splits by **characters**, not tokens

So `chunk_size=1000` = 1000 characters (not tokens).

For token-based splitting you’d use token splitters (like tiktoken based ones).

---

### 2) Separator selection is “best effort”

It will try to keep structure, but if your text has no separators (like JSON blobs), it will eventually go to character splitting.

---

### 3) It also “merges” smaller pieces

After splitting, it tries to **combine adjacent small parts** until it reaches chunk_size.

So it’s not just “split everything blindly”.

---

In [67]:
# Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200, chunk_overlap=50
)

In [68]:
doc_splits = text_splitter.split_documents(docs_list)

In [81]:
len(doc_splits)  

3

In [69]:
for i, doc in enumerate(doc_splits):
    doc.metadata['chunk_id'] = i+1 ### adding chunk id

In [70]:
from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

In [71]:
# Data model
class GeneratePropositions(BaseModel):
    """List of all the propositions in a given document"""

    propositions: List[str] = Field(
        description="List of propositions (factual, self-contained, and concise information)"
    )

In [72]:
llm = ChatOpenAI(model="gpt-4o", temperature=0)

In [73]:
structured_llm= llm.with_structured_output(GeneratePropositions)

In [74]:
proposition_examples = [
    {"document": 
        "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.", 
     "propositions": 
        "['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"
    },
]

In [75]:
example_proposition_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{document}"),
        ("ai", "{propositions}"),
    ]
)

In [76]:
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt = example_proposition_prompt,
    examples = proposition_examples,
)

In [77]:
# Prompt
system = """Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

    1. Express a Single Fact: Each proposition should state one specific fact or claim.
    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        few_shot_prompt,
        ("human", "{document}"),
    ]
)

In [78]:
proposition_generator = prompt | structured_llm

In [80]:
propositions = [] # Store all the propositions from the document

for i in range(len(doc_splits)):
    response = proposition_generator.invoke({"document": doc_splits[i].page_content}) # Creating proposition
    for proposition in response.propositions:
        propositions.append(Document(page_content=proposition, metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ", "chunk_id": i+1}))

In [82]:
# Data model
class GradePropositions(BaseModel):
    """Grade a given proposition on accuracy, clarity, completeness, and conciseness"""

    accuracy: int = Field(
        description="Rate from 1-10 based on how well the proposition reflects the original text."
    )
    
    clarity: int = Field(
        description="Rate from 1-10 based on how easy it is to understand the proposition without additional context."
    )

    completeness: int = Field(
        description="Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers)."
    )

    conciseness: int = Field(
        description="Rate from 1-10 based on whether the proposition is concise without losing important information."
    )

# LLM with function call
llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm= llm.with_structured_output(GradePropositions)

# Prompt
evaluation_prompt_template = """
Please evaluate the following proposition based on the criteria below:
- **Accuracy**: Rate from 1-10 based on how well the proposition reflects the original text.
- **Clarity**: Rate from 1-10 based on how easy it is to understand the proposition without additional context.
- **Completeness**: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).
- **Conciseness**: Rate from 1-10 based on whether the proposition is concise without losing important information.

Example:
Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Propositons_1: Neil Armstrong was an astronaut.
Evaluation_1: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_2: Neil Armstrong walked on the Moon in 1969.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_3: Neil Armstrong was the first person to walk on the Moon.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.
Evaluation_4: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_5: The Apollo 11 mission occurred in 1969.
Evaluation_5: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Format:
Proposition: "{proposition}"
Original Text: "{original_text}"
"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", evaluation_prompt_template),
        ("human", "{proposition}, {original_text}"),
    ]
)

proposition_evaluator = prompt | structured_llm

In [94]:
# Define evaluation categories and thresholds
evaluation_categories = ["accuracy", "clarity", "completeness", "conciseness"]
thresholds = {"accuracy": 7, "clarity": 7, "completeness": 7, "conciseness": 7}

# Function to evaluate proposition
def evaluate_proposition(proposition, original_text):
    response = proposition_evaluator.invoke({"proposition": proposition, "original_text": original_text})
    
    # Parse the response to extract scores
    scores = {"accuracy": response.accuracy, "clarity": response.clarity, "completeness": response.completeness, "conciseness": response.conciseness}  # Implement function to extract scores from the LLM response
    return scores

# Check if the proposition passes the quality check
def passes_quality_check(scores):
    for category, score in scores.items():
        if score < thresholds[category]:
            return False
    return True

evaluated_propositions = [] # Store all the propositions from the document

# Loop through generated propositions and evaluate them
for idx, proposition in enumerate(propositions):
    scores = evaluate_proposition(proposition.page_content, doc_splits[proposition.metadata['chunk_id'] - 1].page_content)
    if passes_quality_check(scores):
        # Proposition passes quality check, keep it
        evaluated_propositions.append(proposition)
    else:
        # Proposition fails, discard or flag for further review
        print(f"{idx+1}) Propostion: {proposition.page_content} \n Scores: {scores}")
        print("Fail")

1) Propostion: Paul Graham wrote an essay titled 'Founder Mode'. 
 Scores: {'accuracy': 10, 'clarity': 10, 'completeness': 5, 'conciseness': 10}
Fail
2) Propostion: The essay 'Founder Mode' was published in September 2024. 
 Scores: {'accuracy': 10, 'clarity': 10, 'completeness': 5, 'conciseness': 10}
Fail
39) Propostion: Steve Jobs managed Apple. 
 Scores: {'accuracy': 10, 'clarity': 10, 'completeness': 5, 'conciseness': 10}
Fail


In [102]:
vectorstore_propositions = FAISS.from_documents(evaluated_propositions, embedding_model)

OllamaEmbeddings: 100%|██████████| 44/44 [00:02<00:00, 17.03it/s]


In [105]:
vectorstore_propositions.embeddings

OllamaEmbeddings(base_url='http://localhost:11434', model='nomic-embed-text:v1.5', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=True, headers=None, model_kwargs=None)

In [106]:
retriever_propositions = vectorstore_propositions.as_retriever(
                search_type="similarity",
                search_kwargs={'k': 4}, # number of documents to retrieve
            )

In [110]:
query = "Who's management approach served as inspiartion for Brian Chesky's \"Founder Mode\" at Airbnb?"
res_proposition = retriever_propositions.invoke(query)
res_proposition

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  1.40it/s]


[Document(id='c9e75959-39c3-4773-bfbc-71f562c7f864', metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, page_content='Brian Chesky was advised to run Airbnb in a traditional managerial style.'),
 Document(id='2b04ffb1-09ef-48a0-ab32-b7c1b82de0b0', metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, page_content="Steve Jobs' management style influenced Brian Chesky's approach at Airbnb."),
 Document(id='1a989229-9a63-46f9-b62b-805d5d835c1c', metadata={'Title': "Paul Graham's Founder Mode Essay", 'Source': 'https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ', 'chunk_id': 3}, page_content='Brian Chesky found success by adopting a different management approach.'),
 Document(id='3b3648cb-16c7-4423-8e11-cc52a44b18b8', metadata=

In [111]:
for i, r in enumerate(res_proposition):
    print(f"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}")

1) Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- Chunk_id: 3
2) Content: Steve Jobs' management style influenced Brian Chesky's approach at Airbnb. --- Chunk_id: 3
3) Content: Brian Chesky found success by adopting a different management approach. --- Chunk_id: 3
4) Content: Brian Chesky is a co-founder of Airbnb. --- Chunk_id: 3
