### Propositions Chunking
#### Overview
###### This code implements the proposition chunking method. The system break downs the input text into propositions that are atomic, factual, self-contained, and concise in nature, encodes the propositions into a vectorstore, which can be later used for retrieval

### Key Components
###### Document Chunking: Splitting a document into manageable pieces for analysis.
###### Proposition Generation: Using LLMs to break down document chunks into factual, self-contained propositions.
###### Proposition Quality Check: Evaluating generated propositions based on accuracy, clarity, completeness, and conciseness.
###### Embedding and Vector Store: Embedding both propositions and larger chunks of the document into a vector store for efficient retrieval.
###### Retrieval and Comparison: Testing the retrieval system with different query sizes and comparing results from the proposition-based model with the larger chunk-based model.

#### Motivation
###### The motivation behind the propositions chunking method is to build a system that breaks down a text document into concise, factual propositions for more granular and precise information retrieval. Using propositions allows for finer control and better handling of specific queries, particularly for extracting knowledge from detailed or complex texts. The comparison between using smaller proposition chunks and larger document chunks aims to evaluate the effectiveness of granular information retrieval.

#### Method Details
###### Loading Environment Variables: The code begins by loading environment variables (e.g., API keys for the LLM service) to ensure that the system can access the necessary resources.

##### Document Chunking:

###### The input document is split into smaller pieces (chunks) using RecursiveCharacterTextSplitter. This ensures that each chunk is of manageable size for the LLM to process.

##### Proposition Generation:

###### Propositions are generated from each chunk using an LLM (in this case, "llama-3.1-70b-versatile"). The output is structured as a list of factual, self-contained statements that can be understood without additional context.

##### Quality Check:

###### A second LLM evaluates the quality of the propositions by scoring them on accuracy, clarity, completeness, and conciseness. Propositions that meet the required thresholds in all categories are retained.

##### Embedding Propositions:

###### Propositions that pass the quality check are embedded into a vector store using the OllamaEmbeddings model. This allows for similarity-based retrieval of propositions when queries are made.
#### Retrieval and Comparison:

###### Two retrieval systems are built: one using the proposition-based chunks and another using larger document chunks. Both are tested with several queries to compare their performance and the precision of the returned results.

#### Benefits
###### Granularity: By breaking the document into small factual propositions, the system allows for highly specific retrieval, making it easier to extract precise answers from large or complex documents.
###### Quality Assurance: The use of a quality-checking LLM ensures that the generated propositions meet specific standards, improving the reliability of the retrieved information.
###### Flexibility in Retrieval: The comparison between proposition-based and larger chunk-based retrieval allows for evaluating the trade-offs between granularity and broader context in search results.

#### Implementation
###### Proposition Generation: The LLM is used in conjunction with a custom prompt to generate factual statements from the document chunks.
###### Quality Checking: The generated propositions are passed through a grading system that evaluates accuracy, clarity, completeness, and conciseness.
###### Vector Store Integration: Propositions are stored in a FAISS vector store after being embedded using a pre-trained embedding model, allowing for efficient similarity-based search and retrieval.
###### Query Testing: Multiple test queries are made to the vector stores (proposition-based and larger chunks) to compare the retrieval performance.

#### Summary
###### This code presents a robust method for breaking down a document into self-contained propositions using LLMs. The system performs a quality check on each proposition, embeds them in a vector store, and retrieves the most relevant information based on user queries. The ability to compare granular propositions against larger document chunks provides insight into which method yields more accurate or useful results for different types of queries. The approach emphasizes the importance of high-quality proposition generation and retrieval for precise information extraction from complex documents.

In [None]:
### LLMs
import os
from dotenv import load_dotenv

# Load environment variables from a '.env' file
load_dotenv()

os.environ['GROQ_API_KEY'] = os.getenv('GROQ_API_KEY') # for LLM

### Test Document

In [None]:
sample_content = """Paul Graham's essay "Founder Mode," published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.
Conventional Wisdom vs. Founder Mode
The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.
This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. "Founder Mode" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional "manager mode" often advised by business schools and professional managers.
Unique Founder Abilities
Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.
Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. "Founder Mode" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.
Challenges of Scaling Startups
As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success.
Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.
Steve Jobs' Management Style
Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's "Founder Mode" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart
. This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale.
"""

### Chunking

In [None]:
### Build Index
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings

# Set embeddings
embedding_model = OllamaEmbeddings(model='nomic-embed-text:v1.5', show_progress=True)

#docs
docs_list = [Document(page_content=sample_content, metadata={"Title": "Paul Graham's Founder Mode Essay", "Source": "https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ"})]

#Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200, chunk_overlap=50
)

doc_splits = text_splitter.split_documents(docs_list)

In [None]:
for i, doc in enumerate(doc_splits):
     doc.metadata['chunk_id'] = i+1  ### adding chunk id

### Generate Propositions

In [None]:
from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_groq import ChatGroq

# Data Model
class GeneratePropositions(BaseModel):
      """List of all the propositions in a given document"""

      propositions: List[str] = Field(
        description = "List of propositions (factual, self-contained and concise information)"
      )

# LLM with function call
llm = ChatGroq(model="llama-3.1-70b-versatile", temperature=0)
structured_llm = llm.with_structured_output(GeneratePropositions)

