In [28]:
data = []
with open("/Users/uzair/langchain/text_data.txt","r") as file:
    item = file.read()
    data.append(item)

data

['Harper Lee, believed to be one of the most influential authors to have ever existed, famously published only a single novel (up until its controversial sequel was published in 2015 just before her death). Lee’s To Kill a Mockingbird was published in 1960 and became an immediate classic of literature. The novel examines racism in the American South through the innocent wide eyes of a clever young girl named Jean Louise (“Scout”) Finch. Its iconic characters, most notably the sympathetic and just lawyer and father Atticus Finch, served as role models and changed perspectives in the United States at a time when tensions regarding race were high. To Kill a Mockingbird earned the Pulitzer Prize for fiction in 1961 and was made into an Academy Award-winning film in 1962, giving the story and its characters further life and influence over the American social sphere.']

## TEXT SPLITTER TECHNIQUES IN LANGCHAIN

Goal of text splitter is to give LLM the information which is required.
Different text splitter :
- Character 
- Recursive Character
- Document Specific
- Semantic
- Agentic splitting

##  1 - Character Splitter

process of simplity dividing the text into N character regardless of content and form.
- chunk size
- chunk overlap

In [2]:
from langchain.text_splitter import CharacterTextSplitter

In [3]:
text_splitter = CharacterTextSplitter(chunk_size = 30, chunk_overlap = 5, separator="", strip_whitespace = False)

In [4]:
text_splitter.create_documents(data)

[Document(page_content='Harper Lee, believed to be one'),
 Document(page_content='e one of the most influential '),
 Document(page_content='tial authors to have ever exis'),
 Document(page_content=' existed, famously published o'),
 Document(page_content='hed only a single novel (up un'),
 Document(page_content='up until its controversial seq'),
 Document(page_content='l sequel was published in 2015'),
 Document(page_content=' 2015 just before her death). '),
 Document(page_content='th). Lee’s To Kill a Mockingbi'),
 Document(page_content='ingbird was published in 1960 '),
 Document(page_content='1960 and became an immediate c'),
 Document(page_content='ate classic of literature. The'),
 Document(page_content='. The novel examines racism in'),
 Document(page_content='sm in the American South throu'),
 Document(page_content='through the innocent wide eyes'),
 Document(page_content=' eyes of a clever young girl n'),
 Document(page_content='irl named Jean Louise (“Scout”'),
 Document(page

## 2 - Splitting Using LLama Index 

In [5]:
from llama_index.core.text_splitter import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

In [6]:
splitter = SentenceSplitter(
    chunk_size=100,
    chunk_overlap=5,
    
)

In [7]:
docs = SimpleDirectoryReader(
    input_files=["/Users/uzair/langchain/text_data.txt"]
).load_data()

In [8]:
docs

[Document(id_='17c0fcae-e0e2-4b67-821e-4a10f3b87bc9', embedding=None, metadata={'file_path': '/Users/uzair/langchain/text_data.txt', 'file_name': 'text_data.txt', 'file_type': 'text/plain', 'file_size': 877, 'creation_date': '2024-02-24', 'last_modified_date': '2024-02-24', 'last_accessed_date': '2024-02-24'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Harper Lee, believed to be one of the most influential authors to have ever existed, famously published only a single novel (up until its controversial sequel was published in 2015 just before her death). Lee’s To Kill a Mockingbird was published in 1960 and became an immediate classic of literature. The novel examines racism in the American South through the innocent wide eyes of a clever young girl 

In [9]:
nodes = splitter.get_nodes_from_documents(docs)

A Node represents a “chunk” of a source Document, whether that is a text chunk, an image, or other. Similar to Documents, they contain metadata and relationship information with other nodes.



In [10]:
nodes

[TextNode(id_='5671b982-45f4-4e95-8c35-aeb6da98a2e2', embedding=None, metadata={'file_path': '/Users/uzair/langchain/text_data.txt', 'file_name': 'text_data.txt', 'file_type': 'text/plain', 'file_size': 877, 'creation_date': '2024-02-24', 'last_modified_date': '2024-02-24', 'last_accessed_date': '2024-02-24'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='17c0fcae-e0e2-4b67-821e-4a10f3b87bc9', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/Users/uzair/langchain/text_data.txt', 'file_name': 'text_data.txt', 'file_type': 'text/plain', 'file_size': 877, 'creation_date': '2024-02-24', 'last_modified_date': '2024-02-24', 'last_accessed_date': '2024-02-24'}, hash='9b34d74f34138e910ee96b501940aa3

In [11]:
nodes[0].text

'Harper Lee, believed to be one of the most influential authors to have ever existed, famously published only a single novel (up until its controversial sequel was published in 2015 just before her death). Lee’s To Kill a Mockingbird was published in 1960 and became an immediate classic of literature.'

##  3 - Recursive Character Splitting

We specify the series of separators which will be used to split the docs. Like new line , spaces or characters or paragraph

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [13]:
ts = RecursiveCharacterTextSplitter(
    chunk_size = 100, chunk_overlap = 0
)

In [14]:
ts.create_documents(data)

[Document(page_content='Harper Lee, believed to be one of the most influential authors to have ever existed, famously'),
 Document(page_content='published only a single novel (up until its controversial sequel was published in 2015 just before'),
 Document(page_content='her death). Lee’s To Kill a Mockingbird was published in 1960 and became an immediate classic of'),
 Document(page_content='literature. The novel examines racism in the American South through the innocent wide eyes of a'),
 Document(page_content='clever young girl named Jean Louise (“Scout”) Finch. Its iconic characters, most notably the'),
 Document(page_content='sympathetic and just lawyer and father Atticus Finch, served as role models and changed'),
 Document(page_content='perspectives in the United States at a time when tensions regarding race were high. To Kill a'),
 Document(page_content='Mockingbird earned the Pulitzer Prize for fiction in 1961 and was made into an Academy'),
 Document(page_content='Award-winnin

## 4 - Document Specific Splitting

In [15]:
from langchain.text_splitter import MarkdownTextSplitter

In [16]:
markdown_sample =""" 
# Markdown Example

This is a paragraph of text. You can use *italics*, **bold**, or even create [links](https://www.example.com).

## Lists

- Item 1
- Item 2
  - Subitem 1
  - Subitem 2

## Code

You can include inline code like `print("Hello, world!")` or code blocks:

```python
def hello():
    print("Hello, world!")

hello()
"""

In [17]:
splitter = MarkdownTextSplitter(
    chunk_size = 50,
    chunk_overlap = 0,
)

In [18]:
splitter.create_documents([markdown_sample])

[Document(page_content='# Markdown Example'),
 Document(page_content='This is a paragraph of text. You can use'),
 Document(page_content='*italics*, **bold**, or even create'),
 Document(page_content='[links](https://www.example.com).'),
 Document(page_content='## Lists'),
 Document(page_content='- Item 1\n- Item 2\n  - Subitem 1\n  - Subitem 2'),
 Document(page_content='## Code'),
 Document(page_content='You can include inline code like `print("Hello,'),
 Document(page_content='world!")` or code blocks:'),
 Document(page_content='```python\ndef hello():'),
 Document(page_content='print("Hello, world!")'),
 Document(page_content='hello()')]

In [19]:
from langchain.text_splitter import PythonCodeTextSplitter

In [20]:
py_code = """
def is_prime(n):
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True
"""

In [21]:
python_splitter = PythonCodeTextSplitter(
    chunk_size = 45,
    chunk_overlap = 5
)

In [22]:
python_splitter.create_documents([py_code])

[Document(page_content='def is_prime(n):\n    if n <= 1:'),
 Document(page_content='return False\n    if n <= 3:'),
 Document(page_content='return True'),
 Document(page_content='if n % 2 == 0 or n % 3 == 0:'),
 Document(page_content='return False\n    i = 5'),
 Document(page_content='while i * i <= n:'),
 Document(page_content='if n % i == 0 or n % (i + 2) == 0:'),
 Document(page_content='return False\n        i += 6'),
 Document(page_content='return True')]

In [23]:
# use recursive character splitter for python code
from langchain.text_splitter import Language
rcs_py_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size = 30,chunk_overlap = 0,language=Language.PYTHON
    
)

In [24]:
rcs_py_splitter.create_documents([py_code])

[Document(page_content='def is_prime(n):'),
 Document(page_content='if n <= 1:'),
 Document(page_content='return False'),
 Document(page_content='if n <= 3:'),
 Document(page_content='return True'),
 Document(page_content='if n % 2 == 0 or n % 3 =='),
 Document(page_content='0:'),
 Document(page_content='return False'),
 Document(page_content='i = 5'),
 Document(page_content='while i * i <= n:'),
 Document(page_content='if n % i == 0 or n %'),
 Document(page_content='(i + 2) == 0:'),
 Document(page_content='return False'),
 Document(page_content='i += 6'),
 Document(page_content='return True')]

## PDF with Tables Using Unstructured Library

In [25]:
pdf_path ="/Users/uzair/Downloads/HDFC_MF_Factsheet__July_2022.pdf"

In [26]:
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
element  = partition_pdf(filename=pdf_path,
                         strategy="hi_res",
                         infer_table_structure=True,
                         model_name = "yolox")

config.json: 100%|██████████| 1.47k/1.47k [00:00<00:00, 477kB/s]
model.safetensors: 100%|██████████| 115M/115M [00:10<00:00, 11.2MB/s] 
model.safetensors: 100%|██████████| 46.8M/46.8M [00:02<00:00, 18.4MB/s]
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly 

In [36]:
element

[<unstructured.documents.elements.Title at 0x2a48050d0>,
 <unstructured.documents.elements.Header at 0x2a56fbc70>,
 <unstructured.documents.elements.Image at 0x2a56fb580>,
 <unstructured.documents.elements.Text at 0x2a63b1250>,
 <unstructured.documents.elements.Title at 0x2a63b1490>,
 <unstructured.documents.elements.NarrativeText at 0x2a63b1d30>,
 <unstructured.documents.elements.Image at 0x2a5adc1f0>,
 <unstructured.documents.elements.NarrativeText at 0x2a6399bb0>,
 <unstructured.documents.elements.NarrativeText at 0x2a6399ca0>,
 <unstructured.documents.elements.Table at 0x2a5adc790>,
 <unstructured.documents.elements.NarrativeText at 0x2a5adcbb0>,
 <unstructured.documents.elements.Title at 0x2a39cd880>,
 <unstructured.documents.elements.Title at 0x2a38a4280>,
 <unstructured.documents.elements.NarrativeText at 0x2a45c3340>,
 <unstructured.documents.elements.NarrativeText at 0x2a4a77bb0>,
 <unstructured.documents.elements.NarrativeText at 0x2a4a77be0>,
 <unstructured.documents.element

In [37]:
element[9].metadata.text_as_html

'<table><thead><th>Name of Scheme</th><th>This product is suitable for investors who are seeking*:</th><th></th><th>Riskometer#</th></thead><thead><th>HDFC SILVER ETF</th><th>e Returns that are commensurate with the performance of silver,</th><th></th><th>pote My</th></thead><tr><td>(An open ended Exchange Traded Fund (ETF) replicating / tracking performance of Silver)</td><td>subject to tracking errors, over long term ¢ Investment in Silver bullion of 0.999 fineness</td><td></td><td>KS 2) ~S RISKOMETER Investors understand that their principal will be at</td></tr></table>'

## Extract Images + text from PDF

In [39]:
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

In [40]:
visual_pdf = "/Users/uzair/Downloads/visual_arts.pdf"

In [41]:
raw_pdf_element = partition_pdf(
    filename=visual_pdf,
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters = 1000,
    new_after_n_chars = 800,
    combine_text_under_n_chars = 600,
    image_output_dir_path= "/Users/uzair/langchain/images_dir"
)

There is a CLIP model which provide embeddings for text and images.

We can use GPT4 Vision , Get the text summary of the image and use that text summary for semantic search.m

## Split by tokens
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

In [42]:
with open("./sample.txt") as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter


text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)


print(texts[0])


Created a chunk of size 109, which is longer than the specified 100
Created a chunk of size 123, which is longer than the specified 100
Created a chunk of size 157, which is longer than the specified 100


Strangers have come to the remote area of the Two Rivers; strangers the likes of which Rand, Mat and Perrin have never seen—a Lady named Moiraine like out of a gleeman’s tale, and her Warder, Lan, and then the gleeman to go with it. Fireworks, a gleeman and a Lady, all in time for Winternight Festival—not even the presence of a dark figure haunting the woods, a figure the wind does not seem to touch, can scare away their excitement.


## Semantic Chunking

Splits the text based on semantic similarity.
At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Suppose you want to classify books based on genere and topics rather based on size of the bookm

We can employ embedding based semantic chunking, It is slow but powerful.m

So in doc , take embedding at certain positions of doc  and compare those embeddings for the search.

In [45]:
with open("./essay.txt","r") as f:
    essay = f.read()


In [46]:
import re
single_sentences_list = re.split(r'(?<=[.?!])\s+',essay)

In [47]:
single_sentences_list

['Women empowerment is a multifaceted concept that has both positive and negative aspects.',
 'On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.',
 'However, there are also challenges and drawbacks associated with women empowerment, including resistance from traditional gender roles, backlash from conservative groups, and the risk of tokenism or superficial change.',
 'One of the key positive aspects of women empowerment is its potential to promote gender equality.',
 'By empowering women to participate fully in all aspects of society, including education, employment, and politics, gender disparities can be reduced, leading to a more equitable and just society.',
 'Empowered women are also more likely to advocate for the rights of other women, further advancing the cause of gender equality.',
 'Another positive aspect of women empowerment is its impact on economic outcomes.',
 "Studies ha

In [48]:
sentences = [{"sentences" : x , "index": i} for i,x in enumerate(single_sentences_list)]

In [49]:
sentences

[{'sentences': 'Women empowerment is a multifaceted concept that has both positive and negative aspects.',
  'index': 0},
 {'sentences': 'On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.',
  'index': 1},
 {'sentences': 'However, there are also challenges and drawbacks associated with women empowerment, including resistance from traditional gender roles, backlash from conservative groups, and the risk of tokenism or superficial change.',
  'index': 2},
 {'sentences': 'One of the key positive aspects of women empowerment is its potential to promote gender equality.',
  'index': 3},
 {'sentences': 'By empowering women to participate fully in all aspects of society, including education, employment, and politics, gender disparities can be reduced, leading to a more equitable and just society.',
  'index': 4},
 {'sentences': 'Empowered women are also more likely to advocate for the rights of o

In [63]:
def combine_sentences(sentences,buffer_size=1):
    for i in range(len(sentences)):
        comb_sentences = ""
        for j in range(i-buffer_size,i):
            if j>0:
                comb_sentences = comb_sentences+sentences[j]['sentences']+" "
        comb_sentences+=sentences[i]['sentences']
        
        for j in range(i+1,i+1+buffer_size):
            if j<len(sentences):
                comb_sentences = " "+comb_sentences+sentences[j]['sentences']
        sentences[i]['combined_sentence']= comb_sentences
        
    return sentences
                
                
                

In [64]:
sentences = combine_sentences(sentences=sentences)

In [65]:
sentences

[{'sentences': 'Women empowerment is a multifaceted concept that has both positive and negative aspects.',
  'index': 0,
  'combined_sentence': ' Women empowerment is a multifaceted concept that has both positive and negative aspects.On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.'},
 {'sentences': 'On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.',
  'index': 1,
  'combined_sentence': ' On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.However, there are also challenges and drawbacks associated with women empowerment, including resistance from traditional gender roles, backlash from conservative groups, and the risk of tokenism or superficial change.'},
 {'sentences': 'However, there are als

In [66]:
from langchain.embeddings import OpenAIEmbeddings

In [67]:
openai_embeddings = OpenAIEmbeddings()

In [68]:
embeddings = openai_embeddings.embed_documents([x['combined_sentence'] for x in sentences])

In [69]:
for i,sentence in enumerate(sentences):
    sentence['comb_sentence_embed'] = embeddings[i]

In [70]:
sentences[:3]

[{'sentences': 'Women empowerment is a multifaceted concept that has both positive and negative aspects.',
  'index': 0,
  'combined_sentence': ' Women empowerment is a multifaceted concept that has both positive and negative aspects.On the positive side, women empowerment can lead to greater gender equality, improved economic outcomes, and a more inclusive and equitable society.',
  'comb_sentence_embed': [-0.04521797575369909,
   -0.021128713981647892,
   0.010108887168670097,
   -0.03750030305586092,
   -0.010513748922365996,
   -0.018054295243246483,
   -0.018775455562159434,
   -0.005098725644235756,
   -0.010697201773542213,
   -0.043092455504916555,
   0.021002193810503014,
   0.032540749599884525,
   -0.0052885045039692165,
   -0.004877316741716075,
   -0.022406557465662884,
   0.016751147725002517,
   0.04412991159507885,
   -0.02254572872259968,
   0.004573670659274795,
   -0.04197908544942219,
   -0.01115899713859202,
   -0.0075721768293279925,
   0.00033369450667286766,
   

Now Calculate the distances between different chunkings and split the doc based on that.

## Agent Based Chunking

This chunking strategy explore the possibility to use LLM to determine how much and what text should be included in a chunk based on the context.

To generate initial chunks, it uses concept of Propositions based on paper that extracts stand alone statements from a raw piece of text. Langchain provides propositional-retrieval template to implement this.

After generating propositions, these are being feed to LLM-based agent. This agent determine whether a proposition should be included in an existing chunk or if a new chunk should be created.

https://miro.medium.com/v2/resize:fit:1400/format:webp/1*aHXJ5wuWuh1faf_BF7i4og.png

In [71]:
from langchain.output_parsers.openai_tools import JsonOutputKeyToolsParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain.chains import create_extraction_chain
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from langchain import hub

In [72]:
import os

In [75]:
obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model="gpt4-1106-preview",openai_api_key = os.environ.get("GPT4_API_KEY"))

  warn_deprecated(


In [76]:
runnable = obj | llm

In [77]:
class Sentences(BaseModel):
    sentences : List[str]

# extraction chain
extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences,llm=llm)

In [78]:
def get_preposition(text):
    runnable_output = runnable.invoke(
        {"input" : text,
         }
    ).content
    
    preposition = extraction_chain.run(runnable_output)[0].sentences
    return preposition