### Chunking Startegies
> The typical 'chunking' considered ass splitting the reference contennt (knowledge base) into managable and meaningful pieces, before being vectorised  
> If its done by fixed length, there may not be meaningful and logical content in each chunk  
> There are specific chunking techniques, which would help the chunks (pieces) to hold logical and coherent content

**Relevant Library Imports**

In [19]:
# IMPORT bs4 FIRST - This is critical for langchain_text_splitters to detect BeautifulSoup
from bs4 import BeautifulSoup
import requests

# Text splitter functionality is provided by LangChain framework
# Import AFTER bs4 so langchain can detect BeautifulSoup
from langchain_text_splitters import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

#### Utilities
Functions that can be used in multiple places like a library

**Get Main Content**  
Given an URL, this function can get only the main content of the web page (leaving aside, side panel, navigation etc)  
It can be feteched along with the html tag Or just as plain text

In [7]:
def get_main_content (url, type):

    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")

    # Remove layout elements
    for tag in soup(["nav", "header", "footer", "aside", "script", "style"]):
        tag.decompose()

    # Check and get main section of the pages
    main = soup.find("main")

    if not main:
        
        # fallback method, if no 'main' section in html page
        candidates = soup.find_all("div", recursive=True)
        main = max(candidates, key=lambda c: len(c.get_text(strip=True)), default=soup.body)

    # Get cleaned HTML content. Tags retained
    main_html = str(main)

    # If HTML content is required, provide with the tags
    if type == 'html':
        return (main_html)

    # If text is requirred, provide only the text content
    elif type == 'text':

        text_soup = BeautifulSoup (main_html, "html.parser")
        main_text = text_soup.get_text(separator="\n", strip=True)
        return main_text


#### Sentence / Paragraph Chunking
> This type is a strategy that splits the text based on natural split points of sentence, paragraph etc.  
> Naturally the text are formulated into logical pieces that can be taken advantage to split  
> it avoids chunk split by fixed length at unnatural points   
> This will help retainning the meaningful content of  natural block of text into the chunk

In [8]:
# Text Splitter based on Recursive charecter library

# Define what are the splitters to be considered. There is default in library itself
seperators = ["\n\n", "\n", "."]

# Splitter function based on seperator and the length criteria
text_splitter = RecursiveCharacterTextSplitter (chunk_size=300, chunk_overlap=0,
                                                length_function=len, is_separator_regex=False,
                                                keep_separator=False,
                                                separators=seperators,
                                                )

**Sentence / Paragraph Chunking**  
Get the main content of a web page in plain text  
Further split the content into chunks based on length criteria and the natural seperators  

In [9]:
# url = "https://cloud.google.com/learn/what-is-cloud-computing?hl=en"
url = "https://www.ibm.com/think/topics/history-of-artificial-intelligence"

# Get the main content of the web page
text_Content = get_main_content (url, "text")

with open ('content.txt', mode='w', encoding="utf-8") as f:
    print (text_Content, file=f)

# Use the text splitter object to split the text based on the schema defined
# Each chunk is a string element in a list
docs = text_splitter.split_text(text_Content)

with open ('chunks.txt', mode='w', encoding="utf-8") as f:
    
    for doc in docs :

        print (doc, "\n---")
        print (doc, "\n---", file=f)


The history of AI
Authors
Tim   Mucci
IBM Writer
Gather
The history of artificial intelligence 
---
Humans have dreamed of creating thinking machines from ancient times. Folklore and historical attempts to build programmable devices reflect this long-standing ambition and fiction abounds with the possibilities of intelligent machines, imagining their benefits and dangers 
---
It's no wonder that when OpenAI released the first version of 
---
GPT
(Generative Pretrained Transformer), it quickly gained widespread attention, marking a significant step toward realizing this ancient dream.
GPT-3 was a landmark moment in
AI 
---
due to its unprecedented size, featuring 175 billion parameters, which enabled it to perform a wide range of natural language tasks without extensive fine-tuning. This model was trained using big data, allowing it to generate human-like text and engage in conversations 
---
It also had the ability to perform few-shot learning, significantly improving its versatility a

#### Content Aware Chunking
> This type of chunking splits the content based on the document structure  
> Normally documents are organised in terms of chapters with headings or sections  
> The library provides methods to split the text based on document structure  
> Having split based on the document structure, each chunk retains the complete and cohesive content that would be meaningful

In [17]:
# levels of header tags in html to split on
header_levels = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

# Define a Splitter object for HTML content from the lib
# This library also gives splitter for Markdown, JSON etc
html_splitter = HTMLHeaderTextSplitter(header_levels)

In [16]:

url = "https://www.ibm.com/think/topics/history-of-artificial-intelligence"

HTML_Content = get_main_content (url, "html")

docs = html_splitter.split_text (HTML_Content)

with open ('chunks.txt', mode='w') as f:

    for doc in docs :

        try :

            print (doc.metadata)
            print ("Content : ", doc.page_content,"\n---")

            print ("Heading : ",doc.metadata,file=f)
            print ("Content : ", doc.page_content,"\n---",file=f)

        except Exception :
            pass

ImportError: Unable to import BeautifulSoup. Please install via `pip install bs4`.

#### Semantic Chunking
> This relies on semantic (meaning) of the sentences and collate them based on similarity in meaning  
> This brings a chunk to be meaningful part of the text, not just text splitter  
> To do this, embedding models are used as a mechanism to identify the meaning

In [18]:
# Choose an embedding model
Embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Prepare a Chunker, which provides various possibilites depending on the need
Semantic_Splitter = SemanticChunker (Embedder, min_chunk_size=100, 
                                     buffer_size = 1, 
                                    #  breakpoint_threshold_amount = 85.0,
                                    #  sentence_split_regex = '(?<=[.?!])\\s+|\\n+'
                                     )

In [20]:

url = "https://www.ibm.com/think/topics/cloud-computing"

# Pull out main content in text format from an URL
Text_Content = get_main_content (url, "text")

with open ('content.txt', mode='w', encoding="utf-8") as f:
    print (Text_Content, file=f)

# Split as documents by the Chunker
docs = Semantic_Splitter.create_documents([Text_Content])

print (len(docs))
with open ('chunks.txt', mode='w', encoding="utf-8") as f:

    for doc in docs :

        print (doc.page_content,"\n---")
        print (doc.page_content,"\n---",file=f)


8
What is cloud computing? Authors
Stephanie  Susnjara
Staff Writer
IBM Think
Ian Smalley
Staff Editor
IBM Think
What is cloud computing? Cloud computing is on-demand access to computing resources—physical or virtual servers, data storage,
networking
capabilities, application development tools, software, AI-powered analytic platforms and more—over the internet with pay-per-use pricing. Think Newsletter
Join over 100,000 subscribers who read the latest news in tech
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. 
---
See the
IBM Privacy Statement
. Thank you! You are subscribed. Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe
here
. Refer to our
IBM Privacy Statement
for more information. https://www.ibm.com/us-en/privacy
In simpler terms, the "cloud" doesn't refer to something floating in the sky. Inst

In [21]:
# Another example with plain text content itself
with open ('AI_History.txt', mode='r', encoding="utf-8") as f:
    Text_Content = f.read ()

docs = Semantic_Splitter.create_documents([Text_Content])

print (len(docs))
with open ('chunks.txt', mode='w', encoding="utf-8") as f:

    for doc in docs :

        print (doc.page_content,"\n---")
        print (doc.page_content,"\n---",file=f)    

5
The Evolution of Artificial Intelligence: From Ancient Myths to the Age of Machines That Learn

The dream of creating artificial intelligence is older than electricity, older even than science itself. For as long as humans have imagined gods and monsters, they have also imagined machines that could think. In ancient Greece, the blacksmith god Hephaestus was said to have forged Talos, a bronze giant who patrolled the shores of Crete, hurling boulders at invaders. In Jewish folklore, there was the Golem, a clay figure brought to life through secret incantations. In ancient China, the craftsman Yan Shi supposedly presented King Mu of Zhou with a life-sized mechanical man who could sing and move like a living person. All of these stories spoke of the same yearning — the wish to breathe intelligence into the inanimate. 
---
For centuries, it remained myth. But slowly, that myth began to take on form through reason, mathematics, and engineering. By the 1600s, the Age of Enlightenment was d