# Demonstarating the Knowledge Pipeline 

DeepDoc --> Partitioner --> Enrichement --> Global Entities --> Chunker --> Embedder in phases

✅ Phase 1: DeepDoc – Loader + Preprocessor
🎯 Objective:
Load a document (PDF/Word/Text), extract clean text, and basic metadata (like page numbers, headings, etc.)

                            Tools we are going to use 

| Task                | Tool                                      |
| ------------------- | ----------------------------------------- |
| PDF Reading         | `pdfplumber` (more accurate) or `PyMuPDF` |
| DOCX Reading        | `python-docx`                             |
| Text Cleaning       | `re` (regex), `unicodedata`, etc.         |
| Metadata Extraction | Basic heuristics                          |


In [2]:
# Step 0 : Installing required libraries for the Phase1
!pip install pdfplumber python-docx unstructured




In [44]:
# Step 1: Load a PDF File and Extract Text (Using pdfplumber)
import pdfplumber

pdf_path = "sample.pdf"  # Replace with your local file path

text_pages = []
with pdfplumber.open(pdf_path) as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        if text:
            text_pages.append({
                "page_number": i + 1,
                "text": text.strip()
            })

print(f"Total Pages: {len(text_pages)}")
print(text_pages[0])  # Print first page preview


Total Pages: 98
{'page_number': 1, 'text': 'CloudContactCenter\nSoftware\nTechnical Requirements\nReference Guide\nAugust 2021\nThis guide contains deployment, configuration, and troubleshooting\ninformation to assist customers and partners with Five9 applications.\nFive9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe\nclaimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange\nwithoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright © 2021Five9,Inc.'}


In [45]:
# 🧽 Step 2: Basic Preprocessing

import re
import unicodedata

def clean_text(text):
    text = unicodedata.normalize("NFKC", text)  # Normalize unicode
    text = re.sub(r'\s+', ' ', text)  # Remove excess whitespace
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII chars
    return text.strip()

for page in text_pages:
    page["cleaned_text"] = clean_text(page["text"])
    print(page)





{'page_number': 1, 'text': 'CloudContactCenter\nSoftware\nTechnical Requirements\nReference Guide\nAugust 2021\nThis guide contains deployment, configuration, and troubleshooting\ninformation to assist customers and partners with Five9 applications.\nFive9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe\nclaimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange\nwithoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright © 2021Five9,Inc.', 'cleaned_text': 'CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.T

In [46]:
# 🧾 Optional: Extract Basic Metadata Heuristically

# Example: Extract first few lines as Title/Heading
def get_doc_title(text):
    lines = text.split('\n')
    if lines:
        return lines[0]
    return "Untitled Document"

document_title = get_doc_title(text_pages[0]["text"])
print("Document Title:", document_title)


Document Title: CloudContactCenter


🚀 Phase 2 – Partitioner

🎯 Objective:
Split your document into meaningful semantic blocks like:

- Title
- Section headers (e.g., 1. Introduction, 2. Overview of Systems)
- Bullet points, paragraphs, tables, etc.

Why Partition?
Raw page-level text is too coarse. We need fine-grained segments to:

- Assign meaning
- Enrich in the next phase
- Chunk effectively later

📦 Two Partitioning Approaches

🔹 Option A: Simple Heuristics (Best to Start With)

Based on patterns like:

- Line starts with number: ^\d+\. → Section heading
- Bullet points: starts with •, -, or o
- Paragraph: blocks separated by double newline

🔹 Option B: Use unstructured.partition (Optional, layout-aware)

- Automatically splits documents into:
- Title, Narrative Text, List Items, Tables, Headers
- Handles multi-column or noisy layouts better

# Option A : First (Simple Heuristics)

In [47]:
# ✅ Step 1: Extract Sections by Numbered Headers

import re

def partition_by_headings(text):
    # Split by headers like: 1. Introduction, 2. Overview of Systems, etc.
    pattern = r'(?=\d+\.\s+[A-Z])'
    sections = re.split(pattern, text)
    structured_blocks = []

    for idx, section in enumerate(sections):
        section = section.strip()
        if section:
            lines = section.split('\n')
            title_line = lines[0].strip()
            body = '\n'.join(lines[1:]).strip() if len(lines) > 1 else ""
            structured_blocks.append({
                "section_id": idx + 1,
                "heading": title_line,
                "content": body
            })
    
    return structured_blocks



In [68]:
# Step 2: Apply to Your First Page (or Entire Doc)
all_sections = []
for page in text_pages:
    sections = partition_by_headings(page["cleaned_text"])
    all_sections.extend(sections)

print(f"Total Sections Found: {len(all_sections)}")
for sec in all_sections[:6]:  # Print first 3 sections
    print(f"\nSection: {sec['heading']}\nContent: {sec['content'][:300]}...")



Total Sections Found: 136

Section: CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange withoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright  2021Five9,Inc.
Content: ...

Section: About Five9 Five9istheleadingproviderofcloudcontactcentersoftware,bringingthepowerof thecloudtothousandsofcustomersandfacilitatingmorethanthreebillioncustomer interactionsannually.Since2001,Five9hasledthecloudrevolutionincontactcenters, deliveringsoftwaretohelporganizationsofeverysizetransitionfrompremise-based softwaretothecloud.Withitsextensiveexpertise,technology,and

In [69]:
for section in sections:
    print(f"\n🔍 Checking Section: {section['heading']}")  # instead of 'title'
    print(f"Content Length: {len(section['content'])}")
    print(f"Content: {section['content'][:300]}")  # preview only




🔍 Checking Section: References ReferenceDocuments Term/Acronym Definition destinationportnumberintheirpacketheaders.Aportnumberisa 16-bitunsignedinteger,rangingfrom0to65535.Aprocessassociates itsinputoroutputchannelsviaInternetsockets,atypeoffile descriptors,withatransportprotocol,aportnumberandanIP address.Thisprocessisknownasbinding,whichenablessendingand receivingdataviathenetwork. TLS TransportLayerSecurityprotocolthatprovidesdatasecrecyand integritybetweenapplications. VoIP VoiceoverInternetProtocol.Voicesignalsaretransmittedoverthe Internetratherthanoverthepublicswitchedtelephonenetwork (PSTN). VPN VirtualPrivateNetworkextendsaprivatenetworksothatthe resourcesthatbelongtothatnetworkareavailablebycontrolled remoteaccess. WebRTC WebReal-TimeCommunicationisanAPIdefinitiondraftedbythe WorldWideWebConsortium(W3C)tosupportbrowser-to-browser applicationsforvoice,videochat,andP2Pfilesharingwithouteither internalorexternalplug-ins. WSS SecureWebSocketprotocol.Providessecure,full-duplex c

🚀 Phase 3: Enrichment — Contextual Intelligence Layer

🎯 Objective:
Enhance each section with structured metadata such as:

- Named Entities (organizations, products, tech terms, dates, etc.)
- Keyword highlights
- (Optional) Intent or section classification (e.g., "architecture", "integration", "security")

🧰 Tools We'll Use: spaCy (lightweight & offline)

We’ll use a pre-trained spaCy NLP model for:
- Tokenization

- Part-of-speech tagging

- Named Entity Recognition (NER)



In [19]:
#📦 Step 1: Install and Load spaCy

!pip install spacy
!python -m spacy download en_core_web_sm


Collecting spacy
  Using cached spacy-3.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp312-cp312-manylinux_2_17_x

In [49]:
import spacy
nlp = spacy.load("en_core_web_sm")


In [50]:
#🧠 Step 2: Enrich Sections with Named Entities

def enrich_with_entities(section):
    doc = nlp(section["content"])
    entities = [{"text": ent.text, "label": ent.label_} for ent in doc.ents]
    return {
        **section,
        "entities": entities
    }

enriched_sections = [enrich_with_entities(sec) for sec in all_sections]


In [51]:
#🧪 Step 3: Preview Enriched Output

for sec in enriched_sections[:3]:
    print(f"\n🧩 Section: {sec['heading']}")
    print(f"🔍 Entities: {[e['text'] + ' (' + e['label'] + ')' for e in sec['entities']]}")
    print(f"📄 Content Preview: {sec['content'][:200]}...\n")



🧩 Section: CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange withoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright  2021Five9,Inc.
🔍 Entities: []
📄 Content Preview: ...


🧩 Section: About Five9 Five9istheleadingproviderofcloudcontactcentersoftware,bringingthepowerof thecloudtothousandsofcustomersandfacilitatingmorethanthreebillioncustomer interactionsannually.Since2001,Five9hasledthecloudrevolutionincontactcenters, deliveringsoftwaretohelporganizationsofeverysizetransitionfrompremise-based softwaretothecloud.Withitsextensiveexpertise,technology

# Option 2: Keyword Extraction (Easy + Useful)

In [24]:
#Install and use KeyBERT (it uses BERT embeddings under the hood):

!pip install keybert


Collecting keybert
  Downloading keybert-0.9.0-py3-none-any.whl.metadata (15 kB)
Downloading keybert-0.9.0-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m299.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: keybert
Successfully installed keybert-0.9.0


In [53]:
#🧠 Step 2: Extract Keywords per Section

from keybert import KeyBERT

kw_model = KeyBERT(model='all-MiniLM-L6-v2')  # Light and powerful

def enrich_with_keywords(section):
    keywords = kw_model.extract_keywords(
        section["content"],
        keyphrase_ngram_range=(1, 2),
        stop_words='english',
        top_n=5
    )
    return {
        **section,
        "keywords": [kw[0] for kw in keywords]
    }

enriched_sections = [enrich_with_keywords(sec) for sec in all_sections]




In [54]:
#🔍 Step 3: Preview Output

for sec in enriched_sections[:3]:
    print(f"\n🧩 Section: {sec['heading']}")
    print(f"🔑 Keywords: {sec['keywords']}")
    print(f"📄 Content Preview: {sec['content'][:200]}...")




🧩 Section: CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange withoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright  2021Five9,Inc.
🔑 Keywords: []
📄 Content Preview: ...

🧩 Section: About Five9 Five9istheleadingproviderofcloudcontactcentersoftware,bringingthepowerof thecloudtothousandsofcustomersandfacilitatingmorethanthreebillioncustomer interactionsannually.Since2001,Five9hasledthecloudrevolutionincontactcenters, deliveringsoftwaretohelporganizationsofeverysizetransitionfrompremise-based softwaretothecloud.Withitsextensiveexpertise,technology,

❓ Why Is KeyBERT Returning Empty Results?

| Root Cause                               | What’s Happening                                                                                             |
| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| 🔹 Section content is too short          | If there's not enough substance (e.g., just a heading or list), BERT can’t extract contextually rich phrases |
| 🔹 Punctuation/formatting issues         | If sections are just a flat stream of bullet points or merged lines, models struggle                         |
| 🔹 `keyphrase_ngram_range` is too narrow | We may miss 3-word technical terms like `"Agent Assist Engine"` or `"call flow integration"`                 |
| 🔹 Token filtering too aggressive        | Stopword removal + short token length + flat formatting = no keywords                                        |



🛠 Let's Fix It Step-by-Step



In [55]:
#✅ Step 1: Try With Larger N-Gram Range
# Allow bigger phrases (e.g., 1 to 3 words):

def enrich_with_keywords(section):
    keywords = kw_model.extract_keywords(
        section["content"],
        keyphrase_ngram_range=(1, 3),  # increased range
        stop_words='english',
        top_n=5
    )
    return {
        **section,
        "keywords": [kw[0] for kw in keywords]
    }


In [56]:
#✅ Step 2: Sanity Check One Section Directly
#Pick a section that has some text, not just headings:

sample_section = all_sections[1]  # e.g., "1. Introduction"
print(sample_section["content"])





In [57]:
kw_model.extract_keywords(sample_section["content"], 
                          keyphrase_ngram_range=(1, 3), 
                          stop_words='english',
                          top_n=10)


[]

In [58]:
#✅ Step 3: Add a Fallback for Very Short Sections
# Skip keyword extraction if content is < 50 characters:

def enrich_with_keywords(section):
    content = section["content"]
    if len(content) < 50:
        return {**section, "keywords": []}
    
    keywords = kw_model.extract_keywords(
        content,
        keyphrase_ngram_range=(1, 3),
        stop_words='english',
        top_n=5
    )
    return {
        **section,
        "keywords": [kw[0] for kw in keywords]
    }



In [59]:
# 🔎 Test the Output Again

for sec in enriched_sections[:3]:
    print(f"\n🧩 Section: {sec['heading']}")
    print(f"🔑 Keywords: {sec['keywords']}")



🧩 Section: CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange withoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright  2021Five9,Inc.
🔑 Keywords: []

🧩 Section: About Five9 Five9istheleadingproviderofcloudcontactcentersoftware,bringingthepowerof thecloudtothousandsofcustomersandfacilitatingmorethanthreebillioncustomer interactionsannually.Since2001,Five9hasledthecloudrevolutionincontactcenters, deliveringsoftwaretohelporganizationsofeverysizetransitionfrompremise-based softwaretothecloud.Withitsextensiveexpertise,technology,andecosystemof partners

In [65]:
for section in sections:
    print(f"\n🔍 Checking Section: {section['title']}")
    print(f"Content Length: {len(section['content'])}")
    print(f"Content: {section['content'][:300]}")  # Just first 300 chars


KeyError: 'title'

#  Use RAKE for Keyword Extraction

In [60]:
#🔧 Step 1: Install rake-nltk

!pip install rake-nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [61]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /home/koyas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [62]:
from rake_nltk import Rake

rake = Rake()  # Now it will work smoothly

def enrich_with_rake_keywords(section):
    content = section["content"]
    if len(content.strip()) < 30:
        return {**section, "keywords": []}
    
    rake.extract_keywords_from_text(content)
    keywords = rake.get_ranked_phrases()[:5]
    
    return {
        **section,
        "keywords": keywords
    }

enriched_sections = [enrich_with_rake_keywords(sec) for sec in all_sections]


In [63]:
for sec in enriched_sections[:3]:
    print(f"\n🧩 Section: {sec['heading']}")
    print(f"🔑 Keywords: {sec['keywords']}")



🧩 Section: CloudContactCenter Software Technical Requirements Reference Guide August 2021 This guide contains deployment, configuration, and troubleshooting information to assist customers and partners with Five9 applications. Five9andtheFive9logoareregisteredtrademarksofFive9anditssubsidiariesintheUnitedStatesandothercountries.Othermarksandbrandsmaybe claimedasthepropertyofothers.Theproductplans,specifications,anddescriptionshereinareprovidedforinformationonlyandsubjecttochange withoutnotice,andareprovidedwithoutwarrantyofanykind,expressorimplied.Copyright  2021Five9,Inc.
🔑 Keywords: []

🧩 Section: About Five9 Five9istheleadingproviderofcloudcontactcentersoftware,bringingthepowerof thecloudtothousandsofcustomersandfacilitatingmorethanthreebillioncustomer interactionsannually.Since2001,Five9hasledthecloudrevolutionincontactcenters, deliveringsoftwaretohelporganizationsofeverysizetransitionfrompremise-based softwaretothecloud.Withitsextensiveexpertise,technology,andecosystemof partners

In [41]:
import nltk
nltk.download('punkt_tab')

sample_text = """
The Agent Assist Application integrates directly with PBX systems like Asterisk and Avaya.
It leverages CRM APIs to deliver real-time customer insights. This seamless integration helps agents reduce call time and improve customer satisfaction.
"""

print("RAKE:")
rake.extract_keywords_from_text(sample_text)
print(rake.get_ranked_phrases()[:5])

print("\nKeyBERT:")
kw_model.extract_keywords(sample_text, keyphrase_ngram_range=(1, 3), stop_words='english', top_n=5)


[nltk_data] Downloading package punkt_tab to /home/koyas/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


RAKE:
['seamless integration helps agents reduce call time', 'agent assist application integrates directly', 'pbx systems like asterisk', 'time customer insights', 'leverages crm apis']

KeyBERT:


[('agent assist application', 0.6803),
 ('assist application integrates', 0.5592),
 ('agent assist', 0.5513),
 ('integration helps agents', 0.5197),
 ('assist application', 0.5173)]

In [64]:
for section in sections:
    print(f"\n🔍 Checking Section: {section['title']}")
    print(f"Content Length: {len(section['content'])}")
    print(f"Content: {section['content'][:300]}")  # Just first 300 chars


KeyError: 'title'