# Notebook for experimenting with Text Splitting and Vector Embedding

In the context of using the document as part of a RAG (Retrieval-Augmented Generation) chatbot powered by a Large Language Model (LLM), evaluating semantic coherence is crucial for several reasons. 

- First and foremost, the effectiveness of the RAG mechanism relies heavily on the quality of the document representations used for retrieval. If the text splitter fails to segment the document effectively or if the embeddings do not accurately capture the semantic content of the segments, the retrieved documents may not be relevant or coherent, leading to suboptimal responses from the chatbot. 
- Additionally, semantic coherence evaluation ensures that the chatbot's responses maintain coherence and relevance to the user's queries, enhancing the overall user experience and trust in the chatbot's capabilities. 
- Moreover, in the context of a conversational AI system like a chatbot, maintaining semantic coherence helps facilitate smooth and natural conversations, ultimately improving user engagement and satisfaction. 


Therefore, evaluating semantic coherence in the context of a RAG chatbot powered by an LLM is essential for ensuring the chatbot's effectiveness, relevance, and coherence in generating responses based on retrieved documents.

### Import Libraries

In [128]:
pip install nltk




In [153]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFDirectoryLoader
import re
import nltk
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize

### Load PDF

In [130]:
import os

directory = "C:/Users/York Yong/OneDrive - Singapore Management University/Desktop/Langchain Project/groq/pokemon guide"
print("Directory exists:", os.path.exists(directory))
print("Directory contents:", os.listdir(directory))

Directory exists: True
Directory contents: ['Pokemon Scarlet_Violet Walkthrough.pdf']


In [131]:

loader=PyPDFDirectoryLoader(directory) ## Data Ingestion from pdf folder
print(vars(loader))   

{'path': 'C:/Users/York Yong/OneDrive - Singapore Management University/Desktop/Langchain Project/groq/pokemon guide', 'glob': '**/[!.]*.pdf', 'load_hidden': False, 'recursive': False, 'silent_errors': False, 'extract_images': False}


In [132]:
docs=loader.load() ## Document Loading
print(docs)

[Document(page_content='Pokemo\x00\nScarle\x00\nAn\x00\nViole\x00\nWalkthroug\x00\nScarlet\nplayers\nwill\nhave\nthe\nopportunity\nto\ncatch\nand\nride\nK o\nraidon\n.\nViolet\nplayers\nwill\nbe\nable\nto\ncatch\nand\nride\nM i\nraidon\n.\nFirst\nthings\nﬁrst,\npick\nyour\nstarter!\n', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}), Document(page_content='This\npokemon\ngame\nis\nset\nin\nthe\nregion\nof\nPaldea,\nafter\na\ngood\namount\nof\ntalking\nthrough\nthe\nbeginning\nof\nthe\ngame\nyou\nwill\nfind\nthere\nare\nthree\nmain\nstory-lines\nor\nQuests\nto\naccomplish,\nand\nthey\ncan\nbe\ndone\nin\nany\norder\nat\nany\ntime.\nThe\ngame\nallows\nyou\nto\njust\ngo\nexplore\nthe\nworld\nat\nyour\nown\npace.\nThis\ncan\nbe\na\nlittle\nconfusing\nthough\nsince\nit\ndoesn’t\ngive\nyou\na\nlot\nof\ndirection\nabout\nwhere\nto\ngo\nnext\nor\nwho\nto\n

#### Test with RecursiveCharacterTextSplitter

In [133]:
text_splitter=RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200) ## Chunk Creation
RCTS_final_documents =text_splitter.split_documents(docs[:10])
RCTS_final_documents


[Document(page_content='Pokemo\x00\nScarle\x00\nAn\x00\nViole\x00\nWalkthroug\x00\nScarlet\nplayers\nwill\nhave\nthe\nopportunity\nto\ncatch\nand\nride\nK o\nraidon\n.\nViolet\nplayers\nwill\nbe\nable\nto\ncatch\nand\nride\nM i\nraidon\n.\nFirst\nthings\nﬁrst,\npick\nyour\nstarter!', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}),
 Document(page_content='This\npokemon\ngame\nis\nset\nin\nthe\nregion\nof\nPaldea,\nafter\na\ngood\namount\nof\ntalking\nthrough\nthe\nbeginning\nof\nthe\ngame\nyou\nwill\nfind\nthere\nare\nthree\nmain\nstory-lines\nor\nQuests\nto\naccomplish,\nand\nthey\ncan\nbe\ndone\nin\nany\norder\nat\nany\ntime.\nThe\ngame\nallows\nyou\nto\njust\ngo\nexplore\nthe\nworld\nat\nyour\nown\npace.\nThis\ncan\nbe\na\nlittle\nconfusing\nthough\nsince\nit\ndoesn’t\ngive\nyou\na\nlot\nof\ndirection\nabout\nwhere\nto\ngo\nnext\nor\nwho\nto\nf

#### Test with NLTK for Document Splitting

In [134]:
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import namedtuple

# Define the Document named tuple
Document = namedtuple('Document', ['page_content', 'metadata'])

# Function to split text into chunks
def split_text_nltk(text, chunk_size):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sentence in sentences:
        words = word_tokenize(sentence)
        for word in words:
            if current_size + len(word) + 1 > chunk_size:  # +1 for the space or punctuation
                chunks.append(' '.join(current_chunk))
                current_chunk = []
                current_size = 0
            current_chunk.append(word)
            current_size += len(word) + 1  # +1 for the space or punctuation
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Process documents
NLTK_final_documents = []
num_docs = 10  # Number of documents to process
num_chunks_per_doc = 10  # Number of chunks per document

for i, doc in enumerate(docs[:num_docs]):
    text = doc.page_content
    chunks = split_text_nltk(text, chunk_size=2000)[:num_chunks_per_doc]
    
    # Append Document objects with page_content and metadata
    for chunk in chunks:
        NLTK_final_documents.append(Document(page_content=chunk, metadata=doc.metadata))
    
    if i + 1 >= num_docs:  # Break after processing the specified number of documents
        break

NLTK_final_documents



[Document(page_content='Pokemo\x00 Scarle\x00 An\x00 Viole\x00 Walkthroug\x00 Scarlet players will have the opportunity to catch and ride K o raidon . Violet players will be able to catch and ride M i raidon . First things ﬁrst , pick your starter !', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}),
 Document(page_content='This pokemon game is set in the region of Paldea , after a good amount of talking through the beginning of the game you will find there are three main story-lines or Quests to accomplish , and they can be done in any order at any time . The game allows you to just go explore the world at your own pace . This can be a little confusing though since it doesn ’ t give you a lot of direction about where to go next or who to fight in what order , so I ’ m going to break it down by each Quest line by the order in which they might be ea

#### Test with SpaCy for Document Splitting

In [135]:
# pip install spaCy

In [136]:
# !python -m spacy download en_core_web_sm


In [137]:
import spacy

def split_text_spacy(text, chunk_size):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sent in doc.sents:
        for token in sent:
            if current_size + len(token.text) + 1 > chunk_size:  # +1 for the space or punctuation
                chunks.append(' '.join(current_chunk))
                current_chunk = []
                current_size = 0
            current_chunk.append(token.text)
            current_size += len(token.text) + 1  # +1 for the space or punctuation
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Process documents
spacy_final_documents = []
num_docs = 10  # Number of documents to process
num_chunks_per_doc = 10  # Number of chunks per document

for i, doc in enumerate(docs[:num_docs]):
    text = doc.page_content
    chunks = split_text_spacy(text, chunk_size=2000)[:num_chunks_per_doc]
    
    # Append Document objects with page_content and metadata
    for chunk in chunks:
        spacy_final_documents.append(Document(page_content=chunk, metadata=doc.metadata))
    
    if i + 1 >= num_docs:  # Break after processing the specified number of documents
        break

spacy_final_documents


[Document(page_content='Pokemo\x00 \n Scarle\x00 \n An\x00 \n Viole\x00 \n Walkthroug\x00 \n Scarlet \n players \n will \n have \n the \n opportunity \n to \n catch \n and \n ride \n K o \n raidon \n . \n Violet \n players \n will \n be \n able \n to \n catch \n and \n ride \n M i \n raidon \n . \n First \n things \n ﬁrst , \n pick \n your \n starter ! \n', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}),
 Document(page_content='This \n pokemon \n game \n is \n set \n in \n the \n region \n of \n Paldea , \n after \n a \n good \n amount \n of \n talking \n through \n the \n beginning \n of \n the \n game \n you \n will \n find \n there \n are \n three \n main \n story - lines \n or \n Quests \n to \n accomplish , \n and \n they \n can \n be \n done \n in \n any \n order \n at \n any \n time . \n The \n game \n allows \n you \n to \n just \n go \n 

#### Test with Custom Character Splitter

In [138]:
def split_text_custom(text, chunk_size, chunk_overlap):
    chunks = []
    start = 0
    text_length = len(text)
    
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap  # move to the next chunk start considering the overlap
    
    return chunks

# Process documents
CCS_final_documents = []
num_docs = 10  # Number of documents to process
num_chunks_per_doc = 10  # Number of chunks per document

for i, doc in enumerate(docs[:num_docs]):
    text = doc.page_content
    chunks = split_text_custom(text, chunk_size=2000, chunk_overlap=200)[:num_chunks_per_doc]
    
    # Append Document objects with page_content and metadata
    for chunk in chunks:
        CCS_final_documents.append(Document(page_content=chunk, metadata=doc.metadata))
    
    if i + 1 >= num_docs:  # Break after processing the specified number of documents
        break

CCS_final_documents


[Document(page_content='Pokemo\x00\nScarle\x00\nAn\x00\nViole\x00\nWalkthroug\x00\nScarlet\nplayers\nwill\nhave\nthe\nopportunity\nto\ncatch\nand\nride\nK o\nraidon\n.\nViolet\nplayers\nwill\nbe\nable\nto\ncatch\nand\nride\nM i\nraidon\n.\nFirst\nthings\nﬁrst,\npick\nyour\nstarter!\n', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}),
 Document(page_content='This\npokemon\ngame\nis\nset\nin\nthe\nregion\nof\nPaldea,\nafter\na\ngood\namount\nof\ntalking\nthrough\nthe\nbeginning\nof\nthe\ngame\nyou\nwill\nfind\nthere\nare\nthree\nmain\nstory-lines\nor\nQuests\nto\naccomplish,\nand\nthey\ncan\nbe\ndone\nin\nany\norder\nat\nany\ntime.\nThe\ngame\nallows\nyou\nto\njust\ngo\nexplore\nthe\nworld\nat\nyour\nown\npace.\nThis\ncan\nbe\na\nlittle\nconfusing\nthough\nsince\nit\ndoesn’t\ngive\nyou\na\nlot\nof\ndirection\nabout\nwhere\nto\ngo\nnext\nor\nwho\nto\

#### Using Hugging Face 'Transformers'

In [139]:
from transformers import GPT2Tokenizer

def split_text_transformers(text, chunk_size):
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokens = tokenizer(text, return_tensors="pt", max_length=chunk_size, truncation=True)
    
    chunks = []
    for i in range(0, len(tokens['input_ids'][0]), chunk_size):
        chunk_tokens = tokens['input_ids'][0][i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunks.append(chunk_text)
    
    return chunks

# Process documents
HF_final_documents = []
num_docs = 10  # Number of documents to process
num_chunks_per_doc = 10  # Number of chunks per document

for i, doc in enumerate(docs[:num_docs]):
    text = doc.page_content
    chunks = split_text_transformers(text, chunk_size=512)[:num_chunks_per_doc]  # Adjust chunk size based on token limit
    
    # Append Document objects with page_content and metadata
    for chunk in chunks:
        HF_final_documents.append(Document(page_content=chunk, metadata=doc.metadata))
    
    if i + 1 >= num_docs:  # Break after processing the specified number of documents
        break

HF_final_documents




[Document(page_content='Pokemo\x00\nScarle\x00\nAn\x00\nViole\x00\nWalkthroug\x00\nScarlet\nplayers\nwill\nhave\nthe\nopportunity\nto\ncatch\nand\nride\nK o\nraidon\n.\nViolet\nplayers\nwill\nbe\nable\nto\ncatch\nand\nride\nM i\nraidon\n.\nFirst\nthings\nﬁrst,\npick\nyour\nstarter!\n', metadata={'source': 'C:\\Users\\York Yong\\OneDrive - Singapore Management University\\Desktop\\Langchain Project\\groq\\pokemon guide\\Pokemon Scarlet_Violet Walkthrough.pdf', 'page': 0}),
 Document(page_content='This\npokemon\ngame\nis\nset\nin\nthe\nregion\nof\nPaldea,\nafter\na\ngood\namount\nof\ntalking\nthrough\nthe\nbeginning\nof\nthe\ngame\nyou\nwill\nfind\nthere\nare\nthree\nmain\nstory-lines\nor\nQuests\nto\naccomplish,\nand\nthey\ncan\nbe\ndone\nin\nany\norder\nat\nany\ntime.\nThe\ngame\nallows\nyou\nto\njust\ngo\nexplore\nthe\nworld\nat\nyour\nown\npace.\nThis\ncan\nbe\na\nlittle\nconfusing\nthough\nsince\nit\ndoesn’t\ngive\nyou\na\nlot\nof\ndirection\nabout\nwhere\nto\ngo\nnext\nor\nwho\nto\

### Evaluating the Semantic Coherence of each textsplitter & OllamaEmbeddings

- Semantic coherence is utilized as an evaluation metric to gauge the effectiveness of both the text splitter and Embeddings method in capturing the underlying semantic structure of the text. - - By assessing the coherence of the semantic relationships between different parts of the text, we can infer how well the text splitter has divided the text into meaningful segments and how accurately the embeddings represent the semantic content of those segments. 
- A higher semantic coherence score indicates that the text splitter has effectively segmented the text into coherent units, and the embeddings accurately capture the semantic relationships between these units. 
- Thus, semantic coherence serves as a valuable measure for evaluating the overall quality and performance of both the text splitter and the embeddings, providing insights into their ability to capture and represent semantic information effectively.

In [155]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Initialize OllamaEmbeddings model
ollama_embedding = OllamaEmbeddings(model="llama3")

# Function to convert final documents to embedded vectors using OllamaEmbeddings
def embed_final_documents(final_documents):
    print("Embedding final documents...")
    embedded_vectors = [ollama_embedding.embed_query(doc.page_content) for doc in final_documents]
    print("Embedding complete.")
    return np.array(embedded_vectors)

# Function to compute semantic coherence using cosine similarity
def compute_semantic_coherence(embedded_vectors):
    print("Computing semantic coherence...")
    similarities = cosine_similarity(embedded_vectors)
    mean_similarity = np.mean(similarities)
    print("Coherence computation complete.")
    return mean_similarity

# Evaluate semantic coherence for each set of final documents
methods = ["RCTS", "NLTK", "spacy", "CCS", "HF"]
coherence_scores_ollama = {}

for method, final_documents in zip(methods, [RCTS_final_documents, NLTK_final_documents, spacy_final_documents, CCS_final_documents, HF_final_documents]):
    print(f"Processing documents for {method}...")
    embedded_vectors = embed_final_documents(final_documents)
    coherence_scores_ollama[method] = compute_semantic_coherence(embedded_vectors)
    print(f"Semantic Coherence ({method}): {coherence_scores_ollama[method]}")

# Print the computed coherence scores
print("Coherence scores:")
for method, score in coherence_scores_ollama.items():
    print(f"{method}: {score}")


Processing documents for RCTS...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (RCTS): 0.6080926294592386
Processing documents for NLTK...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (NLTK): 0.5594600037288328
Processing documents for spacy...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (spacy): 0.5034041292097423
Processing documents for CCS...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (CCS): 0.5433581095054762
Processing documents for HF...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (HF): 0.430632162393928
Coherence scores:
RCTS: 0.6080926294592386
NL

### Evaluating the Semantic Coherence of each textsplitter & OpenAIEmbeddings


In [156]:
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain_openai import OpenAIEmbeddings

# Set your OpenAI API key from environment variables
os.environ['OPENAI_API_KEY']=os.getenv("OPENAI_API_KEY")

# Initialize OpenAIEmbeddings model
OpenAIembeddings = OpenAIEmbeddings()


# Function to convert final documents to embedded vectors using OpenAI embeddings
def embed_final_documents(final_documents):
    print("Embedding final documents...")
    embedded_vectors = [OpenAIembeddings.embed_query(doc.page_content) for doc in final_documents]
    print("Embedding complete.")
    return np.array(embedded_vectors)

# Function to compute semantic coherence using cosine similarity
def compute_semantic_coherence(embedded_vectors):
    print("Computing semantic coherence...")
    similarities = cosine_similarity(embedded_vectors)
    mean_similarity = np.mean(similarities)
    print("Coherence computation complete.")
    return mean_similarity

# Evaluate semantic coherence for each set of final documents
methods = ["RCTS", "NLTK", "spacy", "CCS", "HF"]
coherence_scores_openai = {}

for method, final_documents in zip(methods, [RCTS_final_documents, NLTK_final_documents, spacy_final_documents, CCS_final_documents, HF_final_documents]):
    print(f"Processing documents for {method}...")
    embedded_vectors = embed_final_documents(final_documents)
    coherence_scores_openai[method] = compute_semantic_coherence(embedded_vectors)
    print(f"Semantic Coherence ({method}): {coherence_scores_openai[method]}")

# Print the computed coherence scores
print("Coherence scores:")
for method, score in coherence_scores_openai.items():
    print(f"{method}: {score}")


Processing documents for RCTS...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (RCTS): 0.8312896419886426
Processing documents for NLTK...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (NLTK): 0.8217621444530948
Processing documents for spacy...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (spacy): 0.8285675823169973
Processing documents for CCS...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (CCS): 0.8305530635617879
Processing documents for HF...
Embedding final documents...
Embedding complete.
Computing semantic coherence...
Coherence computation complete.
Semantic Coherence (HF): 0.826821073674062
Coherence scores:
RCTS: 0.8312896419886426
NL

In [158]:
# Create a DataFrame
df = pd.DataFrame({
    "Method": methods,
    "Coherence Score (OllamaEmbeddings)": [coherence_scores_ollama[method] for method in methods],
    "Coherence Score (OpenAI Embeddings)": [coherence_scores_openai[method] for method in methods]
})

# Display the DataFrame
df

Unnamed: 0,Method,Coherence Score (OllamaEmbeddings),Coherence Score (OpenAI Embeddings)
0,RCTS,0.608093,0.83129
1,NLTK,0.55946,0.821762
2,spacy,0.503404,0.828568
3,CCS,0.543358,0.830553
4,HF,0.430632,0.826821
