## Objectives

1. Read .docx files using python-docx
2. Split text into chunks using LangChain.TextSplitter
3. Generate embeddings using HuggingFaceEmbeddings with sentence-transformers/all_miniLM-L6-v2
4. Store and retrieve embeddings using FAISS

## Environment

Created a venv using Python 3.9.23.

To access the venv:

1. `cd "AI SOP Assistant"`
2. `source venv/bin/activate`

### Dependencies

- currently using Python version 3.9
- NumPy version must be >1.21, <2 (currently using 1.26.4)
- torch must be >=1.0.1 for sentence-transformers
- python-docx, langchain, sentence-transformers, faiss-cpu, -U langchain-community
- for more details, see requirements.txt file
- installed via pip in Unix command line: `pip install ...`

## Code
### Library Import

In [1]:
#Importing libraries
import os
from docx import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document as LCDocument

### Loading and Reading Files

In [2]:
#Folder where the .docx SOP files are stored
folder_path='SOPs'
#List to store text and metadata from each document
docs=[]

#Creating a loop that goes through every file in the SOPs folder
for filename in os.listdir(folder_path):
    if filename.endswith('.docx'):
        full_path=os.path.join(folder_path,filename) #Getting full path of the file
        doc = Document(full_path) #Opens the Word document
        #Joining all non-empty paragraphs into one string
        full_text='\n'.join([para.text for para in doc.paragraphs if para.text.strip() != ""])
        #Storing the text along with its filename as metadata
        docs.append({"text":full_text,"source":filename})

In [12]:
#Total number of documents read
print(f"Total documents: {len(docs)}")

#Preview of the first document
print(f"Filename: {docs[0]['source']}")
print(docs[0]['text'][:1000])  #First 1000 characters only

Total documents: 26
Filename: SOP Termination of Employee.docx
Termination of Employee SOP
Purpose
To define a secure, standardized process for handling IT tasks associated with employee termination. This ensures timely revocation of access, data protection, and operational continuity in compliance with company policy and data security standards. This procedure applies to all El Sol NEC employees, contractors, interns, and external users whose employment or engagement has ended, either voluntarily or involuntarily.
Pre-Termination (For Planned Separations Only)
Inventory all devices assigned to the user, review active accounts, and list any shared access or privileged roles.
Open an internal IT ticket with the due date/time to match the employee’s termination date/time.
Day of Termination
Immediately revoke access exactly at termination time.
Disable Microsoft 365 account
Terminate VPN access and revoke tokens
Disable access to internal systems, SaaS platforms, and admin panels.
Disabl

### Splitting Text into Chunks

In [13]:
#Setting up the text splitter to break text into chunks of ~500 characters with some overlap (~100 characters)
text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=500, #Max size
    chunk_overlap=100, #Max overlap to preserve context
    separators=["\n\n","\n","."," "] #Preferred places to split text
)

#List of final chunked documents
documents=[]

#For loop to chunk text for each SOP
for d in docs:
    #Splitting text into chunks
    chunks=text_splitter.split_text(d['text'])
    #Creating LangChain document with metadata
    for chunk in chunks:
        documents.append(LCDocument(page_content=chunk,metadata={"source":d["source"]}))

In [14]:
chunk_lengths=[len(doc.page_content) for doc in documents]
avg_len=sum(chunk_lengths)/len(chunk_lengths)
print(f"Average chunk length: {avg_len:.2f} characters")

Average chunk length: 402.31 characters


### Generating Embeddings

An embedding model converts each chunk of text into a numerical vector, called an embedding, that captures its meaning. Vector lengths should be 384 for `all-MiniLM-L6-v2`. When generating embeddings, we don't need to set a seed because embedding models are deterministic during inference. This means that  results are always the same for a given input text and model version (no randomness involved). May need to set seed when fine-tuning the embedding model, doing train/text splits for ML tasks, random sampling of chunks, or using LLMs that involve sampling.

In [15]:
#Loading a sentence embedding model from Hugging Face
embedding_model=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

### Storing Embeddings in FAISS

Storing the FAISS index locally for reuse allows it to be loaded later.

In [16]:
#Creating a FAISSS vector index from list of documents and their embeddings
vectorstore=FAISS.from_documents(documents,embedding_model)
#Saving the FAISS index locally
vectorstore.save_local("faiss_sop_index")

Below, we ran an example to check the embeddings of the SOPs. This is to confirm that embeddings were generated, FAISS was able to store them, and that semantic similarity search works.

In [17]:
#Recreating FAISS index temporarily in memory
vectorstore=FAISS.from_documents(documents,embedding_model)

#Running a test similarity search with max 3 sources
results=vectorstore.similarity_search("How to set up a device?",k=3)

#Printing results (match #, source SOP, and snippet of 300 characters)
for i, doc in enumerate(results):
    print(f"\nMatch {i+1}")
    print(f"Source: {doc.metadata['source']}")
    print(doc.page_content[:300])


Match 1
Source: SOP Asset Management and Inventory Tracking.docx
Store purchase and warranty documentation in a secure, access-controlled folder.
Follow the Device Setup and Configuration SOP to configure and assign devices. Ensure the asset record is updated with deployment date, assigned user name and department, technician name, and setup confirmation. Require

Match 2
Source: SOP Device Setup and Configuration.docx
Document successful setup in the ticketing system and close the setup ticket.
Provide user with device, power supply, and other peripherals, deliver a quick orientation, and require signature on Device Receipt Acknowledgement Form.
Update inventory records with assigned user name and email, device s

Match 3
Source: SOP Hardware Repair and Replacement.docx
Upon return, inspect and test the device thoroughly. Reconfigure and setup the device as needed (see SOP on Device Setup and Configuration).
Replacement Procedure
If the device is beyond repair or unserviceable, docum

### Balancing Chunk Size and Semantic Precision

Pros of smaller chunks:

- Smaller chunks are less likely to contain multiple topics, which makes their embeddings more semantically precise
- The search query is more likely to find a chunk that directly addresses it
- Core meaning isn't averaged out across unrelated content

Cons of smaller chunks:

- Small chunks <100 characters can miss necessary surrounding info
- More chunks = more embeddings = bigger vector index and slower searches (unless optimized)
- Some procedures or instructions can get split mid-thought, reducing interpretability on retrieval

Smaller chunks generally lead to more accurate semantic similarity, but going too small might hurt performance by cutting off useful context. Sweet spot for SOPs is often a chunk size of 500 characters with an overlap of 100 characters.

https://www.reddit.com/r/LangChain/comments/1bgqc2o/optimal_way_to_chunk_word_document_for/