# Veritas 
**Company Policy Analyzer**

The Vertias is a **Policy Analyzer Chatbot**, uncovers the true meaning and intent behind policy language. Veritas is based on vector store retriever, that utilizes a vector store to fetch policy documents from X company and llm to create context aware ansewers based on the user query.  This retriever leverages the search methods implemented by the vector store, such as similarity search and Maximum Marginal Relevance (MMR), to query texts stored within it.

## Setup


For this project,I used the following libraries:

*   [`ibm-watson-ai`](https://ibm.github.io/watsonx-ai-python-sdk/index.html) for using LLMs from IBM's watsonx.ai.
*   [`langchain`, `langchain-ibm`, `langchain-community`](https://www.langchain.com/) for using relevant features from LangChain.
*   [`pypdf`](https://pypi.org/project/pypdf/)is an open-source pure Python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files.
*   [`chromadb`](https://www.trychroma.com/) is an open-source vector database used to store embeddings.
*   [`lark`](https://pypi.org/project/lark/) is a general-purpose parsing library for Python. It is necessary for a Self-Querying Retriever.


In [3]:
!pip install "ibm-watsonx-ai==1.1.2" | tail -n 1
!pip install "langchain==0.2.1" | tail -n 1
!pip install "langchain-ibm==0.1.11" | tail -n 1
!pip install "langchain-community==0.2.1" | tail -n 1
!pip install "chromadb==0.4.24" | tail -n 1
!pip install "pypdf==4.3.1" | tail -n 1
!pip install "lark==1.1.9" | tail -n 1
!pip install 'posthog<6.0.0' | tail -n 1

Successfully installed ibm-cos-sdk-2.13.6 ibm-cos-sdk-core-2.13.6 ibm-cos-sdk-s3transfer-2.13.6 ibm-watsonx-ai-1.1.2 jmespath-1.0.1 lomond-0.3.3 numpy-1.26.4 pandas-2.1.4 requests-2.32.2 tabulate-0.9.0 tzdata-2025.2
Successfully installed langchain-0.2.1 langchain-core-0.2.43 langchain-text-splitters-0.2.4 langsmith-0.1.147 orjson-3.11.3 requests-toolbelt-1.0.0 tenacity-8.5.0
Successfully installed langchain-ibm-0.1.11
Successfully installed dataclasses-json-0.6.7 langchain-community-0.2.1 marshmallow-3.26.1 mypy-extensions-1.1.0 typing-inspect-0.9.0
Successfully installed asgiref-3.9.2 backoff-2.2.1 bcrypt-5.0.0 build-1.3.0 cachetools-5.5.2 chroma-hnswlib-0.7.3 chromadb-0.4.24 click-8.3.0 coloredlogs-15.0.1 durationpy-0.10 fastapi-0.118.0 filelock-3.19.1 flatbuffers-25.9.23 fsspec-2025.9.0 google-auth-2.40.3 googleapis-common-protos-1.70.0 grpcio-1.75.1 hf-xet-1.1.10 httptools-0.6.4 huggingface-hub-0.35.3 humanfriendly-10.0 kubernetes-33.1.0 markdown-it-py-4.0.0 mdurl-0.1.2 mmh3-5.2.0

## Defining helper functions

Use the following code to define some helper functions to reduce the repeat work in the notebook:


In [5]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## Creating a retriever model


The following steps are involved  to create a retriever model using LangChain:

- Building LLMs
  
- Splitting documents into chunks
  
- Building an embedding model
  
- Retrieving related knowledge from text
  


### Build the LLM
Develop or select a pre-trained language model that can understand and generate human-like text. This model serves as the foundation for processing and interpreting language data.


In [4]:
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models.extensions.langchain import WatsonxLLM

In [7]:
def llm():
    model_id = 'mistralai/mistral-small-3-1-24b-instruct-2503'
    
    parameters = {
        GenParams.MAX_NEW_TOKENS: 256,  # this controls the maximum number of tokens in the generated output
        GenParams.TEMPERATURE: 0.5, # this randomness or creativity of the model's responses
    }
    
    credentials = {
        "url": "https://us-south.ml.cloud.ibm.com"
    }
    
    
    project_id = "skills-network"
    
    model = ModelInference(
        model_id=model_id,
        params=parameters,
        credentials=credentials,
        project_id=project_id
    )
    
    mixtral_llm = WatsonxLLM(model = model)
    return mixtral_llm

### Create the embedding model


Create or utilize an embedding model to convert chunks of text into numerical vectors. These vectors represent the semantic meaning of the text, enabling the model to compare and retrieve relevant information based on similarity.
The following code demonstrates how to build an embedding model using the `watsonx.ai` package.

For this project, the `ibm/slate-125m-english-rtrvr` embedding model is used.


In [10]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings

In [11]:
def watsonx_embedding():
    embed_params = {
        EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
        EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
    }
    
    watsonx_embedding = WatsonxEmbeddings(
        model_id="ibm/slate-125m-english-rtrvr",
        url="https://us-south.ml.cloud.ibm.com",
        project_id="skills-network",
        params=embed_params,
    )
    return watsonx_embedding

#### Load Policy Dataset


Before working on the policy analyzer, we need to load some company policy document in text. A `companypolicies.txt` dataset is given in the data directory.


I am using `TextLoader` from `langchain_comunity` to load the document.


In [None]:
from langchain_community.document_loaders import TextLoader

In [None]:
loader = TextLoader("../data/companypolicies.txt")
txt_data = loader.load()

Split `txt_data` into chunks. `chunk_size = 200`, `chunk_overlap = 20` has been set.


In [16]:
chunks_txt = text_splitter(txt_data, 200, 20)

Store the embeddings into a `ChromaDB`.


In [None]:
from langchain.vectorstores import Chroma

In [18]:
vectordb = Chroma.from_documents(chunks_txt, watsonx_embedding())

##### Simple similarity search


Here is an example of a simple similarity search based on the vector database.

For this demonstration, the query has been set to "email policy".


In [None]:
query = "email policy"
retriever = vectordb.as_retriever()

In [20]:
docs = retriever.invoke(query)

By default, the number of retrieval results is four, and they are ranked by similarity level. I am specifing `search kwargs` like `k` to limit the retrieval results.



In [23]:
retriever = vectordb.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='This policy serves as a framework for handling discipline and termination. The organization recognizes the importance of fairness and consistency in these processes, and decisions will be made after')]

##### MMR search


MMR in vector stores is a technique used to balance the relevance and diversity of retrieved results. It selects documents that are both highly relevant to the query and minimally similar to previously selected documents. This approach helps to avoid redundancy and ensures a more comprehensive coverage of different aspects of the query.


The following code is showing how to conduct an MMR search in a vector database. We just need to sepecify `search_type="mmr"`.


In [24]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='This policy serves as a framework for handling discipline and termination. The organization recognizes the importance of fairness and consistency in these processes, and decisions will be made after'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Employee Referrals: We encourage and appreciate employee referrals as they contribute to building a strong and engaged team.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='any potential violations of this code and support the investigation of such matters.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Smoking Restrictions: Smoking inside company buildings, offices, meeting rooms, and other enclosed spaces is strictly prohibited. This includes electronic cigarettes and vaping devices.')]

##### Similarity score threshold retrieval


We can also set a retrieval method that defines a similarity score threshold, returning only documents with a score above that threshold.


In [25]:
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='This policy serves as a framework for handling discipline and termination. The organization recognizes the importance of fairness and consistency in these processes, and decisions will be made after'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='This policy aims to maintain a safe, healthy, and productive workplace.')]