# Contract Q&A RAG Evaluation Using spaCy and NLI Models

This notebook demonstrates the usage of a Contract Q&A system. The system reads contract documents, processes them, and answers queries about their content. Below are the steps followed:
1. Load contract documents
2. Chunk the documents
3. Create a document store
4. Perform evaluations using BLEU and Hallucination scores

In [6]:
# Ensure the spaCy model is installed
import spacy

def ensure_spacy_model(model_name="en_core_web_sm"):
    try:
        spacy.load(model_name)
    except OSError:
        from subprocess import run
        run(f"python -m spacy download {model_name}", shell=True)

ensure_spacy_model()

## Load Required Libraries

Here, the necessary libraries will be loaded and the spaCy model and the NLI model will be initialized.


In [7]:
import os
from docx import Document as DocxDocument
import spacy
from transformers import pipeline
import sacrebleu
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv

# Load spaCy model for entity recognition
nlp = spacy.load("en_core_web_sm")

# Load NLI model
nli_model = pipeline("text-classification", model="roberta-large-mnli")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Defining Helper Functions

Functions will be defined to: 
1. Read .docx files
2. Load evaluation data.
3. Calculate BLEU scores 
4. Extract entities
5. Callculate Hallucination score

In [8]:
# Function to read text from .docx files
def read_docx(file_path):
    doc = DocxDocument(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

# Function to load evaluation data
def load_evaluation_data(file_path):
    data = read_docx(file_path)
    qa_pairs = []
    lines = data.split('\n')
    current_question = None
    for line in lines:
        line = line.strip()
        if line.startswith('Q') and ':' in line:
            if current_question:
                qa_pairs.append(current_question)
            current_question = {"question": line.split(':', 1)[1].strip()}
        elif line.startswith('A') and ':' in line and current_question:
            current_question["answer"] = line.split(':', 1)[1].strip()
            qa_pairs.append(current_question)
            current_question = None
    return qa_pairs

# Function to calculate BLEU score using sacrebleu
def bleu(pred, ref):
    return sacrebleu.sentence_bleu(pred, [ref]).score

# Hallucination scoring functions
def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

def calculate_hallucination_score(generated_text, reference_text):
    # Extract entities
    gen_entities = extract_entities(generated_text)
    ref_entities = extract_entities(reference_text)
    
    # Calculate entity overlap score
    if gen_entities:
        common_entities = set(gen_entities) & set(ref_entities)
        entity_score = 1 - (len(common_entities) / len(set(gen_entities)))
    else:
        entity_score = 1  # Maximum hallucination if no entities are found in the generated text
    
    # Calculate NLI entailment score
    nli_result = nli_model(f"premise: {reference_text} hypothesis: {generated_text}")
    entailment_score = nli_result[0]['score'] if nli_result[0]['label'] == 'ENTAILMENT' else 0
    
    # Combine scores
    combined_score = (entity_score + (1 - entailment_score)) / 2
    return combined_score

## Define Evaluation Function using BLEU and hallucination scores

In [9]:
# Function to evaluate RAG system
def evaluate_rag_system(query_function, queries, references):
    results = []
    total_bleu_score = 0
    total_hallucination_score = 0
    num_samples = len(queries)
    
    for query, reference in zip(queries, references):
        generated_answer = query_function(query)
        bleu_score_value = bleu(generated_answer, reference)
        hallucination_score_value = calculate_hallucination_score(generated_answer, reference)
        
        total_bleu_score += bleu_score_value
        total_hallucination_score += hallucination_score_value
        
        results.append({
            "query": query,
            "reference": reference,
            "generated_answer": generated_answer,
            "bleu_score": bleu_score_value,
            "hallucination_score": hallucination_score_value
        })
    
    avg_bleu_score = total_bleu_score / num_samples
    avg_hallucination_score = total_hallucination_score / num_samples
    
    return results, {
        "average_bleu_score": avg_bleu_score,
        "average_hallucination_score": avg_hallucination_score
    }

## Define Functions for Document Chunking and Creating Document Store

In [10]:
# Function to chunk contracts
def chunk_contracts(contracts):
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
    documents = [Document(page_content=chunk) for contract in contracts for chunk in splitter.split_text(contract)]
    return documents

# Function to create a document store
def create_docstore(documents, openai_api_key):
    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
    docstore = Chroma.from_documents(documents, embeddings)
    return docstore

## Load Environment Variables

Load the OpenAI API key from the environment variables.

In [11]:
# Load environment variables from .env file
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

## Load and Process Contracts

Load and process the contract documents from the specified directory.

In [12]:
# Function to read and load contracts
CONTRACTS_DIR = "../data/contracts"
def read_docx(file_path):
    doc = DocxDocument(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

def load_contracts():
    contracts = []
    for filename in os.listdir(CONTRACTS_DIR):
        if filename.endswith(".docx"):
            file_path = os.path.join(CONTRACTS_DIR, filename)
            contracts.append(read_docx(file_path))
    return contracts

def read_docx_from_file(file):
    doc = DocxDocument(file)
    return "\n".join([para.text for para in doc.paragraphs])

def load_contracts_from_uploaded_files(uploaded_files):
    contracts = [read_docx_from_file(file) for file in uploaded_files]
    return contracts

# Load and process contracts
contracts = load_contracts()
documents = chunk_contracts(contracts)
docstore = create_docstore(documents, openai_api_key)

## Load Evaluation Data
Load the evaluation data from a .docx file.

In [13]:
# Load evaluation data
EVAL_FILE = "../data/qna/Robinson Q&A.docx"
evaluation_data = load_evaluation_data(EVAL_FILE)

## Define Query Function

Define the query function to search for the most relevant answer in the document store.

In [14]:
def query_function(query):
    results = docstore.similarity_search(query, k=5)
    return results[0].page_content if results else ""

## Prepare Queries and References

Prepare the queries and references from the evaluation data.

In [15]:
# Prepare queries and references from evaluation data
queries = [qa['question'] for qa in evaluation_data]
references = [qa['answer'] for qa in evaluation_data]

## Run Evaluation

Run the evaluation and display the results for each query along with the summary.

In [16]:
# Run evaluation
results, evaluation_summary = evaluate_rag_system(query_function, queries, references)

# Display detailed evaluation results
print("Evaluation Results")
for result in results:
    print(f"Query: {result['query']}")
    print(f"Reference: {result['reference']}")
    print(f"Generated Answer: {result['generated_answer']}")
    print(f"BLEU Score: {result['bleu_score']:.2f}")
    print(f"Hallucination Score: {result['hallucination_score']:.2f}")
    print()

# Display summary evaluation results
print("Summary Evaluation Results")
print(f"Average BLEU Score: {evaluation_summary['average_bleu_score']:.2f}")
print(f"Average Hallucination Score: {evaluation_summary['average_hallucination_score']:.2f}")

Evaluation Results
Query: Who are the parties to the Agreement and what are their defined names?
Reference: Cloud Investments Ltd. (“Company”) and Jack Robinson (“Advisor”)
Generated Answer: Entire Agreement; No Waiver or Assignment: This Agreement together with the Exhibits, which are attached hereto and incorporated herein, set forth the entire Agreement between the parties and shall supersede all previous communications and agreements between the parties, either oral or written. This Agreement may be modified only by a written amendment executed by both parties. This Agreement may not be assigned, sold, delegated or transferred in any manner by Advisor for any reason whatsoever.
BLEU Score: 0.50
Hallucination Score: 0.88

Query: What is the termination notice?
Reference: According to section 4:14 days for convenience by both parties. The Company may terminate without notice if the Advisor refuses or cannot perform the Services or is in breach of any provision of this Agreement.
Gene