## Simple Question and Ansewer RAG using langchain

### Loading environment variables

In [1]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv

load_dotenv()

# Get the OpenAI API key from the environment
openai_api_key = os.getenv("OPENAI_API_KEY")


### Reading different contract files that exist in the data directory

In [2]:
# Function to read text from .docx files
from docx import Document as DocxDocument

# Directory containing contract files
CONTRACTS_DIR = "../data/contracts"

def read_docx(file_path):
    doc = DocxDocument(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

# Load contract data
contracts = []
for filename in os.listdir(CONTRACTS_DIR):
    if filename.endswith(".docx"):
        file_path = os.path.join(CONTRACTS_DIR, filename)
        contracts.append(read_docx(file_path))

# Display loaded contracts
for i, contract in enumerate(contracts):
    print(f"Contract {i + 1}:\n{contract[:500]}\n")


Contract 1:
ADVISORY SERVICES AGREEMENT

This Advisory Services Agreement is entered into as of June 15th, 2023 (the “Effective Date”), by and between Cloud Investments Ltd., ID 51-426526-3, an Israeli company (the "Company"), and Mr. Jack Robinson, Passport Number 780055578, residing at 1 Rabin st, Tel Aviv, Israel, Email: jackrobinson@gmail.com ("Advisor").

Whereas,	Advisor has expertise and/or knowledge and/or relationships, which are relevant to the Company’s business and the Company has asked Advisor 



### Chunking contracts using different strategies

#### Recursive Chunking

In [13]:
# Split contracts into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Create an instance of the RecursiveCharacterTextSplitter with chunk size of 500 characters and overlap of 20 characters
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)

# Split contracts into smaller chunks using the splitter
# Iterate through each contract in the contracts list and split it into chunks using the splitter's split_text() method
# Create a Document object for each chunk and store them in the documents list
documents = [Document(page_content=chunk) for contract in contracts for chunk in splitter.split_text(contract)]

# Display the first few chunks
# Iterate through the first 5 documents in the documents list
# Print the content of each document, limited to the first 500 characters
for i, doc in enumerate(documents[:5]):
    print(f"Chunk {i + 1}:\n{doc.page_content[:500]}\n")

Chunk 1:
ADVISORY SERVICES AGREEMENT

This Advisory Services Agreement is entered into as of June 15th, 2023 (the “Effective Date”), by and between Cloud Investments Ltd., ID 51-426526-3, an Israeli company (the "Company"), and Mr. Jack Robinson, Passport Number 780055578, residing at 1 Rabin st, Tel Aviv, Israel, Email: jackrobinson@gmail.com ("Advisor").

Chunk 2:
Whereas,	Advisor has expertise and/or knowledge and/or relationships, which are relevant to the Company’s business and the Company has asked Advisor to provide it with certain Advisory services, as described in this Agreement; and
Whereas, 	Advisor has agreed to provide the Company with such services, subject to the terms set forth in this Agreement.

NOW THEREFORE THE PARTIES AGREE AS FOLLOWS:

Chunk 3:
Services:  
Advisor shall provide to the Company, as an independent contractor, software development services, and / or any other services as agreed by the parties from time to time (the “Services”). Advisor shall not appoi

### Creating embeddings and populate to chroma vector store

In [14]:
# Create embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create an instance of OpenAIEmbeddings using the provided OpenAI API key
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Initialize and populate Chroma index
# Create an instance of Chroma vector store called docstore
# Populate the Chroma index with the embeddings of the documents
docstore = Chroma.from_documents(documents, embeddings)

### Load and Process Evaluation Data

In [17]:
# Load evaluation data from a .docx file
EVAL_FILE = "../data/qna/Robinson Q&A.docx"

def load_evaluation_data(file_path):
    data = read_docx(file_path)
    qa_pairs = []
    lines = data.split('\n')
    current_question = None
    for line in lines:
        line = line.strip()
        if line.startswith('Q') and ':' in line:
            if current_question:
                qa_pairs.append(current_question)
            current_question = {"question": line.split(':', 1)[1].strip()}
        elif line.startswith('A') and ':' in line and current_question:
            current_question["answer"] = line.split(':', 1)[1].strip()
            qa_pairs.append(current_question)
            current_question = None
    return qa_pairs

# Load evaluation data
evaluation_data = load_evaluation_data(EVAL_FILE)

# Display the extracted QA pairs
for i, qa in enumerate(evaluation_data):
    print(f"QA Pair {i + 1}:\nQuestion: {qa['question']}\nAnswer: {qa['answer']}\n")

QA Pair 1:
Question: Who are the parties to the Agreement and what are their defined names?
Answer: Cloud Investments Ltd. (“Company”) and Jack Robinson (“Advisor”)

QA Pair 2:
Question: What is the termination notice?
Answer: According to section 4:14 days for convenience by both parties. The Company may terminate without notice if the Advisor refuses or cannot perform the Services or is in breach of any provision of this Agreement.

QA Pair 3:
Question: What are the payments to the Advisor under the Agreement?
Answer: According to section 6: 1. Fees of $9 per hour up to a monthly limit of $1,500, 2. Workspace expense of $100 per month, 3. Other reasonable and actual expenses if approved by the company in writing and in advance.

QA Pair 4:
Question: Can the Agreement or any of its obligations be assigned?
Answer: 1. Under section 1.1 the Advisor can’t assign any of his obligations without the prior written consent of the Company, 2. Under section 9  the Advisor may not assign the Agr

 ### Load Necessary Models for Evaluation

In [18]:
import spacy
from transformers import pipeline

# Ensure the spaCy model is installed
def ensure_spacy_model(model_name="en_core_web_sm"):
    try:
        spacy.load(model_name)
    except OSError:
        from subprocess import run
        run(f"python -m spacy download {model_name}", shell=True)

ensure_spacy_model()

# Load spaCy model for entity recognition
nlp = spacy.load("en_core_web_sm")

# Load NLI model
nli_model = pipeline("text-classification", model="roberta-large-mnli")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Define Evaluation Functions

In [19]:
def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

def calculate_hallucination_score(generated_text, reference_text):
    # Extract entities
    gen_entities = extract_entities(generated_text)
    ref_entities = extract_entities(reference_text)
    
    # Calculate entity overlap score
    common_entities = set(gen_entities) & set(ref_entities)
    entity_score = 1 - (len(common_entities) / len(set(gen_entities)))
    
    # Calculate NLI entailment score
    nli_result = nli_model(f"premise: {reference_text} hypothesis: {generated_text}")
    entailment_score = nli_result[0]['score'] if nli_result[0]['label'] == 'ENTAILMENT' else 0
    
    # Combine scores
    combined_score = (entity_score + (1 - entailment_score)) / 2
    return combined_score

### Running Query and Evaluation

In [20]:
def run_query(query, evaluation_data):
    results = docstore.similarity_search(query, k=5)

    for i, result in enumerate(results):
        generated_text = result.page_content
        reference_text = ""
        for qa_pair in evaluation_data:
            if query in qa_pair["question"]:
                reference_text = qa_pair["answer"]
                break
        
        hallucination_score = calculate_hallucination_score(generated_text, reference_text)
        
        print(f"Result {i + 1}:\n{generated_text[:500]}\n")
        print(f"Hallucination Score: {hallucination_score}\n")
        

# Example query
query = "Who owns the IP?"
run_query(query, evaluation_data)

Result 1:
IP: Any Work Product, upon creation, shall be fully and exclusively owned by the Company. The Advisor, immediately upon Company’s request, shall sign any document and/or perform any action needed to formalize such ownership. The Advisor shall not obtain any rights in the Work Product, including moral rights and/or rights for royalties or other consideration under any applicable law (including Section 134 of the Israeli Patent Law – 1967 if applicable), and shall not be entitled to any

Hallucination Score: 0.9

Result 2:
a Confidentiality, Non-Competition and IP Ownership Undertaking in the form attached hereto as Exhibit A.

Hallucination Score: 1.0

Result 3:
Confidentiality, Non-Competition and IP Ownership Undertaking: In connection with the performance of Advisor’s obligations under this Agreement, the Advisor shall execute a Confidentiality, Non-Competition and IP Ownership Undertaking in the form attached hereto as Exhibit A.

Hallucination Score: 0.7680196017026901

R