## Simple Question and Ansewer RAG using langchain

### Loading environment variables

In [1]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv

load_dotenv()

# Get the OpenAI API key from the environment
openai_api_key = os.getenv("OPENAI_API_KEY")


### Reading different contract files if exist in the data directory

In [2]:
# Directory containing contract files
CONTRACTS_DIR = "../data/contracts"

# Function to read text from .docx files
from docx import Document as DocxDocument

def read_docx(file_path):
    doc = DocxDocument(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

# Load contract data
contracts = []
for filename in os.listdir(CONTRACTS_DIR):
    if filename.endswith(".docx"):
        file_path = os.path.join(CONTRACTS_DIR, filename)
        contracts.append(read_docx(file_path))

# Display loaded contracts
for i, contract in enumerate(contracts):
    print(f"Contract {i + 1}:\n{contract[:500]}\n")


Contract 1:
ADVISORY SERVICES AGREEMENT

This Advisory Services Agreement is entered into as of June 15th, 2023 (the “Effective Date”), by and between Cloud Investments Ltd., ID 51-426526-3, an Israeli company (the "Company"), and Mr. Jack Robinson, Passport Number 780055578, residing at 1 Rabin st, Tel Aviv, Israel, Email: jackrobinson@gmail.com ("Advisor").

Whereas,	Advisor has expertise and/or knowledge and/or relationships, which are relevant to the Company’s business and the Company has asked Advisor 



### Chunking contracts using different strategies

#### Recursive Chunking

In [4]:
# Split contracts into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Create an instance of the RecursiveCharacterTextSplitter with chunk size of 500 characters and overlap of 20 characters
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)

# Split contracts into smaller chunks using the splitter
# Iterate through each contract in the contracts list and split it into chunks using the splitter's split_text() method
# Create a Document object for each chunk and store them in the documents list
documents = [Document(page_content=chunk) for contract in contracts for chunk in splitter.split_text(contract)]

# Display the first few chunks
# Iterate through the first 5 documents in the documents list
# Print the content of each document, limited to the first 500 characters
for i, doc in enumerate(documents[:5]):
    print(f"Chunk {i + 1}:\n{doc.page_content[:500]}\n")

Chunk 1:
ADVISORY SERVICES AGREEMENT

This Advisory Services Agreement is entered into as of June 15th, 2023 (the “Effective Date”), by and between Cloud Investments Ltd., ID 51-426526-3, an Israeli company (the "Company"), and Mr. Jack Robinson, Passport Number 780055578, residing at 1 Rabin st, Tel Aviv, Israel, Email: jackrobinson@gmail.com ("Advisor").

Chunk 2:
Whereas,	Advisor has expertise and/or knowledge and/or relationships, which are relevant to the Company’s business and the Company has asked Advisor to provide it with certain Advisory services, as described in this Agreement; and
Whereas, 	Advisor has agreed to provide the Company with such services, subject to the terms set forth in this Agreement.

NOW THEREFORE THE PARTIES AGREE AS FOLLOWS:

Chunk 3:
Services:  
Advisor shall provide to the Company, as an independent contractor, software development services, and / or any other services as agreed by the parties from time to time (the “Services”). Advisor shall not appoi

### Creating embeddings and populate to chroma vector store

In [5]:
# Create embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create an instance of OpenAIEmbeddings using the provided OpenAI API key
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Initialize and populate Chroma index
# Create an instance of Chroma vector store called docstore
# Populate the Chroma index with the embeddings of the documents
docstore = Chroma.from_documents(documents, embeddings)

### Tessting with variable query input

In [18]:
# query to find similar documents
query = "Who owns the IP? "

# Perform similarity search using the raw query string
results = docstore.similarity_search(query, k=5)  # Retrieve top 5 similar documents

# Display the search results
for i, result in enumerate(results):
    print(f"Result {i + 1}:\n{result.page_content[:500]}\n")


Result 1:
except for Intellectual Property exclusively licensed to the Company pursuant to an Inbound IP Contract, the Company owns all worldwide rights, titles, and interests in and to each item of Company Intellectual Property, free and clear of any Encumbrance other than Permitted Encumbrances and licenses granted in the Outbound IP Contracts identified on Schedule 3.11(f) and the Company is the owner of record of all Company Registrations; and

Result 2:
except for Intellectual Property exclusively licensed to the Company pursuant to an Inbound IP Contract, the Company owns all worldwide rights, titles, and interests in and to each item of Company Intellectual Property, free and clear of any Encumbrance other than Permitted Encumbrances and licenses granted in the Outbound IP Contracts identified on Schedule 3.11(f) and the Company is the owner of record of all Company Registrations; and

Result 3:
IP: Any Work Product, upon creation, shall be fully and exclusively owned by the Com