This is the cell where the main residential file is read and the data is cleaned. The cleaning process includes:
- Removing the columns that are not necessary for the analysis
- Removing the rows with missing values
- Removing the rows with zero values in the columns that are not supposed to have zero values

The cleaned data is then saved in a new file.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import json
import re
from docx import Document
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration


  from .autonotebook import tqdm as notebook_tqdm


In [6]:

def read_docx(file_path):
    """
    Extract text from a .docx file.
    """
    doc = Document(file_path)
    text = []
    for paragraph in doc.paragraphs:
        if paragraph.text.strip():  # Only include non-empty paragraphs
            text.append(paragraph.text.strip())
    return "\n".join(text)

def clean_text(text):
    """
    Clean text for better embedding generation.
    """
    # Remove unnecessary whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    return text.strip()

def chunk_text_semantically(text, chunk_size=400, chunk_overlap=50):
    """
    Chunk text using RecursiveCharacterTextSplitter to retain semantic structure.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " ", ""],  # Semantic priority
    )
    return text_splitter.split_text(text)

raw_text = read_docx("test_sample.docx")
cleaned_text = clean_text(raw_text)
chunks = chunk_text_semantically(cleaned_text, chunk_size=400, chunk_overlap=50)

The fact extractor model is created in this cell. The model is trained on the cleaned data. The model is then saved in a new file. As the json file

In [None]:

#when creating the vector db remove the cache
model_name = "chentong00/propositionizer-wiki-flan-t5-large"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
propositions = []
title = "Rental Lease Agreement"
for idx, chunk in enumerate(chunks):
    input_text = f"Title: {title}. Section: Chunk {idx + 1}. Content: {chunk}"
    input_ids = tokenizer(input_text, return_tensors="pt", truncation=True).input_ids.to(device)
    
    # Generate propositions
    outputs = model.generate(input_ids, max_new_tokens=512).cpu()
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Parse JSON output
    try:
        prop_list = json.loads(output_text)
        print(prop_list)
        propositions.extend(prop_list)
    except json.JSONDecodeError:
        print(f"[ERROR] Failed to parse output for Chunk {idx + 1}")
        
print(json.dumps(propositions, indent=2))

cuda
['EX1A-6 MAT CTRCT 11 ark7_ex6-10.htm is a location for the Rental Lease Agreement.', 'EXHIBIT 6.10 Exhibit 6.10 is a Residential Lease Agreement.', 'The Residential Lease Agreement is made and entered on January 1, 2025.', 'The Effective Date of the Residential Lease Agreement is January 1, 2025.', 'ARK7 PROPERTIES LLC is the Landlord.', 'John Doe, Jane Smith is the Tenant.', 'The Residential Lease Agreement is made and entered on January 1, 2025.']
['The parties agree to pay rent using personal check, money order, or cashier s check.', 'Make your check payable to ARK7 INC.', 'Mail your check to the company address listed below.', 'Pay your rent before the due date each month.']
['The address is 535 Mission St, 14th Floor, San Francisco, CA 94105.', 'If any payment is returned for non-sufficient funds or because Tenant stops payments, then Landlord may require Tenant to pay Rent in cash for three months.', 'After the return of a payment, Landlord may require Tenant to pay Rent by

In [8]:
import json
from docx import Document

# Assuming `propositions` is the JSON object you want to save

# Define the file path
file_path = "propositions_output.docx"

# Create a new Word document
doc = Document()

# Add a title to the document (optional)
doc.add_heading("Propositions Output", level=1)

# Add the JSON content to the document
doc.add_paragraph(json.dumps(propositions, indent=2, ensure_ascii=False))

# Save the document
doc.save(file_path)

print(f"Propositions successfully saved to {file_path}")

Propositions successfully saved to propositions_output.docx


This is the end of the fact extractor model.
Now we load the facts and create an embedding on it

In [3]:
import json

# Function to load the propositions from the JSON file
def load_propositions(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        propositions = json.load(file)
    return propositions

# Load the propositions from the file
file_path = "propositions_output.json"
loaded_propositions = load_propositions(file_path)

# Check the loaded propositions
loaded_propositions[0]


'EX1A-6 MAT CTRCT 11 ark7_ex6-10.htm EXHIBIT 6.10 Exhibit 6.10 RESIDENTIAL LEASE AGREEMENT This Lease Agreement is made and entered on [CONTRACT_DATE].'

In [4]:
numbered_propositions = [(i + 1, chunk) for i, chunk in enumerate(loaded_propositions)]
numbered_propositions

[(1,
  'EX1A-6 MAT CTRCT 11 ark7_ex6-10.htm EXHIBIT 6.10 Exhibit 6.10 RESIDENTIAL LEASE AGREEMENT This Lease Agreement is made and entered on [CONTRACT_DATE].'),
 (2, 'The Effective Date of the Lease Agreement is [CONTRACT_DATE].'),
 (3, 'ARK7 PROPERTIES LLC is the Landlord.'),
 (4, 'TENANT1 is the Tenant.'),
 (5, 'TENANT2 is the Tenant.'),
 (6,
  'The parties agree to pay rent using personal check, money order, or cashier s check.'),
 (7, 'Make your check payable to ARK7 INC.'),
 (8, 'Mail your check to the company address listed below.'),
 (9, 'Pay your rent before the due date each month.'),
 (10, 'The address is 535 Mission St, 14th Floor, San Francisco, CA 94105.'),
 (11,
  'If any payment is returned for non-sufficient funds or because Tenant stops payments, then Landlord may require Tenant to pay Rent in cash for three months.'),
 (12,
  'After the return of a payment, Landlord may require Tenant to pay Rent by cashier s check or money order.'),
 (13,
  'In the event of roommate

Now we use ALEASE bert to extract the red flags in the contract and then ask questions to it using the LLM and use Chain Of Thought to generate the answers.

In [5]:
# Embedding function (assuming tokenizer and model are already initialized)
import numpy as np

from transformers import AlbertTokenizer,AlbertModel
import tqdm as notebook_tqdm
import numpy as np
# Specify the directory containing the model files
model_dir = r"C:\Users\SAR\Desktop\Hassan\Hackathon\models"

# Load the tokenizer
tokenizer = AlbertTokenizer.from_pretrained(model_dir)
model = AlbertModel.from_pretrained(model_dir)

print("Model and tokenizer loaded successfully!")

def embed_clause(clause):
    inputs = tokenizer(clause, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    # Use pooler output for fixed-size embeddings
    return outputs.pooler_output.detach().numpy().flatten()

# Extract the clauses (propositions) from the numbered propositions
chunks = [chunk for _, chunk in numbered_propositions]

# Create embeddings for the propositions
embeddings_aleaseBert = np.array([embed_clause(chunk) for chunk in chunks], dtype="float32")

# Create FAISS index
aleaseBert_index = faiss.IndexFlatL2(embeddings_aleaseBert.shape[1])  # Using L2 distance for flat index
aleaseBert_index.add(embeddings_aleaseBert)

print(f"Number of vectors in index: {aleaseBert_index.ntotal}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Model and tokenizer loaded successfully!
Number of vectors in index: 275


In [6]:
query = "What are the problems with the contract?"

# Step 1: Embed the query
query_embedding = embed_clause(query)

# Step 2: Reshape the query embedding for FAISS search
query_embedding = query_embedding.reshape(1, -1)

# Step 3: Perform the search in the FAISS index
D, I = aleaseBert_index.search(query_embedding, k=20)  # Retrieve top 5 closest embeddings

# Step 4: Retrieve the top matching chunks
retrieved_chunks = [chunks[i] for i in I[0]]

# Step 5: Print the results
print("Query:", query)
print("\nTop Matching Chunks:")
for idx, chunk in enumerate(retrieved_chunks):
    print(f"{idx + 1}. {chunk} (Distance: {D[0][idx]:.2f})")
    print("---")


Query: What are the problems with the contract?

Top Matching Chunks:
1. The Tenant remains strictly liable for any injury or damage to persons or property caused by the satellite dish. (Distance: 0.14)
---
2. The bi-monthly basis is every two months. (Distance: 0.15)
---
3. The landlord and tenant must provide notice for changing their addresses. (Distance: 0.15)
---
4. Chunk 8 also discusses cleaning the Premises, if necessary, upon termination of the tenancy. (Distance: 0.15)
---
5. The payment application is not affected by any dates or directions provided by the Tenant that accompanies a payment. (Distance: 0.16)
---
6. The abatement is according to the extent to which the Premises have been rendered untenantable. (Distance: 0.17)
---
7. The payment shall be made every two months. (Distance: 0.17)
---
8. The chunk 35 is about the number of units sharing utilities on the property. (Distance: 0.17)
---
9. Burning candles in the apartment is prohibited. (Distance: 0.18)
---
10. Prepa

Note that I have left a cell here for running the LLM here to perform good analysis on the retreived answer from the fais index db.

After generating the redflags we are going to do named entity extraction to get the entities involved in the contract.

In [10]:
prompt = f"""
You are a contract analysis expert. 

{query}

I will provide you with several clauses from a lease agreement. Your task is to:
1. Review each clause carefully and identify any potential problems or issues in the contract.
2. Point out any red flags or clauses that might be considered unfair, risky, or unbalanced.
3. Based on your analysis, provide recommendations for improving the contract.

Please explain your reasoning step by step for each issue you identify.

Here are the clauses from the lease agreement:

{chr(10).join([f"{i+1}. {chunk}" for i, chunk in enumerate(retrieved_chunks)])}
"""

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small", device_map="auto", torch_dtype=torch.float16)
# Encode the prompt to get input IDs
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the model's output based on the input prompt
outputs = model.generate(input_ids)

# Decode the generated output back into text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated response
print(response)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


KeyboardInterrupt: 

In [1]:
import torch
torch.cuda.empty_cache()
