## 1. Automated Metadata Extraction and Evaluation from Legal Rental Documents and Images.

### AIM:

To develop a system that automatically extracts key `metadata` fields (like rent value, agreement dates, parties involved, etc.) from rental agreements in `.docx` and `.png` formats using `Optical Character Recognition (OCR)` and `Question-Answering (QA)` models, and evaluates its accuracy against ground truth using `recall`.

### DATASET:

1. **File Name**: The name of the rental agreement file, identifying the document being processed.
2. **Aggrement Value**: The monthly rent amount in rupees as an integer, representing the cost of the rental.
3. **Aggrement Start Date**: The date the agreement begins, marking the start of the rental period.
4. **Aggrement End Date**: The date the agreement ends, indicating the end of the rental period.
5. **Renewal Notice (Days)**: The notice period in days required for renewal or termination, specifying how much advance notice is needed.
6. **Party One**: The name of the owner or lessor of the agreement, identifying the property owner.
7. **Party Two**: The name of the tenant or lessee of the agreement, identifying the resident renting the property.

### Workflow
1. **Text Extraction**: Extract text from `.docx` files using `python-docx` and from `.png` files using Tesseract OCR.
2. **Metadata Extraction**: Use a BERT-based QA model to extract metadata by answering predefined questions.
3. **Post-Processing**: Normalize extracted values (e.g., convert text to numbers, format dates).
4. **Evaluation**: Compute per-field recall scores by comparing predictions with ground truth.

#### 1. Code - Import Libraries and Setup

In [1]:
# Import necessary libraries
import pandas as pd
import docx
import pytesseract
from PIL import Image
from transformers import pipeline
from datetime import datetime
import os
from word2number import w2n
import re

# Configure Tesseract path
pytesseract.pytesseract.tesseract_cmd = r'Tesseract-OCR\tesseract.exe'

  from .autonotebook import tqdm as notebook_tqdm


#### 2. Define Helper Functions

##### 2a. Text Extraction Functions

In [2]:
# Function to extract text from .docx files
def extract_text_from_docx(file_path):
    try:
        doc = docx.Document(file_path)
        full_text = [para.text for para in doc.paragraphs if para.text.strip()]
        return "\n".join(full_text)
    except Exception as e:
        print(f"Error reading .docx file {file_path}: {e}")
        return ""

# Function to extract text from .png files using OCR
def extract_text_from_png(file_path):
    try:
        image = Image.open(file_path).convert('L')
        text = pytesseract.image_to_string(image)
        return text.replace('OOO', '000').replace('‘', "'")
    except Exception as e:
        print(f"Error reading .png file {file_path}: {e}")
        return ""

##### 2b. Initialize QA Model

In [3]:
# Load the QA model (BERT for Question Answering)
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##### 2c. Define Metadata Questions

In [4]:
# Define questions for each metadata field
questions = {
    "Aggrement Value": "What is the monthly rent amount in rupees?",
    "Aggrement Start Date": "When does the agreement start?",
    "Aggrement End Date": "When does the agreement end?",
    "Renewal Notice (Days)": "How many days in advance should notice be given to renew or end the agreement?",
    "Party One": "Who is the owner of the agreement?",
    "Party Two": "Who is the resident in the agreement?"
}

##### 2d. Metadata Extraction Function
Extract metadata using the QA model.

In [5]:
# Extract metadata using the QA model
def extract_metadata(text, questions):
    metadata = {}
    for field, question in questions.items():
        result = qa_pipeline(question=question, context=text)
        metadata[field] = result['answer']
    return metadata


#### 3. Process Test Files and Save Predictions
Process all files in the test folder, extract metadata, and save to CSV.

In [6]:
# Process all files in the test folder
def process_test_files(test_folder, output_csv):
    test_files = os.listdir(test_folder)
    predictions = []

    for file_name in test_files:
        file_path = os.path.join(test_folder, file_name)
        text = extract_text_from_docx(file_path) if file_name.endswith('.docx') else extract_text_from_png(file_path)
        metadata = extract_metadata(text, questions)
        metadata['File Name'] = file_name.split('.')[0]

        # Value
        try:
            metadata['Aggrement Value'] = w2n.word_to_num(metadata['Aggrement Value'])
        except:
            metadata['Aggrement Value'] = int(re.sub(r'[^0-9]', '', metadata['Aggrement Value']))

        # Notice Period
        input_text = metadata["Renewal Notice (Days)"].strip().lower()
        days = 0
        if match := re.match(r'(\d+|\w+)\s*(month|months)', input_text):
            try: days = int(match.group(1)) * 30
            except: days = w2n.word_to_num(match.group(1)) * 30
        elif match := re.match(r'(\d+|\w+)\s*day', input_text):
            try: days = int(match.group(1))
            except: days = w2n.word_to_num(match.group(1))
        metadata["Renewal Notice (Days)"] = days

        # Dates
        for key in ['Aggrement Start Date', 'Aggrement End Date']:
            date_str = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', metadata[key])
            date_str = date_str.replace(' of', '').replace('*', '').replace(',', '').strip()
            try:
                parsed_date = datetime.strptime(date_str, '%d %B %Y')
                metadata[key] = parsed_date.strftime('%d.%m.%Y')
            except:
                pass
        
        predictions.append(metadata)
    # Save predictions to CSV
    df = pd.DataFrame(predictions)
    df = df[['File Name', 'Aggrement Value', 'Aggrement Start Date', 'Aggrement End Date', 'Renewal Notice (Days)', 'Party One', 'Party Two']]
    df.to_csv(output_csv, index=False)
    return df

#### 4. Compute Recall
Evaluate the predictions by computing per-field recall against ground truth.

In [7]:
# Compute per-field Recall
def compute_recall(pred_df, gt_df):
    scores = {}
    for col in ['Aggrement Value', 'Aggrement Start Date', 'Aggrement End Date', 'Renewal Notice (Days)', 'Party One', 'Party Two']:
        correct = 0
        total = 0
        for _, row in pred_df.iterrows():
            gt_row = gt_df[gt_df['File Name'] == row['File Name']]
            if gt_row.empty: continue
            gt_val = gt_row.iloc[0][col]
            pred_val = row[col]
            # Handle potential missing values
            if pd.isna(gt_val) or pd.isna(pred_val): continue
            # Compare values
            if col in ['Aggrement Value', 'Renewal Notice (Days)']:
                if gt_val == pred_val:
                    correct += 1
            elif col in ['Aggrement Start Date', 'Aggrement End Date']:
                try:
                    if pd.to_datetime(gt_val, dayfirst=True) == pd.to_datetime(pred_val, dayfirst=True):
                        correct += 1
                except: continue
            else:
                # For Party One and Party Two, compare strings (case-insensitive)
                if str(gt_val).strip().lower() == str(pred_val).strip().lower():
                    correct += 1
            total += 1
        scores[col] = correct / total if total > 0 else 0
    return scores

#### 5. Reading the dataset

In [8]:
test_folder = "data/test"
test_csv = "data/test.csv"
output_csv = "predictions.csv"

# Load ground truth
gt_df = pd.read_csv(test_csv)

#### 6. Run the pipeline and compute recall scores.

In [9]:
# Process test files and get predictions
pred_df = process_test_files(test_folder, output_csv)

# Compute Recall scores
recall_scores = compute_recall(pred_df, gt_df)

# Print Recall scores
print("Per-field Recall Scores:")
for field, score in recall_scores.items():
    print(f"{field}: {score:.2f}")

Per-field Recall Scores:
Aggrement Value: 1.00
Aggrement Start Date: 1.00
Aggrement End Date: 0.50
Renewal Notice (Days): 0.75
Party One: 0.25
Party Two: 0.00
