# ðŸ“Š Data Exploration for OpenNyAI

This notebook explores Indian legal documents to understand:
- Document structure and characteristics
- Entity distributions (courts, statutes, parties)
- Text statistics (length, vocabulary)
- Quality assessment for training data

In [None]:
# Setup
import sys
sys.path.insert(0, '..')

import json
import re
from pathlib import Path
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils.regex_patterns import IndianLegalPatterns
from src.data.preprocessor import LegalTextPreprocessor

# Configure display
pd.set_option('display.max_colwidth', 100)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

## 1. Load Sample Legal Documents

In [None]:
# Sample legal document for exploration
SAMPLE_JUDGMENT = """
IN THE SUPREME COURT OF INDIA
CIVIL APPELLATE JURISDICTION
CIVIL APPEAL NO. 1234 OF 2023

Kesavananda Bharati                                    ... Petitioner
                    Versus
State of Kerala & Ors.                                ... Respondents

JUDGMENT

Hon'ble Mr. Justice D.Y. Chandrachud, CJI
Hon'ble Mr. Justice Sanjiv Khanna

1. This appeal arises from the judgment dated 15th January 2023 passed by the 
High Court of Kerala in WP(C) No. 4567/2022.

FACTS:

2. The petitioner is the head of the Edneer Mutt in Kerala. The dispute concerns 
certain lands belonging to the religious institution.

3. On 24th April 1973, this Court in AIR 1973 SC 1461 laid down the basic 
structure doctrine while examining Article 368 of the Constitution of India.

ISSUE:

4. The primary issue for consideration is whether the impugned legislation 
violates the fundamental rights guaranteed under Articles 14, 19 and 25 of 
the Constitution.

ARGUMENTS OF PETITIONER:

5. Learned Senior Counsel Shri Fali S. Nariman appearing for the petitioner 
submitted that Section 3 of the impugned Act violates Article 14 as it creates 
unreasonable classification.

6. It was further argued relying on (2019) 5 SCC 123 that the right to property 
under Article 300A cannot be taken away without authority of law.

ARGUMENTS OF RESPONDENT:

7. Per contra, learned Additional Solicitor General appearing for the State 
contended that the legislation is a reasonable restriction under Article 19(5).

ANALYSIS:

8. We have carefully considered the submissions made by the learned counsel 
for both parties and perused the material on record.

9. In Minerva Mills v. Union of India (1980) 3 SCC 625, this Court held that 
judicial review is a basic feature of the Constitution.

10. Applying the principles laid down in Section 302 IPC and Section 34 of the 
Indian Penal Code read with Section 120B of the CrPC, we find that...

RATIO DECIDENDI:

11. The doctrine of proportionality requires that any restriction on fundamental 
rights must satisfy the tests of necessity, suitability and proportionality.

ORDER:

12. In view of the above, the appeal is allowed. The impugned judgment of the 
High Court is set aside. The respondents shall pay costs of Rs. 50,000/-.

                                        ........................................J.
                                        (D.Y. CHANDRACHUD, CJI)

                                        ........................................J.
                                        (SANJIV KHANNA)

New Delhi,
Dated: 22nd January 2024
"""

print(f"Sample document length: {len(SAMPLE_JUDGMENT)} characters")
print(f"Approximate tokens: {len(SAMPLE_JUDGMENT) // 4}")

## 2. Extract Legal Entities Using Regex Patterns

In [None]:
# Initialize regex patterns
patterns = IndianLegalPatterns()

# Extract all legal terms
extracted = patterns.extract_legal_terms(SAMPLE_JUDGMENT)

print("=" * 60)
print("EXTRACTED LEGAL ENTITIES")
print("=" * 60)

for category, items in extracted.items():
    if items:
        print(f"\n{category.upper()} ({len(items)} found):")
        for item in items[:5]:  # Show first 5
            print(f"  â€¢ {item}")

## 3. Extract Case Parties

In [None]:
# Extract petitioner vs respondent
case_title = "Kesavananda Bharati v. State of Kerala & Ors."
parties = patterns.extract_case_parties(case_title)

print("Case Parties:")
print(f"  Petitioner: {parties.get('petitioner', 'N/A')}")
print(f"  Respondent: {parties.get('respondent', 'N/A')}")

## 4. Citation Analysis

In [None]:
# Find all citations in the document
citations = patterns.extract_all_citations(SAMPLE_JUDGMENT)

print("Citations Found:")
for cite in citations:
    if cite:  # Filter empty matches
        normalized = patterns.normalize_citation(str(cite))
        print(f"  â€¢ {normalized}")

## 5. Text Statistics

In [None]:
# Preprocess and analyze
preprocessor = LegalTextPreprocessor()
cleaned_text = preprocessor.preprocess(SAMPLE_JUDGMENT)

# Sentence segmentation
sentences = preprocessor.segment_sentences(cleaned_text)

# Word statistics
words = cleaned_text.split()
word_lengths = [len(w) for w in words]

print("Text Statistics:")
print(f"  Total characters: {len(cleaned_text)}")
print(f"  Total words: {len(words)}")
print(f"  Total sentences: {len(sentences)}")
print(f"  Avg words per sentence: {len(words) / len(sentences):.1f}")
print(f"  Avg word length: {sum(word_lengths) / len(word_lengths):.1f}")

## 6. Rhetorical Structure Identification

In [None]:
# Identify rhetorical sections based on headers
section_headers = [
    'JUDGMENT', 'FACTS', 'ISSUE', 'ARGUMENTS OF PETITIONER',
    'ARGUMENTS OF RESPONDENT', 'ANALYSIS', 'RATIO DECIDENDI', 'ORDER'
]

found_sections = []
for header in section_headers:
    if header in SAMPLE_JUDGMENT.upper():
        found_sections.append(header)

print("Rhetorical Sections Found:")
for section in found_sections:
    print(f"  âœ“ {section}")

## 7. PII Detection (for Anonymization)

In [None]:
# Test PII detection patterns
test_text_with_pii = """
The accused can be contacted at 9876543210 or email: accused@example.com.
His Aadhaar number is 1234 5678 9012 and PAN is ABCDE1234F.
"""

anonymized = patterns.anonymize_pii(test_text_with_pii)

print("Original:")
print(test_text_with_pii)
print("\nAnonymized:")
print(anonymized)

## 8. Visualization: Document Structure

In [None]:
# Visualize sentence lengths
sentence_lengths = [len(s.split()) for s in sentences]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of sentence lengths
axes[0].hist(sentence_lengths, bins=20, color='steelblue', edgecolor='white')
axes[0].set_xlabel('Words per Sentence')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Sentence Lengths')
axes[0].axvline(x=sum(sentence_lengths)/len(sentence_lengths), color='red', 
                linestyle='--', label=f'Mean: {sum(sentence_lengths)/len(sentence_lengths):.1f}')
axes[0].legend()

# Entity type distribution
entity_counts = {k: len(v) for k, v in extracted.items() if v}
if entity_counts:
    axes[1].bar(entity_counts.keys(), entity_counts.values(), color='teal')
    axes[1].set_xlabel('Entity Type')
    axes[1].set_ylabel('Count')
    axes[1].set_title('Distribution of Legal Entities')
    axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 9. Next Steps

Based on this exploration:

1. **Data Collection**: Use the scraper to collect documents from Indian Kanoon
2. **Preprocessing**: Apply regex patterns for cleaning and entity extraction
3. **Annotation**: Create training data for NER and RRL models
4. **Model Training**: Fine-tune InLegalBERT on annotated data