VERDICT is an AI-powered contract analysis platform that automates legal contract management using Google Gemini 2.5 Pro. The platform helps legal and procurement teams reduce manual effort, minimize errors, accelerate contract review, and provide actionable insights. It supports contracts in English and Arabic with full RAG-powered Q&A capabilities.
- 📄 Secure Contract Upload — Support for PDF and DOCX files
- 🤖 AI-Powered Analysis — Comprehensive contract analysis using Gemini 2.5 Pro
- 💬 RAG Q&A System — TF-IDF + MMR-powered question answering
- 🌐 Arabic Translation — Full contract translation with RTL display
- 📊 Contract Summarization — Automatic summary generation
⚠️ Risk & Compliance Analysis — Identify potential issues and missing clauses- 🎨 Professional UI — VERDICT-branded Streamlit interface
| Component | Technology | Purpose |
|---|---|---|
| Frontend | Streamlit | Multi-page web application |
| LLM | Google Gemini 2.5 Pro Preview | Contract analysis and generation |
| Document Parsing | doc_reader.py (PDF/DOCX) |
Text extraction from contracts |
| RAG System | TF-IDF + MMR Diversification | Question answering with citations |
| Embeddings | Sentence-Transformers | Semantic search capabilities |
| Translation | Deep Translator (Google) | Arabic translation with chunking |
| Storage | File-based (JSON/TXT) | Persistent analysis storage |
| API Backend | Google Gemini API | AI processing (250 req/day free tier) |
UNSUPERVISED-CLM/
│
├── 🏠 main.py # Main Streamlit app (Home page)
│ │ # - Full contract analysis pipeline
│ │ # - 2-tab interface (Full Analysis + Arabic Translation)
│ │ # - RAG-powered Q&A chatbot
│ │ # - Saved analyses sidebar
│ │ # - Session state management
│
├── 📱 app/ # UI components & styling
│ ├── theme.css # VERDICT design (Aurora Indigo→Teal gradient)
│ │ # - Custom header with navigation
│ │ # - Section cards & hero blocks
│ │ # - Chat interface styling
│ │ # - Responsive design system
│ ├── config.py # App configuration & settings
│ ├── shared.py # Shared utilities (header, theme loader)
│ ├── utils.py # Helper functions
│ ├── parse_utils.py # Document parsing utilities
│ └── assets/ # Images, logos, VERDICT branding
│
├── 📄 pages/ # Streamlit multi-page app structure
│ ├── 1_Upload.py # Upload page with full analysis
│ │ # - Same functionality as main page
│ │ # - Parse → Analyze → Display → RAG → Translate
│ ├── 2_Create.py # Create new contracts page
│ ├── 3_Edit.py # Edit existing contracts page
│ └── 4_Results.py # Results display page
│
├── 🧠 src/ # Core AI logic & processing
│ ├── doc_reader.py # Document parsing (PDF/DOCX → text)
│ │ # - Language detection (Arabic/English)
│ │ # - Text extraction with metadata
│ │
│ ├── llm_handler.py # LLM wrapper for Gemini API
│ │ # - API configuration management
│ │ # - Request/response handling
│ │
│ ├── analysis.py # Main contract analysis logic
│ │ # - analyze_document() function
│ │ # - 7-section structured analysis
│ │ # - Prompt engineering for contracts
│ │
│ ├── analysis_metrics.py # Performance & quality metrics
│ │ # - ContractAnalysisMetrics class
│ │ # - Session tracking
│ │ # - Quality assessment scores
│ │ # - Performance monitoring
│ │
│ ├── rag_model.py # RAG system implementation
│ │ # - TfidfIndex class
│ │ # - TF-IDF vectorization
│ │ # - MMR diversification
│ │ # - Extractive + generative answers
│ │ # - LEGAL_PATTERNS for contracts
│ │
│ ├── metrics_rag.py # RAG performance metrics
│ │ # - Answer quality evaluation
│ │ # - Retrieval accuracy tracking
│ │
│ ├── contract_summary_generator.py # Summary generation system
│ │ # - ContractSummaryGenerator class
│ │ # - Multi-document grouping
│ │ # - Automatic summary creation
│ │
│ ├── translation_utils.py # Arabic translation module
│ │ # - translate_to_arabic() with chunking
│ │ # - save_arabic_translation()
│ │ # - load_arabic_translation()
│ │ # - Rate limiting (0.5s delay)
│ │
│ ├── translation_handler.py # Translation orchestration
│ ├── bilingual_analyzer.py # Bilingual analysis support
│ ├── bilingual_contract_summary_generator.py
│ ├── data_extraction.py # Contract information extraction
│ ├── risk_analysis.py # Risk assessment logic
│ ├── summarization.py # Text summarization
│ └── generate_summaries.py # Batch summary generation
│
├── 💬 prompts/ # Prompt engineering templates
│ ├── extraction_prompt.txt # Information extraction prompts
│ ├── risk_prompt.txt # Risk analysis prompts
│ └── summarization_prompt.txt # Summarization prompts
│
├── 📊 data/ # Data storage directories
│ ├── parsed/ # Parsed contract JSON files
│ │ # Format: filename_YYYYMMDD_HHMMSS.json
│ │ # Contains: {text, language, metadata}
│ │
│ ├── analysed_documents/ # Gemini analysis output files
│ │ # Format: filename_analysis_YYYYMMDD_HHMMSS.txt
│ │ # Contains: Full contract analysis text
│ │ # Note: Files with -arabic.txt are translations
│ │
│ ├── contract_summaries/ # Generated summaries
│ │ # Auto-generated by ContractSummaryGenerator
│ │
│ ├── english/ # English contract samples
│ │ ├── sample_contract_1.pdf
│ │ └── sample_contract_2.docx
│ │
│ ├── arabic/ # Arabic contract samples
│ │ ├── sample_contract_1.pdf
│ │ └── sample_contract_2.docx
│ │
│ └── old_runs/ # Historical analysis runs
│
├── 🧪 tests/ # Test suite
│ │ # Contains unit & integration tests
│ │ # - test_imports.py
│ │ # - test_data_extraction.py
│ │ # - test_risk_analysis.py
│ │ # - test_gemini_api.py
│ │ # Run with: pytest tests/
│
├── 📋 requirements.txt # Python dependencies
├── 🔒 .env # Environment variables (GEMINI_API_KEY)
├── .gitignore # Git ignore file
└── 📖 README.md # This file
Follow these steps to set up VERDICT on a new machine without a virtual environment:
- Clone the repository
git clone https://github.com/thm-msror/Unsupervised-CLM.git cd Unsupervised-CLM
- Install Python dependencies (uses the global interpreter)
python -m pip install --upgrade pip python -m pip install -r requirements.txt python -m pip install deep-translator python-magic-bin
deep-translator(Arabic translation) andpython-magic-bin(file type detection on Windows) are required at runtime; add them to the requirements file if your environment does not already include them. - Create the
.envfile in the project root with your own credentialsGEMINI_API_KEY=AIza...your_key_here... STREAMLIT_SERVER_PORT=8501
- Launch the Streamlit app
streamlit run main.py
- (Optional) Verify setup
pytest python tests/test_env.py
┌─────────────────────────────────────────────────────────────────┐
│ 1. DOCUMENT UPLOAD │
│ - User uploads PDF/DOCX file │
│ - File stored in session state │
└─────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. DOCUMENT PARSING (doc_reader.py) │
│ - Extract text from PDF/DOCX │
│ - Detect language (Arabic/English) │
│ - Save to data/parsed/filename_YYYYMMDD_HHMMSS.json │
│ Output: {text: str, language: str, metadata: dict} │
└─────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. GEMINI ANALYSIS (analyze_text_with_gemini) │
│ - Create temporary JSON with parsed data │
│ - Call analysis.analyze_document() │
│ - Use Gemini 2.5 Pro Preview model │
│ - Generate 7-section structured analysis: │
│ • Contract Overview │
│ • Key Parties and Roles │
│ • Important Dates and Deadlines │
│ • Legal Framework and Governing Law │
│ • Main Obligations and Responsibilities │
│ • Financial Terms and Conditions │
│ • Risk Analysis and Potential Issues │
│ - Save to data/analysed_documents/filename_analysis_*.txt │
└─────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. PARALLEL PROCESSING (Threading) │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ RAG Index Building │ │ Summary Generation │ │
│ │ (build_rag_index) │ │ (generate_summary) │ │
│ │ │ │ │ │
│ │ • Split analysis │ │ • Group related docs │ │
│ │ into sections │ │ • Call Gemini for │ │
│ │ • Build TF-IDF │ │ comprehensive summary │ │
│ │ vectors │ │ • Save to │ │
│ │ • Enable MMR search │ │ contract_summaries/ │ │
│ │ • Save with joblib │ │ │ │
│ └─────────────────────┘ └──────────────────────────┘ │
└─────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. DISPLAY RESULTS (2-Tab Interface) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tab 1: Full Analysis │ │
│ │ - Dark blue styled text box (#0d1933 background) │ │
│ │ - Complete Gemini analysis with all 7 sections │ │
│ │ - Scrollable with max-height 600px │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tab 2: Arabic Translation │ │
│ │ - "Translate to Arabic" button (if not exists) │ │
│ │ - Uses translation_utils.translate_to_arabic() │ │
│ │ - Chunks text (4500 chars) with rate limiting │ │
│ │ - Saves as filename-arabic.txt │ │
│ │ - RTL display with Arabic styling │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 6. Q&A CHATBOT (answer_contract_question) │
│ - User asks question about contract │
│ - Load RAG index from joblib file │
│ - TF-IDF search for relevant sections (top_k=3) │
│ - Apply MMR diversification │
│ - Generate answer using Gemini with context │
│ - Display in chat interface with history │
└─────────────────────────────────────────────────────────────────┘
Streamlit uses st.session_state to persist data across reruns:
# Main page session state
st.session_state.analysis_results # List of all analyzed contracts
st.session_state.current_analysis # Currently displayed analysis
st.session_state.rag_index # RAG index for Q&A
st.session_state.rag_built # Flag: RAG ready
st.session_state.chat_history # Q&A conversation history
# Upload page session state (separate namespace)
st.session_state.upload_analysis_results
st.session_state.upload_current_analysis
st.session_state.upload_rag_index
st.session_state.upload_rag_built
st.session_state.upload_chat_historyUpload → Parse → Analyze → Translate
↓ ↓ ↓ ↓
contract.pdf
parsed/contract_20250126_120000.json
analysed_documents/contract_analysis_20250126_120030.txt
analysed_documents/contract_analysis_20250126_120030-arabic.txt
Loading Logic (main.py load_saved_analyses_simple()):
- Scans
data/analysed_documents/directory - Loads only files matching pattern:
*_analysis_*.txt - Skips Arabic translations (files ending with
-arabic.txt) - Sorts by timestamp (newest first)
- Displays in sidebar with "✅ Loaded Contract Analysis #X from disk"
Comprehensive metrics tracking for LLM performance and quality assessment.
{
'session_id': '20250126_120000', # Timestamp-based unique ID
'session_start': '2025-01-26T12:00:00', # ISO timestamp
'session_end': '2025-01-26T12:05:30', # ISO timestamp
'session_duration_minutes': 5.5 # Total elapsed time
}{
'documents_processed': 10, # Total attempted
'successful_analyses': 9, # Successfully completed
'failed_analyses': 1, # Failed processing
'total_processing_time': 450.2 # Cumulative seconds
}API Response Times:
- Tracks all Gemini API call latencies
- Calculates average, median, p95, p99
- Identifies performance bottlenecks
Throughput:
throughput_docs_per_minute = successful_analyses / (total_time / 60)Success Rate:
success_rate = (successful_analyses / documents_processed) * 100Structure Compliance:
- Checks if analysis follows 7-section format
- Validates presence of all required sections
- Calculation:
sections_found / 7
Completeness Scores:
- Measures analysis depth and thoroughness
- Checks for minimum content length per section
- Penalties for missing information
Extraction Accuracy:
- Validates extracted entities (parties, dates, amounts)
- Cross-references with source text
- Pattern matching for legal terms
Consistency Scores:
- Checks internal logical consistency
- Validates date sequences (effective < expiration)
- Cross-references between sections
Document Sizes:
{
'document_sizes': [2500, 3200, 1800], # Character counts
'avg_document_size': 2500,
'median_document_size': 2500
}Language Distribution:
{
'languages_detected': {
'en': 7, # 7 English contracts
'ar': 3 # 3 Arabic contracts
}
}Complexity Scores:
- Based on document length, legal jargon density, clause count
- Formula:
(char_count / 1000) * jargon_ratio * clause_density
Each analyzed document gets detailed tracking:
{
'filename': 'contract.pdf',
'text_length': 2500,
'language': 'en',
'processing_time': 45.2,
'api_calls': 3,
'tokens_used': 5000,
'errors': [],
'analysis_quality': {
'structure_compliance': 1.0,
'completeness': 0.95,
'extraction_accuracy': 0.92,
'consistency': 0.98
},
'success': True,
'timestamp': '2025-01-26T12:01:30'
}avg_response_time = sum(api_response_times) / len(api_response_times)quality_score = (
structure_compliance * 0.25 +
completeness * 0.30 +
extraction_accuracy * 0.25 +
consistency * 0.20
)tokens_per_char = total_tokens / total_chars_processed
efficiency_score = 1.0 / tokens_per_char # Higher is betterJSON Export:
metrics.save_metrics('data/metrics/session_20250126_120000.json')Console Dashboard:
=== VERDICT Contract Analysis Session ===
Session ID: 20250126_120000
Duration: 5.5 minutes
Processing:
- Documents: 10 processed
- Success: 9 (90.0%)
- Failed: 1 (10.0%)
Performance:
- Avg Response Time: 4.2s
- Throughput: 1.8 docs/min
- Total Tokens: 50,000
Quality (avg):
- Structure: 0.95
- Completeness: 0.92
- Accuracy: 0.89
- Consistency: 0.96
- Overall: 0.93
-
TF-IDF Vectorization:
- Converts text to numerical vectors
- Term frequency × Inverse document frequency
- Captures importance of terms in corpus
-
MMR Diversification:
- Maximal Marginal Relevance algorithm
- Balances relevance with diversity
- Prevents redundant results
- Formula:
MMR = λ × Similarity(q,d) - (1-λ) × max Similarity(d,R)
-
LEGAL_PATTERNS:
- Pre-defined regex patterns for contract entities
- Extracts: parties, dates, amounts, obligations, termination, renewal
- Pattern-based extractive answers
-
Hybrid Search:
- Extractive: Direct text extraction with patterns
- Generative: Gemini-generated answers with context
- Combined approach for best results
build_index(documents):
- Splits documents into searchable chunks
- Builds TF-IDF matrix
- Saves index with joblib for persistence
search(query, top_k=5, use_mmr=True):
- Vectorizes query
- Computes cosine similarity
- Applies MMR if enabled
- Returns ranked results with scores
ask(question, use_extractive=True, use_generative=True):
- Searches relevant context
- Tries extractive answer first (pattern matching)
- Falls back to generative (Gemini) if needed
- Returns answer with source citations
-
📄 Secure Contract Upload & Processing
Multi-page Streamlit app with dedicated Upload page and Results display -
🔍 AI-Powered Comprehensive Analysis
7-section structured analysis using Gemini 2.5 Pro Preview -
💬 RAG-Powered Q&A
TF-IDF + MMR search with Gemini-generated answers -
🌐 Arabic Translation
Full translation with chunking, rate limiting, and RTL display -
📊 Contract Summaries
Automatic summary generation for contract groups -
💾 Persistent Storage
File-based storage with automatic loading on app restart
- Python 3.8+ — Main runtime
- Internet connection — For Google Gemini API access
- Google AI API Key — Free tier available at Google AI Studio
# Install Python dependencies
pip install -r requirements.txt
# Verify installation
python --version
pip list | grep google-generativeaiStep 1: Clone & Install
git clone https://github.com/thm-msror/Unsupervised-CLM.git
cd Unsupervised-CLM
pip install -r requirements.txtStep 2: Get Google Gemini API Key
# 1. Go to Google AI Studio: https://makersuite.google.com/app/apikey
# 2. Sign in with Google account
# 3. Click "Create API Key"
# 4. Copy the API key (starts with AIza...)Step 3: Configure Environment
# Create a .env file
New-Item -ItemType File -Path .env -Force
# Add your Gemini API key:
echo "GEMINI_API_KEY=your_api_key_here" >> .env💡 Free Tier: Google Gemini offers generous free limits: 15 req/min, 1500 req/day, 1M tokens/month
Step 4: Run Application
# Run from repository root
python -m streamlit run .\main.py --server.port 8501
# Open your browser at: http://localhost:8501🔑 API Key Security: Your Gemini API key is used client-side and not stored permanently
📁 Structure Note: The main entry point ismain.pyat the repository root for simple deployment
For hackathon demos, deploy on Render using cloud-hosted LLMs:
Step 1: Connect Repository
- Link your GitHub repository to Render
Step 2: Configure Environment
- Add environment variables from
.envto Render dashboard - run python -m dotenv
- Include API keys, model endpoints, and configuration
Step 3: Set Start Command
streamlit run main.py --server.port $PORT --server.address 0.0.0.0🔒 Security: Add your GEMINI_API_KEY as an environment variable in Render dashboard
# Run all tests
pytest tests/
# Run specific test modules
pytest tests/test_data_extraction.py -v
pytest tests/test_risk_analysis.py -v# Install development dependencies
pip install -r requirements.txtThis application uses Google Gemini 2.5 Flash for contract analysis:
| Feature | Specification |
|---|---|
| Model | gemini-2.5-flash (latest) |
| Context Window | 32K tokens |
| Speed | Optimized for low-latency |
| Free Tier | 15 req/min, 1500 req/day, 1M tokens/month |
Get Your Free API Key:
- Visit Google AI Studio
- Sign in with your Google account
- Click "Create API Key"
- Copy the key (starts with
AIza...)
Add to Environment:
# Create .env file
echo "GEMINI_API_KEY=AIza..." >> .envQuick Test Script:
# Run the test script
python tests/test_gemini_api.pyManual Test with Python:
import google.generativeai as genai
# Configure API
genai.configure(api_key="your_api_key_here")
model = genai.GenerativeModel('gemini-2.5-flash')
# Test contract analysis
contract_text = "Agreement between ABC Corp and XYZ Ltd, effective Jan 2025, $150/hour"
prompt = f"Extract key information from this contract: {contract_text}"
response = model.generate_content(prompt)
print(response.text)- 🆓 Free Tier: Generous limits for development and demos
- ⚡ Fast: Optimized for quick responses
- 🧠 Smart: Excellent at structured data extraction
- 📄 Contract-Ready: Great performance on legal document analysis
{
"session_id": "20251025_232811",
"session_duration_minutes": 4.75,
"documents_processed": 10,
"successful_analyses": 10,
"failed_analyses": 0,
"success_rate": 100%
}| Metric | Value |
|---|---|
| Total Processing Time | 239.96 seconds (~4 minutes) |
| Average Response Time | 23.71 seconds per contract |
| Throughput | 2.5 documents/minute |
| Fastest Analysis | 17.68 seconds (21.5K chars, English) |
| Slowest Analysis | 33.11 seconds (26.6K chars, Arabic) |
Languages:
- 🇸🇦 Arabic: 5 contracts (50%)
- 🇺🇸 English: 5 contracts (50%)
Contract Types Detected:
- Employment Agreements: 6 (60%)
- Purchase Agreements: 2 (20%)
- Software Development: 1 (10%)
Document Size Range:
- Smallest: 13,034 characters
- Largest: 35,482 characters
- Average: 24,285 characters
- Median: 24,922 characters
| Quality Dimension | Score | Performance |
|---|---|---|
| Structure Compliance | 1.00 | ✅ Perfect (100%) |
| Completeness | 1.00 | ✅ Perfect (100%) |
| Extraction Accuracy | 0.53 | |
| Consistency | 1.00 | ✅ Perfect (100%) |
Per-Document Quality Breakdown:
Document #1 (Arabic, 17K chars): Structure ✅ | Complete ✅ | Accuracy 44% | Consistent ✅
Document #2 (Arabic, 28K chars): Structure ✅ | Complete ✅ | Accuracy 56% | Consistent ✅
Document #3 (Arabic, 27K chars): Structure ✅ | Complete ✅ | Accuracy 67% | Consistent ✅
Document #4 (Arabic, 13K chars): Structure ✅ | Complete ✅ | Accuracy 33% | Consistent ✅
Document #5 (English, 22K chars): Structure ✅ | Complete ✅ | Accuracy 44% | Consistent ✅
Document #6 (English, 35K chars): Structure ✅ | Complete ✅ | Accuracy 67% | Consistent ✅
Document #7 (English, 33K chars): Structure ✅ | Complete ✅ | Accuracy 78% | Consistent ✅
Document #8 (English, 16K chars): Structure ✅ | Complete ✅ | Accuracy 33% | Consistent ✅
Document #9 (English, 28K chars): Structure ✅ | Complete ✅ | Accuracy 56% | Consistent ✅
| API Metric | Value |
|---|---|
| Total Tokens Processed | 68,837 tokens |
| Average Tokens/Document | 7,648 tokens |
| Smallest Request | 4,721 tokens |
| Largest Request | 10,543 tokens |
| API Success Rate | 100% (9/9 API calls)* |
*One document processed without API call (cached or alternative method)
Average Contract Complexity: 0.577 (Medium-High)
Complexity Distribution:
- High (>0.7): 1 contract (Arabic MSP Agreement - 0.746)
- Medium (0.5-0.7): 6 contracts
- Low (<0.5): 3 contracts
Complexity Formula:
complexity = (char_count / 1000) * jargon_ratio * clause_densityFastest: 17.677s ━━━━━━━━━━━━━━━━━━░░░░░░░░
Average: 23.707s ━━━━━━━━━━━━━━━━━━━━━━━━░░
Slowest: 33.105s ━━━━━━━━━━━━━━━━━━━━━━━━━━
Distribution:
15-20s: ███ (3 contracts)
20-25s: ████ (4 contracts)
25-30s: ░ (0 contracts)
30-35s: ██ (2 contracts)
✅ Strengths:
- Perfect structure compliance - all analyses follow 7-section format
- 100% success rate - no failures or errors
- Consistent quality across English and Arabic contracts
- Good throughput (2.5 docs/min) for comprehensive analysis
- Extraction accuracy varies (33% - 78% range)
- Arabic contracts slightly slower (avg 25.5s vs 21.2s for English)
- Some contracts have lower information density detection
🎯 Recommendations:
- Optimize extraction accuracy through better prompt engineering
- Implement caching for repeated contract patterns
- Consider parallel processing for batch operations
- Fine-tune information density detection algorithms
Based on the latest session metrics:
- Uptime: 100% (no crashes or system failures)
- API Reliability: 100% (all API calls successful)
- Data Integrity: 100% (all files saved correctly)
- Processing Consistency: Stable response times across session
- Bilingual Support: Equal performance on English and Arabic contracts