A Python command-line tool that uses LLMs to extract structured information from contract documents and outputs the data as JSON.
This tool processes contract text files and extracts key information including:
- Contract summary and type
- Parties involved (with addresses and roles)
- Effective dates and duration
- Contract scope
- Important clauses (categorized by type)
- Governing law
- Total contract value
The extraction uses Azure AI Foundry's GPT-oss-120b model via LangChain with structured output to ensure consistent, schema-validated results.
- Structured Extraction: Uses Pydantic models to ensure consistent output format
- Smart Date Parsing: Handles various date formats and converts to ISO 8601
- Clause Categorization: Automatically categorizes clauses into predefined types
- ISO Standards: Uses ISO 3166-1 alpha-2 country codes and ISO 8601 duration format
- Token Counting: Reports token usage for each contract processed
- Automatic Directory Creation: Creates output directory if it doesn't exist
-
Clone this repository or download the files
-
Install dependencies:
pip install -r requirements.txt- Set up your environment variables:
- Copy
.env.exampleto.env - Fill in your Azure AI Foundry credentials
- Copy
Create a .env file with the following variables:
FOUNDRY_MODEL_ENDPOINT=https://your-foundry-endpoint.azure.com/openai/v1
OPENAI_API_KEY=your-api-key-here
DEPLOYMENT_NAME=gpt-oss-120bSee .env.example for a template.
python import_contract.py --input path/to/contract.txtpython import_contract.py --input "contracts/my_contract.txt"python import_contract.py --helpThe tool generates JSON files in the output/ directory with the naming pattern:
{filename}_extracted.json
{
"summary": "Contract summary...",
"contract_type": "Service",
"parties": [
{
"name": "Company Name",
"location": {
"address": "123 Street",
"city": "City",
"state": "State",
"country": "US"
},
"role": "Provider"
}
],
"effective_date": "2023-01-01",
"contract_scope": "Description of what the contract covers...",
"duration": "P1Y",
"end_date": "2024-01-01",
"total_amount": 100000.00,
"governing_law": {
"country": "US"
},
"clauses": [
{
"summary": "Clause summary...",
"clause_type": "Renewal & Termination"
}
],
"file_id": "contract_filename"
}The tool categorizes clauses into the following types:
- Renewal & Termination: Contract duration, renewal terms, termination conditions
- Confidentiality & Non-Disclosure: Protection of confidential information, NDAs
- Non-Compete & Exclusivity: Non-compete agreements, exclusivity provisions
- Liability & Indemnification: Liability limits, indemnification, warranties, penalties
- Service-Level Agreements: Performance metrics, service standards, quality requirements
Supported contract types include:
- Affiliate Agreement
- Development
- Distributor
- Endorsement
- Franchise
- Hosting
- IP
- Joint Venture
- License Agreement
- Maintenance
- Manufacturing
- Marketing
- Non Compete/Solicit
- Outsourcing
- Promotion
- Reseller
- Service
- Sponsorship
- Strategic Alliance
- Supply
- Transportation
- Python 3.8+
- Azure AI Foundry access with GPT-oss-120b deployment
- See
requirements.txtfor Python package dependencies
If you see authentication errors, verify:
- Your
.envfile exists and contains valid credentials - The
FOUNDRY_MODEL_ENDPOINTURL is correct - The
OPENAI_API_KEYis valid and not expired
Ensure the input file path is correct and the file exists. Use quotes around paths with spaces:
python import_contract.py --input "contracts/file with spaces.txt"For better results:
- Ensure the input file is clean text (not scanned images)
- Remove excessive formatting or special characters
- Verify the contract is in English (or adjust prompts for other languages)
This project is provided as-is for contract analysis purposes.
To improve extraction quality, you can:
- Adjust field descriptions in the Pydantic models
- Refine the extraction prompt in the
process_contract()function - Add additional clause types or contract types to the enums
The code in this example repo is based on an approach to processing CUAD_v1 posted by Tomaz Bratanic Blog Post on his blog.