A comprehensive Python-based system that fully automates Third Normal Form (3NF) database normalization using LangGraph workflow orchestration.
- ✅ Load unlimited CSV and JSON files from input folder
- ✅ Extract comprehensive metadata (datatypes, cardinality, nulls, uniqueness)
- ✅ Auto-detect Primary Keys (single & composite)
- ✅ Auto-detect Foreign Keys using pattern matching and metadata analysis
- ✅ Enforce 1NF (atomic values, no repeating groups)
- ✅ Enforce 2NF (eliminate partial dependencies)
- ✅ Enforce 3NF (eliminate transitive dependencies)
- ✅ Generate normalized table structures
- ✅ Export normalized tables as CSV/JSON
- ✅ Generate Oracle SQL DDL scripts with constraints
- ✅ Generate ERD diagrams (Graphviz/Mermaid)
- ✅ Oracle-compatible datatypes (VARCHAR2, NUMBER, TIMESTAMP, etc.)
- ✅ CREATE TABLE statements
- ✅ PRIMARY KEY constraints
- ✅ FOREIGN KEY constraints
- ✅ INDEX creation for foreign keys
- ✅ Reserved keyword sanitization
- ✅ Proper NULL/NOT NULL handling
- ✅ Syntax validated for Oracle SQL Developer
The system uses LangGraph to orchestrate a 9-node workflow:
- load_files_node - Load CSV/JSON files
- extract_metadata_node - Extract column metadata
- profile_node - Detect dependencies and patterns
- detect_primary_keys_node - Identify primary keys
- detect_foreign_keys_node - Detect FK relationships
- normalize_3nf_node - Perform normalization
- generate_sql_node - Generate SQL DDL
- validate_sql_node - Validate SQL syntax
- export_outputs_node - Export ERD and outputs
Data_Modelling_3NF/
├── input_files/ # Place your CSV/JSON files here
├── normalized_output/ # Generated normalized tables (CSV/JSON)
├── sql_output/ # Generated SQL DDL scripts
├── erd/ # Generated ERD diagrams
├── main.py # Main entry point
├── langgraph_app.py # LangGraph workflow orchestration
├── metadata_extractor.py # Metadata extraction module
├── auto_profiler.py # Data profiling and dependency detection
├── fk_detector.py # Foreign key detection
├── normalizer.py # 3NF normalization engine
├── sql_generator.py # SQL DDL generation
├── utils.py # Utility functions (ERD, sanitization)
└── requirements.txt # Python dependencies
pip install -r requirements.txtNote: For ERD diagram generation with Graphviz, you need to install Graphviz separately:
- Windows: Download from https://graphviz.org/download/ and add to PATH
- macOS:
brew install graphviz - Linux:
sudo apt-get install graphviz
Place your CSV or JSON files in the input_files/ folder:
# Example
input_files/
├── customers.csv
├── orders.csv
├── products.json
└── order_items.csvpython main.pyThe system will automatically:
- Load all files from
input_files/ - Analyze and profile the data
- Detect keys and relationships
- Normalize to 3NF
- Generate SQL scripts
- Export normalized tables
- Create ERD diagrams
After execution, you'll find:
- Location:
normalized_output/ - Formats: CSV and JSON
- One file per normalized table
- Location:
sql_output/normalized_schema.sql - Contains:
- DROP TABLE statements (commented)
- CREATE TABLE statements
- PRIMARY KEY constraints
- FOREIGN KEY constraints
- INDEX definitions
- COMMIT statement
- Location:
erd/normalized_erd.pngorerd/normalized_erd.mmd - Visualizes table relationships
- Shows primary keys and foreign keys
- Column names and datatypes
- Uniqueness profiles (candidate key detection)
- NULL ratio analysis
- Cardinality measurement
- Multivalued column detection
- Functional dependencies (X → Y)
- Partial dependencies (violates 2NF)
- Transitive dependencies (violates 3NF)
- Composite key patterns
- High uniqueness columns (>95%)
- Composite key combinations
- Surrogate key generation when needed
Uses multiple strategies:
- Name similarity matching
- Value overlap analysis
- Cardinality pattern detection
- Hierarchical relationship detection
- Metadata-based matching
- Splits multivalued columns into separate tables
- Ensures atomic values
- Eliminates repeating groups
- Detects partial dependencies
- Extracts dependent attributes
- Creates new tables for partial keys
- Generalized Semantic Rules - No hardcoded domain logic, works on any dataset
- Functional Dependency Driven - Creates tables based on PK→A→B chains, not value repetition
- Semantic Entity Detection:
- Analyzes cardinality patterns (min 2% uniqueness, 10+ unique values)
- Checks attribute diversity (requires 2+ stable functional dependencies)
- Detects contact/structural information (email, phone, address)
- Calculates confidence scores (40%+ threshold for entity creation)
- Classifies entity types: master_entity, reference_entity, lookup_entity
- Multi-Row Pattern Detection:
- Event/history tables (temporal columns + duplicate IDs)
- Status change tables (state columns varying per ID)
- Line item tables (sequence columns + parent references)
- Structured Field Atomization:
- Detects concatenated addresses → splits into street, city, state, zip
- Identifies JSON columns → extracts key-value pairs
- Recognizes full names → separates first/middle/last
- Primary Key Intelligence:
- Prefers natural keys when unique and non-null
- Never uses foreign keys as primary keys
- Generates surrogate keys only when necessary
- Validates functional dependencies before key assignment
- Attribute Placement Validation:
- Ensures columns belong based on FD, not duplication frequency
- Identifies alternative keys when attributes misplaced
- Preserves all original attributes in final schema
NUMBER(10)for integersNUMBER(15,2)for decimalsVARCHAR2(n)for stringsTIMESTAMPfor datetimesDATEfor datesCHAR(1)for booleans
Automatically sanitizes Oracle reserved words:
SELECT→SELECT_colDATE→DATE_colTABLE→TABLE_col
customer_id,name,email,city,state,country
1,John Doe,john@example.com,New York,NY,USA
2,Jane Smith,jane@example.com,Los Angeles,CA,USA
customers.csv
customer_id,name,email,city_id
1,John Doe,john@example.com,1
2,Jane Smith,jane@example.com,2
customers_city_ref.csv
city,state,country,city_id
New York,NY,USA,1
Los Angeles,CA,USA,2
CREATE TABLE customers (
customer_id NUMBER(10) NOT NULL,
name VARCHAR2(100) NOT NULL,
email VARCHAR2(200),
city_id NUMBER(10),
CONSTRAINT pk_customers PRIMARY KEY (customer_id)
);
CREATE TABLE customers_city_ref (
city VARCHAR2(100) NOT NULL,
state VARCHAR2(50),
country VARCHAR2(100),
city_id NUMBER(10) NOT NULL,
CONSTRAINT pk_customers_city_ref PRIMARY KEY (city_id)
);
ALTER TABLE customers
ADD CONSTRAINT fk_customers_1
FOREIGN KEY (city_id)
REFERENCES customers_city_ref(city_id);Edit the relevant modules to adjust detection sensitivity:
fk_detector.py
# Adjust FK detection threshold (default: 50.0)
foreign_keys = fk_detector.detect_all_foreign_keys(threshold=60.0)auto_profiler.py
# Adjust uniqueness threshold for PK detection (default: 0.95)
if uniqueness_ratio > 0.90: # Lower threshold
cardinality = "high"To support additional file formats, extend metadata_extractor.py:
elif file_path.suffix.lower() == '.xlsx':
df = pd.read_excel(file_path)
elif file_path.suffix.lower() == '.parquet':
df = pd.read_parquet(file_path)Solution: Ensure CSV/JSON files are in input_files/ folder
Solution: Install Graphviz system package or use the generated Mermaid file
Solution: Check the validation output and verify datatype compatibility
Solution: Process files in batches or increase Python memory limit
MetadataExtractor- Main class for metadata extractionextract_all_metadata()- Process all files in folderinfer_datatype()- Infer Oracle datatypes
AutoProfiler- Data profiling and dependency detectionfind_functional_dependencies()- Detect FDsdetect_partial_dependencies()- Find 2NF violationsdetect_transitive_dependencies()- Find 3NF violations
ForeignKeyDetector- Foreign key relationship detectiondetect_all_foreign_keys()- Scan all tables for FKscalculate_name_similarity()- Name-based FK detection
Normalizer- 3NF normalization engineenforce_1nf()- First normal formenforce_2nf()- Second normal formenforce_3nf()- Third normal form
SQLGenerator- SQL DDL script generationgenerate_create_table_script()- CREATE TABLE statementsgenerate_foreign_key_constraints()- FK constraintssanitize_identifier()- Reserved keyword handling
ERDGenerator- Entity-relationship diagram generationKeywordSanitizer- SQL keyword sanitizationSurrogateKeyGenerator- Surrogate key generationDatatypeMapper- Datatype conversion utilities
Create test files in input_files/:
employees.csv
emp_id,name,dept,manager_id,skills
1,Alice,IT,NULL,"Python,SQL,Java"
2,Bob,HR,1,"Excel,Communication"
departments.csv
dept_name,location,budget
IT,Building A,100000
HR,Building B,50000
Run the system and verify:
skillscolumn is split (1NF violation)- Department details are extracted (potential 3NF violation)
- FK detected between employees and departments
This project is provided as-is for educational and commercial use.
To extend this system:
- Add new detection algorithms in
fk_detector.py - Implement additional normalization rules in
normalizer.py - Support new SQL dialects in
sql_generator.py - Add visualization options in
utils.py
For issues or questions:
- Check the troubleshooting section
- Review the module documentation
- Examine the generated logs in console output
- Fully Automated: No manual normalization required
- Scalable: Handles unlimited files (200+ tested)
- Production-Ready: Generates executable SQL
- Extensible: Modular architecture for customization
- Well-Documented: Comprehensive inline documentation
- Best Practices: Follows database normalization theory
Ready to normalize your data? Just run python main.py! 🚀