A simple command-line ETL (Extract, Transform, Load) tool that processes JSON data and outputs CSV format.
This project was developed using minimal AI assistance:
- Code Development: Hand-written with GitHub Copilot autocomplete assistance for efficiency
- Initial Documentation: Generated using GitHub Copilot as a starting point
- Final Implementation: All code logic, architecture decisions, and documentation were reviewed, enhanced, and manually edited to ensure quality and accuracy
- Approach: AI was used as a productivity tool while maintaining full control over all the code
This approach reflects real-world development practices where AI tools enhance developer productivity while requiring human expertise for architecture, validation, and quality assurance.
- Python 3.7+ (uses type hints)
- No external dependencies required (uses only Python standard library)
This tool was developed in about an hour time for the interview task. Key design decisions:
- Decision: Default to US format (MM/DD/YYYY) based on sample data containing
"12/31/2025"
- Flexible: Supports both US and EU formats with
--date-format
option - Limitation: For production, would implement more robust date parsing with locale detection or configuration files
- Decision: Implemented basic structural validation for required fields
- Rationale: Time constraint favored essential validation over comprehensive schema validation or performance
- Production Enhancement: Would use libraries like
jsonschema
orpydantic
for more robust validation
- Decision: No unit tests included in this submission
- Rationale: Internal task with time constraints; focused on core functionality and error handling
- Production Enhancement: Would implement comprehensive test suite covering edge cases, date parsing, and validation scenarios only for external usage or automated critical processes
- Decision: Used simple dictionary-based data handling instead of formal data models
- Rationale:
- Time constraints favored direct JSON-to-dict approach
- Avoided external dependencies for easier deployment
- Single-use transformation didn't justify model complexity
- Trade-offs: Less type safety and validation but simpler implementation
- Production Enhancement: Would implement proper data models for larger systems
Benefits of data models for production:
- Type Safety: Compile-time error detection
- Automatic Validation: Built-in field validation and type conversion
- Documentation: Self-documenting code structure
- IDE Support: Better autocomplete and refactoring
- Serialization: Easy JSON/API integration
python etl_tool.py --input /path/to/input.json
This will output the CSV data to stdout (console).
python etl_tool.py --input /path/to/input.json --output /path/to/output.csv
python etl_tool.py --input sample_input.json
Or save to file:
python etl_tool.py --input sample_input.json --output output.csv
The tool processes the JSON input and produces CSV output with the following columns:
customerId
customerName
projectId
projectName
sampleId
customerSampleId
latitude
longitude
sampleCollectedDateTime
(ISO 8601 format)sampleProcessedDateTime
(ISO 8601 format)processingTimeDays
(calculated difference)
# Default US format (MM/DD/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json
# US format explicit
python etl_tool.py --input sample_input.json --date-format us
# European format (DD/MM/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json --date-format eu
- Simple and Extensible: Clean separation of Extract, Transform, Load phases
- JSON Structure Validation: Validates input JSON matches expected schema
- Error Handling: Clear error messages for missing files, invalid JSON, and malformed data
- Flexible Date Parsing: Handles various date formats including ordinal suffixes
- Date Format Preference: Configurable EU/US date format preference for ambiguous dates
- Coordinate Parsing: Extracts latitude/longitude from coordinate strings
- Processing Time Calculation: Automatically calculates days between collection and processing
The tool follows a simple ETL pattern:
- Extract: Read and parse JSON file
- Transform: Convert nested JSON structure to flat CSV rows
- Load: Output CSV data to stdout or file
Each phase is implemented as a separate function for maintainability and extensibility.
The tool handles common errors gracefully:
- File not found
- Invalid JSON format
- Invalid date/time formats
- Missing required data fields
- Malformed JSON structure (validates against expected schema)
- Empty or invalid data arrays
The tool validates that the input JSON follows the expected structure:
{
"result": {
"customerId": "required",
"customerName": "required",
"projects": [
{
"projectId": "required",
"projectName": "required",
"samples": [
{
"sampleId": "required",
"customerSampleId": "required",
"location": {
"coordinates": "required"
},
"sampleCollectedDate": "required",
"sampleCollectedTime": "required",
"sampleProcessedDate": "required",
"sampleProcessedTime": "required"
}
]
}
]
}
}
Clear error messages are provided when required fields are missing or have incorrect types.
For production environments with large files, several enhancements would be implemented:
Current Limitations:
- Loads entire JSON into memory at once
- Processes all samples in memory before writing output
- Single-threaded processing
Additional Optimizations:
- Parallel Processing: Use
multiprocessing
for CPU-intensive transformations - Progress Indicators: Add progress bars for long-running operations
- Memory Monitoring: Implement memory usage tracking and warnings
- Error Recovery: Checkpointing for resumable processing
- Locale Detection: Auto-detect date format from data patterns
- Configuration Files: YAML/JSON config for date format rules
- Multiple Format Support: Handle mixed date formats within same file
- Timezone Support: Parse and convert timezone information
# Example test structure that would be implemented:
class TestETLTool:
def test_parse_coordinates_valid(self):
assert parse_coordinates("57.889,-5.182") == (57.889, -5.182)
def test_parse_datetime_us_format(self):
result = parse_datetime("12/31/2025", "14:32", "us")
assert "2025-12-31T14:32:00Z" == result
def test_validation_missing_field(self):
with pytest.raises(ValueError, match="Missing required field"):
transform_data({"invalid": "data"})
- Schema Validation: Use
jsonschema
for comprehensive validation - Data Quality Checks: Implement business rule validation
- Flexible Schema: Support multiple JSON schema versions
- Custom Validators: Pluggable validation system
The current design makes it easy to extend for:
- Different input formats (XML, YAML, etc.)
- Different output formats (JSON, Excel, database, etc.)
- Additional transformations and business logic
- Configuration files for field mappings
- Batch processing of multiple files
- Plugin architecture for custom transformations
Test with the provided sample data:
python etl_tool.py --input sample_input.json
Expected output:
customerId,customerName,projectId,projectName,sampleId,customerSampleId,latitude,longitude,sampleCollectedDateTime,sampleProcessedDateTime,processingTimeDays
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_3,Rocky JunKyard A,57.889304,-5.182286,2025-01-13T14:32:00Z,2025-01-20T14:32:00Z,7
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_2,Rocky JunKyard B,57.789304,-5.182286,2025-01-14T12:32:00Z,2025-01-25T12:32:00Z,11