ETL Tool - JSON to CSV Converter

A simple command-line ETL (Extract, Transform, Load) tool that processes JSON data and outputs CSV format.

AI Usage Disclosure

This project was developed using minimal AI assistance:

Code Development: Hand-written with GitHub Copilot autocomplete assistance for efficiency
Initial Documentation: Generated using GitHub Copilot as a starting point
Final Implementation: All code logic, architecture decisions, and documentation were reviewed, enhanced, and manually edited to ensure quality and accuracy
Approach: AI was used as a productivity tool while maintaining full control over all the code

This approach reflects real-world development practices where AI tools enhance developer productivity while requiring human expertise for architecture, validation, and quality assurance.

Requirements

Python 3.7+ (uses type hints)
No external dependencies required (uses only Python standard library)

Design Decisions & Trade-offs

This tool was developed in about an hour time for the interview task. Key design decisions:

Date Format Handling

Decision: Default to US format (MM/DD/YYYY) based on sample data containing "12/31/2025"
Flexible: Supports both US and EU formats with --date-format option
Limitation: For production, would implement more robust date parsing with locale detection or configuration files

Validation Approach

Decision: Implemented basic structural validation for required fields
Rationale: Time constraint favored essential validation over comprehensive schema validation or performance
Production Enhancement: Would use libraries like jsonschema or pydantic for more robust validation

Testing Strategy

Decision: No unit tests included in this submission
Rationale: Internal task with time constraints; focused on core functionality and error handling
Production Enhancement: Would implement comprehensive test suite covering edge cases, date parsing, and validation scenarios only for external usage or automated critical processes

Data Model Approach

Decision: Used simple dictionary-based data handling instead of formal data models
Rationale:
- Time constraints favored direct JSON-to-dict approach
- Avoided external dependencies for easier deployment
- Single-use transformation didn't justify model complexity
Trade-offs: Less type safety and validation but simpler implementation
Production Enhancement: Would implement proper data models for larger systems

Benefits of data models for production:

Type Safety: Compile-time error detection
Automatic Validation: Built-in field validation and type conversion
Documentation: Self-documenting code structure
IDE Support: Better autocomplete and refactoring
Serialization: Easy JSON/API integration

Usage

Basic Usage

python etl_tool.py --input /path/to/input.json

This will output the CSV data to stdout (console).

Save to File

python etl_tool.py --input /path/to/input.json --output /path/to/output.csv

Example with Provided Sample Data

python etl_tool.py --input sample_input.json

Or save to file:

python etl_tool.py --input sample_input.json --output output.csv

Expected Output

The tool processes the JSON input and produces CSV output with the following columns:

customerId
customerName
projectId
projectName
sampleId
customerSampleId
latitude
longitude
sampleCollectedDateTime (ISO 8601 format)
sampleProcessedDateTime (ISO 8601 format)
processingTimeDays (calculated difference)

Date Format Options

# Default US format (MM/DD/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json

# US format explicit
python etl_tool.py --input sample_input.json --date-format us

# European format (DD/MM/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json --date-format eu

Features

Simple and Extensible: Clean separation of Extract, Transform, Load phases
JSON Structure Validation: Validates input JSON matches expected schema
Error Handling: Clear error messages for missing files, invalid JSON, and malformed data
Flexible Date Parsing: Handles various date formats including ordinal suffixes
Date Format Preference: Configurable EU/US date format preference for ambiguous dates
Coordinate Parsing: Extracts latitude/longitude from coordinate strings
Processing Time Calculation: Automatically calculates days between collection and processing

Architecture

The tool follows a simple ETL pattern:

Extract: Read and parse JSON file
Transform: Convert nested JSON structure to flat CSV rows
Load: Output CSV data to stdout or file

Each phase is implemented as a separate function for maintainability and extensibility.

Error Handling

The tool handles common errors gracefully:

File not found
Invalid JSON format
Invalid date/time formats
Missing required data fields
Malformed JSON structure (validates against expected schema)
Empty or invalid data arrays

JSON Structure Validation

The tool validates that the input JSON follows the expected structure:

{
  "result": {
    "customerId": "required",
    "customerName": "required",
    "projects": [
      {
        "projectId": "required",
        "projectName": "required",
        "samples": [
          {
            "sampleId": "required",
            "customerSampleId": "required",
            "location": {
              "coordinates": "required"
            },
            "sampleCollectedDate": "required",
            "sampleCollectedTime": "required",
            "sampleProcessedDate": "required",
            "sampleProcessedTime": "required"
          }
        ]
      }
    ]
  }
}

Clear error messages are provided when required fields are missing or have incorrect types.

Scalability Considerations

Large File Handling (>1GB)

For production environments with large files, several enhancements would be implemented:

Current Limitations:

Loads entire JSON into memory at once
Processes all samples in memory before writing output
Single-threaded processing

Additional Optimizations:

Parallel Processing: Use multiprocessing for CPU-intensive transformations
Progress Indicators: Add progress bars for long-running operations
Memory Monitoring: Implement memory usage tracking and warnings
Error Recovery: Checkpointing for resumable processing

Production Enhancements

Date Handling Improvements

Locale Detection: Auto-detect date format from data patterns
Configuration Files: YAML/JSON config for date format rules
Multiple Format Support: Handle mixed date formats within same file
Timezone Support: Parse and convert timezone information

Testing Strategy

# Example test structure that would be implemented:
class TestETLTool:
    def test_parse_coordinates_valid(self):
        assert parse_coordinates("57.889,-5.182") == (57.889, -5.182)

    def test_parse_datetime_us_format(self):
        result = parse_datetime("12/31/2025", "14:32", "us")
        assert "2025-12-31T14:32:00Z" == result

    def test_validation_missing_field(self):
        with pytest.raises(ValueError, match="Missing required field"):
            transform_data({"invalid": "data"})

Validation Enhancements

Schema Validation: Use jsonschema for comprehensive validation
Data Quality Checks: Implement business rule validation
Flexible Schema: Support multiple JSON schema versions
Custom Validators: Pluggable validation system

Future Extensions

The current design makes it easy to extend for:

Different input formats (XML, YAML, etc.)
Different output formats (JSON, Excel, database, etc.)
Additional transformations and business logic
Configuration files for field mappings
Batch processing of multiple files
Plugin architecture for custom transformations

Testing

Test with the provided sample data:

python etl_tool.py --input sample_input.json

Expected output:

customerId,customerName,projectId,projectName,sampleId,customerSampleId,latitude,longitude,sampleCollectedDateTime,sampleProcessedDateTime,processingTimeDays
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_3,Rocky JunKyard A,57.889304,-5.182286,2025-01-13T14:32:00Z,2025-01-20T14:32:00Z,7
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_2,Rocky JunKyard B,57.789304,-5.182286,2025-01-14T12:32:00Z,2025-01-25T12:32:00Z,11

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
etl-tool		etl-tool
etl-tool.bat		etl-tool.bat
etl_tool.py		etl_tool.py
requirements.txt		requirements.txt
sample_input.json		sample_input.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Tool - JSON to CSV Converter

AI Usage Disclosure

Requirements

Design Decisions & Trade-offs

Date Format Handling

Validation Approach

Testing Strategy

Data Model Approach

Usage

Basic Usage

Save to File

Example with Provided Sample Data

Expected Output

Date Format Options

Features

Architecture

Error Handling

JSON Structure Validation

Scalability Considerations

Large File Handling (>1GB)

Production Enhancements

Date Handling Improvements

Testing Strategy

Validation Enhancements

Future Extensions

Testing

About

Uh oh!

Releases

Packages

Languages

ratscrew/lse-coding-task

Folders and files

Latest commit

History

Repository files navigation

ETL Tool - JSON to CSV Converter

AI Usage Disclosure

Requirements

Design Decisions & Trade-offs

Date Format Handling

Validation Approach

Testing Strategy

Data Model Approach

Usage

Basic Usage

Save to File

Example with Provided Sample Data

Expected Output

Date Format Options

Features

Architecture

Error Handling

JSON Structure Validation

Scalability Considerations

Large File Handling (>1GB)

Production Enhancements

Date Handling Improvements

Testing Strategy

Validation Enhancements

Future Extensions

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages