Skip to content

ratscrew/lse-coding-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Tool - JSON to CSV Converter

A simple command-line ETL (Extract, Transform, Load) tool that processes JSON data and outputs CSV format.

AI Usage Disclosure

This project was developed using minimal AI assistance:

  • Code Development: Hand-written with GitHub Copilot autocomplete assistance for efficiency
  • Initial Documentation: Generated using GitHub Copilot as a starting point
  • Final Implementation: All code logic, architecture decisions, and documentation were reviewed, enhanced, and manually edited to ensure quality and accuracy
  • Approach: AI was used as a productivity tool while maintaining full control over all the code

This approach reflects real-world development practices where AI tools enhance developer productivity while requiring human expertise for architecture, validation, and quality assurance.

Requirements

  • Python 3.7+ (uses type hints)
  • No external dependencies required (uses only Python standard library)

Design Decisions & Trade-offs

This tool was developed in about an hour time for the interview task. Key design decisions:

Date Format Handling

  • Decision: Default to US format (MM/DD/YYYY) based on sample data containing "12/31/2025"
  • Flexible: Supports both US and EU formats with --date-format option
  • Limitation: For production, would implement more robust date parsing with locale detection or configuration files

Validation Approach

  • Decision: Implemented basic structural validation for required fields
  • Rationale: Time constraint favored essential validation over comprehensive schema validation or performance
  • Production Enhancement: Would use libraries like jsonschema or pydantic for more robust validation

Testing Strategy

  • Decision: No unit tests included in this submission
  • Rationale: Internal task with time constraints; focused on core functionality and error handling
  • Production Enhancement: Would implement comprehensive test suite covering edge cases, date parsing, and validation scenarios only for external usage or automated critical processes

Data Model Approach

  • Decision: Used simple dictionary-based data handling instead of formal data models
  • Rationale:
    • Time constraints favored direct JSON-to-dict approach
    • Avoided external dependencies for easier deployment
    • Single-use transformation didn't justify model complexity
  • Trade-offs: Less type safety and validation but simpler implementation
  • Production Enhancement: Would implement proper data models for larger systems

Benefits of data models for production:

  • Type Safety: Compile-time error detection
  • Automatic Validation: Built-in field validation and type conversion
  • Documentation: Self-documenting code structure
  • IDE Support: Better autocomplete and refactoring
  • Serialization: Easy JSON/API integration

Usage

Basic Usage

python etl_tool.py --input /path/to/input.json

This will output the CSV data to stdout (console).

Save to File

python etl_tool.py --input /path/to/input.json --output /path/to/output.csv

Example with Provided Sample Data

python etl_tool.py --input sample_input.json

Or save to file:

python etl_tool.py --input sample_input.json --output output.csv

Expected Output

The tool processes the JSON input and produces CSV output with the following columns:

  • customerId
  • customerName
  • projectId
  • projectName
  • sampleId
  • customerSampleId
  • latitude
  • longitude
  • sampleCollectedDateTime (ISO 8601 format)
  • sampleProcessedDateTime (ISO 8601 format)
  • processingTimeDays (calculated difference)

Date Format Options

# Default US format (MM/DD/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json

# US format explicit
python etl_tool.py --input sample_input.json --date-format us

# European format (DD/MM/YYYY for ambiguous dates)
python etl_tool.py --input sample_input.json --date-format eu

Features

  • Simple and Extensible: Clean separation of Extract, Transform, Load phases
  • JSON Structure Validation: Validates input JSON matches expected schema
  • Error Handling: Clear error messages for missing files, invalid JSON, and malformed data
  • Flexible Date Parsing: Handles various date formats including ordinal suffixes
  • Date Format Preference: Configurable EU/US date format preference for ambiguous dates
  • Coordinate Parsing: Extracts latitude/longitude from coordinate strings
  • Processing Time Calculation: Automatically calculates days between collection and processing

Architecture

The tool follows a simple ETL pattern:

  1. Extract: Read and parse JSON file
  2. Transform: Convert nested JSON structure to flat CSV rows
  3. Load: Output CSV data to stdout or file

Each phase is implemented as a separate function for maintainability and extensibility.

Error Handling

The tool handles common errors gracefully:

  • File not found
  • Invalid JSON format
  • Invalid date/time formats
  • Missing required data fields
  • Malformed JSON structure (validates against expected schema)
  • Empty or invalid data arrays

JSON Structure Validation

The tool validates that the input JSON follows the expected structure:

{
  "result": {
    "customerId": "required",
    "customerName": "required",
    "projects": [
      {
        "projectId": "required",
        "projectName": "required",
        "samples": [
          {
            "sampleId": "required",
            "customerSampleId": "required",
            "location": {
              "coordinates": "required"
            },
            "sampleCollectedDate": "required",
            "sampleCollectedTime": "required",
            "sampleProcessedDate": "required",
            "sampleProcessedTime": "required"
          }
        ]
      }
    ]
  }
}

Clear error messages are provided when required fields are missing or have incorrect types.

Scalability Considerations

Large File Handling (>1GB)

For production environments with large files, several enhancements would be implemented:

Current Limitations:

  • Loads entire JSON into memory at once
  • Processes all samples in memory before writing output
  • Single-threaded processing

Additional Optimizations:

  • Parallel Processing: Use multiprocessing for CPU-intensive transformations
  • Progress Indicators: Add progress bars for long-running operations
  • Memory Monitoring: Implement memory usage tracking and warnings
  • Error Recovery: Checkpointing for resumable processing

Production Enhancements

Date Handling Improvements

  • Locale Detection: Auto-detect date format from data patterns
  • Configuration Files: YAML/JSON config for date format rules
  • Multiple Format Support: Handle mixed date formats within same file
  • Timezone Support: Parse and convert timezone information

Testing Strategy

# Example test structure that would be implemented:
class TestETLTool:
    def test_parse_coordinates_valid(self):
        assert parse_coordinates("57.889,-5.182") == (57.889, -5.182)

    def test_parse_datetime_us_format(self):
        result = parse_datetime("12/31/2025", "14:32", "us")
        assert "2025-12-31T14:32:00Z" == result

    def test_validation_missing_field(self):
        with pytest.raises(ValueError, match="Missing required field"):
            transform_data({"invalid": "data"})

Validation Enhancements

  • Schema Validation: Use jsonschema for comprehensive validation
  • Data Quality Checks: Implement business rule validation
  • Flexible Schema: Support multiple JSON schema versions
  • Custom Validators: Pluggable validation system

Future Extensions

The current design makes it easy to extend for:

  • Different input formats (XML, YAML, etc.)
  • Different output formats (JSON, Excel, database, etc.)
  • Additional transformations and business logic
  • Configuration files for field mappings
  • Batch processing of multiple files
  • Plugin architecture for custom transformations

Testing

Test with the provided sample data:

python etl_tool.py --input sample_input.json

Expected output:

customerId,customerName,projectId,projectName,sampleId,customerSampleId,latitude,longitude,sampleCollectedDateTime,sampleProcessedDateTime,processingTimeDays
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_3,Rocky JunKyard A,57.889304,-5.182286,2025-01-13T14:32:00Z,2025-01-20T14:32:00Z,7
C01234,Nature Metrics UK Digital,PROJ01234,Example Project Name,55191_2,Rocky JunKyard B,57.789304,-5.182286,2025-01-14T12:32:00Z,2025-01-25T12:32:00Z,11

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages