Skip to content

SimonBrizuela/data-processor-cli-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Processor CLI - Python

A professional command-line tool for data processing, analysis, and transformation. Built with Python, featuring CSV/JSON/Excel support, data validation, statistical analysis, and automated reporting.

Features

  • Multi-format support (CSV, JSON, Excel, Parquet)
  • Data cleaning and transformation
  • Statistical analysis and aggregations
  • Data validation with custom rules
  • Automated report generation
  • Export to multiple formats
  • Parallel processing for large datasets
  • Logging and error handling
  • Configuration file support
  • Interactive mode

Tech Stack

  • Language: Python 3.9+
  • CLI Framework: Click
  • Data Processing: Pandas, NumPy
  • Validation: Pydantic
  • Visualization: Matplotlib, Seaborn
  • Excel Support: openpyxl
  • Testing: pytest

Installation

# Clone the repository
git clone https://github.com/simonbrizuela/data-processor-cli-python.git
cd data-processor-cli-python

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

Usage

Basic Commands

# Process CSV file
dataproc process input.csv --output output.csv --clean

# Convert formats
dataproc convert input.csv output.json --format json

# Generate statistics
dataproc stats data.csv --output report.txt

# Validate data
dataproc validate data.csv --schema schema.json

# Merge multiple files
dataproc merge file1.csv file2.csv --output combined.csv

# Filter data
dataproc filter data.csv --column age --operator gt --value 25

Advanced Examples

# Clean and transform data with multiple operations
dataproc process sales.csv \
  --clean \
  --remove-duplicates \
  --fill-missing mean \
  --output clean_sales.csv

# Generate comprehensive analysis report
dataproc analyze data.csv \
  --output report.html \
  --visualize \
  --include-correlations

# Batch processing with config file
dataproc batch --config pipeline.yaml

# Interactive mode
dataproc interactive data.csv

Configuration

Create a config.yaml file:

input:
  format: csv
  encoding: utf-8
  delimiter: ','

processing:
  remove_duplicates: true
  handle_missing: 'drop'
  normalize: true

output:
  format: csv
  compression: gzip

logging:
  level: INFO
  file: processor.log

Commands Reference

process

Process and clean data files

Options:

  • --clean - Remove invalid rows
  • --remove-duplicates - Remove duplicate entries
  • --fill-missing - Fill missing values (mean, median, mode, forward, backward)
  • --normalize - Normalize numerical columns
  • --output - Output file path

convert

Convert between file formats

Supported formats: CSV, JSON, Excel, Parquet

stats

Generate statistical analysis

Includes: mean, median, mode, std, min, max, quartiles

validate

Validate data against schema

filter

Filter data based on conditions

Operators: eq, ne, gt, lt, ge, le, contains, startswith

merge

Merge multiple datasets

Options:

  • --on - Column to merge on
  • --how - Merge type (inner, outer, left, right)

aggregate

Perform aggregations

Functions: sum, mean, count, min, max, std

Project Structure

data-processor-cli-python/
├── src/
│   └── dataproc/
│       ├── __init__.py
│       ├── cli.py
│       ├── processor.py
│       ├── validator.py
│       ├── analyzer.py
│       ├── converter.py
│       ├── utils.py
│       └── config.py
├── tests/
│   ├── __init__.py
│   ├── test_processor.py
│   ├── test_validator.py
│   └── test_analyzer.py
├── examples/
│   ├── sample_data.csv
│   └── config.yaml
├── requirements.txt
├── setup.py
└── README.md

Data Validation Schema

Example schema.json:

{
  "columns": {
    "name": {
      "type": "string",
      "required": true,
      "min_length": 2
    },
    "age": {
      "type": "integer",
      "required": true,
      "min": 0,
      "max": 120
    },
    "email": {
      "type": "string",
      "pattern": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
    },
    "salary": {
      "type": "float",
      "min": 0
    }
  }
}

Error Handling

The tool provides detailed error messages and logging:

# Enable debug logging
dataproc process data.csv --log-level DEBUG

# Save logs to file
dataproc process data.csv --log-file processing.log

Performance

  • Chunked processing for large files
  • Parallel processing with multiprocessing
  • Memory-efficient operations
  • Progress bars for long operations
# Process large file in chunks
dataproc process large_file.csv --chunk-size 10000

# Use parallel processing
dataproc process data.csv --parallel --workers 4

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=dataproc --cov-report=html

# Run specific test file
pytest tests/test_processor.py

Examples

Clean Sales Data

from dataproc import DataProcessor

processor = DataProcessor('sales.csv')
processor.remove_duplicates()
processor.fill_missing('mean')
processor.filter_by_date('date', '2024-01-01', '2024-12-31')
processor.save('clean_sales.csv')

Generate Report

from dataproc import Analyzer

analyzer = Analyzer('data.csv')
report = analyzer.generate_report()
analyzer.create_visualizations()
analyzer.export_report('report.html')

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for details.

License

MIT License

Author

Simon Brizuela

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages