A professional command-line tool for data processing, analysis, and transformation. Built with Python, featuring CSV/JSON/Excel support, data validation, statistical analysis, and automated reporting.
- Multi-format support (CSV, JSON, Excel, Parquet)
- Data cleaning and transformation
- Statistical analysis and aggregations
- Data validation with custom rules
- Automated report generation
- Export to multiple formats
- Parallel processing for large datasets
- Logging and error handling
- Configuration file support
- Interactive mode
- Language: Python 3.9+
- CLI Framework: Click
- Data Processing: Pandas, NumPy
- Validation: Pydantic
- Visualization: Matplotlib, Seaborn
- Excel Support: openpyxl
- Testing: pytest
# Clone the repository
git clone https://github.com/simonbrizuela/data-processor-cli-python.git
cd data-processor-cli-python
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .# Process CSV file
dataproc process input.csv --output output.csv --clean
# Convert formats
dataproc convert input.csv output.json --format json
# Generate statistics
dataproc stats data.csv --output report.txt
# Validate data
dataproc validate data.csv --schema schema.json
# Merge multiple files
dataproc merge file1.csv file2.csv --output combined.csv
# Filter data
dataproc filter data.csv --column age --operator gt --value 25# Clean and transform data with multiple operations
dataproc process sales.csv \
--clean \
--remove-duplicates \
--fill-missing mean \
--output clean_sales.csv
# Generate comprehensive analysis report
dataproc analyze data.csv \
--output report.html \
--visualize \
--include-correlations
# Batch processing with config file
dataproc batch --config pipeline.yaml
# Interactive mode
dataproc interactive data.csvCreate a config.yaml file:
input:
format: csv
encoding: utf-8
delimiter: ','
processing:
remove_duplicates: true
handle_missing: 'drop'
normalize: true
output:
format: csv
compression: gzip
logging:
level: INFO
file: processor.logProcess and clean data files
Options:
--clean- Remove invalid rows--remove-duplicates- Remove duplicate entries--fill-missing- Fill missing values (mean, median, mode, forward, backward)--normalize- Normalize numerical columns--output- Output file path
Convert between file formats
Supported formats: CSV, JSON, Excel, Parquet
Generate statistical analysis
Includes: mean, median, mode, std, min, max, quartiles
Validate data against schema
Filter data based on conditions
Operators: eq, ne, gt, lt, ge, le, contains, startswith
Merge multiple datasets
Options:
--on- Column to merge on--how- Merge type (inner, outer, left, right)
Perform aggregations
Functions: sum, mean, count, min, max, std
data-processor-cli-python/
├── src/
│ └── dataproc/
│ ├── __init__.py
│ ├── cli.py
│ ├── processor.py
│ ├── validator.py
│ ├── analyzer.py
│ ├── converter.py
│ ├── utils.py
│ └── config.py
├── tests/
│ ├── __init__.py
│ ├── test_processor.py
│ ├── test_validator.py
│ └── test_analyzer.py
├── examples/
│ ├── sample_data.csv
│ └── config.yaml
├── requirements.txt
├── setup.py
└── README.md
Example schema.json:
{
"columns": {
"name": {
"type": "string",
"required": true,
"min_length": 2
},
"age": {
"type": "integer",
"required": true,
"min": 0,
"max": 120
},
"email": {
"type": "string",
"pattern": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
},
"salary": {
"type": "float",
"min": 0
}
}
}The tool provides detailed error messages and logging:
# Enable debug logging
dataproc process data.csv --log-level DEBUG
# Save logs to file
dataproc process data.csv --log-file processing.log- Chunked processing for large files
- Parallel processing with multiprocessing
- Memory-efficient operations
- Progress bars for long operations
# Process large file in chunks
dataproc process large_file.csv --chunk-size 10000
# Use parallel processing
dataproc process data.csv --parallel --workers 4# Run all tests
pytest
# Run with coverage
pytest --cov=dataproc --cov-report=html
# Run specific test file
pytest tests/test_processor.pyfrom dataproc import DataProcessor
processor = DataProcessor('sales.csv')
processor.remove_duplicates()
processor.fill_missing('mean')
processor.filter_by_date('date', '2024-01-01', '2024-12-31')
processor.save('clean_sales.csv')from dataproc import Analyzer
analyzer = Analyzer('data.csv')
report = analyzer.generate_report()
analyzer.create_visualizations()
analyzer.export_report('report.html')Contributions are welcome! Please read CONTRIBUTING.md for details.
MIT License
Simon Brizuela
- GitHub: @simonbrizuela
- Email: simonbrizuela08@gmail.com