turintech
diff --git a/‎README.md‎
Lines changed: 312 additions & 33 deletions b/‎README.md‎
Lines changed: 312 additions & 33 deletions
@@ -1,62 +1,341 @@
-# template-python
+# ML Training and Prediction CLI
 
-A simple Python template repository using modern tooling with `uv` and `pyproject.toml`.
+A command-line tool for training Linear Regression models with automated preprocessing, comprehensive reporting, and easy prediction capabilities. Built with modern Python tooling using `uv` and scikit-learn.
 
 ## Overview
 
-This template provides a minimal starting point for Python projects using:
-- **uv**: Fast Python package manager and project manager
-- **pyproject.toml**: Modern Python project configuration (PEP 518/621)
-- Python 3.13+
+This CLI tool simplifies the end-to-end machine learning workflow for regression tasks. It handles data loading, preprocessing (missing value imputation, feature scaling, categorical encoding), model training, evaluation, and prediction—all through simple command-line commands. Training produces detailed HTML reports with visualizations to help you understand model performance and feature importance.
 
-## Prerequisites
+**Key Use Cases:**
+- Quick prototyping and baseline models for regression problems
+- Automated preprocessing pipelines with consistent train/test handling
+- Model training with reproducible results and comprehensive reports
+- Easy deployment of trained models for batch predictions
 
-Install `uv` if you haven't already:
+## Features
 
-```bash
-# macOS/Linux
-curl -LsSf https://astral.sh/uv/install.sh | sh
+- **Training (`train` command)**:
+  - Trains Linear Regression models on CSV data
+  - Automatically handles numeric and categorical features
+  - Missing value imputation (mean for numeric, most frequent for categorical)
+  - Feature scaling with StandardScaler
+  - One-hot encoding for categorical variables
+  - Saves trained model with preprocessing pipeline for reproducibility
 
-# Windows
-powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
+- **Prediction (`predict` command)**:
+  - Loads saved models and applies same preprocessing automatically
+  - Validates input data against training schema
+  - Generates predictions with summary statistics
+  - Outputs CSV with original data plus predictions
 
-# Or via pip
-pip install uv
-```
+- **Automated Reporting**:
+  - Generates self-contained HTML reports with embedded visualizations
+  - Performance metrics (R², MSE, RMSE, MAE)
+  - Feature importance (model coefficients)
+  - Actual vs Predicted scatter plot
+  - Residuals plot for model diagnostics
+  - Coefficients bar chart
+
+- **Data Quality Handling**:
+  - Automatic detection and imputation of missing values
+  - Validation of feature names and data types
+  - Clear error messages for data issues
+
+## Installation
 
-For more installation options, see the [uv documentation](https://docs.astral.sh/uv/).
+1. **Install uv** (if you haven't already):
+   ```bash
+   # macOS/Linux
+   curl -LsSf https://astral.sh/uv/install.sh | sh
+
+   # Windows
+   powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
 
-## Setup
+   # Or via pip
+   pip install uv
+   ```
 
-1. Clone or use this template:
+2. **Clone the repository**:
    ```bash
    git clone <repository-url>
-   cd template-python
+   cd ml-cli-tool
    ```
 
-2. Sync dependencies (creates virtual environment automatically):
+3. **Sync dependencies** (creates virtual environment automatically):
    ```bash
    uv sync
    ```
 
-## Running the Template
+## Usage
+
+### Training a Model
+
+Train a model on your CSV data with the `train` command:
+
+```bash
+uv run python main.py train \
+  --input sample_data.csv \
+  --target price \
+  --output-model model.joblib \
+  --report report.html
+```
+
+**Example with Sample Data:**
+
+The repository includes `sample_data.csv` with house price data (80 samples, 4 features):
+- `sqft`: Square footage (numeric)
+- `bedrooms`: Number of bedrooms (numeric)
+- `age`: Age of house in years (numeric)
+- `location_score`: Location quality score 1-10 (numeric)
+- `price`: Sale price in dollars (target variable)
+
+```bash
+# Train on the sample data
+uv run python main.py train \
+  --input sample_data.csv \
+  --target price \
+  --output-model house_model.joblib \
+  --report house_report.html
+```
+
+**Expected Output:**
+```
+=== Train Command ===
+Input CSV: sample_data.csv
+Target Column: price
+Output Model Path: house_model.joblib
+Report Path: house_report.html
+
+Step 1: Loading and validating data...
+✓ Data loaded successfully: 80 samples, 4 features
+  - Numeric features: 4
+  - Target column: 'price'
+
+Step 2: Creating and fitting preprocessing pipeline...
+✓ Preprocessing complete: 4 transformed features
+
+Step 3: Training model...
+✓ Model trained successfully
+  - R² Score: 0.9234
+  - MSE: 123456789.0
+  - RMSE: 11111.11
+  - MAE: 8888.88
+
+Step 4: Generating visualizations...
+✓ Visualizations created successfully
+
+Step 5: Saving model...
+✓ Model saved to: house_model.joblib
+
+Step 6: Generating HTML report...
+✓ Report saved to: house_report.html
+
+============================================================
+🎉 Training completed successfully!
+============================================================
+
+📊 Model Performance:
+  - R² Score: 0.9234
+  - MSE: 123456789.0
+  - RMSE: 11111.11
+  - MAE: 8888.88
+
+💾 Output Files:
+  - Model: house_model.joblib
+  - Report: house_report.html
+```
+
+### Making Predictions
+
+Use a trained model to make predictions on new data:
+
+```bash
+uv run python main.py predict \
+  --model model.joblib \
+  --input new_data.csv \
+  --output predictions.csv
+```
+
+**Example with Saved Model:**
+
+```bash
+# Make predictions using the trained model
+uv run python main.py predict \
+  --model house_model.joblib \
+  --input sample_data.csv \
+  --output predictions.csv
+```
+
+**Expected Output:**
+```
+=== Predict Command ===
+Model Path: house_model.joblib
+Input CSV: sample_data.csv
+Output Path: predictions.csv
+
+Step 1: Loading model...
+✓ Model loaded successfully
+  - Target variable: 'price'
+  - Expected features: 4
+  - Trained on: 2024-01-15T10:30:45.123456
+
+Step 2: Loading input data...
+✓ Input data loaded successfully: 80 samples, 4 features
+
+Step 3: Making predictions...
+✓ Predictions generated successfully: 80 predictions
+
+Step 4: Creating output file...
+✓ Output DataFrame created with column 'predicted_price'
+
+Step 5: Saving predictions...
+✓ Predictions saved to: predictions.csv
+
+Step 6: Calculating summary statistics...
+✓ Summary statistics calculated
+  - Count: 80
+  - Mean: 315000.0000
+  - Median: 320000.0000
+  - Std Dev: 85000.0000
+  - Min: 185000.0000
+  - Max: 485000.0000
+
+============================================================
+🎉 Prediction completed successfully!
+============================================================
+
+📊 Prediction Summary:
+  - Output file: predictions.csv
+  - Number of predictions: 80
+  - Prediction column: 'predicted_price'
+
+📈 Statistics:
+  - Mean: 315000.0000
+  - Median: 320000.0000
+  - Range: [185000.0000, 485000.0000]
+```
+
+The output CSV will contain all original columns plus a new `predicted_price` column.
+
+## Preprocessing Details
+
+The tool automatically applies a preprocessing pipeline to ensure clean, standardized data for the model:
+
+### For Numeric Features:
+1. **Missing Value Imputation**: Replaces missing values (NaN) with the mean of that feature calculated from the training data
+2. **Standardization**: Applies StandardScaler to normalize features to zero mean and unit variance (z-score normalization)
+
+### For Categorical Features:
+1. **Missing Value Imputation**: Replaces missing values with the most frequent category from the training data
+2. **One-Hot Encoding**: Converts categorical variables into binary indicator columns (handles unknown categories gracefully during prediction)
+
+### Why This Matters:
+- **Consistency**: The same preprocessing is applied during both training and prediction, preventing data leakage
+- **No Data Leakage**: Imputation and scaling statistics are learned from training data only
+- **Reproducibility**: The preprocessing pipeline is saved with the model, ensuring identical transformations
+- **Robustness**: Missing values in new data are handled automatically using training statistics
+
+## Report Contents
+
+After training, an HTML report is generated containing:
 
-Run the main script:
+### Performance Metrics
+- **R² Score**: Coefficient of determination (0-1, higher is better). Measures the proportion of variance explained by the model
+- **MSE**: Mean Squared Error. Average of squared differences between actual and predicted values
+- **RMSE**: Root Mean Squared Error. Square root of MSE, in the same units as the target variable
+- **MAE**: Mean Absolute Error. Average absolute difference between actual and predicted values
+
+### Visualizations
+1. **Actual vs Predicted Plot**: Scatter plot showing how well predictions match actual values. Points close to the diagonal line indicate good predictions
+2. **Residuals Plot**: Shows prediction errors (actual - predicted) vs predicted values. Random scatter around zero indicates good model fit; patterns suggest issues
+3. **Feature Coefficients**: Horizontal bar chart showing each feature's impact on the target. Green bars indicate positive correlation, red bars indicate negative correlation
+
+### Model Metadata
+- Model type (Linear Regression)
+- Training date and time
+- Number of features used
+- Imputation method (Mean for numeric, Most Frequent for categorical)
+- Scaling method (Standard Scaler)
+
+The report is **completely self-contained** (all images embedded as base64) and can be opened in any browser or shared via email.
+
+## Example End-to-End Workflow
+
+Here's a complete example showing the full workflow from training to prediction to interpretation:
 
 ```bash
-uv run python main.py
+# 1. Train a model on your data
+uv run python main.py train \
+  --input sample_data.csv \
+  --target price \
+  --output-model house_model.joblib \
+  --report house_report.html
+
+# 2. Open the HTML report in your browser to evaluate performance
+open house_report.html  # macOS
+# or: xdg-open house_report.html  # Linux
+# or: start house_report.html  # Windows
+
+# 3. If satisfied with the model, use it to make predictions on new data
+uv run python main.py predict \
+  --model house_model.joblib \
+  --input new_houses.csv \
+  --output predicted_prices.csv
+
+# 4. View the predictions
+cat predicted_prices.csv  # or open in Excel/spreadsheet software
 ```
 
-This will execute the simple example in `main.py` which prints a greeting.
+**Tips for Best Results:**
+- Ensure your CSV has a header row with column names
+- The target column should be numeric for regression
+- Remove or encode categorical features with too many unique values
+- Check the HTML report to identify important features
+- Use the residuals plot to diagnose model issues (non-linearity, heteroscedasticity)
 
 ## Project Structure
 
 ```
-template-python/
-├── .git/                 # Git repository
-├── .gitignore           # Python-specific ignore patterns
-├── .python-version      # Pinned Python version (3.13)
-├── pyproject.toml       # Project configuration and dependencies
-├── main.py              # Main entry point
-└── README.md            # This file
-```
+ml-cli-tool/
+├── main.py                    # Entry point - runs the CLI
+├── cli.py                     # CLI command definitions (train, predict)
+├── data_loader.py             # Data loading and validation
+├── preprocessing.py           # Preprocessing pipeline creation
+├── model.py                   # Model training, saving, loading, prediction
+├── visualizations.py          # Matplotlib plotting functions
+├── report.py                  # HTML report generation
+├── templates/
+│   └── report_template.html   # Jinja2 template for reports
+├── sample_data.csv            # Example dataset (house prices)
+├── pyproject.toml             # Project configuration and dependencies
+├── .python-version            # Pinned Python version (3.13)
+└── README.md                  # This file
+```
+
+## Requirements
+
+- Python 3.13+
+- Dependencies (automatically installed by `uv sync`):
+  - scikit-learn >= 1.3
+  - pandas >= 2.0
+  - matplotlib >= 3.7
+  - jinja2 >= 3.1
+  - click >= 8.1
+
+## Troubleshooting
+
+**Issue**: "Target column 'X' not found in CSV file"
+- **Solution**: Check the column name spelling and ensure the CSV has a header row
+
+**Issue**: "Expected features: [...], got: [...]. Missing: [...]"
+- **Solution**: Ensure prediction data has all the same feature columns as training data (order doesn't matter)
+
+**Issue**: All predictions are NaN
+- **Solution**: Check that input data contains valid numeric values (not all NaN after preprocessing)
+
+**Issue**: Model file is corrupted or incompatible
+- **Solution**: Retrain the model if the file was modified or created with an incompatible sklearn version
+
+## License
+
+See LICENSE file for details.