Simple-ML-Project - Loan Amount Prediction System

A comprehensive machine learning system designed to predict how much a user can borrow based on their personal and financial profile.

🎯 Business Objective

To build a model that will classify how much a user can borrow. The system analyzes key user characteristics including:

Marital Status: Married vs. single applicants
Education Level: Graduate vs. non-graduate education
Number of Dependents: Family size impact on loan capacity
Employment Situation: Self-employed vs. salaried employment stability

The model combines these factors with financial data (income, credit history, property area) to determine optimal loan amounts for each applicant. Built with modern software engineering practices and advanced ML techniques for accurate and reliable predictions.

🚀 Features

Core Capabilities

Multi-Model Support: Logistic Regression, Random Forest, XGBoost, LightGBM, SVM, Naive Bayes, KNN
Advanced Feature Engineering: Polynomial features, interaction features, feature selection, clustering features
Comprehensive Data Processing: Smart missing value handling, outlier detection, data validation
Production Architecture: Modular design, configuration management, logging system
Model Evaluation: Cross-validation, comprehensive metrics, visualization
Experiment Tracking: SwanLab integration for experiment management

Technical Highlights

Clean Codebase: Removed redundant files and optimized structure
OOP Design: Clean, maintainable, and extensible codebase
Configuration-Driven: YAML-based configuration management
Comprehensive Logging: Multi-level logging with experiment tracking
Data Validation: Automated data quality checks and recommendations
Feature Engineering: Advanced feature creation and selection
Model Persistence: Save/load trained models with metadata
Visualization: Rich plots and evaluation reports

📁 Project Structure

Simple-ML-Project/
├── src/                          # Source code
│   ├── core/                     # Core base classes and interfaces
│   │   ├── interfaces.py        # Abstract base classes and interfaces
│   │   ├── config.py            # Configuration management
│   │   ├── logger.py            # Logging system
│   │   ├── constants.py         # Constants and enums
│   │   ├── exceptions.py        # Custom exceptions
│   │   └── validators.py        # Validation utilities
│   ├── data/                     # Data processing modules
│   │   └── processor.py         # Data processing and feature engineering
│   ├── models/                   # Machine learning models
│   │   ├── base_model.py        # Base model class
│   │   └── sklearn_models.py    # Sklearn model implementations
│   ├── training/                 # Training modules
│   │   ├── trainer.py           # Model trainer
│   │   └── optimizer.py         # Hyperparameter optimizer
│   ├── evaluation/               # Evaluation modules
│   │   ├── evaluator.py         # Model evaluator
│   │   └── visualizer.py        # Result visualizer
│   └── utils/                    # Utility functions
│       ├── helpers.py           # Helper functions
│       └── visualization.py     # Visualization utilities
├── data/                         # Data directory
│   ├── train_u6lujuX_CVtuZ9i.csv # Training data
│   └── test_Y3wMUE5_7gLdaTN.csv  # Test data
├── outputs/                      # Output directory
│   ├── models/                   # Saved models
│   ├── curves/                   # Plots and visualizations
│   └── reports/                  # Evaluation reports
├── example/                      # Example scripts
│   ├── run_logistic_regression.sh    # Logistic Regression script
│   ├── run_random_forest.sh          # Random Forest script
│   ├── run_xgboost.sh               # XGBoost script
│   ├── run_lightgbm.sh             # LightGBM script
│   ├── run_svm.sh                   # SVM script
│   ├── run_naive_bayes.sh           # Naive Bayes script
│   ├── run_knn.sh                   # KNN script
│   └── run_all_models_comparison.sh # All models comparison
├── logs/                         # Log files
├── swanlog/                      # SwanLab experiment logs
├── config.yaml                   # Configuration file
├── main.py                       # Main entry point
├── test_system.py               # System test suite
├── requirements.txt              # Dependencies
└── README.md                     # This file

🛠️ Installation

Prerequisites

Python 3.8+
pip or conda

Setup

Clone the repository

git clone <repository-url>
cd Simple-ML-Project

Create virtual environment

# Using conda (recommended)
conda create -n simple-ml python=3.9
conda activate simple-ml

# Or using venv
python -m venv simple-ml_env
source simple-ml_env/bin/activate  # On Windows: simple-ml_env\Scripts\activate

Install dependencies

pip install -r requirements.txt

Verify installation

python test_system.py

If all tests pass, the installation is successful! 🎉

🚀 Quick Start

Using Model Scripts (Recommended)

The easiest way to run different models is using the provided bash scripts:

# Run individual models
./example/run_logistic_regression.sh    # Logistic Regression
./example/run_random_forest.sh          # Random Forest
./example/run_xgboost.sh               # XGBoost
./example/run_lightgbm.sh              # LightGBM
./example/run_svm.sh                   # Support Vector Machine
./example/run_naive_bayes.sh           # Naive Bayes
./example/run_knn.sh                   # K-Nearest Neighbors

# Compare all models at once
./example/run_all_models_comparison.sh

Manual Usage

Run the complete pipeline

python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config config.yaml

Train specific model

# Edit config.yaml to set model_name to your preferred model
python main.py --data data/train_u6lujuX_CVtuZ9i.csv

Run with custom configuration

python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config my_config.yaml

Test system functionality

python test_system.py

Programmatic Usage

from src.core import ConfigManager, Logger
from src.data import LoanDataProcessor
from src.models import RandomForestModel, XGBoostModel
from src.training import ModelTrainer
from src.evaluation import ModelEvaluator

# Initialize components
config_manager = ConfigManager("config.yaml")
logger = Logger(config_manager.get_logger_config())

# Load and preprocess data
processor = LoanDataProcessor(config_manager.get_data_config(), logger)
df = processor.load_data("data/train_u6lujuX_CVtuZ9i.csv")

# Preprocess data (includes feature engineering and validation)
X, y = processor.preprocess(df)
X_train, X_val, y_train, y_val = processor.split_data(X, y)

# Train model
model = XGBoostModel(config_manager.get_model_config(), logger)
trainer = ModelTrainer(config_manager.get_train_config(), logger)
trainer.set_model(model)
trainer.train(X_train, y_train, X_val, y_val)

# Evaluate model
evaluator = ModelEvaluator(config_manager.get_metrics_config(), logger)
predictions = model.predict(X_val)
probabilities = model.predict_proba(X_val)
metrics = evaluator.compute_metrics(y_val, predictions, probabilities)

print(f"Model accuracy: {metrics['accuracy']:.4f}")
print(f"Model F1-score: {metrics['f1_score']:.4f}")

⚙️ Configuration

The system is fully configurable through config.yaml. Key configuration sections:

Model Configuration

Model:
  model_name: "XGBoost"  # or "RandomForest", "LightGBM", etc.
  model_type: "classification"
  model_params:
    n_estimators: 100
    learning_rate: 0.1
    max_depth: 6
    random_state: 42

Data Processing

Data:
  test_size: 0.2
  random_seed: 42
  missing_strategy: "smart"  # "smart", "drop", "fill"
  scale_features: true
  scaling_method: "standard"  # "standard", "minmax", "none"
  create_features: true
  feature_selection: true
  n_features_select: 10

Training Configuration

Train:
  cv_folds: 5
  random_seed: 42
  verbose: 1

Evaluation Metrics

Metrics:
  metrics_list:
    - "accuracy"
    - "precision"
    - "recall"
    - "f1_score"
    - "roc_auc"
    - "confusion_matrix"
    - "classification_report"

📊 Data Format

The system analyzes user profiles to predict loan amounts. The input data includes key personal and financial factors:

Column	Type	Description	Impact on Loan Amount
Loan_ID	String	Unique loan identifier	-
Gender	Categorical	Male/Female	Affects risk assessment
Married	Categorical	Yes/No	Key factor: Married applicants often qualify for higher amounts
Dependents	Categorical	Number of dependents	Key factor: More dependents may reduce available loan amount
Education	Categorical	Graduate/Not Graduate	Key factor: Higher education typically increases loan eligibility
Self_Employed	Categorical	Yes/No	Key factor: Employment stability affects loan approval and amount
ApplicantIncome	Numeric	Applicant's income	Primary factor for loan amount calculation
CoapplicantIncome	Numeric	Co-applicant's income	Additional income source for loan capacity
LoanAmount	Numeric	Loan amount requested	Target variable for prediction
Loan_Amount_Term	Numeric	Loan term in months	Affects monthly payment capacity
Credit_History	Numeric	Credit history (0/1)	Critical for loan approval and amount
Property_Area	Categorical	Urban/Rural/Semiurban	Location-based risk assessment
Loan_Status	Categorical	Y/N (approval status)	Historical approval data for training

🎯 Model Performance

The system supports multiple models with different strengths. Here are the latest performance results:

Model	Accuracy	F1-Score	Precision	Recall	ROC-AUC	Training Time	Best For
LogisticRegression	86.18%	85.04%	87.60%	86.18%	80.90%	0.44s	Interpretability, Baseline
RandomForest	84.55%	83.72%	84.63%	84.55%	86.32%	3.79s	General purpose, Feature importance
XGBoost	~85%	~84%	~85%	~84%	~86%	~10s	High performance, Competitions
LightGBM	85.37%	84.83%	85.19%	85.37%	86.44%	1.35s	Large datasets, Fast training
SVM	85.37%	84.04%	86.96%	85.37%	83.53%	0.65s	Small datasets, High-dimensional
NaiveBayes	84.55%	83.51%	85.01%	84.55%	80.99%	0.44s	Fast training, Baseline
KNN	84.55%	83.90%	84.38%	84.55%	77.09%	0.77s	Simple, Non-parametric

Results from latest test runs. Performance may vary based on data and configuration.

🏆 Model Comparison Summary

Best Overall Performance: LogisticRegression (86.18% accuracy)
Fastest Training: NaiveBayes & LogisticRegression (0.44s)
Best for Large Datasets: LightGBM (fast + good performance)
Most Interpretable: LogisticRegression
Best for Feature Importance: RandomForest

🔧 Advanced Features

Feature Engineering

Polynomial Features: Create interaction terms and polynomial combinations
Interaction Features: Automatic feature interaction creation
Feature Selection: Automatic feature importance selection using mutual information
Clustering Features: Add cluster-based features using K-means
Power Transformations: Yeo-Johnson and quantile transformations

Data Validation

Missing Value Analysis: Comprehensive missing data assessment with quality scoring
Outlier Detection: Multiple outlier detection methods (IQR, Z-score)
Data Consistency: Logical constraint validation and business rule checking
Quality Scoring: Automated data quality assessment (0-1 scale)
Comprehensive Reports: Detailed validation reports with recommendations

Model Evaluation

Cross-Validation: K-fold cross-validation with stratified sampling
Comprehensive Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, Average Precision
Visualization: Confusion matrix, ROC curves, Precision-Recall curves, feature importance
Model Comparison: Side-by-side model performance comparison
Hyperparameter Optimization: Grid search and random search support

📈 Outputs

The system generates comprehensive outputs in the outputs/ directory:

Models (`outputs/models/`)

Trained model files (.pkl format)
Model metadata and parameters
Feature importance scores
Cross-validation results

Visualizations (`outputs/curves/`)

Confusion matrices
ROC curves
Precision-Recall curves
Feature importance plots
Model comparison charts
Comprehensive evaluation dashboards

Reports (`outputs/reports/`)

Evaluation metrics summary (evaluation_results.yaml)
Data quality report
Model performance comparison
Training history logs
Summary report (summary_report.txt)

Logs (`logs/` and `swanlog/`)

Detailed training logs
SwanLab experiment tracking
System performance metrics

🧪 Testing

Run the comprehensive test suite:

# Run system tests
python test_system.py

# Run specific module tests (if available)
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

The test_system.py script performs comprehensive testing of:

✅ All module imports
✅ Configuration loading
✅ Data processing pipeline
✅ Model creation and training
✅ Evaluation metrics
✅ File I/O operations
✅ System integration

📝 Logging

The system provides comprehensive logging:

Console Logging: Real-time progress updates
File Logging: Detailed logs saved to files
Experiment Tracking: SwanLab integration for experiment management
Structured Logging: JSON-formatted logs for analysis

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install black flake8 mypy pytest

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

🚀 Recent Updates

Version 2.2.0 (Latest)

✅ Model Scripts: Added individual bash scripts for all 7 models
✅ Model Comparison: Added comprehensive model comparison script
✅ Performance Testing: All model scripts tested and working
✅ Documentation: Updated README with model scripts usage
✅ Model Registration: Fixed missing model registrations (SVM, NaiveBayes, KNN)
✅ Parameter Fixes: Fixed model parameter issues (NaiveBayes var_smoothing)

Version 2.1.0

✅ Code Cleanup: Removed redundant files and optimized project structure
✅ Import Fixes: Fixed all import issues and module dependencies
✅ Configuration Updates: Updated config.yaml for better compatibility
✅ System Testing: Added comprehensive test suite (test_system.py)
✅ Performance Optimization: Improved model training and evaluation
✅ Documentation: Updated README with latest features and usage

Version 2.0.0

✅ Complete System Overhaul: Refactored entire codebase with modern OOP design
✅ Advanced Feature Engineering: Added polynomial features, clustering features, and interaction terms
✅ Comprehensive Data Validation: Implemented automated data quality assessment
✅ Multi-Model Support: Added support for 7+ machine learning models
✅ Enhanced Evaluation: Added comprehensive metrics and visualization
✅ Configuration Management: YAML-based configuration system
✅ Logging System: Multi-level logging with experiment tracking
✅ SwanLab Integration: Experiment tracking and visualization
✅ Modular Architecture: Clean, maintainable, and extensible design
✅ Documentation: Comprehensive documentation and examples

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Scikit-learn team for the excellent ML library
XGBoost and LightGBM teams for gradient boosting libraries
Pandas and NumPy teams for data manipulation tools
Matplotlib and Seaborn teams for visualization tools

📞 Support

For questions, issues, or contributions:

Check the Issues page
Create a new issue with detailed description
Contact the maintainers

Built with ❤️ for the machine learning community

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
data		data
example		example
logs		logs
outputs		outputs
src		src
tests		tests
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
temp_config_LogisticRegression.yaml		temp_config_LogisticRegression.yaml
test_production_system.py		test_production_system.py

supyuxiang/Simple-ML-Project

Folders and files

Latest commit

History

Repository files navigation