Skip to content

To build a model that will classify how much a user can borrow. It is based on the user's marital status, education level, number of dependents, and employment situation.

Notifications You must be signed in to change notification settings

supyuxiang/Simple-ML-Project

Repository files navigation

Simple-ML-Project - Loan Amount Prediction System

A comprehensive machine learning system designed to predict how much a user can borrow based on their personal and financial profile.

🎯 Business Objective

To build a model that will classify how much a user can borrow. The system analyzes key user characteristics including:

  • Marital Status: Married vs. single applicants
  • Education Level: Graduate vs. non-graduate education
  • Number of Dependents: Family size impact on loan capacity
  • Employment Situation: Self-employed vs. salaried employment stability

The model combines these factors with financial data (income, credit history, property area) to determine optimal loan amounts for each applicant. Built with modern software engineering practices and advanced ML techniques for accurate and reliable predictions.

🚀 Features

Core Capabilities

  • Multi-Model Support: Logistic Regression, Random Forest, XGBoost, LightGBM, SVM, Naive Bayes, KNN
  • Advanced Feature Engineering: Polynomial features, interaction features, feature selection, clustering features
  • Comprehensive Data Processing: Smart missing value handling, outlier detection, data validation
  • Production Architecture: Modular design, configuration management, logging system
  • Model Evaluation: Cross-validation, comprehensive metrics, visualization
  • Experiment Tracking: SwanLab integration for experiment management

Technical Highlights

  • Clean Codebase: Removed redundant files and optimized structure
  • OOP Design: Clean, maintainable, and extensible codebase
  • Configuration-Driven: YAML-based configuration management
  • Comprehensive Logging: Multi-level logging with experiment tracking
  • Data Validation: Automated data quality checks and recommendations
  • Feature Engineering: Advanced feature creation and selection
  • Model Persistence: Save/load trained models with metadata
  • Visualization: Rich plots and evaluation reports

📁 Project Structure

Simple-ML-Project/
├── src/                          # Source code
│   ├── core/                     # Core base classes and interfaces
│   │   ├── interfaces.py        # Abstract base classes and interfaces
│   │   ├── config.py            # Configuration management
│   │   ├── logger.py            # Logging system
│   │   ├── constants.py         # Constants and enums
│   │   ├── exceptions.py        # Custom exceptions
│   │   └── validators.py        # Validation utilities
│   ├── data/                     # Data processing modules
│   │   └── processor.py         # Data processing and feature engineering
│   ├── models/                   # Machine learning models
│   │   ├── base_model.py        # Base model class
│   │   └── sklearn_models.py    # Sklearn model implementations
│   ├── training/                 # Training modules
│   │   ├── trainer.py           # Model trainer
│   │   └── optimizer.py         # Hyperparameter optimizer
│   ├── evaluation/               # Evaluation modules
│   │   ├── evaluator.py         # Model evaluator
│   │   └── visualizer.py        # Result visualizer
│   └── utils/                    # Utility functions
│       ├── helpers.py           # Helper functions
│       └── visualization.py     # Visualization utilities
├── data/                         # Data directory
│   ├── train_u6lujuX_CVtuZ9i.csv # Training data
│   └── test_Y3wMUE5_7gLdaTN.csv  # Test data
├── outputs/                      # Output directory
│   ├── models/                   # Saved models
│   ├── curves/                   # Plots and visualizations
│   └── reports/                  # Evaluation reports
├── example/                      # Example scripts
│   ├── run_logistic_regression.sh    # Logistic Regression script
│   ├── run_random_forest.sh          # Random Forest script
│   ├── run_xgboost.sh               # XGBoost script
│   ├── run_lightgbm.sh             # LightGBM script
│   ├── run_svm.sh                   # SVM script
│   ├── run_naive_bayes.sh           # Naive Bayes script
│   ├── run_knn.sh                   # KNN script
│   └── run_all_models_comparison.sh # All models comparison
├── logs/                         # Log files
├── swanlog/                      # SwanLab experiment logs
├── config.yaml                   # Configuration file
├── main.py                       # Main entry point
├── test_system.py               # System test suite
├── requirements.txt              # Dependencies
└── README.md                     # This file

🛠️ Installation

Prerequisites

  • Python 3.8+
  • pip or conda

Setup

  1. Clone the repository
git clone <repository-url>
cd Simple-ML-Project
  1. Create virtual environment
# Using conda (recommended)
conda create -n simple-ml python=3.9
conda activate simple-ml

# Or using venv
python -m venv simple-ml_env
source simple-ml_env/bin/activate  # On Windows: simple-ml_env\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Verify installation
python test_system.py

If all tests pass, the installation is successful! 🎉

🚀 Quick Start

Using Model Scripts (Recommended)

The easiest way to run different models is using the provided bash scripts:

# Run individual models
./example/run_logistic_regression.sh    # Logistic Regression
./example/run_random_forest.sh          # Random Forest
./example/run_xgboost.sh               # XGBoost
./example/run_lightgbm.sh              # LightGBM
./example/run_svm.sh                   # Support Vector Machine
./example/run_naive_bayes.sh           # Naive Bayes
./example/run_knn.sh                   # K-Nearest Neighbors

# Compare all models at once
./example/run_all_models_comparison.sh

Manual Usage

  1. Run the complete pipeline
python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config config.yaml
  1. Train specific model
# Edit config.yaml to set model_name to your preferred model
python main.py --data data/train_u6lujuX_CVtuZ9i.csv
  1. Run with custom configuration
python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config my_config.yaml
  1. Test system functionality
python test_system.py

Programmatic Usage

from src.core import ConfigManager, Logger
from src.data import LoanDataProcessor
from src.models import RandomForestModel, XGBoostModel
from src.training import ModelTrainer
from src.evaluation import ModelEvaluator

# Initialize components
config_manager = ConfigManager("config.yaml")
logger = Logger(config_manager.get_logger_config())

# Load and preprocess data
processor = LoanDataProcessor(config_manager.get_data_config(), logger)
df = processor.load_data("data/train_u6lujuX_CVtuZ9i.csv")

# Preprocess data (includes feature engineering and validation)
X, y = processor.preprocess(df)
X_train, X_val, y_train, y_val = processor.split_data(X, y)

# Train model
model = XGBoostModel(config_manager.get_model_config(), logger)
trainer = ModelTrainer(config_manager.get_train_config(), logger)
trainer.set_model(model)
trainer.train(X_train, y_train, X_val, y_val)

# Evaluate model
evaluator = ModelEvaluator(config_manager.get_metrics_config(), logger)
predictions = model.predict(X_val)
probabilities = model.predict_proba(X_val)
metrics = evaluator.compute_metrics(y_val, predictions, probabilities)

print(f"Model accuracy: {metrics['accuracy']:.4f}")
print(f"Model F1-score: {metrics['f1_score']:.4f}")

⚙️ Configuration

The system is fully configurable through config.yaml. Key configuration sections:

Model Configuration

Model:
  model_name: "XGBoost"  # or "RandomForest", "LightGBM", etc.
  model_type: "classification"
  model_params:
    n_estimators: 100
    learning_rate: 0.1
    max_depth: 6
    random_state: 42

Data Processing

Data:
  test_size: 0.2
  random_seed: 42
  missing_strategy: "smart"  # "smart", "drop", "fill"
  scale_features: true
  scaling_method: "standard"  # "standard", "minmax", "none"
  create_features: true
  feature_selection: true
  n_features_select: 10

Training Configuration

Train:
  cv_folds: 5
  random_seed: 42
  verbose: 1

Evaluation Metrics

Metrics:
  metrics_list:
    - "accuracy"
    - "precision"
    - "recall"
    - "f1_score"
    - "roc_auc"
    - "confusion_matrix"
    - "classification_report"

📊 Data Format

The system analyzes user profiles to predict loan amounts. The input data includes key personal and financial factors:

Column Type Description Impact on Loan Amount
Loan_ID String Unique loan identifier -
Gender Categorical Male/Female Affects risk assessment
Married Categorical Yes/No Key factor: Married applicants often qualify for higher amounts
Dependents Categorical Number of dependents Key factor: More dependents may reduce available loan amount
Education Categorical Graduate/Not Graduate Key factor: Higher education typically increases loan eligibility
Self_Employed Categorical Yes/No Key factor: Employment stability affects loan approval and amount
ApplicantIncome Numeric Applicant's income Primary factor for loan amount calculation
CoapplicantIncome Numeric Co-applicant's income Additional income source for loan capacity
LoanAmount Numeric Loan amount requested Target variable for prediction
Loan_Amount_Term Numeric Loan term in months Affects monthly payment capacity
Credit_History Numeric Credit history (0/1) Critical for loan approval and amount
Property_Area Categorical Urban/Rural/Semiurban Location-based risk assessment
Loan_Status Categorical Y/N (approval status) Historical approval data for training

🎯 Model Performance

The system supports multiple models with different strengths. Here are the latest performance results:

Model Accuracy F1-Score Precision Recall ROC-AUC Training Time Best For
LogisticRegression 86.18% 85.04% 87.60% 86.18% 80.90% 0.44s Interpretability, Baseline
RandomForest 84.55% 83.72% 84.63% 84.55% 86.32% 3.79s General purpose, Feature importance
XGBoost ~85% ~84% ~85% ~84% ~86% ~10s High performance, Competitions
LightGBM 85.37% 84.83% 85.19% 85.37% 86.44% 1.35s Large datasets, Fast training
SVM 85.37% 84.04% 86.96% 85.37% 83.53% 0.65s Small datasets, High-dimensional
NaiveBayes 84.55% 83.51% 85.01% 84.55% 80.99% 0.44s Fast training, Baseline
KNN 84.55% 83.90% 84.38% 84.55% 77.09% 0.77s Simple, Non-parametric

Results from latest test runs. Performance may vary based on data and configuration.

🏆 Model Comparison Summary

  • Best Overall Performance: LogisticRegression (86.18% accuracy)
  • Fastest Training: NaiveBayes & LogisticRegression (0.44s)
  • Best for Large Datasets: LightGBM (fast + good performance)
  • Most Interpretable: LogisticRegression
  • Best for Feature Importance: RandomForest

🔧 Advanced Features

Feature Engineering

  • Polynomial Features: Create interaction terms and polynomial combinations
  • Interaction Features: Automatic feature interaction creation
  • Feature Selection: Automatic feature importance selection using mutual information
  • Clustering Features: Add cluster-based features using K-means
  • Power Transformations: Yeo-Johnson and quantile transformations

Data Validation

  • Missing Value Analysis: Comprehensive missing data assessment with quality scoring
  • Outlier Detection: Multiple outlier detection methods (IQR, Z-score)
  • Data Consistency: Logical constraint validation and business rule checking
  • Quality Scoring: Automated data quality assessment (0-1 scale)
  • Comprehensive Reports: Detailed validation reports with recommendations

Model Evaluation

  • Cross-Validation: K-fold cross-validation with stratified sampling
  • Comprehensive Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, Average Precision
  • Visualization: Confusion matrix, ROC curves, Precision-Recall curves, feature importance
  • Model Comparison: Side-by-side model performance comparison
  • Hyperparameter Optimization: Grid search and random search support

📈 Outputs

The system generates comprehensive outputs in the outputs/ directory:

Models (outputs/models/)

  • Trained model files (.pkl format)
  • Model metadata and parameters
  • Feature importance scores
  • Cross-validation results

Visualizations (outputs/curves/)

  • Confusion matrices
  • ROC curves
  • Precision-Recall curves
  • Feature importance plots
  • Model comparison charts
  • Comprehensive evaluation dashboards

Reports (outputs/reports/)

  • Evaluation metrics summary (evaluation_results.yaml)
  • Data quality report
  • Model performance comparison
  • Training history logs
  • Summary report (summary_report.txt)

Logs (logs/ and swanlog/)

  • Detailed training logs
  • SwanLab experiment tracking
  • System performance metrics

🧪 Testing

Run the comprehensive test suite:

# Run system tests
python test_system.py

# Run specific module tests (if available)
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

The test_system.py script performs comprehensive testing of:

  • ✅ All module imports
  • ✅ Configuration loading
  • ✅ Data processing pipeline
  • ✅ Model creation and training
  • ✅ Evaluation metrics
  • ✅ File I/O operations
  • ✅ System integration

📝 Logging

The system provides comprehensive logging:

  • Console Logging: Real-time progress updates
  • File Logging: Detailed logs saved to files
  • Experiment Tracking: SwanLab integration for experiment management
  • Structured Logging: JSON-formatted logs for analysis

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install black flake8 mypy pytest

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

🚀 Recent Updates

Version 2.2.0 (Latest)

  • Model Scripts: Added individual bash scripts for all 7 models
  • Model Comparison: Added comprehensive model comparison script
  • Performance Testing: All model scripts tested and working
  • Documentation: Updated README with model scripts usage
  • Model Registration: Fixed missing model registrations (SVM, NaiveBayes, KNN)
  • Parameter Fixes: Fixed model parameter issues (NaiveBayes var_smoothing)

Version 2.1.0

  • Code Cleanup: Removed redundant files and optimized project structure
  • Import Fixes: Fixed all import issues and module dependencies
  • Configuration Updates: Updated config.yaml for better compatibility
  • System Testing: Added comprehensive test suite (test_system.py)
  • Performance Optimization: Improved model training and evaluation
  • Documentation: Updated README with latest features and usage

Version 2.0.0

  • Complete System Overhaul: Refactored entire codebase with modern OOP design
  • Advanced Feature Engineering: Added polynomial features, clustering features, and interaction terms
  • Comprehensive Data Validation: Implemented automated data quality assessment
  • Multi-Model Support: Added support for 7+ machine learning models
  • Enhanced Evaluation: Added comprehensive metrics and visualization
  • Configuration Management: YAML-based configuration system
  • Logging System: Multi-level logging with experiment tracking
  • SwanLab Integration: Experiment tracking and visualization
  • Modular Architecture: Clean, maintainable, and extensible design
  • Documentation: Comprehensive documentation and examples

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Scikit-learn team for the excellent ML library
  • XGBoost and LightGBM teams for gradient boosting libraries
  • Pandas and NumPy teams for data manipulation tools
  • Matplotlib and Seaborn teams for visualization tools

📞 Support

For questions, issues, or contributions:

  1. Check the Issues page
  2. Create a new issue with detailed description
  3. Contact the maintainers

Built with ❤️ for the machine learning community

About

To build a model that will classify how much a user can borrow. It is based on the user's marital status, education level, number of dependents, and employment situation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •