A comprehensive machine learning system designed to predict how much a user can borrow based on their personal and financial profile.
To build a model that will classify how much a user can borrow. The system analyzes key user characteristics including:
- Marital Status: Married vs. single applicants
- Education Level: Graduate vs. non-graduate education
- Number of Dependents: Family size impact on loan capacity
- Employment Situation: Self-employed vs. salaried employment stability
The model combines these factors with financial data (income, credit history, property area) to determine optimal loan amounts for each applicant. Built with modern software engineering practices and advanced ML techniques for accurate and reliable predictions.
- Multi-Model Support: Logistic Regression, Random Forest, XGBoost, LightGBM, SVM, Naive Bayes, KNN
- Advanced Feature Engineering: Polynomial features, interaction features, feature selection, clustering features
- Comprehensive Data Processing: Smart missing value handling, outlier detection, data validation
- Production Architecture: Modular design, configuration management, logging system
- Model Evaluation: Cross-validation, comprehensive metrics, visualization
- Experiment Tracking: SwanLab integration for experiment management
- Clean Codebase: Removed redundant files and optimized structure
- OOP Design: Clean, maintainable, and extensible codebase
- Configuration-Driven: YAML-based configuration management
- Comprehensive Logging: Multi-level logging with experiment tracking
- Data Validation: Automated data quality checks and recommendations
- Feature Engineering: Advanced feature creation and selection
- Model Persistence: Save/load trained models with metadata
- Visualization: Rich plots and evaluation reports
Simple-ML-Project/
├── src/ # Source code
│ ├── core/ # Core base classes and interfaces
│ │ ├── interfaces.py # Abstract base classes and interfaces
│ │ ├── config.py # Configuration management
│ │ ├── logger.py # Logging system
│ │ ├── constants.py # Constants and enums
│ │ ├── exceptions.py # Custom exceptions
│ │ └── validators.py # Validation utilities
│ ├── data/ # Data processing modules
│ │ └── processor.py # Data processing and feature engineering
│ ├── models/ # Machine learning models
│ │ ├── base_model.py # Base model class
│ │ └── sklearn_models.py # Sklearn model implementations
│ ├── training/ # Training modules
│ │ ├── trainer.py # Model trainer
│ │ └── optimizer.py # Hyperparameter optimizer
│ ├── evaluation/ # Evaluation modules
│ │ ├── evaluator.py # Model evaluator
│ │ └── visualizer.py # Result visualizer
│ └── utils/ # Utility functions
│ ├── helpers.py # Helper functions
│ └── visualization.py # Visualization utilities
├── data/ # Data directory
│ ├── train_u6lujuX_CVtuZ9i.csv # Training data
│ └── test_Y3wMUE5_7gLdaTN.csv # Test data
├── outputs/ # Output directory
│ ├── models/ # Saved models
│ ├── curves/ # Plots and visualizations
│ └── reports/ # Evaluation reports
├── example/ # Example scripts
│ ├── run_logistic_regression.sh # Logistic Regression script
│ ├── run_random_forest.sh # Random Forest script
│ ├── run_xgboost.sh # XGBoost script
│ ├── run_lightgbm.sh # LightGBM script
│ ├── run_svm.sh # SVM script
│ ├── run_naive_bayes.sh # Naive Bayes script
│ ├── run_knn.sh # KNN script
│ └── run_all_models_comparison.sh # All models comparison
├── logs/ # Log files
├── swanlog/ # SwanLab experiment logs
├── config.yaml # Configuration file
├── main.py # Main entry point
├── test_system.py # System test suite
├── requirements.txt # Dependencies
└── README.md # This file
- Python 3.8+
- pip or conda
- Clone the repository
git clone <repository-url>
cd Simple-ML-Project
- Create virtual environment
# Using conda (recommended)
conda create -n simple-ml python=3.9
conda activate simple-ml
# Or using venv
python -m venv simple-ml_env
source simple-ml_env/bin/activate # On Windows: simple-ml_env\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Verify installation
python test_system.py
If all tests pass, the installation is successful! 🎉
The easiest way to run different models is using the provided bash scripts:
# Run individual models
./example/run_logistic_regression.sh # Logistic Regression
./example/run_random_forest.sh # Random Forest
./example/run_xgboost.sh # XGBoost
./example/run_lightgbm.sh # LightGBM
./example/run_svm.sh # Support Vector Machine
./example/run_naive_bayes.sh # Naive Bayes
./example/run_knn.sh # K-Nearest Neighbors
# Compare all models at once
./example/run_all_models_comparison.sh
- Run the complete pipeline
python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config config.yaml
- Train specific model
# Edit config.yaml to set model_name to your preferred model
python main.py --data data/train_u6lujuX_CVtuZ9i.csv
- Run with custom configuration
python main.py --data data/train_u6lujuX_CVtuZ9i.csv --config my_config.yaml
- Test system functionality
python test_system.py
from src.core import ConfigManager, Logger
from src.data import LoanDataProcessor
from src.models import RandomForestModel, XGBoostModel
from src.training import ModelTrainer
from src.evaluation import ModelEvaluator
# Initialize components
config_manager = ConfigManager("config.yaml")
logger = Logger(config_manager.get_logger_config())
# Load and preprocess data
processor = LoanDataProcessor(config_manager.get_data_config(), logger)
df = processor.load_data("data/train_u6lujuX_CVtuZ9i.csv")
# Preprocess data (includes feature engineering and validation)
X, y = processor.preprocess(df)
X_train, X_val, y_train, y_val = processor.split_data(X, y)
# Train model
model = XGBoostModel(config_manager.get_model_config(), logger)
trainer = ModelTrainer(config_manager.get_train_config(), logger)
trainer.set_model(model)
trainer.train(X_train, y_train, X_val, y_val)
# Evaluate model
evaluator = ModelEvaluator(config_manager.get_metrics_config(), logger)
predictions = model.predict(X_val)
probabilities = model.predict_proba(X_val)
metrics = evaluator.compute_metrics(y_val, predictions, probabilities)
print(f"Model accuracy: {metrics['accuracy']:.4f}")
print(f"Model F1-score: {metrics['f1_score']:.4f}")
The system is fully configurable through config.yaml
. Key configuration sections:
Model:
model_name: "XGBoost" # or "RandomForest", "LightGBM", etc.
model_type: "classification"
model_params:
n_estimators: 100
learning_rate: 0.1
max_depth: 6
random_state: 42
Data:
test_size: 0.2
random_seed: 42
missing_strategy: "smart" # "smart", "drop", "fill"
scale_features: true
scaling_method: "standard" # "standard", "minmax", "none"
create_features: true
feature_selection: true
n_features_select: 10
Train:
cv_folds: 5
random_seed: 42
verbose: 1
Metrics:
metrics_list:
- "accuracy"
- "precision"
- "recall"
- "f1_score"
- "roc_auc"
- "confusion_matrix"
- "classification_report"
The system analyzes user profiles to predict loan amounts. The input data includes key personal and financial factors:
Column | Type | Description | Impact on Loan Amount |
---|---|---|---|
Loan_ID | String | Unique loan identifier | - |
Gender | Categorical | Male/Female | Affects risk assessment |
Married | Categorical | Yes/No | Key factor: Married applicants often qualify for higher amounts |
Dependents | Categorical | Number of dependents | Key factor: More dependents may reduce available loan amount |
Education | Categorical | Graduate/Not Graduate | Key factor: Higher education typically increases loan eligibility |
Self_Employed | Categorical | Yes/No | Key factor: Employment stability affects loan approval and amount |
ApplicantIncome | Numeric | Applicant's income | Primary factor for loan amount calculation |
CoapplicantIncome | Numeric | Co-applicant's income | Additional income source for loan capacity |
LoanAmount | Numeric | Loan amount requested | Target variable for prediction |
Loan_Amount_Term | Numeric | Loan term in months | Affects monthly payment capacity |
Credit_History | Numeric | Credit history (0/1) | Critical for loan approval and amount |
Property_Area | Categorical | Urban/Rural/Semiurban | Location-based risk assessment |
Loan_Status | Categorical | Y/N (approval status) | Historical approval data for training |
The system supports multiple models with different strengths. Here are the latest performance results:
Model | Accuracy | F1-Score | Precision | Recall | ROC-AUC | Training Time | Best For |
---|---|---|---|---|---|---|---|
LogisticRegression | 86.18% | 85.04% | 87.60% | 86.18% | 80.90% | 0.44s | Interpretability, Baseline |
RandomForest | 84.55% | 83.72% | 84.63% | 84.55% | 86.32% | 3.79s | General purpose, Feature importance |
XGBoost | ~85% | ~84% | ~85% | ~84% | ~86% | ~10s | High performance, Competitions |
LightGBM | 85.37% | 84.83% | 85.19% | 85.37% | 86.44% | 1.35s | Large datasets, Fast training |
SVM | 85.37% | 84.04% | 86.96% | 85.37% | 83.53% | 0.65s | Small datasets, High-dimensional |
NaiveBayes | 84.55% | 83.51% | 85.01% | 84.55% | 80.99% | 0.44s | Fast training, Baseline |
KNN | 84.55% | 83.90% | 84.38% | 84.55% | 77.09% | 0.77s | Simple, Non-parametric |
Results from latest test runs. Performance may vary based on data and configuration.
- Best Overall Performance: LogisticRegression (86.18% accuracy)
- Fastest Training: NaiveBayes & LogisticRegression (0.44s)
- Best for Large Datasets: LightGBM (fast + good performance)
- Most Interpretable: LogisticRegression
- Best for Feature Importance: RandomForest
- Polynomial Features: Create interaction terms and polynomial combinations
- Interaction Features: Automatic feature interaction creation
- Feature Selection: Automatic feature importance selection using mutual information
- Clustering Features: Add cluster-based features using K-means
- Power Transformations: Yeo-Johnson and quantile transformations
- Missing Value Analysis: Comprehensive missing data assessment with quality scoring
- Outlier Detection: Multiple outlier detection methods (IQR, Z-score)
- Data Consistency: Logical constraint validation and business rule checking
- Quality Scoring: Automated data quality assessment (0-1 scale)
- Comprehensive Reports: Detailed validation reports with recommendations
- Cross-Validation: K-fold cross-validation with stratified sampling
- Comprehensive Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, Average Precision
- Visualization: Confusion matrix, ROC curves, Precision-Recall curves, feature importance
- Model Comparison: Side-by-side model performance comparison
- Hyperparameter Optimization: Grid search and random search support
The system generates comprehensive outputs in the outputs/
directory:
- Trained model files (
.pkl
format) - Model metadata and parameters
- Feature importance scores
- Cross-validation results
- Confusion matrices
- ROC curves
- Precision-Recall curves
- Feature importance plots
- Model comparison charts
- Comprehensive evaluation dashboards
- Evaluation metrics summary (
evaluation_results.yaml
) - Data quality report
- Model performance comparison
- Training history logs
- Summary report (
summary_report.txt
)
- Detailed training logs
- SwanLab experiment tracking
- System performance metrics
Run the comprehensive test suite:
# Run system tests
python test_system.py
# Run specific module tests (if available)
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
The test_system.py
script performs comprehensive testing of:
- ✅ All module imports
- ✅ Configuration loading
- ✅ Data processing pipeline
- ✅ Model creation and training
- ✅ Evaluation metrics
- ✅ File I/O operations
- ✅ System integration
The system provides comprehensive logging:
- Console Logging: Real-time progress updates
- File Logging: Detailed logs saved to files
- Experiment Tracking: SwanLab integration for experiment management
- Structured Logging: JSON-formatted logs for analysis
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Install development dependencies
pip install -r requirements.txt
pip install black flake8 mypy pytest
# Format code
black src/ tests/
# Lint code
flake8 src/ tests/
# Type checking
mypy src/
- ✅ Model Scripts: Added individual bash scripts for all 7 models
- ✅ Model Comparison: Added comprehensive model comparison script
- ✅ Performance Testing: All model scripts tested and working
- ✅ Documentation: Updated README with model scripts usage
- ✅ Model Registration: Fixed missing model registrations (SVM, NaiveBayes, KNN)
- ✅ Parameter Fixes: Fixed model parameter issues (NaiveBayes var_smoothing)
- ✅ Code Cleanup: Removed redundant files and optimized project structure
- ✅ Import Fixes: Fixed all import issues and module dependencies
- ✅ Configuration Updates: Updated config.yaml for better compatibility
- ✅ System Testing: Added comprehensive test suite (
test_system.py
) - ✅ Performance Optimization: Improved model training and evaluation
- ✅ Documentation: Updated README with latest features and usage
- ✅ Complete System Overhaul: Refactored entire codebase with modern OOP design
- ✅ Advanced Feature Engineering: Added polynomial features, clustering features, and interaction terms
- ✅ Comprehensive Data Validation: Implemented automated data quality assessment
- ✅ Multi-Model Support: Added support for 7+ machine learning models
- ✅ Enhanced Evaluation: Added comprehensive metrics and visualization
- ✅ Configuration Management: YAML-based configuration system
- ✅ Logging System: Multi-level logging with experiment tracking
- ✅ SwanLab Integration: Experiment tracking and visualization
- ✅ Modular Architecture: Clean, maintainable, and extensible design
- ✅ Documentation: Comprehensive documentation and examples
This project is licensed under the MIT License - see the LICENSE file for details.
- Scikit-learn team for the excellent ML library
- XGBoost and LightGBM teams for gradient boosting libraries
- Pandas and NumPy teams for data manipulation tools
- Matplotlib and Seaborn teams for visualization tools
For questions, issues, or contributions:
- Check the Issues page
- Create a new issue with detailed description
- Contact the maintainers
Built with ❤️ for the machine learning community