### Model Training To-Do List

1. **Setup & Data Preparation**
   - [ ] Install required packages: `mlflow`, `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `seaborn`
   - [ ] Load [final_customer_data_with_risk.csv](cci:7://file:///c:/Users/My%20Device/Desktop/Week-4_KAIM/data/processed/final_customer_data_with_risk.csv:0:0-0:0)
   - [ ] Split data into features (X) and target (y = 'is_high_risk')
   - [ ] Split into train/validation/test sets (80/10/10)

2. **Model Training**
   - [ ] Set up MLflow experiment tracking
   - [ ] Train baseline models:
     - [ ] Logistic Regression
     - [ ] Random Forest
     - [ ] XGBoost (optional)
   - [ ] Log all experiments with parameters and metrics

3. **Hyperparameter Tuning**
   - [ ] Tune best performing model using GridSearchCV/RandomizedSearchCV
   - [ ] Log best parameters and retrain model

4. **Model Evaluation**
   - [ ] Evaluate on validation set:
     - [ ] Accuracy, Precision, Recall, F1
     - [ ] ROC-AUC score
     - [ ] Confusion matrix
   - [ ] Generate feature importance plots

5. **Final Model**
   - [ ] Train final model on train+validation data
   - [ ] Evaluate on test set
   - [ ] Save the best model

6. **Documentation**
   - [ ] Add markdown cells explaining each step
   - [ ] Include visualizations
   - [ ] Document key findings and model performance

7. **Cleanup**
   - [ ] Remove any temporary code
   - [ ] Ensure all cells run in sequence
   - [ ] Save and commit changes



### 1. **Setup & Data Preparation**
   - **Install required packages**: Install all necessary libraries for data processing, model training, and visualization.
   - **Load the dataset**: Read [final_customer_data_with_risk.csv](cci:7://file:///c:/Users/My%20Device/Desktop/Week-4_KAIM/data/processed/final_customer_data_with_risk.csv:0:0-0:0) into a pandas DataFrame.
   - **Split into features and target**: 
     - Features (X): All columns except `is_high_risk` (e.g., RFM metrics, transaction history).
     - Target (y): The `is_high_risk` column (0 or 1).
   - **Train/Validation/Test Split**:
     - 80% for training, 10% for validation, and 10% for testing.
     - Use `train_test_split` with `stratify=y` to maintain class distribution.

---

### 2. **Model Training**
   - **Set up MLflow**: Initialize MLflow to log experiments, parameters, and metrics.
   - **Train baseline models**:
     - **Logistic Regression**: A simple, interpretable model to establish a baseline.
     - **Random Forest**: Handles non-linear relationships and feature interactions.
     - **XGBoost (optional)**: A powerful gradient-boosted tree model for better performance.
   - **Log experiments**: Track model parameters, metrics, and artifacts (e.g., plots, feature importance) in MLflow.

---

### 3. **Hyperparameter Tuning**
   - **Select the best-performing model** (e.g., Random Forest).
   - **Define a hyperparameter grid** (e.g., `n_estimators`, `max_depth`).
   - **Use `GridSearchCV` or `RandomizedSearchCV`** to find the best hyperparameters.
   - **Log the best parameters** and retrain the model on the full training set.

---

### 4. **Model Evaluation**
   - **Evaluate on the validation set**:
     - **Metrics**: Calculate accuracy, precision, recall, F1-score, and ROC-AUC.
     - **Confusion Matrix**: Visualize true/false positives/negatives.
     - **ROC Curve**: Plot the trade-off between true positive rate and false positive rate.
   - **Feature Importance**: Identify which features most influence the model's predictions.

---

### 5. **Final Model**
   - **Combine training and validation sets** for the final training.
   - **Retrain the best model** on this combined dataset.
   - **Evaluate on the test set** to get an unbiased estimate of performance.
   - **Save the model** (e.g., using `joblib` or `pickle`) for future use.

---

### 6. **Documentation**
   - **Add markdown cells** to explain each step clearly.
   - **Include visualizations** (e.g., ROC curves, confusion matrices, feature importance plots).
   - **Summarize findings**: Note which model performed best, key insights, and potential improvements.

---

### 7. **Cleanup**
   - **Remove any temporary or redundant code** to keep the notebook clean.
   - **Ensure all cells run in sequence** without errors.
   - **Save the notebook** and commit changes to your Git repository.

---

### Next Steps:
1. **Start with the first step** (Setup & Data Preparation) and run each cell to ensure everything loads correctly.
2. **Proceed incrementally**, checking outputs at each stage.
3. **Use MLflow** to track experiments and compare models.



## Checks all the dependencies and write on the requirements file

In [4]:
import subprocess
import sys
import importlib
from pathlib import Path

def get_installed_packages():
    """Get a set of lowercase package names that are currently installed."""
    if sys.version_info >= (3, 8):
        return {pkg.metadata['Name'].lower() for pkg in importlib.metadata.distributions()}
    else:
        # Fallback for Python < 3.8
        import pkg_resources
        return {pkg.key.lower() for pkg in pkg_resources.working_set}

def update_requirements(requirements_path='requirements.txt'):
    # List of required packages
    required_packages = [
        'pandas',
        'numpy',
        'scikit-learn',
        'matplotlib',
        'seaborn',
        'mlflow',
        'xgboost',
        'ipykernel',
        'jupyter',
        'scipy',
        'imbalanced-learn',
        'pytest',
        'pytest-cov'
    ]
    
    # Read existing requirements
    req_file = Path(requirements_path)
    if req_file.exists():
        with open(req_file, 'r') as f:
            existing_packages = {line.split('==')[0].lower().strip() for line in f if line.strip()}
    else:
        existing_packages = set()
    
    # Get installed packages
    installed_packages = get_installed_packages()
    
    # Find missing packages
    missing_packages = [pkg for pkg in required_packages 
                       if pkg.lower() not in {p.lower() for p in existing_packages} 
                       and pkg.lower() not in installed_packages]
    
    # Update requirements.txt if needed
    if missing_packages:
        print("Adding missing packages to requirements.txt:")
        with open(requirements_path, 'a') as f:
            for pkg in missing_packages:
                try:
                    # Get the installed version
                    version = importlib.metadata.version(pkg)
                    f.write(f"{pkg}=={version}\n")
                    print(f"✓ Added {pkg}=={version}")
                except importlib.metadata.PackageNotFoundError:
                    print(f"⚠ {pkg} not installed. Will attempt to install...")
    else:
        print("All required packages are already in requirements.txt")
    
    # Install missing packages
    if missing_packages:
        print("\nInstalling missing packages...")
        subprocess.check_call([sys.executable, "-m", "pip", "install"] + missing_packages)
        print("✓ Installation complete!")

# Run the function
update_requirements()

All required packages are already in requirements.txt
