### Model Training To-Do List

1. **Setup & Data Preparation**
   - [ ] Install required packages: `mlflow`, `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `seaborn`
   - [ ] Load [final_customer_data_with_risk.csv](cci:7://file:///c:/Users/My%20Device/Desktop/Week-4_KAIM/data/processed/final_customer_data_with_risk.csv:0:0-0:0)
   - [ ] Split data into features (X) and target (y = 'is_high_risk')
   - [ ] Split into train/validation/test sets (80/10/10)

2. **Model Training**
   - [ ] Set up MLflow experiment tracking
   - [ ] Train baseline models:
     - [ ] Logistic Regression
     - [ ] Random Forest
     - [ ] XGBoost (optional)
   - [ ] Log all experiments with parameters and metrics

3. **Hyperparameter Tuning**
   - [ ] Tune best performing model using GridSearchCV/RandomizedSearchCV
   - [ ] Log best parameters and retrain model

4. **Model Evaluation**
   - [ ] Evaluate on validation set:
     - [ ] Accuracy, Precision, Recall, F1
     - [ ] ROC-AUC score
     - [ ] Confusion matrix
   - [ ] Generate feature importance plots

5. **Final Model**
   - [ ] Train final model on train+validation data
   - [ ] Evaluate on test set
   - [ ] Save the best model

6. **Documentation**
   - [ ] Add markdown cells explaining each step
   - [ ] Include visualizations
   - [ ] Document key findings and model performance

7. **Cleanup**
   - [ ] Remove any temporary code
   - [ ] Ensure all cells run in sequence
   - [ ] Save and commit changes



### 1. **Setup & Data Preparation**
   - **Install required packages**: Install all necessary libraries for data processing, model training, and visualization.
   - **Load the dataset**: Read [final_customer_data_with_risk.csv](cci:7://file:///c:/Users/My%20Device/Desktop/Week-4_KAIM/data/processed/final_customer_data_with_risk.csv:0:0-0:0) into a pandas DataFrame.
   - **Split into features and target**: 
     - Features (X): All columns except `is_high_risk` (e.g., RFM metrics, transaction history).
     - Target (y): The `is_high_risk` column (0 or 1).
   - **Train/Validation/Test Split**:
     - 80% for training, 10% for validation, and 10% for testing.
     - Use `train_test_split` with `stratify=y` to maintain class distribution.

---

### 2. **Model Training**
   - **Set up MLflow**: Initialize MLflow to log experiments, parameters, and metrics.
   - **Train baseline models**:
     - **Logistic Regression**: A simple, interpretable model to establish a baseline.
     - **Random Forest**: Handles non-linear relationships and feature interactions.
     - **XGBoost (optional)**: A powerful gradient-boosted tree model for better performance.
   - **Log experiments**: Track model parameters, metrics, and artifacts (e.g., plots, feature importance) in MLflow.

---

### 3. **Hyperparameter Tuning**
   - **Select the best-performing model** (e.g., Random Forest).
   - **Define a hyperparameter grid** (e.g., `n_estimators`, `max_depth`).
   - **Use `GridSearchCV` or `RandomizedSearchCV`** to find the best hyperparameters.
   - **Log the best parameters** and retrain the model on the full training set.

---

### 4. **Model Evaluation**
   - **Evaluate on the validation set**:
     - **Metrics**: Calculate accuracy, precision, recall, F1-score, and ROC-AUC.
     - **Confusion Matrix**: Visualize true/false positives/negatives.
     - **ROC Curve**: Plot the trade-off between true positive rate and false positive rate.
   - **Feature Importance**: Identify which features most influence the model's predictions.

---

### 5. **Final Model**
   - **Combine training and validation sets** for the final training.
   - **Retrain the best model** on this combined dataset.
   - **Evaluate on the test set** to get an unbiased estimate of performance.
   - **Save the model** (e.g., using `joblib` or `pickle`) for future use.

---

### 6. **Documentation**
   - **Add markdown cells** to explain each step clearly.
   - **Include visualizations** (e.g., ROC curves, confusion matrices, feature importance plots).
   - **Summarize findings**: Note which model performed best, key insights, and potential improvements.

---

### 7. **Cleanup**
   - **Remove any temporary or redundant code** to keep the notebook clean.
   - **Ensure all cells run in sequence** without errors.
   - **Save the notebook** and commit changes to your Git repository.

---

### Next Steps:
1. **Start with the first step** (Setup & Data Preparation) and run each cell to ensure everything loads correctly.
2. **Proceed incrementally**, checking outputs at each stage.
3. **Use MLflow** to track experiments and compare models.



## Checks all the dependencies and write on the requirements file

In [10]:
import subprocess
import sys
import importlib
from pathlib import Path

def get_installed_packages():
    """Get a set of lowercase package names that are currently installed."""
    if sys.version_info >= (3, 8):
        return {pkg.metadata['Name'].lower() for pkg in importlib.metadata.distributions()}
    else:
        # Fallback for Python < 3.8
        import pkg_resources
        return {pkg.key.lower() for pkg in pkg_resources.working_set}

def update_requirements(requirements_path='requirements.txt'):
    # List of required packages
    required_packages = [
        'pandas',
        'numpy',
        'scikit-learn',
        'matplotlib',
        'seaborn',
        'mlflow',
        'xgboost',
        'ipykernel',
        'jupyter',
        'scipy',
        'imbalanced-learn',
        'pytest',
        'pytest-cov'
    ]
    
    # Read existing requirements
    req_file = Path(requirements_path)
    if req_file.exists():
        with open(req_file, 'r') as f:
            existing_packages = {line.split('==')[0].lower().strip() for line in f if line.strip()}
    else:
        existing_packages = set()
    
    # Get installed packages
    installed_packages = get_installed_packages()
    
    # Find missing packages
    missing_packages = [pkg for pkg in required_packages 
                       if pkg.lower() not in {p.lower() for p in existing_packages} 
                       and pkg.lower() not in installed_packages]
    
    # Update requirements.txt if needed
    if missing_packages:
        print("Adding missing packages to requirements.txt:")
        with open(requirements_path, 'a') as f:
            for pkg in missing_packages:
                try:
                    # Get the installed version
                    version = importlib.metadata.version(pkg)
                    f.write(f"{pkg}=={version}\n")
                    print(f"✓ Added {pkg}=={version}")
                except importlib.metadata.PackageNotFoundError:
                    print(f"⚠ {pkg} not installed. Will attempt to install...")
    else:
        print("All required packages are already in requirements.txt")
    
    # Install missing packages
    if missing_packages:
        print("\nInstalling missing packages...")
        subprocess.check_call([sys.executable, "-m", "pip", "install"] + missing_packages)
        print("✓ Installation complete!")

# Run the function
update_requirements()

All required packages are already in requirements.txt


## Import the necessary libraries


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import mlflow
import mlflow.sklearn
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Load the data

In [12]:
# Load the processed data
data_path = '../data/processed/final_customer_data_with_risk.csv'
df = pd.read_csv(data_path)

# Display basic info
print("Data shape:", df.shape)
print("\nFirst 5 rows:")
display(df.head())

Data shape: (95662, 18)

First 5 rows:


Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult,Risk_Label,Cluster
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0,Medium Risk,1
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0,Medium Risk,1
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0,Low Risk,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0,Medium Risk,1
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0,Medium Risk,1


In [13]:
# Display all column names
print("Available columns in the DataFrame:")
print(df.columns.tolist())

# Display the first few rows to see the data
print("\nFirst 5 rows of the DataFrame:")
display(df.head())

Available columns in the DataFrame:
['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'Amount', 'Value', 'TransactionStartTime', 'PricingStrategy', 'FraudResult', 'Risk_Label', 'Cluster']

First 5 rows of the DataFrame:


Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult,Risk_Label,Cluster
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0,Medium Risk,1
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0,Medium Risk,1
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0,Low Risk,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0,Medium Risk,1
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0,Medium Risk,1


In [14]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())


y = df['Risk_Label']  # Changed from 'is_high_risk' to 'Risk_Label'
X = df.drop('Risk_Label', axis=1)  # Update this line as well

#distribution check
print("\nClass distribution:")
print(df['Risk_Label'].value_counts(normalize=True))
# Basic statistics
print("\nNumerical features statistics:")
display(df.describe())

Missing values per column:
TransactionId           0
BatchId                 0
AccountId               0
SubscriptionId          0
CustomerId              0
CurrencyCode            0
CountryCode             0
ProviderId              0
ProductId               0
ProductCategory         0
ChannelId               0
Amount                  0
Value                   0
TransactionStartTime    0
PricingStrategy         0
FraudResult             0
Risk_Label              0
Cluster                 0
dtype: int64

Class distribution:
Risk_Label
Medium Risk    0.761661
High Risk      0.208975
Low Risk       0.029364
Name: proportion, dtype: float64

Numerical features statistics:


Unnamed: 0,CountryCode,Amount,Value,PricingStrategy,FraudResult,Cluster
count,95662.0,95662.0,95662.0,95662.0,95662.0,95662.0
mean,256.0,6717.846,9900.584,2.255974,0.002018,1.179612
std,0.0,123306.8,123122.1,0.732924,0.044872,0.453961
min,256.0,-1000000.0,2.0,0.0,0.0,0.0
25%,256.0,-50.0,275.0,2.0,0.0,1.0
50%,256.0,1000.0,1000.0,2.0,0.0,1.0
75%,256.0,2800.0,5000.0,2.0,0.0,1.0
max,256.0,9880000.0,9880000.0,4.0,1.0,2.0


## Train/Test/validate split

In [15]:
from sklearn.model_selection import train_test_split

# First split: 80% training, 20% temp
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=RANDOM_STATE,
    stratify=y
)

# Split temp into validation and test (50/50 of temp = 10% each of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=RANDOM_STATE,
    stratify=y_temp
)

# Print the shapes
print("Training set:", X_train.shape, y_train.shape)
print("Validation set:", X_val.shape, y_val.shape)
print("Test set:", X_test.shape, y_test.shape)

# Check class distribution in each set
print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))
print("\nClass distribution in validation set:")
print(y_val.value_counts(normalize=True))
print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))

Training set: (38264, 17) (38264,)
Validation set: (38265, 17) (38265,)
Test set: (19133, 17) (19133,)

Class distribution in training set:
Risk_Label
Medium Risk    0.761656
High Risk      0.208969
Low Risk       0.029375
Name: proportion, dtype: float64

Class distribution in validation set:
Risk_Label
Medium Risk    0.761662
High Risk      0.208990
Low Risk       0.029348
Name: proportion, dtype: float64

Class distribution in test set:
Risk_Label
Medium Risk    0.761668
High Risk      0.208958
Low Risk       0.029373
Name: proportion, dtype: float64


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os

# Set random seed for reproducibility
RANDOM_STATE = 42

# File paths
input_file = r"C:\Users\My Device\Desktop\Week-4_KAIM\data\processed\final_customer_data_with_risk.csv"
output_dir = r"C:\Users\My Device\Desktop\Week-4_KAIM\data\splits"

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

def load_and_split_data():
    print("Loading data...")
    df = pd.read_csv(input_file)
    print(f"Original data shape: {df.shape}")
    
    # Check if target column exists
    target_col = 'Risk_Label'  # Change this if your target column has a different name
    if target_col not in df.columns:
        raise ValueError(f"Target column '{target_col}' not found in the data")
    
    # Separate features and target
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    # First split: 70% training, 30% temp
    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, 
        test_size=0.3, 
        random_state=RANDOM_STATE,
        stratify=y
    )
    
    # Split temp into validation and test (50/50 of temp = 15% each of total)
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp,
        test_size=0.5,
        random_state=RANDOM_STATE,
        stratify=y_temp
    )
    
    # Add target back to features for saving
    train_data = X_train.copy()
    train_data[target_col] = y_train
    
    val_data = X_val.copy()
    val_data[target_col] = y_val
    
    test_data = X_test.copy()
    test_data[target_col] = y_test
    
    # Save the splits
    print("\nSaving split datasets...")
    train_data.to_csv(os.path.join(output_dir, "train.csv"), index=False)
    val_data.to_csv(os.path.join(output_dir, "val.csv"), index=False)
    test_data.to_csv(os.path.join(output_dir, "test.csv"), index=False)
    
    # Print dataset sizes
    print("\nDataset sizes:")
    print(f"Training set: {len(train_data):,} samples")
    print(f"Validation set: {len(val_data):,} samples")
    print(f"Test set: {len(test_data):,} samples")
    
    # Print class distribution
    print("\nClass distribution:")
    for name, data in [('Training', train_data), ('Validation', val_data), ('Test', test_data)]:
        print(f"\n{name} set:")
        print(data[target_col].value_counts(normalize=True).sort_index())
    
    print(f"\nAll datasets saved to: {output_dir}")

if __name__ == "__main__":
    load_and_split_data()

Loading data...
Original data shape: (95662, 18)

Saving split datasets...

Dataset sizes:
Training set: 66,963 samples
Validation set: 14,349 samples
Test set: 14,350 samples

Class distribution:

Training set:
Risk_Label
High Risk      0.208981
Low Risk       0.029359
Medium Risk    0.761659
Name: proportion, dtype: float64

Validation set:
Risk_Label
High Risk      0.208934
Low Risk       0.029410
Medium Risk    0.761656
Name: proportion, dtype: float64

Test set:
Risk_Label
High Risk      0.208990
Low Risk       0.029338
Medium Risk    0.761672
Name: proportion, dtype: float64

All datasets saved to: C:\Users\My Device\Desktop\Week-4_KAIM\data\splits


## Load the splits

In [6]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Load the split data
data_dir = Path('../data/splits/')
X_train = pd.read_csv(data_dir / 'train.csv')
X_val = pd.read_csv(data_dir / 'val.csv')
X_test = pd.read_csv(data_dir / 'test.csv')

# The target column is 'Risk_Label' in each file
y_train = X_train.pop('Risk_Label')
y_val = X_val.pop('Risk_Label')
y_test = X_test.pop('Risk_Label')

print("Data loaded successfully!")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

Data loaded successfully!
Training set: 66963 samples
Validation set: 14349 samples
Test set: 14350 samples


## Encode the target variable

In [7]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Load the split data
data_dir = Path('../data/splits/')
X_train = pd.read_csv(data_dir / 'train.csv')
X_val = pd.read_csv(data_dir / 'val.csv')
X_test = pd.read_csv(data_dir / 'test.csv')

# The target column is 'Risk_Label' in each file
y_train = X_train.pop('Risk_Label')
y_val = X_val.pop('Risk_Label')
y_test = X_test.pop('Risk_Label')

print("Data loaded successfully!")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

Data loaded successfully!
Training set: 66963 samples
Validation set: 14349 samples
Test set: 14350 samples


## Set up the mlflow

In [8]:
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

# Set up MLflow
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("Credit_Risk_Prediction")

<Experiment: artifact_location=('file:c:/Users/My '
 'Device/Desktop/Week-4_KAIM/notebooks/mlruns/675144119365642279'), creation_time=1765893232336, experiment_id='675144119365642279', last_update_time=1765893232336, lifecycle_stage='active', name='Credit_Risk_Prediction', tags={}>

In [9]:
print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", X_test.shape)

print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))

Training set shape: (66963, 17)
Validation set shape: (14349, 17)
Test set shape: (14350, 17)

Class distribution in training set:
Risk_Label
Medium Risk    0.761659
High Risk      0.208981
Low Risk       0.029359
Name: proportion, dtype: float64


## Preprocess the data

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Identify column types
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

# Create transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Create column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Fit and transform the data
print("\nFitting and transforming data...")
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print("\nProcessed data shapes:")
print("Training:", X_train_processed.shape)
print("Validation:", X_val_processed.shape)
print("Test:", X_test_processed.shape)

Categorical columns: ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'TransactionStartTime']
Numeric columns: ['CountryCode', 'Amount', 'Value', 'PricingStrategy', 'FraudResult', 'Cluster']

Fitting and transforming data...


MemoryError: Unable to allocate 105. GiB for an array with shape (66963, 210020) and data type float64

## Encode the target variable

In [28]:
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_val_encoded = le.transform(y_val)
y_test_encoded = le.transform(y_test)

# Print class mapping
print("Class mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

Class mapping: {'High Risk': np.int64(0), 'Low Risk': np.int64(1), 'Medium Risk': np.int64(2)}


## Train a baseline model (Logistic Regression)

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train the model
print("Training Logistic Regression...")
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_processed, y_train_encoded)

# Evaluate on validation set
y_val_pred = model.predict(X_val_processed)
print("\nValidation Results:")
print(classification_report(y_val_encoded, y_val_pred, target_names=le.classes_))

Training Logistic Regression...


NameError: name 'X_train_processed' is not defined

## Train and evaluate multiple models


In [17]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    "Logistic_Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random_Forest": RandomForestClassifier(random_state=42, n_jobs=-1),
    "Gradient_Boosting": GradientBoostingClassifier(random_state=42)
}

results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_processed, y_train_encoded)
    
    # Predict on validation set
    y_pred = model.predict(X_val_processed)
    
    # Calculate metrics
    results[name] = {
        'accuracy': accuracy_score(y_val_encoded, y_pred),
        'precision': precision_score(y_val_encoded, y_pred, average='weighted'),
        'recall': recall_score(y_val_encoded, y_pred, average='weighted'),
        'f1': f1_score(y_val_encoded, y_pred, average='weighted')
    }
    
    print(f"\n{name} Results:")
    print(classification_report(y_val_encoded, y_pred, target_names=le.classes_))

# Compare models
print("\nModel Comparison (Weighted F1 Score):")
for name, metrics in results.items():
    print(f"{name}: {metrics['f1']:.4f}")


Training Logistic_Regression...


NameError: name 'X_train_processed' is not defined

##  Hyperparameter Tuning (Example with Random Forest):


In [1]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

# Fit the model
print("Performing grid search...")
grid_search.fit(X_train_processed, y_train_encoded)

# Best model
best_model = grid_search.best_estimator_
print("\nBest parameters:", grid_search.best_params_)

# Evaluate on validation set
y_val_pred = best_model.predict(X_val_processed)
print("\nBest Model Validation Results:")
print(classification_report(y_val_encoded, y_val_pred, target_names=le.classes_))

NameError: name 'RandomForestClassifier' is not defined

## Evaluate the best model

In [None]:
# Final evaluation on test set
y_test_pred = best_model.predict(X_test_processed)
print("\nTest Set Results:")
print(classification_report(y_test_encoded, y_test_pred, target_names=le.classes_))