# Model Training Notebook

In this notebook, we experiment with various machine learning models on the processed dataset. 
We perform clustering using K-Means, and we also train classifiers (Random Forest and Neural Network) for anomaly detection.

Ensure your processed data file (`smart_meter_data_features.csv`) exists in the data/processed/ directory.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# Import functions from modules
from src.model_kmeans import perform_kmeans_clustering, evaluate_clustering, plot_clusters
from src.model_random_forest import train_random_forest
from src.model_neural_network import train_neural_network

# Load the processed data
data_path = '../data/processed/smart_meter_data_features.csv'
data = pd.read_csv(data_path)
print('Processed data loaded successfully.')
data.head()

## K-Means Clustering Experiment

Here we perform K-Means clustering on selected features and evaluate the clustering performance using the silhouette score.

In [2]:
# Select features for clustering (adjust these based on your dataset)
if 'energy_consumption' in data.columns and 'hour' in data.columns:
    clustering_features = ['energy_consumption', 'hour']
else:
    clustering_features = data.columns.tolist()[:2]

# Perform clustering
data_clustered, kmeans_model = perform_kmeans_clustering(data.copy(), clustering_features, n_clusters=3)

# Evaluate clustering performance
score = evaluate_clustering(data_clustered, clustering_features)

# Visualize the clusters
plot_clusters(data_clustered, clustering_features, kmeans_model)

## Random Forest Classification Experiment

Next, we train a Random Forest classifier for anomaly detection. Ensure that your dataset includes an 'anomaly' column.

In [3]:
target = 'anomaly'
if target in data.columns:
    # Exclude non-feature columns
    feature_columns = [col for col in data.columns if col not in ['timestamp', target, 'cluster']]
    # Train the Random Forest model
    rf_model = train_random_forest(data.copy(), feature_columns, target)
else:
    print(f"Column '{target}' not found in data. Skipping Random Forest training.")

## Neural Network Classification Experiment

Now we train a Neural Network model for anomaly detection.

In [4]:
if target in data.columns:
    # Use the same feature columns as for Random Forest
    feature_columns = [col for col in data.columns if col not in ['timestamp', target, 'cluster']]
    # Train the Neural Network (using fewer epochs for demonstration; adjust as needed)
    nn_model, history, X_test, y_test, scaler = train_neural_network(data.copy(), feature_columns, target, epochs=20)
    
    # Plot the training and validation loss
    plt.figure(figsize=(8,6))
    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Neural Network Training Loss')
    plt.legend()
    plt.show()
else:
    print(f"Column '{target}' not found in data. Skipping Neural Network training.")