<img src="Figures/top_ML.png" alt="Drawing" style="width: 1000px;"/>

# EXAMPLE

Develop a Supervised Machine Learning model to classify the users of a Electricity Retail Company, according to their hourly electricity consumption profile during a day. This classification will allow the company's marketing staff to send personalized and appropriate offers to these two types of customer profiles: users with a **high consumption profile** and users with a **non-high consumption profile**.

The columns are (0) CUPs, (1) cluster and (2-26) hourly consumption (from h-0 to h-23).

# 1. Import libraries

In [None]:
import pandas as pd #import pandas
import matplotlib.pyplot as plt # import matplotlib to make graphs
import seaborn as sns # import seaborn to make graphics
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 2. Load dataset

<div class="alert alert-success">
    <b>Load the dataset </b>
</div>


In [None]:
## your code here

<div class="alert alert-success">
    <b> Look for cluster uniques classes </b>
</div>

In [None]:
## your code here

In [None]:
# Dataset shape
consumption.shape

In [None]:
consumption.describe()

<div class="alert alert-success">
    <b> Are there missing value? </b>
</div>

In [None]:
## your code here

#### Let's see how many cases we have in each of the clusters. Do we have a balanced dataset?

In [None]:
# cluster==0
print("Number cluster 0:", consumption[consumption['cluster'] == 0]['cluster'].count())
# cluster=1
print("Number cluster 1:", consumption[consumption['cluster'] == 1]['cluster'].count())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Get the value counts
x = consumption['cluster'].value_counts()

# Create the barplot
sns.barplot(x=x.index, y=x.values)

# Set the title
plt.title('Value counts target')

# Show the plot
plt.show()

<div class="alert alert-success">
    <b> Create two dataframes (one for each cluster) to analyze them separately </b>
</div>

In [None]:
clients_0 = ## your code here
clients_1 = ## your code here

In [None]:
# Average hourly consumption comparison
print("Average hourly power cluster 0: ", clients_0.drop(['CUPs','cluster'], axis=1).mean(axis=1).mean(), 'kW')
print("Average hourly power cluster 1: ", clients_1.drop(['CUPs','cluster'], axis=1).mean(axis=1).mean(), 'kW')

**Remove the 'cluster' column in order to plot the different load curves.**

In [None]:
df_0 = clients_0.drop(['cluster'], axis=1)


**Make the "column CUPs" as index (it makes sense, since each row has a different value and identifies the SM).**

In [None]:
df_0.set_index(['CUPs'], inplace=True)

# Transpose the matrix, for ease of plotting
df_0 = df_0.T

# We change the name of the index to "hour".
df_0.index.name = 'hour'
df_0.head()

In [None]:
df_1 = clients_1.drop(['cluster'], axis=1)
df_1.set_index(['CUPs'], inplace=True)
df_1 = df_1.T
df_1.index.name = 'hour'
df_1.head()

**We obtain a list with the columns of the two dfs to have the CUPs of cluster 0 and cluster 1.**

In [None]:
cups_0 = df_0.columns
cups_1 = df_1.columns

print(cups_0)

**Create a plot**

In [None]:

plt.figure(figsize=(20,8))

# Create a loop where cups takes each of the strings in the cups_0 list.
for cups in cups_0:
    # 'lightcoral' indicates the color (https://matplotlib.org/2.1.1/gallery/color/named_colors.html)
    # linewidth sets the line width and alpha the transparency
    plt.plot(df_0[cups], 'lightcoral', linewidth=1, alpha=0.4)
for cups in cups_1:
    plt.plot(df_1[cups], 'green', linewidth=1, alpha=0.4)

    # X axis displays the hours
plt.xticks(df_0.index)
plt.xlabel('Hours', fontsize=16)
plt.ylabel('Consumers consumption [kWh]', fontsize=16)

plt.margins(x=0, y=0)
plt.show()  

<div class="alert alert-success">
    <b> Add average consumption to distinguish more clearly the differences between the clusters. </b>
</div>

In [None]:
df_0['mean'] = ## your code here
df_1['mean'] = ## your code here
df_0.head()

**We create the same graphs as before, adding the average curves of the two clusters with more opacity (alpha)**

In [None]:

plt.figure(figsize=(20,8))
for cups in cups_0:
    plt.plot(df_0[cups], 'lightcoral', linewidth=1, alpha=0.2)

for cups in cups_1:
    plt.plot(df_1[cups], 'green', linewidth=1, alpha=0.2)

plt.plot(df_0['mean'], 'tomato', linestyle='dashed', linewidth=4, alpha=1)    
plt.plot(df_1['mean'], 'green', linestyle='dashed', linewidth=4, alpha=1)

plt.xticks(df_0.index)
plt.margins(x=0, y=0)
plt.xlabel('Hours', fontsize=16)
plt.ylabel('Consumers consumption [kWh]', fontsize=16)
plt.show()  

**Correlation between features and target**

In [None]:

plt.figure(figsize=(18, 10))

# Create the correlation matrix after eliminating the CUPs column since it does not provide information in this case.
corr = consumption.drop(['CUPs'],axis=1).corr()

# Create a heat map to visually detect the correlation between the columns.
sns.heatmap(corr, cmap="coolwarm")

**Let's now create some Boxplots to detect the variability within each cluster.**

Clients_0: 'non-high consumption'

In [None]:
# Creating boxplot
plt.subplots(figsize=(15, 8))
bp = clients_0.drop(['CUPs'],axis=1).boxplot(column=list(clients_0.drop(['CUPs'],axis=1).columns))
plt.show()

Clients_1: 'high consumption'

In [None]:
plt.subplots(figsize=(15, 8))
bp = clients_1.drop(['CUPs'],axis=1).boxplot(column=list(clients_1.drop(['CUPs'],axis=1).columns))
plt.show()

### Feature engineering
Create some new features that may be interesting to reduce the dimensionality of the problem and improve the performance of the algorithm. New features starting from the hourly consumption (mean, max, std, mean(13h-21h)).

<div class="alert alert-success">
    <b> Create some new features that may be interesting to reduce the dimensionality of the problem and improve the performance of the algorithm. "max" and "min" </b>
</div>

In [None]:

hours = list(consumption.drop(['CUPs', 'cluster'], axis=1))

# Basic examples (please note that some of these characteristics may have a high correlation between them)
consumption['average'] = consumption[hours].mean(axis=1)
consumption['max'] = ## your code here
consumption['min'] = ## your code here
consumption['std'] = consumption[hours].std(axis=1)

# Example minmax
minmax = []
# iteramos fila a fila en nuestro df
for index, row in consumption.iterrows():
    # si el mínimo es 0, fijaremos minmax a 0, para evitar una indeterminación 0/0
    if row['min'] == 0:
        minmax.append(0)
    else:
        minmax.append(row['min']/row['max'])
consumption['minmax'] = minmax


In [None]:
# Example average over a period of time. We have seen that between 13h and 21h there is a greater difference between clusters. 
peak_hours = ['h-' + str(x) for x in range(13,21)]
consumption['peak_hours'] = consumption[peak_hours].mean(axis=1)

consumption.head()

**Check the correlation matrix again**

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = consumption.drop(['CUPs'],axis=1).corr()
sns.heatmap(corr, cmap="coolwarm")

# Negative correlation (close to -1) is also interesting, as may be the case for minmax and cluster.

## Split the data

The seed ***randome_state=0*** is used for all exercises. ***Suffle=True*** indicates that the data is randomly split between training and test. This reduces the variance and prevents the model from overfitting.

In [None]:
X = consumption.drop(['cluster'], axis=1) 
y = consumption['cluster']

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2  # percentage of the input data that I will use to validate the model
random_state=0
# Divide the data into training, validation and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state,
                                                    shuffle=True)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=test_size, random_state=random_state,
                                                    shuffle=True)

<div class="alert alert-success">
    <b> Add more Classification algorithms </b>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

num_folds = 15
error_metrics = {'balanced_accuracy'}
models = {('LR', LogisticRegression()),
           ('RF', RandomForestClassifier())}

results = [] # stores the results of the evaluation metrics
names = [] # name of each algorithm
msg = [] # print the summary of the cross-validation method


In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit

# Entreno con validación cruzada
for scoring in error_metrics:
    print('Classification evaluation metric: ', scoring)
    for name, model in models:
        print('Model ', name)
        cross_validation = StratifiedShuffleSplit(n_splits=num_folds, random_state=0)
        cv_results = cross_val_score(model, X_train, y_train, cv=cross_validation, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        resume = (name, cv_results.mean(), cv_results.std())
        msg.append(resume)
    print(msg)

    # Comparar resultados entre algoritmos
    fig = plt.figure()
    fig.suptitle('Comparison of algorithms with evaluation metrics: %s' %scoring)
    ax = fig.add_subplot(111)
    ax.set_xlabel('Candidate models')
    ax.set_ylabel('%s' %scoring)
    plt.boxplot(results)
    ax.set_xticklabels(names)
    plt.show()

    results = []

## *Hyperparameter setting*.

Steps to perform hyperadjustment of parameters:

* Metric to optimize: *balanced_accuracy*
* Define search parameter ranges: *params*
* Assign a validation method: *StratifiedShuffleSplit* (n_splits = 10).
* Train with the validation data: *X_val*

In [None]:
#RandomForestClassifier
model = RandomForestClassifier()
params = {
     'n_estimators': [100, 600], #default=100
     'min_samples_split': [2,5] #default=2
 }
scoring='balanced_accuracy'
cross_validation = StratifiedShuffleSplit(n_splits=10, random_state=0)
my_cv = cross_validation.split(X_val, y_val)
gsearch = GridSearchCV(estimator=model, param_grid=params, scoring=scoring, cv=my_cv)
gsearch.fit(X_val, y_val)

print("Best results: %f using the following hyperparameters %s" % (gsearch.best_score_, gsearch.best_params_))
means = gsearch.cv_results_['mean_test_score']
stds = gsearch.cv_results_['std_test_score']
params = gsearch.cv_results_['params']

## Final evaluation of the model.
Evaluation metrics:
  * 1. Confusion matrix
  * 2. Matthews Coefficient (MCC)
  * 3. ROC / AUC curve


In [None]:

clf_model = RandomForestClassifier(max_features='auto', min_samples_split=2,  n_estimators=600)
clf_model.fit(X_train,y_train)  # The RF model is trained
y_predict = clf_model.predict(X_test)  # Predictions are calculated


### **Print the feature ranking importance**

In [None]:
import numpy as np

# Get the feature importance from the best RF model
attribute_importance = gsearch.best_estimator_.feature_importances_
feature_names = X.columns.tolist()


# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': attribute_importance})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)



In [None]:

# Plotting the feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='rocket')
plt.title('Feature Importances from Random Forest Model')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()

# Print the most relevant features
print("Most Relevant Features:")
print(feature_importance_df.head(10))  # Show top 10 features


**1. Confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix = confusion_matrix(y_test, y_predict)
print(classification_report(y_test, y_predict))
print(confusion_matrix)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay


# clf_model is your trained classifier, and X_test, y_test are the test data and labels
disp = ConfusionMatrixDisplay.from_estimator(
    clf_model, X_test, y_test,
    cmap=plt.cm.Blues
)
plt.show()

**2. Matthews Coefficient (MCC)**.

The MCC uses correlation coefficients between -1 and +1. 
* Coefficient +1 represents a perfect prediction.
* Coefficient 0 represents a random mean prediction.
* Coefficient -1 represents an inverse prediction. 

In [None]:
from sklearn.metrics import matthews_corrcoef

matthews_corrcoef(y_test, y_predict)

**3. ROC curve / AUC**.

* ROC curve: Curve of the true positive rate versus false positive rate at different classification thresholds.

* AUC: (Area under the curve): The area under the curve (AUC) ROC is the probability that a classifier is more confident that a randomly chosen positive example is truly positive relative to a randomly chosen negative example being positive.

In [None]:
from sklearn.metrics import roc_auc_score, RocCurveDisplay, roc_curve


# Plot ROC curve using RocCurveDisplay
RocCurveDisplay.from_estimator(clf_model, X_test, y_test)
plt.show()

# Calculate AUC score
# Use predicted probabilities for the positive class
y_prob = clf_model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print('AUC: %.3f' % auc_score)