<img src="Figures/top_ML.png" alt="Drawing" style="width: 1100px;"/>

# 4. EJERCICIO CLASIFICACIÓN DE USUARIOS SEGÚN SU CONSUMO

Desarrollar un modelo de Aprendizaje Automático Supervisado para clasificar a los usuarios de una Compañía Minorista de Electricidad, según su perfil de consumo de electricidad por hora durante un día. Esta clasificación permitirá al personal de marketing de la compañía enviar ofertas personalizadas y apropiadas a estos dos tipos de perfiles de clientes: usuarios con un perfil de **consumo alto** y usuarios con un perfil de **consumo no alto**.

Las columnas son: 

* (0) CUPs
* (1) etiqueta
* (2-26) consumo por hora (de h-0 a h-23).


# Pasos para crear un modelo de machine learning

<img src="Figures/Fases.png" alt="Drawing" style="width: 800px;"/>

# 1. Importar librerías

In [None]:
import pandas as pd #import pandas
import matplotlib.pyplot as plt # import matplotlib to make graphs
import seaborn as sns # import seaborn to make graphics
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 2. Cargar el dataset y comprender los datos

<div class="alert alert-success">
    <b> Cargamos el dataset </b>
</div>


In [None]:
consumption = pd.read_excel('Data/S4-clasificacion-consumos.xlsx')
consumption.head()

<div class="alert alert-success">
    <b> Revisar cuantos grupos hay (etiquetas) </b>
</div>

In [None]:
consumption['etiqueta'].unique()


In [None]:
consumption['CUPs'].nunique()

In [None]:
# Dataset shape
consumption.shape

In [None]:
# Formato de los datos
dataset.dtypes

In [None]:
consumption.describe()

In [None]:
consumption.info()

<div class="alert alert-success">
    <b> Falta algún dato? </b>
</div>

In [None]:
consumption.isna().sum()

#### Veamos cuántos usuarios hay de cada clase. ¿Tenemos un conjunto de datos equilibrado?

In [None]:
# cluster==0
print("Número etiqueta 0:", consumption[consumption['etiqueta'] == 0]['etiqueta'].count())
# cluster=1
print("Número etiqueta 1:", consumption[consumption['etiqueta'] == 1]['etiqueta'].count())

In [None]:
x = consumption['etiqueta'].value_counts().values
x

In [None]:
# Nombramos las etiquetas para visualizarlas
labels = ['Etiqueta 0', 'Etiqueta 1']
values = x

In [None]:
# Creamos un DataFrame
df = pd.DataFrame({'Labels': labels, 'Values': x})

In [None]:
# Creamos la gráfica
plt.figure(figsize=(8, 6))
sns.barplot(x='Labels', y='Values', data=df, palette='viridis') 
plt.title('Contador del número de consumidores en cada clase')
print('Número total de etiquetas: ', x)

<div class="alert alert-success">
    <b> Creamos dos Dataframes (uno para cada clase) para analizarlos por separado </b>
</div>

In [None]:
clients_0 = consumption[consumption['etiqueta']==0]
clients_1 = consumption[consumption['etiqueta']==1]

In [None]:
clients_1.head()

In [None]:
# Average hourly consumption comparison
print("Media horaria de potencia CLASE/ETIQUETA 0: ", clients_0.drop(['CUPs','etiqueta'], axis=1).mean(axis=1).mean(), 'kW')
print("Media horaria de potencia CLASE/ETIQUETA 1: ", clients_1.drop(['CUPs','etiqueta'], axis=1).mean(axis=1).mean(), 'kW')


<div class="alert alert-success">
    <b> Eliminar la columna 'etiqueta' para graficar las diferentes curvas de consumo. </b>
</div>


In [None]:
df_0 = clients_0.drop(['etiqueta'], axis=1)



<div class="alert alert-success">
    <b> Haz que la "columna CUPs" sea el índice (ya que cada fila tiene un valor diferente e identifica al SM). </b>
</div>


In [None]:
df_0.set_index(['CUPs'], inplace=True)

# Transpose the matrix, for ease of plotting
df_0 = df_0.T

# We change the name of the index to "hour".
df_0.index.name = 'hour'
df_0.head()

In [None]:
df_1 = clients_1.drop(['etiqueta'], axis=1)
df_1.set_index(['CUPs'], inplace=True)
df_1 = df_1.T
df_1.index.name = 'hour'
df_1.head()


<div class="alert alert-success">
    <b> Obtenemos una lista con las columnas de los dos dataframes para tener los CUPs de la clase 0 y la clase 1. </b>
</div>


In [None]:
cups_0 = df_0.columns
cups_1 = df_1.columns

print(cups_0)


<div class="alert alert-success">
    <b> Graficamos. </b>
</div>


In [None]:

plt.figure(figsize=(20,8))

# Create a loop where cups takes each of the strings in the cups_0 list.
for cups in cups_0:
    # 'lightcoral' indicates the color (https://matplotlib.org/2.1.1/gallery/color/named_colors.html)
    # linewidth sets the line width and alpha the transparency
    plt.plot(df_0[cups], 'lightcoral', linewidth=1, alpha=0.4)
for cups in cups_1:
    plt.plot(df_1[cups], 'green', linewidth=1, alpha=0.4)

    # X axis displays the hours
plt.xticks(df_0.index)
plt.xlabel('Hours', fontsize=16)
plt.ylabel('Consumers consumption [kWh]', fontsize=16)

plt.margins(x=0, y=0)
plt.show()  

<div class="alert alert-success">
    <b> Agregamos el consumo promedio para distinguir más claramente las diferencias entre los clusters </b>
</div>

In [None]:
df_0['mean'] = df_0.mean(axis=1)
df_1['mean'] = df_1.mean(axis=1)
df_1.head(10)

<div class="alert alert-success">
    <b> Creamos los mismos gráficos que antes, agregando las curvas promedio de las dos clases (0/1) con más opacidad (alpha). </b>
</div>


In [None]:

plt.figure(figsize=(20,8))
for cups in cups_0:
    plt.plot(df_0[cups], 'lightcoral', linewidth=1, alpha=0.2)

for cups in cups_1:
    plt.plot(df_1[cups], 'green', linewidth=1, alpha=0.2)

plt.plot(df_0['mean'], 'tomato', linestyle='dashed', linewidth=4, alpha=1)    
plt.plot(df_1['mean'], 'green', linestyle='dashed', linewidth=4, alpha=1)

plt.xticks(df_0.index)
plt.margins(x=0, y=0)
plt.xlabel('Hours', fontsize=16)
plt.ylabel('Consumers consumption [kWh]', fontsize=16)
plt.show()  

<div class="alert alert-success">
    <b> Correlación entre las características y la etiqueta. </b>
</div>


<img src="Figures/coef-correlacion.jpg" alt="Drawing" style="width: 400px;"/>


<img src="Figures/correlation.png" alt="Drawing" style="width: 400px;"/>

In [None]:

plt.figure(figsize=(18, 10))

# Create the correlation matrix after eliminating the CUPs column since it does not provide information in this case.
corr = consumption.drop(['CUPs'],axis=1).corr()

# Create a heat map to visually detect the correlation between the columns.
sns.heatmap(corr, cmap="coolwarm", vmin=-1, vmax=1)

<div class="alert alert-success">
    <b> Creamos un boxplot para detectar la variabilidad en cada clase. </b>
</div>



### Clients_0: 'non-high consumption'

In [None]:
# Creating boxplot
plt.subplots(figsize=(15, 8))
bp = clients_0.drop(['CUPs'],axis=1).boxplot(column=list(clients_0.drop(['CUPs'],axis=1).columns))
plt.show()

### Clients_1: 'high consumption'

In [None]:
plt.subplots(figsize=(15, 8))
bp = clients_1.drop(['CUPs'],axis=1).boxplot(column=list(clients_1.drop(['CUPs'],axis=1).columns))
plt.show()

# 3. Preparamos los datos
## Feature selection/ engineering
Crear algunas nuevas características/features que pueden ser interesantes para reducir la dimensionalidad del problema y mejorar el rendimiento del algoritmo. Las nuevas características se basarán en el consumo por hora (media, máximo, desviación estándar, media (13h-21h).

<div class="alert alert-success">
    <b> Creear nuevas características que puedan ser interesantes para reducir la dimensionalidad del problema y mejorar el rendimiento del algoritmo. "Máximo" y "mínimo". </b>
</div>

In [None]:

hours = list(consumption.drop(['CUPs', 'etiqueta'], axis=1))

# Basic examples (please note that some of these characteristics may have a high correlation between them)
consumption['average'] = consumption[hours].mean(axis=1)
consumption['max'] = consumption[hours].max(axis=1)
consumption['min'] = consumption[hours].min(axis=1)
consumption['std'] = consumption[hours].std(axis=1)

# Example minmax
minmax = []
# iteramos fila a fila en nuestro df
for index, row in consumption.iterrows():
    # si el mínimo es 0, fijaremos minmax a 0, para evitar una indeterminación 0/0
    if row['min'] == 0:
        minmax.append(0)
    else:
        minmax.append(row['min']/row['max'])
consumption['minmax'] = minmax


In [None]:
# Example average over a period of time. We have seen that between 13h and 21h there is a greater difference between clusters. 
peak_hours = ['h-' + str(x) for x in range(13,21)]
consumption['peak_hours'] = consumption[peak_hours].mean(axis=1)

consumption.head()

<div class="alert alert-success">
    <b> Comprobar la correlation matrix otra vez. </b>
</div>


In [None]:
f, ax = plt.subplots(figsize=(20, 10))
corr = consumption.drop(['CUPs'],axis=1).corr()
sns.heatmap(corr, cmap="coolwarm", vmin=-1, vmax=1)

# Negative correlation (close to -1) is also interesting, as may be the case for minmax and cluster.

# 4. Dividir los datos

Shuffle=True indica que los datos se dividen aleatoriamente entre entrenamiento y prueba. Esto reduce la varianza y evita que el modelo se sobreajuste.

<img src="Figures/train-val-test.png" alt="Drawing" style="width: 800px;"/>


<img src="Figures/X_y.png" alt="Drawing" style="width: 800px;"/>

In [None]:
X = consumption.drop(['etiqueta'], axis=1) 
y = consumption['etiqueta']

In [None]:
y.head()

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2 # percentage of the input data that I will use to validate the model
random_state=0
# Divide the data into training, validation and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state,
                                                    shuffle=True)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=test_size, random_state=random_state,
                                                    shuffle=True)

# 5. Construcción y evaluación de algoritmos

<div class="alert alert-success">
    <b> Añadir algoritmos de clasificación </b>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

num_folds = 15
error_metrics = {'balanced_accuracy'}
models = {('LR', LogisticRegression()),
           ('RF', RandomForestClassifier())}

results = [] # stores the results of the evaluation metrics
names = [] # name of each algorithm
msg = [] # print the summary of the cross-validation method


In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit

# Entreno con validación cruzada
for scoring in error_metrics:
    print('Classification evaluation metric: ', scoring)
    for name, model in models:
        print('Model ', name)
        cross_validation = StratifiedShuffleSplit(n_splits=num_folds, random_state=0)
        cv_results = cross_val_score(model, X_train, y_train, cv=cross_validation, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        resume = (name, cv_results.mean(), cv_results.std())
        msg.append(resume)
    print(msg)

    # Comparar resultados entre algoritmos
    fig = plt.figure()
    fig.suptitle('Comparison of algorithms with evaluation metrics: %s' %scoring)
    ax = fig.add_subplot(111)
    ax.set_xlabel('Candidate models')
    ax.set_ylabel('%s' %scoring)
    plt.boxplot(results)
    ax.set_xticklabels(names)
    plt.show()

    results = []

# 6. Ajuste de Hiperparámetros del mejor(es) modelo.

Pasos para realizar la ajuste hiperparamétrico de los parámetros:

* Métrica a optimizar: balanced accuracy
* Definir rangos de parámetros de búsqueda: params
* Asignar un método de validación: StratifiedShuffleSplit (n_splits = 10).
* Entrenar con los datos de validación: X_val

In [None]:
#RandomForestClassifier
model = RandomForestClassifier()
params = {
     'n_estimators': [100, 600, 1000], #default=100
     'min_samples_split': [2,5], #default=2
 }
scoring='balanced_accuracy'
cross_validation = StratifiedShuffleSplit(n_splits=10, random_state=0)
my_cv = cross_validation.split(X_val, y_val)
gsearch = GridSearchCV(estimator=model, param_grid=params, scoring=scoring, cv=my_cv)
gsearch.fit(X_val, y_val)

print("Best results: %f using the following hyperparameters %s" % (gsearch.best_score_, gsearch.best_params_))
means = gsearch.cv_results_['mean_test_score']
stds = gsearch.cv_results_['std_test_score']
params = gsearch.cv_results_['params']

# 7. Evaluación final del modelo.
Evaluation metrics:
  * 1. Confusion matrix
  * 2. Matthews Coefficient (MCC)
  * 3. ROC / AUC curve


In [None]:

clf_model = RandomForestClassifier(min_samples_split=2,  n_estimators=600)
clf_model.fit(X_train,y_train)  # The RF model is trained
y_predict = clf_model.predict(X_test)  # Predictions are calculated


In [None]:
y_predict

<div class="alert alert-success">
    <b> Imprimir el ranking de importancia de las características de entrada. </b>
</div>


In [None]:

# Feature importances extraction
feature_importances = clf_model.feature_importances_

# If you have actual feature names, use them here. Otherwise, generate as follows:
feature_names = feature_names = X_train.columns


In [None]:
feature_names

In [None]:
# Sorting the feature importances in descending order
sorted_idx = np.argsort(feature_importances)[::-1]

# Creating the plot
plt.figure(figsize=(10, 8))
plt.title('Feature Importances')
plt.bar(range(len(feature_importances)), feature_importances[sorted_idx], align='center')
plt.xticks(range(len(feature_importances)), np.array(feature_names)[sorted_idx], rotation=90)
plt.xlim([-1, X_train.shape[1]])  # set x limits to feature range
plt.tight_layout()
plt.show()

<div class="alert alert-success">
    <b> 1. Matriz de Confusión </b>
</div>


In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

confusion_matrix = confusion_matrix(y_test, y_predict)
print(classification_report(y_test, y_predict))
print(confusion_matrix)

In [None]:
# Gráfico no normalizado de la martiz de confusión

# Plotting the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix)
disp.plot()
plt.show()

<div class="alert alert-success">
    <b> 2. Matthews Coefficient (MCC) </b>
</div>

El MCC utiliza coeficientes de correlación entre -1 y +1.
* El coeficiente +1 representa una predicción perfecta.
* El coeficiente 0 representa una predicción media aleatoria.
* El coeficiente -1 representa una predicción inversa.

In [None]:
from sklearn.metrics import matthews_corrcoef

matthews_corrcoef(y_test, y_predict)

<div class="alert alert-success">
    <b> 3. ROC/AUC curve </b>
</div>



<img src="Figures/rocteoria.png" alt="Drawing" style="width: 800px;"/>

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

# Calcular ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_predict)

# Calcular AUC
auc = roc_auc_score(y_test, y_predict)
print("AUC: {:.2f}".format(auc))

In [None]:
# Graficar la curva ROC
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.00])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()