<a href="https://colab.research.google.com/github/Lenna888/Classifier_ML_Dsmatallana_Lelatorre_802/blob/main/Machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Clasificador de correo SPAM - HAM**
---
Este script tiene como objetivo construir un modelo de Regresión Logística para la clasificación de correos SPAM y HAM. Se realiza un análisis de las características (features) para el entrenamiento del modelo. Además se usará F1 Score y matrices de confusión como evaluador del modelo.


@authors: Duvan Santiago Matallana Jiménez - Laura Estefania Latorre Pachon Grupo: 802  
@date: 2025-09-10


In [None]:
from google.colab import drive
drive.mount('/content/drive')

##**Importación de librerias**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler

print("Successful imports")

###Lectura y normalización de los features dominio_coincide y dominio_remitente (Cambio de texto a numérico)

In [None]:
try:
  #Load the dataset
  df = pd.read_csv('/content/drive/MyDrive/Machine_learning/Dataset_linear_model/dataset_con_clase.csv', sep=';')
  print("Dataset has been upload correctly, watch to aprove:")
  print(df.head())
except FileNotFoundError:
  print("Error: file not exist or wrong directory")
  df = None

if df is not None:
    # Normalize sender_domain vs reply_domain
    # Create a new column ‘domain_matches’
    # It will be 1 if the domains are the same, 0 if they are different.
    df['dominio_coincide'] = (df['dominio_remitente'] == df['dominio_respuesta']).astype(int)
    print("\nThe column ‘dominio_coincide’ has been created.")
    df = df.drop(columns=['dominio_remitente', 'dominio_respuesta'])
    print("\nThe columns ‘dominio_remitente’ and ‘dominio_respuesta’ have been removed.")

    X = df.drop('clase', axis=1)
    y = df['clase']
    print("\nThe variables X and y have been created.")
    print(X.columns.tolist())




Se revisan las **correlaciones** entre las features para empezar con las diferentes pruebas y hallar el **mejor modelo**.

In [None]:
if df is not None:
    # Calculate the correlation matrix
    correlation_matrix = df.corr()

    plt.figure(figsize=(14, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",
                linewidths=.5)
    plt.title('Feature Correlation Heatmap')
    plt.show()

División del conjunto de datos para **_entrenamiento, validación y test final_**.

In [None]:
if 'X' in locals() and 'y' in locals():
  """
   It is first divided into 2, and then one of those halves is subdivided into 2
   to get the train/validation/testing structure.
  """
  x_train, x_temp, y_train, y_temp = train_test_split(
      X, y, test_size = 0.3, random_state=42,
      stratify=y
  )

  x_val, x_test, y_val, y_test = train_test_split(
      x_temp, y_temp, test_size=0.5, random_state=42,
      stratify=y_temp
  )

  print(f"Training set size: {x_train.shape[0]} samples")
  print(f"Validation set size: {x_val.shape[0]} samples")
  print(f"Test set size: {x_test.shape[0]} samples")
else:
  print("Error: Variables X and y are not defined.")

##**Inicio de las diferentes pruebas**

Entrenamiento del modelo con todas las features para poder verificar que su cargue, envío y construcción de hiperparámetros es correcto.

In [None]:
if 'x_train' in locals():
  """
    The model is trained with an accepted error of 0.0001 and a maximum of
    10,000 iterations to avoid overfitting or divergence. In addition, seed 42
    is used because it is widely deployed.
  """
  print("Test with all features")
  logistic_model_1 = LogisticRegression(tol=0.0001, random_state=42, max_iter=10000)
  logistic_model_1.fit(x_train, y_train)
  y_pred_val_1 = logistic_model_1.predict(x_val)
  accuracy_1 = accuracy_score(y_val, y_pred_val_1)
  f1_1 = f1_score(y_val, y_pred_val_1)
  print(f"Accuracy: {accuracy_1:.5f}")
  print(f"F1 Score: {f1_1:.5f}")

  cm_1 = confusion_matrix(y_val, y_pred_val_1)
  sns.heatmap(cm_1, annot=True, fmt='d', cmap='Blues')
  plt.title('Confusion Matrix - Test 1')
  plt.xlabel('Prediction')
  plt.ylabel('Actual Value')
  plt.show()

else:
  print("Error: Variable x_train is not defined.")


_Prueba con Features de alta relación_

In [None]:
if 'x_train' in locals():
  """
    The same parameters are maintained, except that training is performed with
    the best features according to the correlation analysis.
  """
  selected_features =[
      'cantidad_exclamaciones',
      'cantidad_urls',
      'javascript_embebido',
      'adjuntos_ejecutables',
      'adjuntos_sospechosos',
      'lenguaje_imperativo'
  ]

  print("Test with some features")
  x_train_2 = x_train[selected_features]
  x_val_2 = x_val[selected_features]

  logistic_model_2 = LogisticRegression(tol=0.0001, random_state=42, max_iter=10000)
  logistic_model_2.fit(x_train_2, y_train)

  y_pred_val_2 = logistic_model_2.predict(x_val_2)
  accuracy_2 = accuracy_score(y_val, y_pred_val_2)
  f1_2 = f1_score(y_val, y_pred_val_2)

  print(f"Accuracy: {accuracy_2:.5f}")
  print(f"F1 Score: {f1_2:.5f}")

  cm_2 = confusion_matrix(y_val, y_pred_val_2)
  sns.heatmap(cm_2, annot=True, fmt='d', cmap='Blues')
  plt.title('Confusion Matrix - Test 2')
  plt.xlabel('Prediction')
  plt.ylabel('Actual Value')
  plt.show()





_Prueba con eliminación de ruido en Features_

In [None]:
if 'x_train' in locals():
  """
    Here, features with correlations below 0.1 are removed.
  """
  excepted_features = [
      'cantidad_interrogaciones',
      'cantidad_dominios_urls',
      'dominio_coincide'
  ]
  without_excepted_features = [col for col in X.columns if col not in excepted_features]
  logistic_model_3 = LogisticRegression(tol=0.0001, random_state=42, max_iter=10000)
  logistic_model_3.fit(x_train[without_excepted_features], y_train)
  y_pred_val_3 = logistic_model_3.predict(x_val[without_excepted_features])
  accuracy_3 = accuracy_score(y_val, y_pred_val_3)
  f1_3 = f1_score(y_val, y_pred_val_3)
  print(f"Accuracy: {accuracy_3:.5f}")
  print(f"F1 Score: {f1_3:.5f}")

  cm_3 = confusion_matrix(y_val, y_pred_val_3)
  sns.heatmap(cm_3, annot=True, fmt='d', cmap='Blues')
  plt.title('Confusion Matrix - Test 3')
  plt.xlabel('Prediction')
  plt.ylabel('Actual Value')
  plt.show()


_Solo Features de correlación superior a 0.15_

In [None]:
if 'x_train' in locals():
  """
    Here, on the contrary, only those greater than 0.15 are taken to validate
    good correlations.
  """
  features_over = [
        'cantidad_urls',
        'lenguaje_imperativo',
        'adjuntos_sospechosos',
        'javascript_embebido',
        'adjuntos_ejecutables',
        'cantidad_exclamaciones',
        'cantidad_destinatarios',
        'ip_en_url',
        'idioma_diferente_usuario'
  ]
  logistic_model_4 = LogisticRegression(tol=0.0001, random_state=42, max_iter=1000)
  logistic_model_4.fit(x_train[features_over], y_train)
  y_pred_val_4 = logistic_model_4.predict(x_val[features_over])
  accuracy_4 = accuracy_score(y_val, y_pred_val_4)
  f1_4 = f1_score(y_val, y_pred_val_4)
  print(f"Accuracy: {accuracy_4:.5f}")
  print(f"F1 Score: {f1_4:.5f}")

  cm_4 = confusion_matrix(y_val, y_pred_val_4)
  sns.heatmap(cm_4, annot=True, fmt='d', cmap='Blues')
  plt.title('Confusion Matrix - Test 4')
  plt.xlabel('Prediction')
  plt.ylabel('Actual Value')
  plt.show()

###Normalización para repetir las pruebas
Ahora se normalizan los datos para evitar aparición de outliers, y ver la diferencia con el entrenamiento sin la normalización.

In [None]:
if 'x_train' in locals():
  """
    The subdivisions of the dataset are standardized to improve the accuracy
    of the models.
  """
  scaler = StandardScaler()
  x_train_scaled_ss = scaler.fit_transform(x_train)
  x_val_scaled_ss = scaler.transform(x_val)
  x_test_scaled_ss = scaler.transform(x_test)
  x_train_scaled = pd.DataFrame(x_train_scaled_ss, columns=X.columns)
  x_val_scaled = pd.DataFrame(x_val_scaled_ss, columns=X.columns)
  x_test_scaled = pd.DataFrame(x_test_scaled_ss, columns=X.columns)

  print("Data converted to DataFrame. The variables are ready for testing.")

####Pruebas con datos escalados

Se realizan las mismas pruebas pero ahora con los datos normalizados, aquí ya no se realiza la matriz de confusión, ya que con el **_accuracy_** y el **_f1 score_** es suficiente para ver las principales características, y se elige el mejor para las **pruebas finales**.

In [None]:
"""
  The process is the same as with the previous tests, only the standardized
  subdivisions are used and the new results are obtained.
"""

# --- Test 1   ---
print("Test 1 (All features) with scaled data")
logistic_model_1s = LogisticRegression(random_state=42, max_iter=1000)
logistic_model_1s.fit(x_train_scaled, y_train)
y_pred_val_1s = logistic_model_1s.predict(x_val_scaled)
print(f"Accuracy: {accuracy_score(y_val, y_pred_val_1s):.5f}")
print(f"F1 Score: {f1_score(y_val, y_pred_val_1s):.5f}")

# --- Test 2 ---
selected_features =[
      'cantidad_exclamaciones',
      'cantidad_urls',
      'javascript_embebido',
      'adjuntos_ejecutables',
      'adjuntos_sospechosos',
      'lenguaje_imperativo'
  ]
print(f"\n Test 2 (High correlation) with scaled data")
logistic_model_2s = LogisticRegression(random_state=42, max_iter=1000)
logistic_model_2s.fit(x_train_scaled[selected_features], y_train)
y_pred_val_2s = logistic_model_2s.predict(x_val_scaled[selected_features])
print(f"Accuracy {accuracy_score(y_val, y_pred_val_2s):.5f}")
print(f"F1 Score {f1_score(y_val, y_pred_val_2s):.5f}")

# --- Test 3 ---
excepted_features = [
      'cantidad_interrogaciones',
      'cantidad_dominios_urls',
      'dominio_coincide'
  ]
features_3 = [col for col in X.columns if col not in excepted_features]
print(f"\n Test 3 (Noise exclusion) with scaled data")
logistic_model_3s = LogisticRegression(random_state=42, max_iter=1000)
logistic_model_3s.fit(x_train_scaled[features_3], y_train)
y_pred_val_3s = logistic_model_3s.predict(x_val_scaled[features_3])
print(f"Accuracy {accuracy_score(y_val, y_pred_val_3s):.5f}")
print(f"F1 Score {f1_score(y_val, y_pred_val_3s):.5f}")

# --- Test 4 ---
features_over = [
        'cantidad_urls',
        'lenguaje_imperativo',
        'adjuntos_sospechosos',
        'javascript_embebido',
        'adjuntos_ejecutables',
        'cantidad_exclamaciones',
        'cantidad_destinatarios',
        'ip_en_url',
        'idioma_diferente_usuario'
    ]
print(f"\n Test 4 (Balanced) with scaled data ")
logistic_model_4s = LogisticRegression(random_state=42, max_iter=1000)
logistic_model_4s.fit(x_train_scaled[features_over], y_train)
y_pred_val_4s = logistic_model_4s.predict(x_val_scaled[features_over])
print(f"Accuracy {accuracy_score(y_val, y_pred_val_4s):.5f}")
print(f"F1 Score {f1_score(y_val, y_pred_val_4s):.5f}")

##Prueba final
A continuación, se presentan las correlaciones de la Features selectas para el mejor modelo, además, se realiza el testeo con el 15% del dataset.

###Prueba con datos sin normalizar.

In [None]:

"""
  The best trained model is taken, the correlation of its features is sought,
  and then it is tested with the last subdivision, which is the testing,
  and its true accuracy is found.
"""
features_del_mejor_modelo = features_3
mejor_modelo = logistic_model_3

features_para_mapa = features_del_mejor_modelo + ['clase']

correlation_matrix_optima = df[features_para_mapa].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix_optima, annot=True, cmap='viridis', fmt=".2f")
plt.title('Mapa de Correlación de la Función Óptima')
plt.show()

x_test_final = x_test[features_del_mejor_modelo]
y_pred_final = mejor_modelo.predict(x_test_final)
print("--- RESULTADOS FINALES DEL MODELO ÓPTIMO EN EL CONJUNTO DE PRUEBA ---")
final_accuracy = accuracy_score(y_test, y_pred_final)
final_f1 = f1_score(y_test, y_pred_final)
print(f"\nExactitud Final: {final_accuracy:.5f}")
print(f"Puntuación F1 Final: {final_f1:.5f}")

print("\n--- Reporte de Clasificación Detallado ---")
print(classification_report(y_test, y_pred_final, target_names=['HAM (Clase 0)', 'SPAM (Clase 1)']))

print("\n--- Matriz de Confusión Final ---")
cm_final = confusion_matrix(y_test, y_pred_final)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_final, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Predicción HAM', 'Predicción SPAM'],
            yticklabels=['Real HAM', 'Real SPAM'])
plt.title('Matriz de Confusión del Conjunto de Prueba')
plt.ylabel('Etiqueta Real')
plt.xlabel('Etiqueta Predicha')
plt.show()

###Prueba con datos normalizados



In [None]:
"""
  The test is repeated but with normalized data in order to make a better
  assessment.
"""
features_del_mejor_modelo = features_3
mejor_modelo = logistic_model_3s

features_para_mapa = features_del_mejor_modelo + ['clase']

correlation_matrix_optima = df[features_para_mapa].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix_optima, annot=True, cmap='viridis', fmt=".2f")
plt.title('Mapa de Correlación de la Función Óptima')
plt.show()

x_test_final = x_test_scaled[features_del_mejor_modelo]
y_pred_final = mejor_modelo.predict(x_test_final)
print("--- RESULTADOS FINALES DEL MODELO ÓPTIMO EN EL CONJUNTO DE PRUEBA ---")
final_accuracy = accuracy_score(y_test, y_pred_final)
final_f1 = f1_score(y_test, y_pred_final)
print(f"\nExactitud Final: {final_accuracy:.5f}")
print(f"Puntuación F1 Final: {final_f1:.5f}")

print("\n--- Reporte de Clasificación Detallado ---")
print(classification_report(y_test, y_pred_final, target_names=['HAM (Clase 0)', 'SPAM (Clase 1)']))

print("\n--- Matriz de Confusión Final ---")
cm_final = confusion_matrix(y_test, y_pred_final)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_final, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Predicción HAM', 'Predicción SPAM'],
            yticklabels=['Real HAM', 'Real SPAM'])
plt.title('Matriz de Confusión del Conjunto de Prueba')
plt.ylabel('Etiqueta Real')
plt.xlabel('Etiqueta Predicha')
plt.show()

##Calculo de outliers

Finalmente se puede buscar cuales fueron los errores por los cuales el modelo estaba seguro erroneamente, y calcular su distancia en un punto dado, lo cual seria su error.

In [None]:
"""
  The linear function resulting from the best model is analyzed, the distances
  are calculated, and those values that deviate from this mean are taken as error
 and their difference is calculated.
"""
features_del_mejor_modelo = features_3
mejor_modelo = logistic_model_3s

y_pred_val_optimo = mejor_modelo.predict(x_val_scaled[features_del_mejor_modelo])
distancia_optima = mejor_modelo.decision_function(x_val_scaled[features_del_mejor_modelo])

# Creamos un DataFrame para analizar los errores
analisis_df_optimo = pd.DataFrame({
    'actual_value': y_val.values,
    'prediction': y_pred_val_optimo,
    'decision distance': distancia_optima
})

# Filtramos para quedarnos solo con los errores
errores_df_optimo = analisis_df_optimo[analisis_df_optimo['actual_value'] != analisis_df_optimo['prediction']].copy()
errores_df_optimo['confianza_error'] = np.abs(errores_df_optimo['decision distance'])
errores_df_optimo = errores_df_optimo.sort_values(by='confianza_error', ascending=False)

print("\n Analysis of outliers in the optimal model ")
print("The errors in which the model was most ‘certain’ of its incorrect prediction:")
print(errores_df_optimo.head())