## Model Training SVM

Para la selección de modelos, decidí entrenar un modelo de Linear Support Vector Machine y otro de Regresión Logística

En este notebook se encuentra el entrenamiento de SVM.

A continuación, aplicaremos CountVectorizer y TF-IDF Transformer.

CountVectorizer convierte el texto de cada registro a una matriz, en la cuál cada renglón representa un documento (que es un registro de la columna de quejas) y cada columna es una palabra del vocabulario del mismo.

TF-IDF Transformer convierte el conteo de palabras a un score de TF-IDF, que normaliza la importancia de cada palabra basada en su frecuencia en cada documento y a través de todos los documentos. Nos ayuda a enfocarnos en las palabras más importantes, que no son tan comunes, pero sobre todo convierte el texto en formato numérico para poder entrenar nuestro modelo.

In [46]:
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score
import dagshub
import mlflow
import mlflow.sklearn
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.pyll import scope
import pickle
import pathlib

In [42]:
df = pd.read_csv('../data/processed.csv')
X = df.complaint_what_happened
y = df.ticket_classification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [43]:
dagshub.init(url="https://dagshub.com/zapatacc/final-exam-pcd2024-autumn", mlflow=True)

MLFLOW_TRACKING_URI = mlflow.get_tracking_uri()

print(MLFLOW_TRACKING_URI)

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(experiment_name="erick-machuca-SVM")

https://dagshub.com/zapatacc/final-exam-pcd2024-autumn.mlflow


2024/11/21 12:57:42 INFO mlflow.tracking.fluent: Experiment with name 'erick-machuca-SVM' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/626151d2cb9a4b17b840cd442c8160e4', creation_time=1732215462967, experiment_id='20', last_update_time=1732215462967, lifecycle_stage='active', name='erick-machuca-SVM', tags={}>

In [44]:
# Start logging the experiment
with mlflow.start_run() as run:
    # Log model parameters
    mlflow.log_param("loss", "hinge")
    mlflow.log_param("penalty", "l2")
    mlflow.log_param("alpha", 1e-3)
    mlflow.log_param("max_iter", 5)
    mlflow.log_param("class_weight", "balanced")

    # Define and train the SGD pipeline
    sgd = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None, class_weight='balanced')),
    ])
    sgd.fit(X_train, y_train)

    # Make predictions
    y_pred = sgd.predict(X_test)

    # Calculate and log metrics
    accuracy = accuracy_score(y_pred, y_test)
    recall = recall_score(y_pred, y_test, average='macro')
    precision = precision_score(y_pred, y_test, average='macro')

    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("precision", precision)

    # Print a success message
    print(f"SGD Classifier logged in MLflow with accuracy: {accuracy:.2f}, recall: {recall:.2f}, precision: {precision:.2f}")

SGD Classifier logged in MLflow with accuracy: 0.48, recall: 0.28, precision: 0.42
🏃 View run capable-robin-924 at: https://dagshub.com/zapatacc/final-exam-pcd2024-autumn.mlflow/#/experiments/20/runs/6ea19ae69e0943399c91ee5a491a08dd
🧪 View experiment at: https://dagshub.com/zapatacc/final-exam-pcd2024-autumn.mlflow/#/experiments/20


In [48]:
# 1. Define the objective function for SVM hyperparameter tuning
def objective(params):
    # Extract parameters from the search space
    alpha = params['alpha']
    max_iter = int(params['max_iter'])

    # Build and train the SVM pipeline
    sgd = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier(
            loss='hinge',
            penalty='l2',
            alpha=alpha,
            random_state=42,
            max_iter=max_iter,
            tol=None,
            class_weight='balanced'))
    ])
    sgd.fit(X_train, y_train)

    # Make predictions and calculate the objective metric (negative accuracy for minimization)
    y_pred = sgd.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return -accuracy  # Return negative because fmin minimizes by default

# 2. Set up the search space for hyperparameters
search_space = {
    'alpha': hp.loguniform('alpha', -5, 1),  # Alpha (learning rate) on a log scale
    'max_iter': scope.int(hp.quniform('max_iter', 5, 100, 5))  # Iterations
}

# 3. Start MLflow run for hyperparameter optimization
with mlflow.start_run(run_name="SVM Hyper-parameter Optimization", nested=True):
    # 4. Optimize SVM parameters using hyperopt
    trials = Trials()
    best_params = fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,
        max_evals=10,  # Adjust for more evaluations
        trials=trials
    )

    # Convert parameters to usable types
    best_params['alpha'] = float(best_params['alpha'])
    best_params['max_iter'] = int(best_params['max_iter'])

    # Log the best parameters to MLflow
    mlflow.log_params(best_params)

    # 5. Set experiment tags for tracking
    mlflow.set_tags({
        "project": "Text Classification with SVM",
        "optimizer_engine": "hyper-opt",
        "model_family": "Linear SVM",
        "feature_set_version": 1,
    })

    # 6. Train the SVM model with the best parameters
    sgd = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier(
            loss='hinge',
            penalty='l2',
            alpha=best_params['alpha'],
            random_state=42,
            max_iter=best_params['max_iter'],
            tol=None,
            class_weight='balanced'))
    ])
    sgd.fit(X_train, y_train)

    # Make predictions and calculate metrics
    y_pred = sgd.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average='macro')
    precision = precision_score(y_test, y_pred, average='macro')

    # Log metrics to MLflow
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("precision", precision)

    # 7. Save the trained SVM pipeline using mlflow.sklearn
    mlflow.sklearn.log_model(sgd, artifact_path="model")

    # Print out a success message
    print(f"Best SVM model logged with accuracy: {accuracy:.2f}, recall: {recall:.2f}, precision: {precision:.2f}")



100%|██████████| 10/10 [00:33<00:00,  3.37s/trial, best loss: -0.47913749649957993]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Best SVM model logged with accuracy: 0.48, recall: 0.26, precision: 0.31
🏃 View run SVM Hyper-parameter Optimization at: https://dagshub.com/zapatacc/final-exam-pcd2024-autumn.mlflow/#/experiments/20/runs/73938a23abad4b7dbff08e3136aa10ce
🧪 View experiment at: https://dagshub.com/zapatacc/final-exam-pcd2024-autumn.mlflow/#/experiments/20
