Fitting a support vector machine (SVM) machine learning model follows on from the preprocessing and vectorisation stages.

### Important Instructions for Working with the SVM Model

**This model is computationally intensive, especially for larger datasets and when using complex kernels like Radial Basis Function (RBF). To ensure efficient execution and avoid unnecessary delays, please follow these guidelines:**

1. **Consider Dataset Size:**
   - If you are working with a large dataset, it is advisable to start with a smaller subset to test the model. You can do this by selecting a random sample of the data.
   - For example, use only a portion of the data for initial testing and parameter tuning before scaling up to the full dataset.

2. **Kernel Selection:**
   - The SVM model with the RBF kernel (`kernel='rbf'`) can be slow. If your data can be linearly separated, consider using the linear kernel (`kernel='linear'`) instead, which is much faster.
   - Evaluate the nature of your data before deciding on the kernel. Use the RBF kernel only if non-linearity is essential for your model.

3. **Parameter Tuning:**
   - If you are using grid search to optimise hyperparameters like `C` and `gamma`, be mindful that this process can be very time-consuming. Limit the range and number of parameter combinations to speed up the process.
   - Consider using random search as an alternative to grid search to reduce computational overhead while still exploring the parameter space effectively.

4. **Dimensionality Reduction:**
   - If your dataset has a large number of features, consider using dimensionality reduction techniques like PCA (Principal Component Analysis) to reduce the feature space before applying the SVM model.
   - This can significantly reduce the computational load and speed up the model fitting process.

5. **Parallel Processing:**
   - Ensure you are utilising all available computational resources. If you are using `GridSearchCV` or `RandomizedSearchCV`, set `n_jobs=-1` to enable parallel processing, which will speed up the hyperparameter tuning process.
   - Be aware that parallel processing can still be resource-intensive and may not fully mitigate long run times on very large datasets.

6. **Monitor Resource Usage:**
   - Keep an eye on your system’s CPU and memory usage during the execution of the model. If you notice that your system is becoming unresponsive, consider stopping the process and revisiting the above points.
   - It may be necessary to use a more powerful machine or cloud computing resources for very large datasets.

7. **Expect Longer Execution Times:**
   - Be prepared for longer execution times when working with SVMs, particularly with large datasets and complex kernels. It is normal for this model to take significantly longer compared to other algorithms like Logistic Regression or Decision Trees.

8. **Checkpointing:**
   - If possible, save intermediate results or checkpoints during the process to avoid losing progress in case of unexpected interruptions.

9. **Consider Alternatives:**
   - If the execution time is too long or computational resources are limited, consider using alternative models such as Random Forests or Gradient Boosting Machines, which often provide comparable performance with faster runtimes.

By following these guidelines, you can effectively manage the computational demands of the SVM model and make informed decisions about its application to your data.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import h5py
import joblib
import json


  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [7]:
# Load a sample of the training and testing datasets
with h5py.File('../data/splits/train_test_split.h5', 'r') as f:
    X_train = f['X_train'][:10000]  # Load only a sample of the training data
    X_test = f['X_test'][:2000]     # Load only a sample of the testing data
    y_train = f['y_train'][:10000]
    y_test = f['y_test'][:2000]

print("Sample of training and testing datasets loaded successfully.")


Sample of training and testing datasets loaded successfully.


In [8]:
# Model SVM, classification
model = SVC()


In [9]:
# Setting hyperparameters
params = {
    'C': [0.1, 1, 10],             # Regularisation strength
    'kernel': ['linear', 'rbf'],   # Kernel type, start with 'linear' for quicker results
    'gamma': ['scale', 'auto']     # Kernel coefficient
}

In [10]:
# Consider using RandomizedSearchCV instead of GridSearchCV to reduce computation time
search = RandomizedSearchCV(
    estimator=model,
    param_distributions=params, 
    n_iter=10,                    # Number of parameter settings that are sampled
    cv=3,                         # Cross-validation splits
    scoring='accuracy',
    verbose=3,                    # Increased verbosity for progress tracking
    n_jobs=-1                     # Utilise all available cores
)

In [11]:
# Fit the model to the training data
search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [12]:
# Output the best parameters and best score
best_model = search.best_estimator_
print(f"Best Parameters: {search.best_params_}")
print(f"Best Cross-Validation Accuracy: {search.best_score_:.4f}")

Best Parameters: {'kernel': 'linear', 'gamma': 'scale', 'C': 10}
Best Cross-Validation Accuracy: 0.9365


In [14]:
# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)


In [15]:
# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.941
Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.94      0.94       960
           1       0.94      0.94      0.94      1040

    accuracy                           0.94      2000
   macro avg       0.94      0.94      0.94      2000
weighted avg       0.94      0.94      0.94      2000



In [16]:
# Save the SVM model to an .h5 file using joblib
model_save_path = '../models/svm_model.pkl'
joblib.dump(best_model, model_save_path)
print(f"SVM model saved to {model_save_path}.")


SVM model saved to ../models/svm_model.pkl.


In [17]:
# Create a dictionary to store the metrics
metrics = {
    "accuracy": accuracy,
    "classification_report": classification_report(y_test, y_pred, output_dict=True)
}

# Specify the path where the metrics will be saved
metrics_save_path = '../models/svm_metrics.json'

# Save the metrics dictionary to a JSON file
with open(metrics_save_path, 'w') as f:
    json.dump(metrics, f)

print(f"Model performance metrics saved to {metrics_save_path}.")

Model performance metrics saved to ../models/svm_metrics.json.
