In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Ingest the Dataset

In [2]:
features = pd.DataFrame(load_breast_cancer()['data'])
features.columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
        'mean smoothness', 'mean compactness', 'mean concavity',
        'mean concave points', 'mean symmetry', 'mean fractal dimension',
        'radius error', 'texture error', 'perimeter error', 'area error',
        'smoothness error', 'compactness error', 'concavity error',
        'concave points error', 'symmetry error',
        'fractal dimension error', 'worst radius', 'worst texture',
        'worst perimeter', 'worst area', 'worst smoothness',
        'worst compactness', 'worst concavity', 'worst concave points',
        'worst symmetry', 'worst fractal dimension']
target = pd.DataFrame(load_breast_cancer()['target'])

# Create Training, Validation, and Testing Datasets
Apply the train_test_split() command to create a training, validation, and testing dataset (use test_size = 0.2)

*HINT: You may need to use train_test_split() twice*

In [3]:
X, X_test, y, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Hyperparameter Tuning
Model parameters refer to quantities of a model that are estimated using data. For example, the coefficients of a linear regression model are parameters. Hyperparameters, on the other hand, are quantities of a model that are chosen by the programmer. These quantities may relate to how the model accounts for overfitting (regularization parameter) or how the model changes in response to data (learning rate). 

For this exercise, we will be exploring how changes in the regularization parameter affect the performance of a support vector machine classification model. At a high-level, the regularization parameter is high when the model penalizes overfitting (overfitting = learning the specifics of the dataset itself rather than trends within the data that generalize to new data points). 

The regularization parameter C is inversely proportional to the strength of the regularization. Thus, when C is high, the regularization is low, meaning that the model does not permit misclassifications. And when C is low, the regularization is high, meaning that the model allows the training dataset to misclassify a greater number of examples to prevent overfitting. 

# Run Model
Build 7 support vector machines (using a linear kernel) that predict whether a given patient has cancer or not. Each support vector machine should have a different regularization parameter: (0.001, 0.01, 0.1, 1, 10, 100, 1000). Then, calculate the f1-score metric for each model. Lastly, identify the regularization parameter which yields the best model. For context, the f1-score is a function of both precision and recall and exists on a scale of 0 to 1.  

*HINT: Use the following link (under the header "Generating Model") to see how to implement an SVM. Make use of scikit-learn documentation to identify how to change the regularization parameter.*

https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [5]:
C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
for reg_param in C: 
    model = SVC(kernel='linear', C=reg_param)
    # ravel converts dataframe into 1D NumPy array
    model.fit(X_train, y_train.values.ravel())
    y_pred = model.predict(X_val)
    print("Regularization Parameter: {}, F1-Score: {}".format(reg_param, f1_score(y_val, y_pred)))

Regularization Parameter: 0.001, F1-Score: 0.9532710280373831
Regularization Parameter: 0.01, F1-Score: 0.9532710280373831
Regularization Parameter: 0.1, F1-Score: 0.9532710280373831
Regularization Parameter: 1, F1-Score: 0.9444444444444444
Regularization Parameter: 10, F1-Score: 0.9629629629629629
Regularization Parameter: 100, F1-Score: 0.9245283018867925
Regularization Parameter: 1000, F1-Score: 0.9245283018867925


# Testing
Now that you've performed validation and identified the hyperparameters which yield the best-performing SVM, test that model on the testing dataset by generating a set of predictions and calcluating the f1-score. 

In [6]:
final_model = SVC(kernel='linear', C=10)
final_model.fit(X_train, y_train.values.ravel())
final_pred = final_model.predict(X_test)
print("F1-Score: {}".format(f1_score(y_test, final_pred)))

F1-Score: 0.9624060150375939
