# Hands-On Activity: Support Vector Machines (SVM) for Breast Cancer Classification

---

## Introduction:
Welcome to today's hands-on session! We'll be exploring the application of Support Vector Machines (SVM) for classifying breast cancer. By the end of this activity, you'll have a solid understanding of SVMs, data preprocessing, and model evaluation.

## Instructions:

### Step 1: Importing Libraries

Before we dive into the code, take a moment to look at the libraries we're importing. Why do you think each library is crucial for our task? What role might `train_test_split`, `StandardScaler`, and `SVC` play in this context?

In [43]:
# Importing required libraries
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


### Step 2: Load and Prepare the Data

In this step, we load the dataset and split it into features and target. Why is it essential to split the data in this way? What do you think are the features in our dataset, and what is the target variable?

In [44]:
# Load the Breast Cancer dataset into a Pandas DataFrame
cancer = datasets.load_breast_cancer()
data = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
data['target'] = cancer.target

# Split the data into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

### Step 3: Split the Data into Training and Testing Sets

Explore the purpose of splitting the data into training and testing sets. What does the test_size parameter represent, and why do we set a random_state?

In [45]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Step 4: Standardize the Data

We're standardizing our data here. Why is standardization important for SVMs? What does the fit_transform method do, and why do we only use transform on the test set?

In [46]:
X_test.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
204,12.47,18.6,81.09,481.9,0.09965,0.1058,0.08005,0.03821,0.1925,0.06373,...,14.97,24.64,96.05,677.9,0.1426,0.2378,0.2671,0.1015,0.3014,0.0875
70,18.94,21.31,123.6,1130.0,0.09009,0.1029,0.108,0.07951,0.1582,0.05461,...,24.86,26.58,165.9,1866.0,0.1193,0.2336,0.2687,0.1789,0.2551,0.06589
131,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,...,19.26,26.0,124.9,1156.0,0.1546,0.2394,0.3791,0.1514,0.2837,0.08019
431,12.4,17.68,81.47,467.8,0.1054,0.1316,0.07741,0.02799,0.1811,0.07102,...,12.88,22.91,89.61,515.8,0.145,0.2629,0.2403,0.0737,0.2556,0.09359
540,11.54,14.44,74.65,402.9,0.09984,0.112,0.06737,0.02594,0.1818,0.06782,...,12.26,19.68,78.78,457.8,0.1345,0.2118,0.1797,0.06918,0.2329,0.08134


In [47]:
# Standardize the data

numerical_columns = X_train.select_dtypes(include=['float64', 'int64']).columns

# Create a StandardScaler object
scaler = StandardScaler()

# Apply standard scaling to selected columns
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.fit_transform(X_test[numerical_columns])
# Display the scaled dataset
X_train


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
68,-1.440753,-0.435319,-1.362085,-1.139118,0.780573,0.718921,2.823135,-0.119150,1.092662,2.458173,...,-1.232861,-0.476309,-1.247920,-0.973968,0.722894,1.186732,4.672828,0.932012,2.097242,1.886450
181,1.974096,1.733026,2.091672,1.851973,1.319843,3.426275,2.013112,2.665032,2.127004,1.558396,...,2.173314,1.311279,2.081617,2.137405,0.761928,3.265601,1.928621,2.698947,1.891161,2.497838
63,-1.399982,-1.249622,-1.345209,-1.109785,-1.332645,-0.307355,-0.365558,-0.696502,1.930333,0.954379,...,-1.295284,-1.040811,-1.245220,-0.999715,-1.438693,-0.548564,-0.644911,-0.970239,0.597602,0.057894
248,-0.981797,1.416222,-0.982587,-0.866944,0.059390,-0.596788,-0.820203,-0.845115,0.313264,0.074041,...,-0.829197,1.593530,-0.873572,-0.742947,0.796624,-0.729392,-0.774950,-0.809483,0.798928,-0.134497
60,-1.117700,-1.010259,-1.125002,-0.965942,1.269511,-0.439002,-0.983341,-0.930600,3.394436,0.950213,...,-1.085129,-1.334616,-1.117138,-0.896549,-0.174876,-0.995079,-1.209146,-1.354582,1.033544,-0.205732
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,-1.480675,-1.066580,-1.362085,-1.157451,0.149987,0.944057,-0.035754,-0.514485,0.331474,3.755073,...,-1.352920,-1.628421,-1.336108,-1.045037,-0.469795,-0.059039,-0.627221,-1.016366,-1.032028,1.376025
106,-0.701497,-0.200650,-0.687880,-0.682204,1.327033,-0.036619,-0.229252,-0.353247,-0.036372,0.339253,...,-0.644011,0.614731,-0.647704,-0.626555,1.616328,0.085623,0.060743,0.116740,-0.156974,0.398365
270,0.048802,-0.555001,-0.065125,-0.061423,-2.261627,-1.466613,-1.028567,-1.105515,-1.103492,-1.249242,...,-0.275720,-0.806427,-0.379841,-0.339278,-1.989065,-1.307006,-1.127968,-1.239034,-0.708639,-1.271455
435,-0.038969,0.102073,-0.031374,-0.154780,0.737432,0.184701,0.298585,0.430059,-0.517123,0.372579,...,0.167478,0.868921,0.203878,-0.013556,1.291049,0.672020,0.632532,1.050012,0.434322,1.213362


### Step 5: Build an SVM Classifier

Now, we're building our SVM classifier. What does the choice of a linear kernel signify? What is the significance of the hyperparameter C in SVM, and why do we set it to 1?

#### Choice of Linear Kernel:
The choice of a linear kernel in an SVM signifies that the algorithm will use a linear decision boundary to separate the classes in the feature space. In simpler terms, it assumes that the relationship between features and the target variable is approximately linear. This is suitable when the classes can be effectively separated by a straight line.

#### Significance of Hyperparameter C in SVM:
The hyperparameter C in SVM is a regularization parameter that controls the trade-off between achieving a low training error and a low testing error. Here's what it signifies:

* Low C (e.g., C=0.1): Allows for a softer margin. The SVM classifier will be more tolerant of misclassifications during training, prioritizing a wider margin even if it means some training points are misclassified.

* High C (e.g., C=100): Enforces a hard margin. The SVM classifier aims to classify all training points correctly, even if it means having a narrower margin. It is less tolerant of misclassifications and may lead to overfitting.

#### Why Set C to 1?
Setting C to 1 strikes a balance between having a reasonably wide margin and minimizing misclassifications during training. It's often considered a default or starting point. This choice is useful when you don't have strong prior knowledge about the data, and you want the SVM to find a balance between generalization and fitting the training data.

In practice, the optimal value for C depends on the specific dataset and the problem at hand. It's common to use techniques like grid search or cross-validation to find the best value for C based on the performance of the model on a validation set.

In [49]:
# Build an SVM classifier
svm_classifier = SVC(C=1, kernel='linear')
svm_classifier.fit(X_train, y_train)


### Step 6: Make Predictions and Evaluate the Model

In the final step, we're making predictions and evaluating our model. What does accuracy signify, and what insights can we gain from the classification report?

#### Accuracy:
Significance:
Accuracy is a straightforward measure that indicates the proportion of correctly classified instances out of the total instances. 

* **High Accuracy**: Indicates that a large portion of instances is correctly classified. However, it might not be sufficient on its own, especially if the classes are imbalanced.
* **Low Accuracy**: Suggests that the model is struggling to correctly classify instances. It's crucial to investigate further using other metrics, especially in scenarios where classes are imbalanced.

#### Classification Report:
* The classification report provides a more detailed overview, especially in scenarios where there are multiple classes.

##### Key Metrics:

* Precision:
  * Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of the accuracy of positive predictions.
  * High Precision:
    * Indicates a low false positive rate, meaning that when the model predicts a positive outcome, it's likely to be correct.
 
* Recall (Sensitivity or True Positive Rate):
  * Recall is the ratio of correctly predicted positive observations to the all observations in actual class. It is a measure of the model's ability to capture all the relevant instances.
  * High Recall:
    * Suggests that the model is effectively capturing most of the positive instances present in the dataset.
 
* F1-Score:
  * F1-Score is the weighted average of Precision and Recall. It's a balance between precision and recall and is especially useful when the classes are imbalanced.
  * High F1-Score:
    * Indicates a good balance between precision and recall. This is often desirable, especially when there's an uneven distribution of classes.
 
* Support:
  * Support is the number of actual occurrences of the class in the specified dataset. It gives an idea of the number of instances that contribute to the metrics.
  * Imbalanced Classes:
    * In scenarios where classes are imbalanced, accuracy alone may not be informative. Focusing on precision, recall, and the F1-Score becomes crucial for a more nuanced evaluation.

* Macro Average:
  * The macro-average is a way to calculate the average performance across all classes without considering class imbalance. It treats all classes equally, giving each class the same weight.
  * Treats each class equally, providing a balanced view of the model's performance across different classes.

* Weighted Average:
  * The weighted-average considers class imbalance by assigning weights based on the number of instances in each class. It gives more importance to classes with more instances.
  * Accounts for class imbalance by giving more weight to classes with larger support.
  * Reflects the influence of classes with more instances on the overall performance.

* Macro vs. Weighted
  * Macro-average is suitable when you want each class to contribute equally, while weighted-average is useful when considering the impact of class imbalance.




In [50]:
# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}\n")
print(f"Confusion Matrix:\n {conf_matrix}\n")
print(f"Classification Report:\n {class_report}")


Accuracy: 0.9736842105263158

Confusion Matrix:
 [[40  3]
 [ 0 71]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

