Q1. The mathematical formula for a linear Support Vector Machine (SVM) can be represented as follows:

Given a dataset of N data points with features xᵢ (i = 1 to N) and corresponding labels yᵢ (yᵢ = ±1 for binary classification), the linear SVM aims to find a hyperplane that maximizes the margin between the two classes while minimizing classification errors. The equation of this hyperplane can be represented as:

w^T * x + b = 0

Where:

w is the weight vector that defines the orientation of the hyperplane.
b is the bias or intercept term.
x represents the feature vector.
The decision boundary is where w^T * x + b = 0. The goal is to find the optimal w and b that best separate the data points into two classes.

Q2. The objective function of a linear SVM is to maximize the margin while minimizing classification errors. Mathematically, it can be expressed as:

Minimize 1/2 * ||w||² subject to yᵢ * (w^T * xᵢ + b) ≥ 1 for all data points (for linearly separable data)

In this objective function:

"w" is the weight vector.
"b" is the bias term.
"xᵢ" is a data point.
"yᵢ" is the corresponding class label.
The goal is to find the values of "w" and "b" that satisfy the constraints while maximizing the margin between the two classes. The margin is defined as 2/||w||.

Q3. The kernel trick in SVM is a technique used to transform data into a higher-dimensional feature space to make it linearly separable when it is not in the original feature space. It involves replacing the inner product (dot product) of the data points with a kernel function. The kernel function is a function that computes the similarity between data points in the higher-dimensional space.

Common kernel functions include:

Linear Kernel: K(xᵢ, xⱼ) = xᵢ^T * xⱼ
Polynomial Kernel: K(xᵢ, xⱼ) = (γ * xᵢ^T * xⱼ + r)ᵒ
Radial Basis Function (RBF) Kernel: K(xᵢ, xⱼ) = exp(-γ * ||xᵢ - xⱼ||²)
The kernel trick allows SVM to find a hyperplane in this higher-dimensional space, even when the data is not linearly separable in the original space.

Q4. The support vectors in SVM are the data points that are closest to the hyperplane and play a crucial role in defining the decision boundary. Support vectors are the points for which the margin is exactly 1. The main roles of support vectors are:

They help define the position and orientation of the decision boundary or hyperplane.
They have non-zero Lagrange multipliers (αᵢ) in the SVM dual problem, indicating their importance in the final solution.
They represent the most challenging data points that are closest to the decision boundary.
Example: Suppose you have a binary classification problem with two classes (A and B), and the data points are not linearly separable. The support vectors are the data points from class A and class B that are nearest to the decision boundary. These support vectors define the margin and contribute to finding the optimal hyperplane.

Q5. Illustrating Hyperplane, Marginal plane, Soft margin, and Hard margin in SVM:

Hyperplane: The hyperplane is the decision boundary that separates two classes in a linear SVM. It is represented by the equation w^T * x + b = 0.

Marginal Plane: The marginal plane is a hyperplane that is equidistant from the support vectors of both classes. It defines the margin, which is the distance between the marginal plane and the support vectors.

Soft Margin: In a soft-margin SVM, some misclassification of data points is allowed to handle noisy or overlapping data. It introduces a parameter "C" to control the trade-off between maximizing the margin and minimizing classification errors. A smaller "C" allows for a wider margin but tolerates more misclassifications.

Hard Margin: In a hard-margin SVM, no misclassification is allowed. It requires that all data points are correctly classified. This is suitable for linearly separable data but may not work for real-world datasets with noise or overlapping classes.



Q6. Polynomial functions and kernel functions in machine learning are related in the context of kernel methods, including Support Vector Machines (SVM). Polynomial functions can be used as kernel functions to transform data into a higher-dimensional space, making it linearly separable. The relationship can be summarized as follows:

Polynomial Kernel: The polynomial kernel is a specific kernel function used in SVM and other kernel methods. It takes the form K(xᵢ, xⱼ) = (γ * xᵢ^T * xⱼ + r)ᵒ, where γ, r, and ᵒ are kernel parameters.

Kernel Trick: The kernel trick is a technique that allows SVM to operate in this higher-dimensional space without explicitly calculating the transformation of data points. It replaces the inner product of data points in the higher-dimensional space with the result of the kernel function, K(xᵢ, xⱼ).

Relationship: When you choose a polynomial kernel with a suitable degree (ᵒ), it can perform a polynomial transformation on the data. For example, a polynomial kernel with degree 2 can transform the data into a quadratic feature space, making it possible to find a quadratic decision boundary in the original feature space. The choice of γ and r in the kernel function further influences the behavior of the kernel.

In summary, polynomial kernels are a specific type of kernel function that can be used with SVM to transform data into higher-dimensional spaces, making them suitable for handling non-linearly separable data.

Q8. In Support Vector Regression (SVR), increasing the value of epsilon (ε) does not directly affect the number of support vectors. The number of support vectors depends on the data distribution and the choice of the kernel function, not on the value of epsilon. Epsilon in SVR controls the width of the ε-insensitive tube, which determines the level of tolerance for errors in the prediction.

Epsilon defines a range within which errors are considered negligible and do not contribute to the loss function. Points outside this range are penalized based on their distance from the predicted values.

The number of support vectors is determined by the complexity of the problem, the choice of the kernel function, and the value of the C parameter, which balances the trade-off between fitting the training data and regularization. A higher value of C may lead to more support vectors, indicating a more complex model that fits the training data closely.

Q9. In Support Vector Regression (SVR), the choice of kernel function, C parameter, epsilon parameter, and gamma parameter can significantly affect the model's performance:

Kernel Function: The kernel function determines how data is transformed into a higher-dimensional space. For non-linear data, choosing the appropriate kernel function is crucial. Common choices include linear, polynomial, and radial basis function (RBF) kernels. The choice of the kernel should be guided by the data's characteristics.

C Parameter: The C parameter controls the trade-off between fitting the training data and regularization. A smaller C value allows for a wider margin and can tolerate more errors on the training data. A larger C value leads to a narrower margin and a more accurate fit to the training data. The choice of C depends on the problem's tolerance for errors and the dataset's noise.

Epsilon Parameter: The epsilon (ε) parameter defines the width of the ε-insensitive tube. A smaller ε means that the model is less tolerant of errors, while a larger ε allows for more errors within the tolerance range. The choice of ε depends on the desired level of flexibility in the model and the noise in the data.

Gamma Parameter: The gamma (γ) parameter is specific to some kernel functions, such as the RBF kernel. It controls the shape and scale of the kernel function. A smaller γ results in a wider, smoother kernel, while a larger γ leads to a narrower, more localized kernel. The choice of γ should be based on the data's characteristics, and tuning it correctly is essential for model performance.

The impact of these parameters on performance is problem-specific, and they often require tuning through techniques like cross-validation to find the best combination for your specific regression problem.

Q10. If your goal is to predict the actual price of a house as accurately as possible, using Mean Squared Error (MSE) as the evaluation metric would be more appropriate. MSE measures the average squared difference between the predicted values and the actual prices. Minimizing MSE results in the model providing predictions that are as close as possible to the actual prices, making it a suitable choice for regression tasks like predicting house prices.

R-squared (R²) measures the proportion of the variance in the dependent variable that is explained by the independent variables. While R-squared can be useful for understanding how well your features explain the variance, it may not be the best choice if the primary goal is to minimize prediction errors, as MSE directly quantifies prediction accuracy.

Q11. In a dataset with a significant number of outliers, Mean Absolute Error (MAE) would be the most appropriate regression metric to use with your SVM model. MAE is less sensitive to outliers compared to Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

MAE calculates the average absolute difference between the predicted values and the actual target values. Outliers can have a substantial impact on squared error metrics like MSE and RMSE because they square the differences. MAE, on the other hand, treats all errors equally and provides a robust measure of model performance in the presence of outliers.

Q12. If both the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) values are very close, it is typically recommended to choose RMSE as the evaluation metric for your SVM regression model. RMSE has the advantage of having the same units as the target variable, making it more interpretable. It represents the standard deviation of the model's prediction errors and is commonly used in regression tasks for practical understanding.

While MSE and RMSE provide similar information, RMSE is preferred in scenarios where you want to express the model's performance in the same units as the target variable, allowing for easier interpretation of prediction errors.

Q13. When comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF), and the goal is to measure how well the model explains the variance in the target variable, the most appropriate metric is the coefficient of determination, often denoted as R-squared (R²).

R-squared measures the proportion of the variance in the target variable that is explained by the model. A higher R² value indicates that the model accounts for more variance in the data. It is a suitable metric for assessing how well each model captures the variation in the target variable, making it a good choice for this scenario.










In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_inline
import seaborn as sns

In [4]:
from scipy.sparse import load_npz
X_y=load_npz("class_X_y.npz")

In [5]:
X=X_y[:,:-1]
y=X_y[:,-1]
X=X.toarray()
y=y.toarray()

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [7]:
from sklearn.svm import SVC

In [8]:
svc=SVC(kernel='linear')

In [9]:
svc.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [10]:
svc.coef_

array([[-0.05519804,  0.00593964, -0.00305198, ...,  0.00277915,
         0.02077447,  0.        ]])

In [11]:
y_pred=svc.predict(X_test)

In [12]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [13]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1024
           1       0.94      0.88      0.91       161

    accuracy                           0.98      1185
   macro avg       0.96      0.94      0.95      1185
weighted avg       0.98      0.98      0.98      1185

[[1015    9]
 [  19  142]]
0.9763713080168777


## Hyperparameter tuning with SVC

In [20]:
from sklearn.model_selection import GridSearchCV
parameter={
    'C':[0.1],
    'gamma':[1,0.1],
    'kernel':['linear']
}

In [21]:
grid=GridSearchCV(SVC(),param_grid=parameter,cv=2,verbose=3)
y_train = y_train.ravel()

In [22]:
grid.fit(X_train,y_train)

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV 1/2] END .....C=0.1, gamma=1, kernel=linear;, score=0.956 total time= 2.3min
[CV 2/2] END .....C=0.1, gamma=1, kernel=linear;, score=0.950 total time= 1.6min
[CV 1/2] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.956 total time=  57.6s
[CV 2/2] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.950 total time= 2.0min


In [23]:
grid.best_params_

{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}

In [24]:
y_pred=grid.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1024
           1       0.96      0.84      0.90       161

    accuracy                           0.97      1185
   macro avg       0.97      0.92      0.94      1185
weighted avg       0.97      0.97      0.97      1185

[[1018    6]
 [  25  136]]
0.9738396624472574
