# Q1. What is the relationship between polynomial functions and kernel functions in machine learning algorithms?

Polynomial functions and kernel functions are closely related in machine learning algorithms. A kernel function is a mathematical function that takes two data points as input and outputs a measure of similarity between them. One of the most common types of kernel functions used in machine learning is the polynomial kernel function, which is defined as:

K(x, y) = (x * y + c)^d

where x and y are the input data points, c is a constant, and d is the degree of the polynomial.

Polynomial functions are a class of mathematical functions that can be used to model complex relationships between input and output variables. In machine learning, polynomial functions can be used to model non-linear relationships between input features and output labels.

The polynomial kernel function can be used to transform the input data into a higher-dimensional space, where polynomial functions can be used to model complex relationships between the input features and output labels. This process is known as the kernel trick and it allows us to use linear models, such as Support Vector Machines (SVMs), to effectively model non-linear relationships between the input features and output labels.

In summary, the polynomial kernel function and polynomial functions are related in that the former is a type of kernel function used in machine learning algorithms to transform the input data into a higher-dimensional space, where the latter can be used to model complex relationships between the input features and output labels.

# Q2. How can we implement an SVM with a polynomial kernel in Python using Scikit-learn?

To implement SVM with a polynomial kernel in Python using Scikit-learn, we can follow these steps:

In [1]:
# Import the necessary modules:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler


In [2]:
# Load the iris dataset and split it into training and testing sets:

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]  # select the first and third feature
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)


In [3]:
# Scale the features:

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)


In [4]:
# Create an SVM model with a polynomial kernel and fit the training data:

svm = SVC(kernel='poly', degree=3, C=1.0, random_state=1)
svm.fit(X_train_std, y_train)


Here, we set the kernel parameter to 'poly' to use a polynomial kernel, and we set the degree parameter to 3 to use a cubic polynomial. The C parameter controls the trade-off between maximizing the margin and minimizing the classification error.

In [6]:
# Evaluate the model on the test data:

y_pred = svm.predict(X_test_std)
print('This will print the accuracy score of the SVM on the test data.')
print('Accuracy:', svm.score(X_test_std, y_test))


This will print the accuracy score of the SVM on the test data.
Accuracy: 0.8888888888888888


# Q3. How does increasing the value of epsilon affect the number of support vectors in SVR?

In Support Vector Regression (SVR), epsilon is a hyperparameter that controls the width of the margin for the regression line. As epsilon increases, the margin becomes wider, which allows more points to lie within the margin and be ignored by the model. This leads to an increase in the number of support vectors as more data points fall within the margin.

In other words, a larger epsilon value leads to a larger margin, which allows the model to be more flexible in its fitting and can result in a larger number of support vectors being used to define the regression line. However, it's important to note that increasing epsilon too much can lead to overfitting, so it's important to find an appropriate value through experimentation and cross-validation.

# Q4. How does the choice of kernel function, C parameter, epsilon parameter, and gamma parameter affect the performance of Support Vector Regression (SVR)? Can you explain how each parameter works and provide examples of when you might want to increase or decrease its value?

The performance of Support Vector Regression (SVR) is affected by several parameters such as the choice of kernel function, C parameter, epsilon parameter, and gamma parameter.

1. Choice of kernel function: The kernel function transforms the input data into higher-dimensional space to make it easier to classify. The choice of kernel function depends on the data and the problem at hand. Some popular kernel functions include linear, polynomial, RBF (Radial Basis Function), and sigmoid. For example, a linear kernel can be useful when the data is linearly separable, while a polynomial kernel can be useful when the data is not linearly separable.

2. C parameter: The C parameter controls the trade-off between achieving a low training error and a low testing error. A smaller value of C allows for more misclassifications but can lead to a larger margin, while a larger value of C reduces the margin but can lead to overfitting. Increasing the value of C will increase the complexity of the model and can result in a better fit to the training data, but may not generalize well to unseen data.

3. Epsilon parameter: The epsilon parameter defines the margin of error for SVR. It represents the distance from the predicted regression line to the support vector. Increasing the value of epsilon will allow for more training points to be included as support vectors and can lead to a smoother regression curve, but may result in less accurate predictions.

4. Gamma parameter: The gamma parameter controls the smoothness of the decision boundary. A smaller value of gamma will result in a smoother decision boundary, while a larger value of gamma will result in a more complex decision boundary. Increasing the value of gamma can result in overfitting and may not generalize well to unseen data.

The choice of these parameters can have a significant impact on the performance of the SVR model. It is essential to fine-tune the parameters to achieve the best possible results. Here are some scenarios when you might want to increase or decrease each parameter:

1. Choice of kernel function:
* Increase polynomial kernel degree when data is non-linear and more complex.
* Use RBF kernel for data with non-linear decision boundaries.
* Use a linear kernel when the data is linearly separable.

2. C parameter:
* Increase C when you want the model to fit the training data more closely.
* Decrease C when you want to reduce overfitting and prioritize a larger margin.

3. Epsilon parameter:
* Increase epsilon when you want to include more training points as support vectors.
* Decrease epsilon when you want to reduce the margin of error and prioritize accurate predictions.

4. Gamma parameter:
* Increase gamma when you want to fit the model to the training data more closely.
* Decrease gamma when you want to reduce overfitting and prioritize a smoother decision boundary.

# Q5. Assignment:

* Import the necessary libraries and load the dataset

* Split the dataset into training and testing sets

* Preprocess the data using any technique of your choice (e.g. scaling, normalization)

* Create an instance of the SVC classifier and train it on the training data

* Use the trained classifier to predict the labels of the testing data

* Evaluate the performance of the classifier using any metric of your choice (e.g. accuracy, precision, recall, F1-score)

* Tune the hyperparameters of the SVC classifier using GridSearchCV or RandomizedSearchCV to improve its performance

* Train the tuned classifier on the entire dataset

* Save the trained classifier to a file for future use.

#### Note :
You can use any dataset of your choice for this assignment, but make sure it is suitable for classification and has a sufficient number of features and samples.

In [7]:
# Import the necessary libraries and load the dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import GridSearchCV

In [12]:
df = pd.read_csv('diabetes.csv')
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


* Pregnancies: Number of times pregnant (integer)
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
* BloodPressure: Diastolic blood pressure (mm Hg) (integer)
* SkinThickness: Triceps skin fold thickness (mm) (integer)
* Insulin: 2-Hour serum insulin (mu U/ml) (integer)
* BMI: Body mass index (weight in kg/(height in m)^2) (float)
* DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history) (float)
* Age: Age in years (integer)
* Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [17]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [18]:
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [24]:
#here few misconception is there lke BMI can not be zero, BP can't be zero, glucose, insuline can't be zero so lets try to fix it
# now replacing zero values with the mean of the column
df['BMI'] = df['BMI'].replace(0,df['BMI'].mean())
df['BloodPressure'] = df['BloodPressure'].replace(0,df['BloodPressure'].mean())
df['Glucose'] = df['Glucose'].replace(0,df['Glucose'].mean())
df['Insulin'] = df['Insulin'].replace(0,df['Insulin'].mean())
df['SkinThickness'] = df['SkinThickness'].replace(0,df['SkinThickness'].mean())

In [45]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [46]:
# Split the dataset into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=10)

In [47]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [48]:
# Create an instance of the SVC classifier and train it on the training data

svc=SVC(kernel='linear')
svc.fit(X_train_scaled,y_train)

In [49]:
svc.coef_

array([[ 0.22092198,  0.96781284, -0.09334562,  0.01246615, -0.0839724 ,
         0.42980332,  0.12734569,  0.20048527]])

In [50]:
# Use the trained classifier to predict the labels of the testing data


## Prediction
y_pred=svc.predict(X_test_scaled)

In [51]:
y_pred

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [52]:
# Evaluate the performance of the classifier using any metric of your choice (e.g. accuracy, precision, recall, F1-score)

print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.74      0.89      0.81       121
           1       0.72      0.46      0.56        71

    accuracy                           0.73       192
   macro avg       0.73      0.68      0.69       192
weighted avg       0.73      0.73      0.72       192

[[108  13]
 [ 38  33]]
0.734375


In [60]:
# Tune the hyperparameters of the SVC classifier using GridSearchCV or RandomizedSearchCV to improve its performance

from sklearn.model_selection import GridSearchCV
 
# defining parameter range
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.01],
              'kernel':['linear', 'poly', 'rbf']
              }

In [61]:
grid=GridSearchCV(SVC(),param_grid=param_grid,refit=True,cv=2,verbose=1)

In [62]:
grid.fit(X_train_scaled,y_train)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


In [63]:
grid.best_params_

{'C': 1, 'gamma': 1, 'kernel': 'linear'}

In [64]:
# Train the tuned classifier on the entire dataset
svc=SVC(kernel='linear',C=1, gamma=1)
svc.fit(X_train_scaled,y_train)

In [65]:
svc.coef_

array([[ 0.22092198,  0.96781284, -0.09334562,  0.01246615, -0.0839724 ,
         0.42980332,  0.12734569,  0.20048527]])

In [66]:
y_pred=svc.predict(X_test_scaled)

In [67]:
y_pred

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [68]:
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.74      0.89      0.81       121
           1       0.72      0.46      0.56        71

    accuracy                           0.73       192
   macro avg       0.73      0.68      0.69       192
weighted avg       0.73      0.73      0.72       192

[[108  13]
 [ 38  33]]
0.734375


In [69]:
# Save the trained classifier to a file for future use.

import pickle

with open('svc_model.pkl', 'wb') as file:
    pickle.dump(svc, file)

In [70]:
with open('standard_scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)