## Assignment no 61 (Support Vector Regressor) (7.4.23)

### Q1. What is the relationship between polynomial functions and kernel functions in machine learning algorithms?

**Ans-**
- In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models. It represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. The polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these.
- Kernel functions are used to transform data into a higher dimensional space, allowing for non-linear decision boundaries. The polynomial kernel is one such kernel function that maps the original data into a higher dimensional space using polynomial combinations of the original variables. This allows for the learning of non-linear models using algorithms such as support vector machines (SVMs).

### Q2. How can we implement an SVM with a polynomial kernel in Python using Scikit-learn?

**Ans-**
- We can implement an SVM with a polynomial kernel in Python using the Scikit-learn library. Here is an example code snippet that shows how to do this:

from sklearn.svm import SVC

classifier = SVC(kernel='poly', degree=4)

classifier.fit(x_train, y_train)

- In this example, we create an instance of the SVC class from Scikit-learn, specifying that we want to use a polynomial kernel by setting the kernel parameter to 'poly'. We also set the degree of the polynomial kernel using the degree parameter. Then, we fit the model to our training data using the fit method.

### Q3. How does increasing the value of epsilon affect the number of support vectors in SVR?

**Ans-**
In Support Vector Regression (SVR), the value of epsilon determines the width of the margin around the regression line. The larger the value of epsilon, the wider the margin becomes. This means that more points will fall within the margin and will not be considered as support vectors. As a result, increasing the value of epsilon can reduce the number of support vectors in an SVR model. 

### Q4. How does the choice of kernel function, C parameter, epsilon parameter, and gamma parameter affect the performance of Support Vector Regression (SVR)? Can you explain how each parameter works and provide examples of when you might want to increase or decrease its value?

**Ans-**

The choice of kernel function, C parameter, epsilon parameter, and gamma parameter can all affect the performance of Support Vector Regression (SVR). Here's how each parameter works and some examples of when you might want to increase or decrease its value:

- **Kernel function**: The kernel function determines the similarity measure used to map the data into a higher dimensional space. Common kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. The choice of kernel function can have a significant impact on the performance of the SVR model. For example, if the data is linearly separable, a linear kernel may work well. If the data is not linearly separable, a non-linear kernel such as RBF or polynomial may be more appropriate.

- **C parameter**: The C parameter controls the trade-off between model complexity and the degree to which deviations larger than epsilon are tolerated. A larger value of C will result in a more complex model that fits the training data more closely, while a smaller value of C will result in a simpler model that allows for more deviations. If the model is overfitting, you may want to decrease the value of C. If the model is underfitting, you may want to increase the value of C.

- **Epsilon parameter**: The epsilon parameter determines the width of the margin around the regression line. A larger value of epsilon will result in a wider margin and fewer support vectors, while a smaller value of epsilon will result in a narrower margin and more support vectors. If the model is overfitting, you may want to increase the value of epsilon. If the model is underfitting, you may want to decrease the value of epsilon.

- **Gamma parameter**: The gamma parameter is only relevant for certain kernel functions such as RBF and polynomial. It controls the shape of the kernel function and therefore the flexibility of the decision boundary. A larger value of gamma will result in a more flexible decision boundary, while a smaller value of gamma will result in a less flexible decision boundary. If the model is overfitting, you may want to decrease the value of gamma. If the model is underfitting, you may want to increase the value of gamma.

### Q5. Assignment:
- Import the necessary libraries and load the dataset.
- Split the dataset into training and testing set.
- Preprocess the data using any technique of your choice (e.g. scaling, normalization)
- Create an instance of the SVC classifier and train it on the training data.
- Use the trained classifier to predict the labels of the testing data.
- Evaluate the performance of the classifier using any metric of your choice (e.g. accuracy, precision, recall, F1-score)
- Tune the hyperparameters of the SVC classifier using GridSearchCV or RandomizedSearchCV to improve its performance
- Train the tuned classifier on the entire dataset.
- Save the trained classifier to a file for future use.

In [1]:
# Import the necessary libraries and load the dataset.

import pandas as pd
import numpy as np
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:
df_tips = sns.load_dataset("tips")
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df_tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [4]:
df_tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [5]:
df_tips.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [6]:
df_tips.shape

(244, 7)

In [7]:
df_tips.duplicated().sum()

1

In [8]:
df_tips['sex'].unique()

['Female', 'Male']
Categories (2, object): ['Male', 'Female']

In [9]:
df_tips['smoker'].unique()

['No', 'Yes']
Categories (2, object): ['Yes', 'No']

In [10]:
df_tips['day'].unique()

['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']

In [11]:
df_tips['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [12]:
X = df_tips.drop('time', axis=1)
y = df_tips['time']

In [13]:
# Convert categorical variables to dummy variables
X = pd.get_dummies(X)

X.head()

Unnamed: 0,total_bill,tip,size,sex_Male,sex_Female,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,2,0,1,0,1,0,0,0,1
1,10.34,1.66,3,1,0,0,1,0,0,0,1
2,21.01,3.5,3,1,0,0,1,0,0,0,1
3,23.68,3.31,2,1,0,0,1,0,0,0,1
4,24.59,3.61,4,0,1,0,1,0,0,0,1


In [14]:
X.shape

(244, 11)

In [15]:
y.head()

0    Dinner
1    Dinner
2    Dinner
3    Dinner
4    Dinner
Name: time, dtype: category
Categories (2, object): ['Lunch', 'Dinner']

In [16]:
y.shape

(244,)

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [18]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((195, 11), (49, 11), (195,), (49,))

In [19]:
from sklearn.svm import SVC

svc = SVC()

In [20]:
svc.fit(X_train, y_train)

In [21]:
y_pred = svc.predict(X_test)
y_pred

array(['Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner'], dtype=object)

In [22]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [23]:
accuracy_score(y_test, y_pred).round(2)

0.71

In [24]:
confusion_matrix(y_test, y_pred).round(2)

array([[35,  0],
       [14,  0]])

In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Dinner       0.71      1.00      0.83        35
       Lunch       0.00      0.00      0.00        14

    accuracy                           0.71        49
   macro avg       0.36      0.50      0.42        49
weighted avg       0.51      0.71      0.60        49



## Hyperparameter Tuning with SVC

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
# Defining parameters
parameters = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel':['linear']
              }

In [28]:
svc_cv = GridSearchCV(estimator=svc, param_grid=parameters,cv=3)
svc_cv

In [29]:
svc_cv.fit(X_train,y_train)

In [30]:
svc_cv.best_params_

{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}

In [31]:
svc = SVC(C=0.1, gamma=1, kernel='linear')

In [32]:
svc.fit(X_train, y_train)

In [33]:
y_pred = svc.predict(X_test)
y_pred

array(['Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Lunch', 'Dinner', 'Dinner', 'Lunch', 'Dinner', 'Lunch',
       'Dinner', 'Dinner', 'Dinner', 'Lunch', 'Dinner', 'Lunch', 'Lunch',
       'Dinner', 'Lunch', 'Dinner', 'Dinner', 'Dinner', 'Lunch', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner', 'Lunch', 'Dinner',
       'Dinner', 'Dinner', 'Lunch', 'Dinner', 'Dinner', 'Dinner',
       'Dinner', 'Dinner', 'Dinner', 'Dinner'], dtype=object)

In [34]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [35]:
accuracy_score(y_test, y_pred).round(2)

0.88

In [36]:
confusion_matrix(y_test, y_pred).round(2)

array([[34,  1],
       [ 5,  9]])

In [37]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Dinner       0.87      0.97      0.92        35
       Lunch       0.90      0.64      0.75        14

    accuracy                           0.88        49
   macro avg       0.89      0.81      0.83        49
weighted avg       0.88      0.88      0.87        49



In [38]:
import pickle

file = open('svc_classifier.pkl','wb')
pickle.dump(svc,file)
file.close()