# <font color = 'paleyellow'><b>Support Vector Machines</b></font>

Support vector machine for binary classification
Support vector machine is a supervised machine learning algorithm which can be used for both classification or regression challenges. 

However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. 
Then, we perform classification by finding the hyper-plane that differentiate the two classes very well.

---

## <font color = 'paleyellow'><b>Support vector Regression</b></font>

Support vector regression is a type of support vector machine that supports linear and non-linear regression. In simple regression we try to minimise the error rate. 

In SVR we try to fit the error within a certain threshold. This is called the margin of tolerance. The decision boundary is the hyperplane that maximises the margin between the two classes. The support vectors are the data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. The distance between the hyperplane and the support vectors is called the margin. Our goal is to choose a hyperplane with the greatest possible margin, so that the maximum number of data points can be classified correctly.

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.svm import SVR

In [6]:
df = pd.read_csv('D:\Python work\Machine learning revision\Classification\insurance.csv')

In [7]:
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
788,29,male,22.515,3,no,northeast,5209.57885
1168,32,male,35.2,2,no,southwest,4670.64
893,47,male,38.94,2,yes,southeast,44202.6536
1127,35,female,35.86,2,no,southeast,5836.5204
709,36,female,27.74,0,no,northeast,5469.0066


In [8]:
df.shape

(1338, 7)

In [9]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [11]:
from sklearn.preprocessing import LabelEncoder

In [12]:
le = LabelEncoder()

In [13]:
df['sex'] = le.fit_transform(df['sex'])
df['smoker'] = le.fit_transform(df['smoker'])

In [17]:
df['region'] = le.fit_transform(df['region'])

In [18]:
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
866,18,1,37.29,0,0,2,1141.4451
389,24,0,30.21,3,0,1,4618.0799
616,56,0,28.595,0,0,0,11658.11505
1192,58,0,32.395,1,0,0,13019.16105
1016,19,0,24.605,1,0,1,2709.24395


In [19]:
df.region.unique()

array([3, 2, 1, 0])

In [21]:
X = df.drop('charges', axis=1)
y = df['charges']

In [22]:
X.shape, y.shape

((1338, 6), (1338,))

In [23]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=2)

In [35]:
svr = SVR(kernel='rbf')

In [36]:
svr.fit(Xtrain, ytrain)

In [37]:
ypred = svr.predict(Xtest)

In [38]:
# Accuracy of the model
svr.score(Xtest, ytest)

-0.09373369145895727

In [39]:
from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(ytest, ypred)

176789510.50744787

In [40]:
r2_score(ytest, ypred)

-0.09373369145895727

### <font color = 'red'><b>I will improve the model later</b></font>

---

## <font color = 'paleyellow'><b>Support vector classifier</b></font>

In [None]:
# support vector classifier is a binary classifier which classifies the data into two classes.
# it uses the concept of margin to classify the data. margin is the distance between the hyperplane and the nearest data point from either set.
# the goal is to choose a hyperplane with the maximum possible margin between support vectors in the given data set.

In [None]:
from sklearn.datasets import load_iris
from sklearn.svm import SVC

In [None]:
iris = load_iris()

In [None]:
X, y = iris.data[:, :2], iris.target
# taking only the first two features for simplicity

In [None]:
# spliting the data into train and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=42)

In [None]:
svc = SVC(kernel='linear').fit(Xtrain, ytrain)

In [None]:
ypred = svc.predict(Xtest)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(ytest, ypred)

In [None]:
def plot_decision_boundary(X, y, model):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[:, 0], X[:, 1], c = y, s = 30, cmap = 'viridis')
    
    x_min, x_max = X[:, 0].min() -1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() -1, X[:, 1].max() + 1
    
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    Z = Z.reshape(xx.shape)
    
    # Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    # Flattens the meshgrid coordinates using ravel() and concatenates them column-wise using np.c_ to create a 2D array 
    # where each row represents a point in the feature space.
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap = 'viridis')
    plt.xlabel('Sepal Length')
    plt.ylabel('Sepal Width')
    plt.title('SVC on Iris')
    plt.show()    
    

In [None]:
plot_decision_boundary(Xtrain, ytrain, svc)