#Objective: To understand the support vector machines for multi-class classification and regression problems.

##Multiclass classification dataset:
###This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)
###Attribute Information:
1.	Id number: 1 to 214 (removed from CSV file)
2.	RI: refractive index
3.	Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4.	Mg: Magnesium
5.	Al: Aluminum
6.	Si: Silicon
7.	K: Potassium
8.	Ca: Calcium
9.	Ba: Barium
10.	Fe: Iron
### Target class
Type of glass: (class attribute)
-- 1 buildingwindowsfloatprocessed -- 2 buildingwindowsnonfloatprocessed -- 3 vehiclewindowsfloatprocessed
-- 4 vehiclewindowsnonfloatprocessed (none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps

##Regression dataset:
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.


###Data fields
1.	MSSubClass: The building class
2.	MSZoning: The general zoning classification
3.	LotFrontage: Linear feet of street connected to property
4.	LotArea: Lot size in square feet
5.	Street: Type of road access
6.	Alley: Type of alley access
7.	LotShape: General shape of property
8.	LandContour: Flatness of the property
9.	Utilities: Type of utilities available
10.	LotConfig: Lot configuration
11.	LandSlope: Slope of property
12.	Neighborhood: Physical locations within Ames city limits
13.	Condition1: Proximity to main road or railroad
14.	Condition2: Proximity to main road or railroad (if a second is present)
15.	BldgType: Type of dwelling
16.	HouseStyle: Style of dwelling
17.	OverallQual: Overall material and finish quality
18.	OverallCond: Overall condition rating
19.	YearBuilt: Original construction date
20.	YearRemodAdd: Remodel date
21.	RoofStyle: Type of roof
22.	RoofMatl: Roof material
23.	Exterior1st: Exterior covering on house
24.	Exterior2nd: Exterior covering on house (if more than one material)
25.	MasVnrType: Masonry veneer type
26.	MasVnrArea: Masonry veneer area in square feet
27.	ExterQual: Exterior material quality
28.	ExterCond: Present condition of the material on the exterior
29.	Foundation: Type of foundation
30.	BsmtQual: Height of the basement
31.	BsmtCond: General condition of the basement
32.	BsmtExposure: Walkout or garden level basement walls
33.	BsmtFinType1: Quality of basement finished area
34.	BsmtFinSF1: Type 1 finished square feet
35.	BsmtFinType2: Quality of second finished area (if present)
36.	BsmtFinSF2: Type 2 finished square feet
37.	BsmtUnfSF: Unfinished square feet of basement area
38.	TotalBsmtSF: Total square feet of basement area
39.	Heating: Type of heating
40.	HeatingQC: Heating quality and condition
41.	CentralAir: Central air conditioning
42.	Electrical: Electrical system
43.	1stFlrSF: First Floor square feet
44.	2ndFlrSF: Second floor square feet
45.	LowQualFinSF: Low quality finished square feet (all floors)
46.	GrLivArea: Above grade (ground) living area square feet
47.	BsmtFullBath: Basement full bathrooms
48.	BsmtHalfBath: Basement half bathrooms
49.	FullBath: Full bathrooms above grade
50.	HalfBath: Half baths above grade
51.	Bedroom: Number of bedrooms above basement level
52.	Kitchen: Number of kitchens
53.	KitchenQual: Kitchen quality
54.	TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
55.	Functional: Home functionality rating
56.	Fireplaces: Number of fireplaces
57.	FireplaceQu: Fireplace quality
58.	GarageType: Garage location
59.	GarageYrBlt: Year garage was built
60.	GarageFinish: Interior finish of the garage
61.	GarageCars: Size of garage in car capacity
62.	GarageArea: Size of garage in square feet
63.	GarageQual: Garage quality
64.	GarageCond: Garage condition
65.	PavedDrive: Paved driveway
66.	WoodDeckSF: Wood deck area in square feet
67.	OpenPorchSF: Open porch area in square feet
68.	EnclosedPorch: Enclosed porch area in square feet
69.	3SsnPorch: Three season porch area in square feet
70.	ScreenPorch: Screen porch area in square feet
71.	PoolArea: Pool area in square feet
72.	PoolQC: Pool quality
73.	Fence: Fence quality
74.	MiscFeature: Miscellaneous feature not covered in other categories
75.	MiscVal: $Value of miscellaneous feature
76.	MoSold: Month Sold
77.	YrSold: Year Sold
78.	SaleType: Type of sale
79.	SaleCondition: Condition of sale

###Target:
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.


Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data


# Task 1: Multi-class Support vector machine (SVM)
1.	Load multi-class dataset
2.	Apply pre-processing techniques
3.	Divide dataset into training and testing sets (fraction of your choice)
4.	Build multi-class SVM model (use sklearn)
5.	Evaluate precision and recall
6.	Play with hyper-parameters and find best combination


# Task 2: Support vector regression (SVR)
1.	Load regression dataset
2.	Apply pre-processing techniques
3.	Divide dataset into training and testing sets (fraction of your choice)
4.	Build SVR model (use sklearn)
5.	Evaluate root mean square error
6.	Play with hyper-parameters and find best combination


# Task 3: Play with various SVM kernels such as polynomial, rbf, sigmoid tanh, etc.

###For more details: 
https://scikit-learn.org/stable/modules/svm.html
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html




## Task 1: Multi-class Support vector machine (SVM) 

In [1]:
# Load the libraries
from sklearn.svm import SVC,SVR
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score,classification_report,mean_squared_error

In [3]:
# Load the dataset 
data = pd.read_csv('data/glass.csv')
data.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [4]:
data.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516523,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


In [6]:
data.isna().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

In [9]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)

In [7]:
X = data.drop(columns = ['Type'], axis = 1)
y = data['Type']

In [8]:
# sanity check
print("X shape : ", X.shape)
print("y shape : ", y.shape)

X shape :  (214, 9)
y shape :  (214,)


In [11]:
# scaling the values
cols = X.columns
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [13]:
X = pd.DataFrame(X, columns = cols)
X.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
0,0.432836,0.437594,1.0,0.252336,0.351786,0.009662,0.30855,0.0,0.0
1,0.283582,0.475188,0.801782,0.333333,0.521429,0.077295,0.223048,0.0,0.0
2,0.220808,0.421053,0.790646,0.389408,0.567857,0.062802,0.218401,0.0,0.0
3,0.285777,0.372932,0.821826,0.311526,0.5,0.091787,0.259294,0.0,0.0
4,0.275241,0.381955,0.806236,0.29595,0.583929,0.088567,0.245353,0.0,0.0


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [16]:
# sanity check
print("X train shape : ", X_train.shape)
print("y train shape : ", y_train.shape)
print("X test shape : ", X_test.shape)
print("y test shape : ", y_test.shape)

X train shape :  (149, 9)
y train shape :  (149,)
X test shape :  (65, 9)
y test shape :  (65,)


In [17]:
# Build SVM model 
model=SVC()
model.fit(X_train,y_train)

SVC()

In [18]:
# Evaluate the build model on test dataset
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
print('Training Accuracy : ',accuracy_score(y_train, y_pred_train))
print('Testing Accuracy : ',accuracy_score(y_test, y_pred_test))

Training Accuracy :  0.7181208053691275
Testing Accuracy :  0.6461538461538462


In [20]:
# Evaluate training and testing accuracy

print(classification_report(y_train, y_pred_train))
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           1       0.62      0.86      0.72        51
           2       0.73      0.75      0.74        53
           3       0.00      0.00      0.00        13
           5       1.00      0.57      0.73         7
           6       1.00      0.33      0.50         6
           7       1.00      0.89      0.94        19

    accuracy                           0.72       149
   macro avg       0.72      0.57      0.61       149
weighted avg       0.69      0.72      0.69       149

              precision    recall  f1-score   support

           1       0.61      0.74      0.67        19
           2       0.57      0.74      0.64        23
           3       0.00      0.00      0.00         4
           5       1.00      0.33      0.50         6
           6       0.00      0.00      0.00         3
           7       0.90      0.90      0.90        10

    accuracy                           0.65        65
   macro avg       0.51

## Task 2: Implement support vector regression (SVR)


In [65]:
# Load training and testing datasets
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
data = train.append(test)
data.drop(columns = ['Id'], axis = 1, inplace = True)

In [66]:
data[data.select_dtypes(include='object').columns] = data[data.select_dtypes(include='object').columns].astype('str')

In [67]:
# Apply pre-processing techniques
# Apply feature selection techniques of your choice to reduce the feature set

# label encoding the categorical values
le = LabelEncoder()

for i in train.select_dtypes(include='object').columns:
    data[i] = le.fit_transform(data[i])

In [68]:
# filling nan values by mean
null_cols = list(data.isna().any()[data.isna().any() == True].index)

imp = SimpleImputer()
data[null_cols] = imp.fit_transform(data[null_cols])

In [69]:
cols = list(data.columns)
print(cols)

['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fen

In [70]:
data.describe()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
count,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,...,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0
mean,57.137718,3.03049,69.305795,10168.11408,0.995889,1.891059,1.947585,2.776978,0.001713,3.055841,...,2.251799,2.993148,3.493662,3.923604,50.825968,6.213087,2007.792737,7.491607,3.779034,180921.19589
std,42.517628,0.662386,21.312345,7886.996359,0.063996,0.423503,1.409721,0.704391,0.05551,1.604472,...,35.663946,0.128073,1.091376,0.405566,567.402211,2.714762,1.314964,1.593719,1.078241,56174.332503
min,20.0,0.0,21.0,1300.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0,34900.0
25%,20.0,3.0,60.0,7478.0,1.0,2.0,0.0,3.0,0.0,2.0,...,0.0,3.0,4.0,4.0,0.0,4.0,2007.0,8.0,4.0,163000.0
50%,50.0,3.0,69.305795,9453.0,1.0,2.0,3.0,3.0,0.0,4.0,...,0.0,3.0,4.0,4.0,0.0,6.0,2008.0,8.0,4.0,180921.19589
75%,70.0,3.0,78.0,11570.0,1.0,2.0,3.0,3.0,0.0,4.0,...,0.0,3.0,4.0,4.0,0.0,8.0,2009.0,8.0,4.0,180921.19589
max,190.0,5.0,313.0,215245.0,1.0,2.0,3.0,3.0,2.0,4.0,...,800.0,3.0,4.0,4.0,17000.0,12.0,2010.0,9.0,5.0,755000.0


In [72]:
# scaling the numerical values
scaler = MinMaxScaler()
X = data.drop(columns = ['SalePrice'], axis = 1)
X = scaler.fit_transform(X)

data = pd.DataFrame(data, columns = cols)

In [73]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,3,65.0,8450,1,2,3,3,0,4,...,0,3,4,4,0,2,2008,8,4,208500.0
1,20,3,80.0,9600,1,2,3,3,0,2,...,0,3,4,4,0,5,2007,8,4,181500.0
2,60,3,68.0,11250,1,2,0,3,0,4,...,0,3,4,4,0,9,2008,8,4,223500.0
3,70,3,60.0,9550,1,2,0,3,0,0,...,0,3,4,4,0,2,2006,8,0,140000.0
4,60,3,84.0,14260,1,2,0,3,0,2,...,0,3,4,4,0,12,2008,8,4,250000.0


In [74]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('SalePrice',axis=1), data['SalePrice'], test_size=0.3, random_state=42)

In [75]:
# Train SVR model
model=SVR()
model.fit(X_train,y_train)

SVR()

In [76]:
# Evaluate training and testing root mean square error
print('Training MSE : ',(mean_squared_error(y_train,model.predict(X_train)))**(0.5))
print('Testing MSE : ',(mean_squared_error(y_test,model.predict(X_test)))**(0.5))

Training MSE :  54224.110865502094
Testing MSE :  60448.57044189207



## Task 3: Play with various SVM kernels such as polynomial, rbf, sigmoid tanh, etc.


In [77]:
#Play with various SVM kernels such as polynomial, rbf, sigmoid tanh, etc.

In [78]:
model=SVR(kernel='poly')
model.fit(X_train,y_train)
print('Training MSE : ',(mean_squared_error(y_train,model.predict(X_train)))**(0.5))
print('Testing MSE : ',(mean_squared_error(y_test,model.predict(X_test)))**(0.5))

Training MSE :  54120.00486564621
Testing MSE :  60104.22059250853


In [79]:
model=SVR(kernel='rbf')
model.fit(X_train,y_train)
print('Training MSE : ',(mean_squared_error(y_train,model.predict(X_train)))**(0.5))
print('Testing MSE : ',(mean_squared_error(y_test,model.predict(X_test)))**(0.5))

Training MSE :  54224.110865502094
Testing MSE :  60448.57044189207


In [80]:
model=SVR(kernel='sigmoid')
model.fit(X_train,y_train)
print('Training MSE : ',(mean_squared_error(y_train,model.predict(X_train)))**(0.5))
print('Testing MSE : ',(mean_squared_error(y_test,model.predict(X_test)))**(0.5))

Training MSE :  54222.51545124184
Testing MSE :  60447.11169364456
