<a href="https://colab.research.google.com/github/shounakk05/Hands-On-ML-Journey/blob/main/Chapter-02/Exercise_01_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise I will try to use SVM, experimenting with it's various hyper-parameters to check this Ml model's performance on the California Housing data.

GridSearchCV gave the best estimator parameters as C=3000.0 and kernel='linear'. It's RMSE value : 69939.82642470924

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

housing = pd.read_csv("https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv")

train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 42)
housing = train_set.drop(['median_house_value'], axis = 1)
housing_labels = train_set['median_house_value'].copy()

housing['income_cat'] = pd.cut(housing['median_income'], bins = [0, 1.5, 3.0, 4.5, 6.0, np.inf], labels = [1, 2, 3, 4, 5])

# I decided to use Stratified Split instead of simple train test split to ensure distribution of the data
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state =42)
for train_index, test_index in split.split(housing, housing['income_cat']):
  strat_train_set = housing.iloc[train_index]
  strat_test_index = housing.iloc[test_index]

# Pipeline creation for data cleaning and transformation
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_attr = list(housing.drop('ocean_proximity', axis = 1))
cat_attr = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy = 'median')), ('std_scaler', StandardScaler())]), num_attr),
    ('cat', OneHotEncoder(), cat_attr)
])

housing_prep = full_pipeline.fit_transform(housing)

In [10]:
print(housing_prep)

[[ 1.27258656 -1.3728112   0.34849025 ...  0.          0.
   1.        ]
 [ 0.70916212 -0.87669601  1.61811813 ...  0.          0.
   1.        ]
 [-0.44760309 -0.46014647 -1.95271028 ...  0.          0.
   1.        ]
 ...
 [ 0.59946887 -0.75500738  0.58654547 ...  0.          0.
   0.        ]
 [-1.18553953  0.90651045 -1.07984112 ...  0.          0.
   0.        ]
 [-1.41489815  0.99543676  1.85617335 ...  0.          1.
   0.        ]]


Using GridSearchCV for getting the best hyperpameter values for the model

In [41]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

param_grid = [
    {'kernel':['linear'], 'C':[10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
    {'kernel':['rbf'], 'C':[1.0, 3.0, 10., 30., 100., 300., 1000.0], 'gamma':[0.01, 0.03, 0.1, 0.3, 1.0, 3.0]}
]

svm_reg = SVR()

grid_search = GridSearchCV(svm_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error', verbose = 2)

grid_search.fit(housing_prep, housing_labels)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END ..............................C=10.0, kernel=linear; total time=  16.5s
[CV] END ..............................C=10.0, kernel=linear; total time=  10.6s
[CV] END ..............................C=10.0, kernel=linear; total time=  10.7s
[CV] END ..............................C=10.0, kernel=linear; total time=  10.5s
[CV] END ..............................C=10.0, kernel=linear; total time=  10.5s
[CV] END ..............................C=30.0, kernel=linear; total time=   9.9s
[CV] END ..............................C=30.0, kernel=linear; total time=  10.2s
[CV] END ..............................C=30.0, kernel=linear; total time=  10.7s
[CV] END ..............................C=30.0, kernel=linear; total time=  10.5s
[CV] END ..............................C=30.0, kernel=linear; total time=  10.5s
[CV] END .............................C=100.0, kernel=linear; total time=  11.0s
[CV] END .............................C=100.0, 

In [44]:
import numpy as np

rmse_best_score = np.sqrt(abs(grid_search.best_score_))
print(f"RMSE from best_score_: {rmse_best_score}")

RMSE from best_score_: 69939.82642470924
