Q1

a. Dimension reduction
b. Clustering
c. Regression
d. Classification

**SVM-C with Gaussian kernel**

Target output type: one-hot

Parameters: support vectors, their coefficients, and intercepts

Hyperparameters: C (regularization), gamma (kernel coefficient), tol (tolerance)

Hyperparameter range: 0.5 to 2, 'scale' or 'auto', 1e-3 to 1e-5

sklearn commands: s=sklearn.svm.SVC(); s.fit(X, y); s.predict(X_)


**SVM-R with Gaussian kernel**

Target output type: float

Parameters: support vectors, their coefficients, and intercepts

Hyperparameters: C (regularization), gamma (kernel coefficient), tol (tolerance), epsilon

Hyperparameter range: 0.5 to 2, 'scale' or 'auto', 1e-3 to 1e-5, 0.1 to 0.01

sklearn commands: r=sklearn.svm.SVR(); r.fir(X, y); r.predict(X_)

**NN with one hidden layer (C)**

Target output type: one hot vector

Parameters: weight and bias matrices for each neuron

Hyperparameters: hidden_layer_size, activation, alpha(regularization)

Hyperparameter range: 20 to 500, 'relu' or 'logistic' or 'tanh', 0.0001 to 1

sklearn commands: clf = sklearn.neural_network.MLPClassifier(); clf.fit(X, y); clf.predict(X_)

**NN with one hidden layer (R)**

Target output type: float

Parameters: weight and bias matrices for each neuron

Hyperparameters: hidden_layer_size, activation, alpha(regularization)

Hyperparameter range: 20 to 500, 'relu' or 'logistic' or 'tanh', 0.0001 to 1

sklearn commands: clf = sklearn.neural_network.MLPRegressor(); clf.fit(X, y); clf.predict(X_)

**Random forest (R)**

Target output type: float

Parameters: A group of decision trees and their features/splits

Hyperparameters: n_estimators(number of trees), min_impurity_decrease, max_features(to be considered for split)

Hyperparameter range: 5 to 1000, 0 to 1, order of n (but lesser)

sklearn commands: r = sklearn.ensemble.RandomForestRegressor(); r.fit(X, y); r.predict(X_)

**Random forest (C)**

Target output type: one hot vector

Parameters: A group of decision trees and their features/splits

Hyperparameters: n_estimators(number of trees), min_impurity_decrease, max_features(to be considered for split)

Hyperparameter range: 5 to 1000, 0 to 1, order of n (but lesser)

sklearn commands: c = sklearn.ensemble.RandomForestClassifier(); c.fit(X, y); c.predict(X_)

**k-means**

Target output type: integer(index of cluster)

Parameters: cluster centers

Hyperparameters: n_clusters, init(initial center positions)

Hyperparameter range: 2 to 20, depends on data

sklearn commands: k = sklearn.cluster.KMeans(); k.fit(X); k.predict(X_)

**DBSCAN**

Target output type: integer(cluster label and -1 for noisy output)

Parameters: core points

Hyperparameters: eps(max distance between samples to classify them as neighbors), min_samples(min number of neighbors to classify point as a core point)

Hyperparameter range: 0.1 to 5, depends on dataset

sklearn commands: d = sklearn.cluster.DBSCAN(); d.fit(X); d.predict(X_)

**PCA**

Target output type: none

Parameters: basis of the new vector space (on which current space will be projected)

Hyperparameters: num_components

Hyperparameter range: 1 to 3 for data visualization. Of the order of n/10 if some other framework has to be fit upon the data

sklearn commands: pca=sklearn.decomposition.PCA(n_components=2); X_train_ = pca.fit_transform(X_train); X_test_ = pca.transform(X_test);

**Kernel PCA**

Target output type: none

Parameters: basis of the new vector space (on which current space will be projected)

Hyperparameters: num_components, gamma

Hyperparameter range: order of n/10, order of 1/num_features

sklearn commands: kpca=sklearn.decomposition.KernelPCA(); X_train_ = kpca.fit_transform(X_train); X_test_ = kpca.transform(X_test);

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("/content/drive/My Drive/ds203_assignment/SeoulBikeData.csv", encoding='latin1')

In [None]:
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [None]:
# this is supervised learning problem and a continuous(integer) variable has to be predicted
# Rented Bike Count has to be predicted
# performance will be measured by r2_score
# we will drop date, as season helps account for it
# we will also drop dew point temperature as it is irrelevant
# all other variables (after apt preprocessing) will be used in the prediction
# i.e. all other variables are usable

In [None]:
df.isnull().sum()

Date                         0
Rented Bike Count            0
Hour                         0
Temperature(°C)              0
Humidity(%)                  0
Wind speed (m/s)             0
Visibility (10m)             0
Dew point temperature(°C)    0
Solar Radiation (MJ/m2)      0
Rainfall(mm)                 0
Snowfall (cm)                0
Seasons                      0
Holiday                      0
Functioning Day              0
dtype: int64

In [None]:
for col in df.columns:
  print(col, len(df[col].unique()))

Date 365
Rented Bike Count 2166
Hour 24
Temperature(°C) 546
Humidity(%) 90
Wind speed (m/s) 65
Visibility (10m) 1789
Dew point temperature(°C) 556
Solar Radiation (MJ/m2) 345
Rainfall(mm) 61
Snowfall (cm) 51
Seasons 4
Holiday 2
Functioning Day 2


In [None]:
df["Holiday"]=df["Holiday"]=="Holiday"

In [None]:
df["Functioning Day"]=df["Functioning Day"]=="Yes"

In [None]:
for s in df["Seasons"].unique():
  df["is_"+s] = df["Seasons"]==s

In [None]:
df=df.drop(["Date", "Seasons", "Dew point temperature(°C)"], axis=1)

In [None]:
df.head()

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Holiday,Functioning Day,is_Winter,is_Spring,is_Summer,is_Autumn
0,254,0,-5.2,37,2.2,2000,0.0,0.0,0.0,False,True,True,False,False,False
1,204,1,-5.5,38,0.8,2000,0.0,0.0,0.0,False,True,True,False,False,False
2,173,2,-6.0,39,1.0,2000,0.0,0.0,0.0,False,True,True,False,False,False
3,107,3,-6.2,40,0.9,2000,0.0,0.0,0.0,False,True,True,False,False,False
4,78,4,-6.0,36,2.3,2000,0.0,0.0,0.0,False,True,True,False,False,False


In [None]:
df_x = df.drop("Rented Bike Count", axis=1)
df_y=df["Rented Bike Count"]
data_x = df_x.to_numpy()
data_y = df_y.to_numpy()

In [None]:
from sklearn.preprocessing import StandardScaler as SS
ss=SS()
data_x=ss.fit_transform(data_x)

In [None]:
data_x

array([[-1.66132477, -1.51395724, -1.04248288, ..., -0.58051386,
        -0.58051386, -0.57629575],
       [-1.51686175, -1.53907415, -0.99336999, ..., -0.58051386,
        -0.58051386, -0.57629575],
       [-1.37239873, -1.58093567, -0.94425709, ..., -0.58051386,
        -0.58051386, -0.57629575],
       ...,
       [ 1.37239873, -0.86091752, -0.94425709, ..., -0.58051386,
        -0.58051386,  1.73522016],
       [ 1.51686175, -0.90277904, -0.8460313 , ..., -0.58051386,
        -0.58051386,  1.73522016],
       [ 1.66132477, -0.91952365, -0.74780551, ..., -0.58051386,
        -0.58051386,  1.73522016]])

In [None]:
# framework1: Ridge Regression
# framework2: NN with one hidden layer

In [None]:
l=data_x.shape[0]
indices = np.random.permutation(l)
idx_train, idx_val, idx_test = indices[:int(l*0.7)], indices[int(l*0.7):int(l*0.85)], indices[int(l*0.85):]
x_train, x_val, x_test = data_x[idx_train,:], data_x[idx_val,:], data_x[idx_test,:]
y_train, y_val, y_test = data_y[idx_train], data_y[idx_val], data_y[idx_test]

In [None]:
from sklearn.linear_model import Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV

In [None]:
ridge_alpha_vals=[0, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]
distributions = dict(alpha=ridge_alpha_vals)

In [None]:
lr=Ridge()
clf = RandomizedSearchCV(lr, distributions, random_state=0, cv=5)

In [None]:
x_train_main = np.concatenate((x_train, x_val))
y_train_main = np.concatenate((y_train, y_val))
search = clf.fit(x_train_main, y_train_main)

In [None]:
search.best_params_

{'alpha': 10}

In [None]:
search.score(x_val, y_val)

0.5573249502349399

In [None]:
nn_hl_sizes=[(i*20,) for i in range(1, 20)]
nn_alpha_vals=[0.01, 0.03, 0.1, 0.3, 1, 3]

In [None]:
distributions = dict(alpha=nn_alpha_vals,
                     hidden_layer_sizes=nn_hl_sizes)

In [None]:
nn=MLPRegressor(tol=1, learning_rate_init=1) # for faster convergence
clf = RandomizedSearchCV(nn, distributions, random_state=0, cv=5, n_iter=20)

In [None]:
search = clf.fit(x_train_main, y_train_main)

In [None]:
print(search.best_params_)
search.score(x_val, y_val)

{'hidden_layer_sizes': (140,), 'alpha': 0.1}


0.8371386866965053

One hidden layer Neural Network performs better

The optimal hyperparameters were found to be:

  alpha=0.1

  hidden_layer_size=140

In [None]:
search.score(x_test, y_test)

0.823498820645469

In [None]:
y_test_pred=search.predict(x_test)

In [None]:
print(y_test[:10], y_test_pred[:10])
mse=np.sum((y_test-y_test_pred)**2)/y_test.shape[0]
print(mse)

[ 227  476 2333 1144  486  712 1929  353  267 1235] [ 299.99962489  703.2486672  1790.23887933 1144.83126417  385.68951648
  320.75897395 1346.00153988  361.07952352  321.52937985 1229.18138188]
74241.10723999186


In [None]:
# the model is not extremely accurate, but it gives results that roughly match the expected results
# we conclude that the model is usable