<a href="https://colab.research.google.com/github/sk-ruban/dsa5208/blob/main/kernel_ridge_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will split the notebook into the following five sections:
1. Data preparation
2. Set up MPI processes
3. Applying distributed kernel ridge regression
4. Obtaining predicted median value using kernel function
5. Evaluating model performance
6. Cross Validation / Model Tuning


In [12]:
# Install dependencies
# !pip3 install wheel
# !pip3 install mpi4py

# Import Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# from mpi4py import MPI

## **Data Preparation**

In [5]:
#Import Data
df = pd.read_csv('data/housing.tsv', sep='\t', header=None)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20433 entries, 0 to 20432
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   longitude         20433 non-null  float64
 1   latitude          20433 non-null  float64
 2   housingMedianAge  20433 non-null  int64  
 3   totalRooms        20433 non-null  int64  
 4   totalBedrooms     20433 non-null  int64  
 5   population        20433 non-null  int64  
 6   households        20433 non-null  int64  
 7   medianIncome      20433 non-null  float64
 8   oceanProximity    20433 non-null  int64  
 9   medianHouseValue  20433 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 1.6 MB
None

First few rows:
   longitude  latitude  housingMedianAge  totalRooms  totalBedrooms  \
0    -122.23     37.88                41         880            129   
1    -122.22     37.86                21        7099           1106   
2    -122.24     37.85                52  

In [None]:
#Import

df.columns = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms',
              'totalBedrooms', 'population', 'households', 'medianIncome',
              'oceanProximity', 'medianHouseValue']

              
# Display basic information
print(df.info())
print("\nFirst few rows:")
print(df.head())


In [10]:
#Normalize Data

features = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms',
              'totalBedrooms', 'population', 'households', 'medianIncome',
              'oceanProximity', 'medianHouseValue']

scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])

print("\nDataset after preprocessing:")
print(df.head())


Dataset after preprocessing:
   longitude  latitude  housingMedianAge  totalRooms  totalBedrooms  \
0   0.211155  0.567481          0.784314    0.022331       0.019863   
1   0.212151  0.565356          0.392157    0.180503       0.171477   
2   0.210159  0.564293          1.000000    0.037260       0.029330   
3   0.209163  0.564293          1.000000    0.032352       0.036313   
4   0.209163  0.564293          1.000000    0.041330       0.043296   

   population  households  medianIncome  oceanProximity  medianHouseValue  
0    0.008941    0.020556      0.539668            0.75          0.902266  
1    0.067210    0.186976      0.538027            0.75          0.708247  
2    0.013818    0.028943      0.466028            0.75          0.695051  
3    0.015555    0.035849      0.354699            0.75          0.672783  
4    0.015752    0.042427      0.230776            0.75          0.674638  


In [15]:
#Split Data

X = df[features]
y = df['medianHouseValue']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (14303, 10)
Test set shape: (6130, 10)


## **Set up MPI Processes**

In [None]:
# Code for MPI stuff, define no. of processes (rank no.)
# Split data according to process

## **Applying Kernel Ridge Regression**


In [None]:
#A) Solve for Kernel Matrix K

#A.1) Define Kernel Computation Function
def compute_gaussian_kernel(X1, X2, sigma):
    dists = np.sum(X1**2, axis=1).reshape(-1, 1) + np.sum(X2**2, axis=1) - 2 * np.dot(X1, X2.T) # |x1-x2|^2 = (x1)^2 + (x2)^2 -2(x1 . x2)
    return np.exp(-dists / (2 * sigma ** 2))                                                    # exp( - dists / 2.sigma )


#A.2) Apply Kernel Computation Function to all split data (Basically MPI shit)

#A.3) Add up all the computed rows to get matrix K

In [None]:
#Solve for matrix A which is defined as K + lambda I

#Solve for alpha, using A(alpha) = y, and the Conjugate Gradient Method

#Our PREDICTOR function f(x) is defined as SUM to N of (alpha . k(x1,x2) ). We use that to predict median value!

In [17]:
# Solve with SKLEARN
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

krr = KernelRidge(kernel='rbf')
param_grid = {
    'alpha': [1],
    'gamma': [1]
}

grid_search = GridSearchCV(krr, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

best_model = KernelRidge(kernel='rbf', alpha=best_params['alpha'], gamma=best_params['gamma'])
best_model.fit(X_train, y_train)

y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

Best parameters: {'alpha': 1, 'gamma': 1}
Train RMSE: 0.00835074344332795
Test RMSE: 0.008312276512706872
