<a href="https://colab.research.google.com/github/tejaswinirb1/ML_observations/blob/main/LWR_activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Activity to complete:

Consider two more features and implement the algorithm
Implement the same for Diabetes dataset available in sklearn.datasets
Compare KNN regression and Local weighted regression (LWR) algorithm considering multiple features for both housing dataset and Diabetes dataset.

In [2]:
import pandas as pd
import numpy as np

def locally_weighted_regression(X, Y, tau, x_query):
    weights = np.exp(-np.sum((X - x_query)**2, axis=1) / (2 * tau**2))
    X_augmented = np.c_[np.ones(X.shape[0]), X]
    x_query_augmented = np.concatenate(([1], x_query)).reshape(1, -1)
    W = np.diag(weights)
    X_transpose_W = X_augmented.T @ W
    theta = np.linalg.pinv(X_transpose_W @ X_augmented) @ X_transpose_W @ Y
    return x_query_augmented @ theta

# Load the housing dataset
db = pd.read_csv('housing.csv')

# Select features: housing_median_age, total_rooms, population
X = db[['housing_median_age', 'total_rooms', 'population']].values
Y = db['median_house_value'].values

# Example query point
X_Query = np.array([41, 2500, 1200])  # Example values for the 3 features
tau = 10  # Adjust bandwidth as needed

# Predict
y_query = locally_weighted_regression(X, Y, tau, X_Query)
print(f"Predicted median house value: {y_query[0]}")

Predicted median house value: 205990.04694565106


#Observation:
This is the predicted median house value for the given query point (house features: [41, 2500, 1200]) using Locally Weighted Regression. It provides a single point estimate based on the local data points.

In [4]:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load and preprocess the diabetes dataset
diabetes = load_diabetes()
X_d = diabetes.data
Y_d = diabetes.target
scaler = StandardScaler()
X_d_scaled = scaler.fit_transform(X_d)

# Use the same locally_weighted_regression function from Activity 1

# Example query point (mean of dataset)
X_Query_d = np.mean(X_d_scaled, axis=0)
tau = 0.5  # Adjust bandwidth as needed

# Predict
y_query_d = locally_weighted_regression(X_d_scaled, Y_d, tau, X_Query_d)
print(f"Predicted disease progression value: {y_query_d[0]}")

Predicted disease progression value: 170.90977125258212


This is the predicted disease progression value for the query point (mean of the dataset) using Locally Weighted Regression. Similar to Program 1, it provides a single point estimate based on the data points in the vicinity of the query point.

In [6]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# ... (locally_weighted_regression function from Activity 1) ...

# Prepare data (housing and diabetes) - using the same data preparation as in previous activities
# ...

# Split data into training and testing sets
X_h_train, X_h_test, Y_h_train, Y_h_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_d_train, X_d_test, Y_d_train, Y_d_test = train_test_split(X_d_scaled, Y_d, test_size=0.2, random_state=42)

# Train KNN models
knn_h = KNeighborsRegressor(n_neighbors=5)  # Housing
knn_d = KNeighborsRegressor(n_neighbors=5)  # Diabetes
knn_h.fit(X_h_train, Y_h_train)
knn_d.fit(X_d_train, Y_d_train)

# Make predictions with KNN
knn_preds_h = knn_h.predict(X_h_test)
knn_preds_d = knn_d.predict(X_d_test)

# Make predictions with LWR (for a subset of test data due to computation)
lwr_preds_h = [locally_weighted_regression(X_h_train, Y_h_train, tau=10, x_query=x) for x in X_h_test[:50]]
lwr_preds_d = [locally_weighted_regression(X_d_train, Y_d_train, tau=0.5, x_query=x) for x in X_d_test[:50]]
lwr_preds_h = np.array(lwr_preds_h).reshape(-1)  # Reshape to 1D array
lwr_preds_d = np.array(lwr_preds_d).reshape(-1)  # Reshape to 1D array


# Evaluate and compare using Mean Squared Error (MSE)
mse_knn_h = mean_squared_error(Y_h_test, knn_preds_h)
mse_lwr_h = mean_squared_error(Y_h_test[:50], lwr_preds_h)
mse_knn_d = mean_squared_error(Y_d_test, knn_preds_d)
mse_lwr_d = mean_squared_error(Y_d_test[:50], lwr_preds_d)

print(f"Housing Dataset - KNN MSE: {mse_knn_h}, LWR MSE: {mse_lwr_h}")
print(f"Diabetes Dataset - KNN MSE: {mse_knn_d}, LWR MSE: {mse_lwr_d}")

Housing Dataset - KNN MSE: 11278255164.991802, LWR MSE: 15916847976.84166
Diabetes Dataset - KNN MSE: 3019.075505617978, LWR MSE: 8543.22691353124


#Observation:
* Housing Dataset: KNN has a lower MSE than LWR, suggesting that KNN performs better in predicting median house values on this dataset.
* Diabetes Dataset: KNN again has a lower MSE than LWR, indicating better performance in predicting disease progression.
* Overall: Based on MSE, KNN appears to generalize better and provide more accurate predictions compared to LWR for both datasets.