**R^2 Score**

R^2 score is calculated on the testing set because it provides an unbiased estimate of how well the model will perform on new, unseen data, which is crucial for model evaluation and selection.



### Question 4: Differentiate between generalization, overfitting, and underfitting.

- **Generalization**:
  - **Definition**: The ability of a machine learning model to perform well on new, unseen data.
  - **Significance**: A well-generalized model can make accurate predictions on data it hasn't seen during training.
  
- **Overfitting**:
  - **Definition**: Occurs when a model learns the noise and details in the training data to the extent that it negatively impacts the model's ability to generalize.
  - **Significance**: Overfitted models have high accuracy on training data but poor accuracy on new data.

- **Underfitting**:
  - **Definition**: Occurs when a model is too simple to capture the underlying pattern of the data.
  - **Significance**: Underfitted models have poor performance on both training and new data.

### Question 6: Study the impact of dataset variation on R^2 score.

- **Impact of Dataset Variation**:
  - **R^2 Score**: Measures how well the regression model fits the observed data.
  - **Impact**:
    - High-quality, well-structured datasets typically yield higher \( R^2 \) scores.
    - Noisy, unstructured datasets may result in lower \( R^2 \) scores.
    - Large variations in dataset quality can significantly impact \( R^2 \) scores.

### Question 7: Discuss the strengths and weaknesses of linear and kNN regressions.

- **Linear Regression**:
  - **Strengths**: Simple to implement and interpret, works well with linearly separable data, provides insights into relationships between variables.
  - **Weaknesses**: Assumes a linear relationship between variables, sensitive to outliers, may underperform when the relationship is non-linear.

- **k-Nearest Neighbors (kNN) Regression**:
  - **Strengths**: Non-parametric, does not assume any underlying data distribution, flexible and can capture complex patterns.
  - **Weaknesses**: Computationally expensive, sensitive to the choice of k, does not generalize well to high-dimensional data.

### Question 8: Analyze the impact of \( R^2 \) on the mean relative error for wave and Boston Housing datasets.

- **Wave Dataset**:
  - **KNN**:
    - \( R^2 \) Score: 0.8183
    - Mean Relative Error: 0.1345
  - **Linear Regression**:
    - \( R^2 \) Score: 0.6608
    - Mean Relative Error: 0.4714

- **California Housing Dataset**:
  - **KNN**:
    - \( R^2 \) Score: 0.8206
    - Mean Relative Error: 0.2105
  - **Linear Regression**:
    - \( R^2 \) Score: 0.8411
    - Mean Relative Error: 0.2360

- **Impact of \( R^2 \) on Mean Relative Error**: Generally, higher \( R^2 \) scores correlate with lower mean relative error, indicating better model performance.

**R^2 Score**
It ranges from 0 to 1, where 1 indicates that the model perfectly predicts the target variable based on the independent variables.

**Mean relative Error**
Lower mean relative error indicates better model performance, as it means the model's predictions are closer to the actual values.


R^2 and mean relative error are complementary metrics in evaluating the performance of regression models.
Higher R^2 scores are generally associated with lower mean relative error, which indicates better model performance in predicting the target variable.

In [None]:

pip install mglearn

Collecting mglearn
  Downloading mglearn-0.2.0-py2.py3-none-any.whl (581 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/581.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/581.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m573.4/581.4 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m581.4/581.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mglearn
Successfully installed mglearn-0.2.0


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:


# KNN Using Build-in Dataset
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
import mglearn
X, y = mglearn.datasets.make_wave(n_samples=40)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
knn_reg= KNeighborsRegressor(n_jobs=-1, n_neighbors=3)
knn_reg.fit(X_train, y_train)

# to make predictions on X_test
prediction_X= knn_reg.predict(X_test)
print('the prediction on testing feature is:'+ str(prediction_X))
# the ammount of variance in the target vector explained by the model
r_square= knn_reg.score(X_test, y_test)
print('The R Score is '+ str(r_square))

# mean_relative_error = np.mean(np.abs((y_pred - y) / y))
# mean_relative_error

the prediction on testing feature is:[-0.05396539  0.35686046  1.13671923 -1.89415682 -1.13881398 -1.63113382
  0.35686046  0.91241374 -0.44680446 -1.13881398]
The R Score is 0.8344172446249605


In [None]:
#Question 2:Compute the mean relative error of actual and predicted dependent variable using the following formula. Write the code and output.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import mglearn

X, y= mglearn.datasets.make_wave(n_samples= 40)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)


lr_reg= LinearRegression().fit(X_train, y_train)

lr_reg_score_train=lr_reg.score(X_train, y_train)
lr_reg_score_test=lr_reg.score(X_test, y_test)

y_pred= lr_reg.predict(X_test)
# mean_relative_error = np.mean(np.abs((y_test - X_test) / X_test))
mean_relative_error= np.mean(np.abs((y_pred - X_test))/X_test)
mean_relative_error

0.2755425970628801

In [None]:
# X, y= mglearn.datasets.load_extended_boston()
# print(X.shape)
# print(y.shape)


# X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
# lr_reg= LinearRegression().fit(X_train, y_train)

# lr_reg_score= lr_reg.score(X_test, y_test)
# lr_reg_score



In [None]:
df= pd.read_csv('/content/sample_data/california_housing_train.csv')
X= df[["total_rooms"]]
y= df["total_bedrooms"]

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
lr_reg= LinearRegression().fit(X_train, y_train)
y_pred= lr_reg.predict(X_test)


lr_reg_score_rsquare= lr_reg.score(X_test, y_test)
print(lr_reg_score_rsquare)
mean_relative_error= np.mean(np.abs((y_pred - y_test)/y_pred))
print(mean_relative_error)
mse = np.mean((y_pred - y_test)**2)
print(mse)

(13600, 1)
(13600,)
(3400, 1)
(3400,)
0.8411231646860435
0.19778964277789
26231.403924160844


In [None]:
#Question 3: Compare the linear and kNN regressions on the basis of R^2 and mean relative error for wave and Boston Housing datasets.


from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn


def mean_relative_error(y_pred, y_test):
  return np.mean(np.abs((y_pred - y_test))/y_test)

# KNN Using Build-in Dataset
X, y = mglearn.datasets.make_wave(n_samples=40)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
knn_reg= KNeighborsRegressor(n_jobs=-1, n_neighbors=3)
knn_reg.fit(X_train, y_train)
y_pred= knn_reg.predict(X_test)
# to make predictions on X_test
prediction_X= knn_reg.predict(X_test)
# print('the prediction on testing feature is:'+ str(prediction_X))
# the ammount of variance in the target vector explained by the model
r_square= knn_reg.score(X_test, y_test)
print('The R^2 Score of KNN on wave dataset'+ str(r_square))
print('the mean relative error of KNN on wave dataset' + str(mean_relative_error(y_pred, y_test)))

# Linear Regression Using Built-in Dataset
X, y= mglearn.datasets.make_wave(n_samples= 40)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
lr_reg= LinearRegression().fit(X_train, y_train)
y_pred= lr_reg.predict(X_test)
lr_reg_score_test=lr_reg.score(X_test, y_test)
print('the R^2 score of Linear Regression testing wave dataset'+ str(lr_reg_score_test))
print('the mean relative error of Linear Regression wave dataset' + str(mean_relative_error(y_pred, y_test)))

df= pd.read_csv('/content/sample_data/california_housing_train.csv')
X= df[["total_rooms"]]
y= df["total_bedrooms"]


# KNN using California Housing Training dataset
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
knn_reg= KNeighborsRegressor(n_neighbors= 5, n_jobs= -1)
knn_reg= knn_reg.fit(X_train, y_train)
y_pred= knn_reg.predict(X_test)
r_square= knn_reg.score(X_test, y_test)
print('The R^2 Score is of KNN on California Housing dataset'+ str(r_square))
print('the mean relative error of KNN on California Housing dataset' + str(mean_relative_error(y_pred, y_test)))




# Linear Regression using California Hosuing Training Dataset
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size= 0.2)
lr_reg= LinearRegression().fit(X_train, y_train)
y_pred= lr_reg.predict(X_test)

lr_reg_score_train=lr_reg.score(X_train, y_train)
lr_reg_score_test=lr_reg.score(X_test, y_test)
print('the R^2 score is of Linear Regression on California Housing dataset'+ str(lr_reg_score_test))
print('the mean relative error of of Linear Regression on California Housing dataset' + str(mean_relative_error(y_pred, y_test)))



The R^2 Score of KNN on wave dataset0.8183022897768604
the mean relative error of KNN on wave dataset0.13446818817673806
the R^2 score of Linear Regression testing wave dataset0.6607869057739273
the mean relative error of Linear Regression wave dataset0.4713975184565494
The R^2 Score is of KNN on California Housing dataset0.8205980343355469
the mean relative error of KNN on California Housing dataset0.2104500750003855
the R^2 score is of Linear Regression on California Housing dataset0.8411231646860435
the mean relative error of of Linear Regression on California Housing dataset0.2360237740000491


In [None]:
#Question 5: Analyze the kNN regression with k=1, 3, and 9 for wave and Boston Housing datasets.

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

#KNN using Built-in dataset
X, y= mglearn.datasets.make_wave(n_samples= 60)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state= 0, test_size= 0.2)

k = [1, 3, 9]
for neighbor in k:
  knn= KNeighborsRegressor(n_neighbors=neighbor)
  knn= knn.fit(X_train, y_train)
  y_pred= knn.predict(X_test)
  print(f'For Built-in Dataset "wave", the value of k is {neighbor} and the predicted output is {np.round(y_pred, 3)}.')


print()
print()


#KNN using California housing dataset
X= df[["total_bedrooms"]]
y= df["total_rooms"]

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state= 42, test_size= 0.2)


k= [1, 3, 9]
for neighbor in k:
  knnh= KNeighborsRegressor(n_neighbors=neighbor)
  knnh= knnh.fit(X_train, y_train)
  y_pred= knnh.predict(X_test)
  print(f'For Built-in Dataset "California Housing", the value of k is {neighbor} and the predicted output is {np.round(y_pred, 3)}.')








For Built-in Dataset "wave", the value of k is 1 and the predicted output is [-1.547  0.799 -0.026  0.652  0.731  0.45   0.731 -0.081 -0.752 -0.746
 -2.374 -0.447].
For Built-in Dataset "wave", the value of k is 3 and the predicted output is [-1.481  0.708 -0.183  0.706  0.647  0.504  0.647 -0.41  -1.13  -0.423
 -1.401 -0.41 ].
For Built-in Dataset "wave", the value of k is 9 and the predicted output is [-1.291  0.878 -0.629  0.449  0.95   0.48   0.95  -0.99  -1.277 -0.661
 -1.309 -0.99 ].


For Built-in Dataset "California Housing", the value of k is 1 and the predicted output is [5284. 2644. 3218. ...  199.  299.  247.].
For Built-in Dataset "California Housing", the value of k is 3 and the predicted output is [5507.    2181.667 3526.333 ...  199.333  330.667  407.333].
For Built-in Dataset "California Housing", the value of k is 9 and the predicted output is [5029.    2129.111 3504.111 ...  190.667  477.444  369.556].
