### 2.1. (Linear regression, 25 points)

We have a labeled dataset $ D $ = $ {(x_1, y_1) , (x_2, y_2), \ldots, (x_n, y_n) }$ , with $ x_i \in \mathbb{R}^d $ being the $ d $-dimensional feature vector of the $ i $-th sample, and $ y_i \in \mathbb{R} $ being the real-valued target (label).

A linear regression model is given by

$$
f_{w_0, w_1, \ldots, w_d}(x) = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d,
$$

where $ w_0 $ is often called bias and $ w_1, w_2, \ldots, w_d $ are often called coefficients.

Now, we want to utilize the dataset $ D $ to build a linear model based on linear regression. We provide a training set $ D_{\text{train}} $ that includes 2024 labeled samples with 11 features (See `linear_regression.train.txt`) to fit the model, and a test set $ D_{\text{test}} $ that includes 10 unlabeled samples with 11 features (see `linear_regression.test.txt`) to estimate the model.

1. Using the `LinearRegression` class from the Sklearn package to get the bias $ w_0 $ and the coefficients $ w_1, w_2, \ldots, w_{11} $, then computing the $ \hat{y} = f(x) $ of the test set $ D_{\text{test}} $ by the model trained well. (Put the estimation of $ w_0, w_1, \ldots, w_{11} $ and these $ \hat{y} $ in your answers.)

2. Implementing the linear regression by yourself to obtain the bias $ w_0 $ and the coefficients $ w_1, w_2, \ldots, w_{11} $, then computing the $ \hat{y} = f(x) $ of the test set $ D_{\text{test}} $. (Put the estimation of $ w_0, w_1, \ldots, w_{11} $ and these $ \hat{y} $ in your answers. It is allowed to compute the inverse of a matrix using the existing Python package.)


> **Hint:** Note that for `linear_regression_train.txt`, there are 2024 rows with 12 columns where the first 11 columns are features $ x $ and the last column is target $ y $. The `linear_regression_test.txt` only contains 10 rows with 11 columns (features). Both of these tasks require the submission of code and results. Put all the code in a “HW1_yourID_Q1.ipynb” Jupyter notebook file. (A “.py” file is also acceptable.)


---

### 1.

In [1]:
import pandas as pd

# Load the train data
train_data = pd.read_csv('linear_regression_train.txt', sep=' ', header=None)

# Load the test data
test_data = pd.read_csv('linear_regression_test.txt', sep=' ', header=None)

# Initialize training variables
X_train = train_data.iloc[:, :-1]  
y_train = train_data.iloc[:, -1]  

# Initialize testing variables
X_test = test_data

In [2]:
from sklearn.linear_model import LinearRegression

# Initialise linear mode
linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [4]:
# Bias of fitted model
print('Bias:', linreg.intercept_)

# Coefficients of fitted model
for i, coef in enumerate(linreg.coef_, start=1):
    print(f'w_{i}: {coef}')

Bias: 3.6136460266690635
w_1: 0.015325690386246947
w_2: 0.0002522764063409011
w_3: 0.000720386849964616
w_4: 0.9991635645789793
w_5: 0.9997402360310115
w_6: 1.0006233993583626
w_7: 0.9988323588429893
w_8: 1.0000013015236555
w_9: 1.0002245483558736
w_10: 0.9990396151482012
w_11: 0.9993448460859441


In [6]:
# Predict on test data
y_hat = linreg.predict(X_test)

# Output predictions with labels
for i, y in enumerate(y_hat, start=1):
    print(f'y_hat_{i}: {y}')


y_hat_1: -56.11129687958417
y_hat_2: -173.5165197093181
y_hat_3: -6.770877912226868
y_hat_4: 209.51709044187203
y_hat_5: 116.89029785098518
y_hat_6: -100.29084527235771
y_hat_7: -310.12783900144865
y_hat_8: 501.3863019426078
y_hat_9: 244.11476780856393
y_hat_10: 18.566393254550345


In [8]:
# Create dataframe for test data, add in column headers from feature 1-11
df = pd.DataFrame(X_test)
df.columns = [f'feature_{i}' for i in range(1, 12)]

# Add in the predicted values
df['predicted'] = y_hat

df

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,predicted
0,32.958165,36.019125,-31.786659,83.365958,-8.600305,-44.034774,-68.927666,-24.71759,75.467763,-98.755544,25.905601,-56.111297
1,-83.581906,-80.202289,-55.363995,54.275198,-74.347354,-75.905742,-24.127742,24.046764,-82.404782,49.650006,-46.894782,-173.51652
2,84.241254,-39.832637,-89.555024,-55.594047,69.746813,-38.047707,97.234721,-88.253101,40.840461,5.026011,-42.477354,-6.770878
3,-75.902745,51.805697,-90.406799,65.480477,96.574654,37.581815,14.725089,-22.905153,64.97539,-29.075757,-20.21986,209.51709
4,93.580876,62.174052,-86.4197,95.165741,12.784854,-63.811706,79.181736,47.505544,-54.118494,63.500403,-68.075415,116.890298
5,-38.536146,86.274667,-75.406284,-63.433126,67.366338,-23.90929,80.107992,-43.579923,-44.396491,-56.729703,-18.690957,-100.290845
6,60.415815,45.947195,-77.057367,9.909093,-30.667548,-82.719948,-79.452907,20.684428,-98.66395,6.542145,-60.306773,-310.127839
7,-82.56096,-65.584762,-78.43361,83.671333,82.313412,51.260561,98.766245,53.27588,69.00278,27.717866,33.310546,501.386302
8,53.798119,-2.079414,19.894084,84.800805,97.561526,49.684502,12.306095,95.116051,-84.758244,4.39063,-19.348425,244.114768
9,96.623825,-1.8869,69.575141,-77.278018,74.902899,-71.225067,-24.293523,51.774044,-18.352882,90.773252,-12.824746,18.566393


---

### 2.

In [12]:
import numpy as np

# Add a column of 1's to X_train for the bias term
X_train_b = np.c_[np.ones((X_train.shape[0], 1)), X_train] 

# Calculate the optimal weights using the Normal Equation
theta_best = np.linalg.inv(X_train_b.T.dot(X_train_b)).dot(X_train_b.T).dot(y_train)

# Extract bias and coefficients
w_0_manual = theta_best[0]
w_manual = theta_best[1:]

X_test = test_data.values  # Test features

# Add a column of 1's to X_test for the bias term
X_test_b = np.c_[np.ones((X_test.shape[0], 1)), X_test]

# Predict the target values for the test set
y_hat_manual = X_test_b.dot(theta_best)


In [13]:
# Output bias
print(f"Manual Bias (w_0): {w_0_manual}")

Manual Bias (w_0): 3.6136460266689294


In [14]:
# Output the coefficients
for i, coef in enumerate(w_manual, start=1):
    print(f'Manual Coefficient w_{i}: {coef}')

Manual Coefficient w_1: 0.015325690386246597
Manual Coefficient w_2: 0.00025227640634090803
Manual Coefficient w_3: 0.0007203868499653398
Manual Coefficient w_4: 0.9991635645789786
Manual Coefficient w_5: 0.999740236031012
Manual Coefficient w_6: 1.0006233993583618
Manual Coefficient w_7: 0.9988323588429896
Manual Coefficient w_8: 1.0000013015236555
Manual Coefficient w_9: 1.0002245483558734
Manual Coefficient w_10: 0.9990396151482013
Manual Coefficient w_11: 0.9993448460859444


In [15]:
# Output the y_hats
for i, y in enumerate(y_hat_manual, start=1):
    print(f'Manual Predicted value y_hat_{i}: {y}')

Manual Predicted value y_hat_1: -56.11129687958437
Manual Predicted value y_hat_2: -173.5165197093183
Manual Predicted value y_hat_3: -6.770877912226993
Manual Predicted value y_hat_4: 209.51709044187183
Manual Predicted value y_hat_5: 116.89029785098501
Manual Predicted value y_hat_6: -100.29084527235779
Manual Predicted value y_hat_7: -310.1278390014488
Manual Predicted value y_hat_8: 501.3863019426076
Manual Predicted value y_hat_9: 244.11476780856376
Manual Predicted value y_hat_10: 18.56639325455041


In [17]:
# Create dataframe for test data, add in column headers from feature 1-11
df = pd.DataFrame(X_test)
df.columns = [f'feature_{i}' for i in range(1, 12)]

# Add in the predicted values
df['predicted'] = y_hat_manual

# Shift the predicted column to the left
df = df[['predicted'] + [col for col in df.columns if col != 'predicted']]

df

Unnamed: 0,predicted,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11
0,-56.111297,32.958165,36.019125,-31.786659,83.365958,-8.600305,-44.034774,-68.927666,-24.71759,75.467763,-98.755544,25.905601
1,-173.51652,-83.581906,-80.202289,-55.363995,54.275198,-74.347354,-75.905742,-24.127742,24.046764,-82.404782,49.650006,-46.894782
2,-6.770878,84.241254,-39.832637,-89.555024,-55.594047,69.746813,-38.047707,97.234721,-88.253101,40.840461,5.026011,-42.477354
3,209.51709,-75.902745,51.805697,-90.406799,65.480477,96.574654,37.581815,14.725089,-22.905153,64.97539,-29.075757,-20.21986
4,116.890298,93.580876,62.174052,-86.4197,95.165741,12.784854,-63.811706,79.181736,47.505544,-54.118494,63.500403,-68.075415
5,-100.290845,-38.536146,86.274667,-75.406284,-63.433126,67.366338,-23.90929,80.107992,-43.579923,-44.396491,-56.729703,-18.690957
6,-310.127839,60.415815,45.947195,-77.057367,9.909093,-30.667548,-82.719948,-79.452907,20.684428,-98.66395,6.542145,-60.306773
7,501.386302,-82.56096,-65.584762,-78.43361,83.671333,82.313412,51.260561,98.766245,53.27588,69.00278,27.717866,33.310546
8,244.114768,53.798119,-2.079414,19.894084,84.800805,97.561526,49.684502,12.306095,95.116051,-84.758244,4.39063,-19.348425
9,18.566393,96.623825,-1.8869,69.575141,-77.278018,74.902899,-71.225067,-24.293523,51.774044,-18.352882,90.773252,-12.824746
