<a href="https://colab.research.google.com/github/yeb2Binfang/ECE-GY-6233-optimization/blob/main/HW/LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Coding Problem
<br>
In this problem you will analyze data using the linear regression. The goal of the problem is to predict the miles per gallon a car will get using six quantities (features) about that car. The data is broken into training and testing sets. Each row in both “X” files contain six features for a single car (plus a 1 in the last dimension) and the same row in the corresponding “y” file contains the miles per gallon for that car.
<br>


See https://archive.ics.uci.edu/ml/datasets/Auto+MPG for more details on this dataset. We will use the 2nd to 7th features to predict the 1st attribute in the data set. The data has been preprocessed, thus you must use the data provided with this homework.

### Training Part

Using the training data only, write code to solve the linear regression problem

$$
\min_{\omega} \mathcal{L}=	\sum_{i=1}^{N_{train}}\|y_i-x_i\omega\|^2=(Y-X\omega)^T(Y-X\omega)
$$

In [None]:
# Load training data from X_train.csv and Y_train.csv
import pandas as pd
X_train_url = 'https://raw.githubusercontent.com/yeb2Binfang/ECE-GY-6233-optimization/main/HW/X_train.csv'
y_train_url = 'https://raw.githubusercontent.com/yeb2Binfang/ECE-GY-6233-optimization/main/HW/y_train.csv'

df_X_train = pd.read_csv(X_train_url, header=None)
df_y_train = pd.read_csv(y_train_url, header=None)
df_X_train.head()


Unnamed: 0,0,1,2,3,4,5,6
0,0.30957,-0.36707,0.45545,-0.20083,-0.73992,-0.80885,1
1,0.30957,0.3592,-0.11611,-0.038361,0.16625,-0.80885,1
2,-0.86291,-0.99778,-0.89551,-1.2251,-0.55868,-0.26592,1
3,-0.86291,-0.69198,-0.42787,-0.56226,-0.15997,1.0914,1
4,-0.86291,-0.92133,-0.63571,-1.251,-0.41369,0.81993,1


$$
\begin{gathered}
\mathop{min}\limits_{\omega} \mathcal{L}=\sum_{i=1}^{N_{train}}||x_i\omega-y_i||^2\\
=(X\omega-Y)^T(X\omega-Y)\\
=\omega^TX^TX\omega-\omega^TX^TY-Y^TX\omega+Y^TY\\
=\omega^TX^TX\omega-2Y^TX\omega+Y^TY
\end{gathered}
$$

We let $f(\omega)=\omega^TX^TX\omega-2Y^TX\omega+Y^TY$. We take the derivative, then we can get

$$
\begin{gathered}
\triangledown_{\omega}f(\omega)=\frac{\partial}{\partial \omega}(\omega^TX^TX\omega-2Y^TX\omega+Y^TY)=0\\
=(X^TX+X^TX)\omega-2(Y^TX)^T\\
=2X^TX\omega - 2X^TY=0\\
\downarrow \\
\omega =(X^TX)^{-1}X^TY
\end{gathered}
$$



In [None]:
# Compute the vector omega and print your result
import numpy as np

X_train = df_X_train.to_numpy()
y_train = df_y_train.to_numpy()
w = np.linalg.inv(np.transpose(X_train).dot(X_train)).dot(np.transpose(X_train)).dot(y_train)
print(w)


[[-0.45626139]
 [ 0.7301673 ]
 [-0.28461867]
 [-5.58558944]
 [ 0.28957774]
 [ 2.78139798]
 [ 0.01015709]]


<br>
3. By learning the vector $\omega$, we an predict the value $y_i$ in the testing set using $x_i$, as
<br>
\begin{align*}
    y^{pred}_i = \omega_1 x_{i1} + \omega_2 x_{i2} + \dots + \omega_6 x_{i6} + \omega_0= x_i\omega
\end{align*}

Using only the testing data. Compute the root mean squared error (RMSE) between the predicted value $y^{pred}$ and the real value $y^{test}$.

$$
RMSE = \sqrt{\frac{1}{N_{test}}\sum_{i=1}^{N_{test}}(y_i^{pred}-y_i^{test})}
$$


In [None]:
# Load testing data from X_test.csv and Y_test.csv
X_test_url = 'https://raw.githubusercontent.com/yeb2Binfang/ECE-GY-6233-optimization/main/HW/X_test.csv'
y_test_url = 'https://raw.githubusercontent.com/yeb2Binfang/ECE-GY-6233-optimization/main/HW/y_test.csv'

df_X_test = pd.read_csv(X_test_url, header=None)
df_y_test = pd.read_csv(y_test_url, header=None)
df_X_test.head()


Unnamed: 0,0,1,2,3,4,5,6
0,0.30957,0.34964,0.013785,0.52674,0.49248,0.277,1
1,1.4821,1.4868,1.5726,0.84226,-1.4649,-1.6232,1
2,-0.86291,-0.7111,-0.66169,-0.4151,1.1087,1.6343,1
3,0.30957,0.53121,-0.11611,0.94586,0.52872,-0.53739,1
4,1.4821,1.0281,0.89711,0.26774,-1.5736,0.54847,1


In [None]:
# Compute the RMSE and print the result
X_test = df_X_test.to_numpy()
y_test = df_y_test.to_numpy()
y_pred = X_test.dot(w)
RMSE = np.sqrt(np.sum((y_pred-y_test)**2)/len(y_test))
print(RMSE)

2.633643577969158
