# Module 10 – Regression via Linear Algebra

## DSC 40A, Summer 2023

In [None]:
import pandas as pd
import numpy as np


We'll work with the same dataset of data scientist salaries. As before, we've cleaned the data and omitted anyone with a salary over 1 million dollars.

In [None]:
# Run this cell to load in our dataset. We've already cleaned it for you.
salaries = pd.read_csv('data/data_scientist_salaries_for_regression.csv')
salaries = salaries[(salaries['Salary'] < 1_000_000)].reset_index()
salaries

### Design matrix

We're trying to find a linear prediction rule where the only feature is `'YearsExperience'`. Our design matrix would then look like this:

In [None]:
# Don't worry about this code ---
X_as_df = pd.DataFrame()
X_as_df['1'] = np.ones(salaries.shape[0]).astype(int)
X_as_df['YearsExperience'] = salaries['YearsExperience']
X_as_df
# ---

X_as_df

It looks nice as a DataFrame, but we really need the design matrix as a numpy array to be able to work with it. 

In [None]:
# Converting to a numpy array
X = X_as_df.values
X

### Observation vector

The observation vector represents each person's actual salary (as they reported it). We'll represent salaries in thousands of dollars. Our observation vector would look like this:

In [None]:
y = salaries['Salary']/1000
y

Again, we'll need this as a numpy array.

In [None]:
# Converting to a numpy array
y = y.values
y

### Making predictions

For any vector $\vec{w} \in \mathbb{R}^{2}$, we can make predictions using

$$\vec{h} = X \vec{w}$$

Let's test it out!



In [None]:
#The @ symbol does matrix multiplication.
X @ np.array([80, 3])

Our goal is to get the above array as close to `y` as possible. In other words, we want our predicted salaries to be as close to the actual salaries as possible.

In [None]:
y

### Implementing the solution

We claimed that the vector $\vec{w}$ that minimizes

$$R_{sq}(\vec{w}) = \frac{1}{n} || \vec{y} - X \vec{w} ||^2$$

is

$$\vec{w}^* = (X^TX)^{-1}X^T\vec{y}$$

In [None]:
#Given a design matrix and an observation vector, find the optimal parameter vector, w.
def least_squares(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

In [None]:
w_star = least_squares(X, y)

In [None]:
w_star

What if I have 10 years of experience – what should I expect my salary to be?

In [None]:
np.array([1, 10]) @ w_star

### Comparing to the formulas

We get the same results using linear algebra as we got using the formulas from last time.


$$w_1^* = r \frac{\sigma_y}{\sigma_x}$$

$$w_0^* = \bar{y} - w_1^* \bar{x}$$

In [None]:
def correlation(x, y):
    x = np.array(x)
    y = np.array(y)
    
    x_su = (x - np.mean(x)) / np.std(x)
    y_su = (y - np.mean(y)) / np.std(y)
    
    return np.mean(x_su * y_su)

def slope(x, y):
    return correlation(x, y) * np.std(y) / np.std(x)

def intercept(x, y):
    return np.mean(y) - slope(x, y) * np.mean(x)

In [None]:
intercept(X[:, 1], y)

In [None]:
slope(X[:, 1], y)