## Linear algebra matrices lab

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

### Read the Iris data

In [2]:
df = pd.read_csv("data/iris.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
# look at some random rows
print(df.sample(6))

     sepal_length  sepal_width  petal_length  petal_width     species
93            5.0          2.3           3.3          1.0  versicolor
65            6.7          3.1           4.4          1.4  versicolor
56            6.3          3.3           4.7          1.6  versicolor
124           6.7          3.3           5.7          2.1   virginica
103           6.3          2.9           5.6          1.8   virginica
36            5.5          3.5           1.3          0.2      setosa


### Build a model to predict petal length from sepal length and sepal width

In [5]:
predictors = ['sepal_length', 'sepal_width']
X = df[predictors].values
y = df['petal_length'].values

In [6]:
regr = LinearRegression()
regr.fit(X, y)

print('Model parameters:')
print('intercept: {:.3f}'.format(regr.intercept_))
print('coefficients: {}'.format(regr.coef_))

Model parameters:
intercept: -2.525
coefficients: [ 1.77559255 -1.33862329]


### Lab problems

Using the model, we want to make a prediction for the case for a given sepal length and sepal width.

We can do it like this:

In [7]:
# model coefficients
b = np.array([regr.intercept_, regr.coef_[0], regr.coef_[1]])

# predictor values
sepal_length = 4.5
sepal_width = 2.9

y_pred = b[0] + b[1]*sepal_length + b[2]*sepal_width
print(y_pred)

1.5833974099886166


We can also do it like this:

In [8]:
X_new = pd.DataFrame(np.array([[sepal_length, sepal_width]]))
regr.predict(X_new)

array([1.58339741])

#### Problem 1

We can also use linear algebra to make predictions.  Let's create a vector x of the predictor values.

In [9]:
x = np.array([1.0, sepal_length, sepal_width])

Can you use this vector x plus the vector b of model coefficients to make a prediction?  The value 1.0 in array x was added to make things easier.

In [65]:
# Write code to make a prediction using arrays x and b.  You only need one linear algebra operation

x.dot(b)  #solution

1.5833974099886168

#### Problem 2

Continuing the previous problem, suppose we want to make a bunch of predictions at once.  Instead of a vector u, we have a matrix X, where each row of X gives us the predictor values.

In [23]:
X = np.array([
    [1.0, 6.5, 2.8],
    [1.0, 6.0, 2.5],
    [1.0, 4.5, 2.9]
])

In [66]:
# Write code to make three predictions using X and b.  You only need one linear algebra operation.

X.dot(b) # solution

array([5.26844483, 4.78223555, 1.58339741])

#### Problem 3

Continuing the previous problem, if we have test data X_test and y_test, can we compute the difference between y_test and the predicted values using linear algebra.  Don't create any intermediate variables -- just write a NumPy formula that computes the difference between the predicted and the test values.

In [67]:
X_test = np.array([
    [1.0, 4.3, 2.8],
    [1.0, 4.6, 3.5],
    [1.0, 4.5, 2.9]
])
y_test = np.array([6.0, 1.5, 1.5])

In [68]:
# Write code to compute the error on three predictions from X_test.  Write a single numpy expression

X_test.dot(b) - y_test  #solution


array([-4.63785877, -0.54221731,  0.08339741])

#### Problem 4

Continuing the previous problem, now write an expression that gives the sum of the squared error.  Again, do not use any intermediate variables.  Use X_test, y_test, and b as before.

In [69]:
# Write code to compute the sum of the squared error.  Write a single numpy expression
# YOUR CODE HERE
(X.dot(b) - y_test).dot(X.dot(b) - y_test)  #solution


11.315198266289801

In [71]:
# compute the answer another way to make sure your answer is correct
# YOUR CODE HERE

error = (X.dot(b) - y_test)
SSE = np.sum(error**2)
print(SSE)

11.315198266289801
