# DX 601 Week 11 Homework

## Introduction

In this homework, you will practice working with systems of linear equations and review previous weeks' material.

## Example Code

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx500-examples
* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Shared Imports

Do not install or use any additional modules.
Installing additional modules may result in an autograder failure resulting in zero points for some or all problems.

In [43]:
import math
import sys
from statistics import LinearRegression

In [44]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.linear_model

## Shared Data

### Vineyard Data

This data set attempts to predict yields for a small vineyard in Lake Erie in 1991 based on the yields in the previous years.
Each row of the data set represents the yields of a row of the vineyard.
See https://github.com/EpistasisLab/pmlb/blob/master/datasets/192_vineyard/metadata.yaml for more information.

In [45]:
vineyard = pd.read_csv("https://github.com/EpistasisLab/pmlb/raw/refs/heads/master/datasets/192_vineyard/192_vineyard.tsv.gz", sep="\t")
vineyard.head()

Unnamed: 0,lugs_1989,lugs_1990,target
0,1.0,5.0,9.5
1,3.0,8.0,17.5
2,3.0,11.0,18.0
3,3.0,9.0,20.0
4,5.0,9.5,20.5


In [46]:
vineyard_inputs = vineyard[["lugs_1989", "lugs_1990"]]
vineyard_inputs.head()

Unnamed: 0,lugs_1989,lugs_1990
0,1.0,5.0
1,3.0,8.0
2,3.0,11.0
3,3.0,9.0
4,5.0,9.5


In [47]:
vineyard_target = vineyard["target"]

## Problems

### Problem 1

Set `p1` to the value of $x$ after solving the following system of linear equations.

\begin{array}{rcl}
3x & = & 4.2 \\
\end{array}


In [8]:
# YOUR CHANGES HERE

p1 = 4.2/3

In [9]:
p1

1.4000000000000001

### Problem 2

Set `p2` to be a tuple of `(x, y)` where $x$ and $y$ are the solution to the following system of linear equations.

\begin{array}{rcl}
3x + 2y & = & 8.6 \\
2x + 5y & = & 13.8 \\
\end{array}


Hint: Just do this by hand.

In [10]:
# YOUR CHANGES HERE

# 3x + 2y = 8.6
# 2y = 8.6 - 3x
# y = 4.3 - (3/2)(x)

# 2x + 5(4.3 - (3/2)(x)) = 13.8
# 2x + 21.5 - 7.5x = 13.8
# 21.5 = 13.8 + 5.5x
# 5.5x = 7.7
# x = 1.4

# y = 4.2 - (3/2)(1.4)
# y = 4.2 - 2.9
# y = 1.3

p2 = (1.4, 1.3)

In [11]:
p2

(1.4, 1.3)

### Problem 3

Set `p3` to be the x intercept of the following equation.

\begin{array}{rcl}
4x + 2y + 3z & = & 12 \\
\end{array}

In [12]:
# YOUR CHANGES HERE

p3 = 3

In [13]:
p3

3

### Problem 4

Set `p4` to be the sum of the 5 axis intercepts of the following equation.

\begin{array}{rcl}
9a + 4b + 27c + 6d + 3e & = & 36 \\
\end{array}

In [14]:
# YOUR CHANGES HERE

# a = 4
# b = 9
# c = 1.33
# d = 6
# e = 12

p4 = 4 + 9 + 36/27 + 6 + 12

In [15]:
p4

32.333333333333336

### Problem 5

Set `p5` to the augmented matrix of the following system of linear equations.

\begin{array}{rcl}
3x + 2y + 13z = 10 \\
7x + 2y - 13z = 23 \\
\end{array}

In [16]:
# YOUR CHANGES HERE

# coefficients
A = np.array([[3, 2, 13],
              [7, 2, 13]])

# constants
B = np.array([[10],
              [23]])

augmented_matrix = np.concatenate((A, B), axis=1)

p5 = augmented_matrix

In [17]:
p5

array([[ 3,  2, 13, 10],
       [ 7,  2, 13, 23]])

### Problem 6

Set `p6` to the rank of the following system of linear equations?

\begin{array}{rcl}
3x + 2y + 0z = 3 \\
2x + 3y + 1z = 5 \\
5x + 5y + 5z = 20 \\
\end{array}

In [18]:
# YOUR CHANGES HERE

A = np.array([[3, 2, 0],
              [2, 3, 1],
              [5, 5, 5]])

B = np.array([[2],
              [5],
              [20]])

augmented_matrix = np.concatenate((A, B), axis=1)
rank = np.linalg.matrix_rank(augmented_matrix)

p6 = rank

In [19]:
p6

np.int64(3)

### Problem 7

Consider the following system of linear equations.

\begin{array}{rcl}
3x + 2y + 0z = 3 \\
2x + 3y + 1z = 5 \\
5x + 5y + 5z = 20 \\
\end{array}

This system could be rewritten as 
\begin{array}{rcl}
\mathbf{A}
\begin{bmatrix}
x \\ y \\ z \\
\end{bmatrix}
& = &
\begin{bmatrix}
3 \\ 5 \\ 20 \\
\end{bmatrix}
\end{array}

Set `p7` to $\mathbf{A}$.

In [30]:
# YOUR CHANGES HERE

p7 = np.array([[3],
               [5],
               [20]])

In [31]:
p7

array([[ 3],
       [ 5],
       [20]])

### Problem 8

Set `p8` to the number of free variables in the following system of linear equations.

\begin{array}{rcl}
x + 3y + 4z = 3 \\
0x + 0y + 1z = 2 \\
x + 3y + 5z = 5 \\
\end{array}

In [28]:
# YOUR CHANGES HERE

A = np.array([[1, 3, 4],
              [0, 0, 1],
              [1, 3, 5]])

B = np.array([[3],
              [2],
              [5]])

augmented_matrix = np.concatenate((A, B), axis=1)

augmented_matrix_copy = augmented_matrix.astype(float).copy()

n = augmented_matrix_copy.shape[0]
col = 0
for i in range(n):
    if col >= augmented_matrix_copy.shape[1]:
        break

	# i: is row indexing --> rows from i to the end (a range)
	# col is col indexing --> a single column (just one column, not a range)
    np_abs = np.abs(augmented_matrix_copy[i:, col])
    max_row = i + np.argmax(np_abs)
    if np.abs(augmented_matrix_copy[max_row, col]) < 1e-12:
        col += 1
        continue

    augmented_matrix_copy[[i, max_row]] = augmented_matrix_copy[[max_row, i]]

    augmented_matrix_copy[i] = augmented_matrix_copy[i] / augmented_matrix_copy[i, col]

    for j in range(n):
        if i != j:
            augmented_matrix_copy[j] = augmented_matrix_copy[j] - augmented_matrix_copy[j, col] * augmented_matrix_copy[i]

    col += 1

augmented_matrix_copy = np.where(np.isclose(augmented_matrix_copy, 0.0, atol=1e-12), 0.0, augmented_matrix_copy)
print(augmented_matrix_copy)

p8 = 1

[[ 1.  3.  0. -5.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  1.  2.]]


In [29]:
p8

1

### Problem 9

Set `p9` to any solution `(x, y, z)` to the following system of linear equations.

\begin{array}{rcl}
2x + 4y + 0z = 16 \\
1x + 3y + 1z = 16 \\
3x + 0y + 0z = 6 \\
\end{array}

In [36]:
# YOUR CHANGES HERE

A = np.array([[2, 4, 0],
              [1, 3, 1],
              [3, 0, 0]])

B = np.array([[16],
              [16],
              [6]])

augmented_matrix = np.concatenate((A, B), axis=1)

augmented_matrix_copy = augmented_matrix.astype(float).copy()

n = augmented_matrix_copy.shape[0]
col = 0
for i in range(n):
    if col >= augmented_matrix_copy.shape[1]:
        break

	# i: is row indexing --> rows from i to the end (a range)
	# col is col indexing --> a single column (just one column, not a range)
    np_abs = np.abs(augmented_matrix_copy[i:, col])
    max_row = i + np.argmax(np_abs)
    if np.abs(augmented_matrix_copy[max_row, col]) < 1e-12:
        col += 1
        continue

    augmented_matrix_copy[[i, max_row]] = augmented_matrix_copy[[max_row, i]]

    augmented_matrix_copy[i] = augmented_matrix_copy[i] / augmented_matrix_copy[i, col]

    for j in range(n):
        if i != j:
            augmented_matrix_copy[j] = augmented_matrix_copy[j] - augmented_matrix_copy[j, col] * augmented_matrix_copy[i]

    col += 1

augmented_matrix_copy = np.where(np.isclose(augmented_matrix_copy, 0.0, atol=1e-12), 0.0, augmented_matrix_copy)
print(augmented_matrix_copy)

p9 = (2, 3, 5)

[[1. 0. 0. 2.]
 [0. 1. 0. 3.]
 [0. 0. 1. 5.]]


In [34]:
p9

(2, 3, 5)

### Problem 10

Set `p10` to any solution `(x, y, z)` to the following system of linear equations.

\begin{array}{rcl}
x + 3y + 0z = 3 \\
0x + 0y + 1z = 2 \\
\end{array}

Hint: these equations are in reduced row echelon form, so there are shortcuts to picking solutions.

In [41]:
# YOUR CHANGES HERE

# x + 3y = 3
# x = 3 - 3y
#
# z = 2/1
# z = 2

p10 = (3, 0, 2)

In [42]:
p10

(3, 0, 2)

### Problem 11

Set `p11` to be a tuple or list of the average yields in the vineyard data set for 1989, 1990, and 1991 in that order.

In [51]:
# YOUR CHANGES HERE

p11 = (vineyard["lugs_1989"].mean(), vineyard["lugs_1990"].mean(), vineyard["target"].mean())

In [52]:
p11

(np.float64(3.2788461538461537),
 np.float64(9.653846153846153),
 np.float64(18.08653846153846))

### Problem 12

Set `p12` to the 95th percentile of the data in `q12`.

In [53]:
# DO NOT CHANGE

q12 = np.array([3.44857705, 2.09151799, 4.98803337, 3.8649001 , 1.20265499,
       3.89903439, 3.05276698, 0.92826333, 3.20371215, 1.81124845,
       3.53150155, 2.32418747, 1.81826697, 3.50670706, 1.37181554,
       2.95770001, 3.80008758, 2.65923837, 2.83248683, 2.91306525,
       2.18314379, 2.17931002, 2.9086665 , 3.26098354, 3.24755896,
       1.01129371, 4.56540725, 3.05517241, 2.32079938, 3.39392893,
       3.3886077 , 3.38112083, 3.88523072, 3.13214221, 3.73298754,
       4.11129171, 2.74133096, 2.4825709 , 3.21885293, 4.08327916,
       2.82768517, 2.1188981 , 3.45886466, 4.20440619, 2.25038228,
       1.59150786, 2.24486543, 3.49914959, 3.72254599, 1.84068517])

In [54]:
# YOUR ANSWER HERE

p12 = np.percentile(q12, 95)

In [55]:
p12

np.float64(4.162504674)

### Problem 13

Set `p13` to the average $L_1$ loss using the average of 1989 and 1990 vineyard yields per row to predict 1991 yields per row.

In [56]:
# YOUR CHANGES HERE

a = vineyard["lugs_1989"]
b = vineyard["lugs_1990"]
y_true = (a + b) / 2
# 1991
y_pred = vineyard["target"]

def l1_loss(y_true, y_pred):
  return np.mean(np.abs(y_true - y_pred))

p13 = l1_loss(y_true, y_pred)

In [57]:
p13

np.float64(11.620192307692308)

### Problem 14

Build a linear regression trained with `vineyard_inputs` as its input and `vineyard_target` as its target output. Set `p14` as the output of that regression with `vineyard_inputs` as its input.

In [60]:
# YOUR CHANGES HERE

from sklearn.linear_model import LinearRegression

X_train = vineyard_inputs
y_train = vineyard_target

model = LinearRegression()
model.fit(X_train, y_train)

p14 = model.predict(X_train)

In [61]:
p14

array([12.46642263, 16.68370318, 18.66216936, 17.34319191, 19.91175065,
       18.56238423, 19.32165808, 19.68179142, 21.2307281 , 20.24149501,
       22.02039093, 20.24149501, 19.45183218, 21.3609022 , 20.90098374,
       22.58009452, 20.11132091, 22.67987965, 19.65140244, 23.33936837,
       20.90098374, 24.25920531, 20.34128014, 21.100554  , 19.78157655,
       23.89907197, 20.44106527, 22.87944991, 18.00268063, 22.77966478,
       16.68370318, 15.23455163, 15.56429599, 15.56429599, 14.90480727,
       18.00268063, 14.90480727, 16.88327344, 15.56429599, 16.88327344,
       16.22378472, 16.32356985, 15.0045924 , 16.98305857, 15.76386625,
       13.78540008, 12.46642263, 13.78540008, 13.88518521, 12.56620776,
       12.99573725,  9.69829362])

### Problem 15

Given the following data, set `p15` to the weighted variance of 

| Color | Shape | Score | Probability |
|---|---|---|---:|
| red | square | 3 | 0.250 |
| blue | circle | 4 | 0.125 |
| purple | line | 2 | 0.125 |
| purple | diamond | 5 | 0.25 |
| blue | triangle | 3 | 0.25 |

In [64]:
# YOUR CHANGES HERE

scores = np.array([3, 4, 2, 5, 3])
probabilities = np.array([0.250, 0.125, 0.125, 0.25, 0.25])

weighted_mean = np.average(scores, weights=probabilities)
weighted_variance = np.average((scores - weighted_mean) ** 2, weights=probabilities)

p15 = weighted_variance

In [65]:
p15

np.float64(1.0)

### Problem 16

Set `p16` to be the correlation between the 1989 and 1990 yields in the vineyard data set.

In [62]:
# YOUR CHANGES HERE

p16 = vineyard_inputs.corr()

In [63]:
p16

Unnamed: 0,lugs_1989,lugs_1990
lugs_1989,1.0,0.722479
lugs_1990,0.722479,1.0


### Problem 17

Compute the sample mean and variance of the 1990 vineyard yields.
Assuming that the yields follow a normal distribution with your computed parameters, what would the one-sided p-value of a yield of 13 lugs be?

Hint: use the [SciPy stats module](https://docs.scipy.org/doc/scipy/reference/stats.html) to calculate the p-values from the distribution.

In [66]:
# YOUR CHANGES HERE

from scipy.stats import norm

sample_mean = vineyard["lugs_1990"].mean()
sample_variance = vineyard["lugs_1990"].var()
sample_std = sample_variance ** 0.5

p_value = 1 - norm.cdf(13, loc=sample_mean, scale=sample_std)

p17 = p_value

In [67]:
p17

np.float64(0.07618255849601963)

### Problem 18

Set `p18` to be the $2 \times 3$ matrix full of question marks below, filled in with the following information.
1. Each serving of noodles requires 1/2 cup of flour.
2. Each serving of noodles requires 1/8 cup of water.
3. Noodles do not need sugar.
4. Each serving of cake requires 1/4 cup of flour.
5. Each serving of cake requires 1/4 cup of sugar.
6. Cake does not need water.

\begin{array}{rcl}
\begin{bmatrix}
\text{servings of noodles} & \text{pieces of cake} \\
\end{bmatrix}
\begin{bmatrix}
\text{??} & \text{??} & \text{??} \\
\text{??} & \text{??} & \text{??} \\
\end{bmatrix}
& = &
\begin{bmatrix}
\text{flour needed} & \text{sugar needed} & \text{water needed} \\
\end{bmatrix}
\end{array}

In [68]:
# YOUR CHANGES HERE

p18 = np.array([[1/2, 0, 1/8],
                [1/4, 1/4, 0]])

In [69]:
p18

array([[0.5  , 0.   , 0.125],
       [0.25 , 0.25 , 0.   ]])

### Problem 19

Set `p19` to be the cosine similarity of the vectors `x19` and `y19`.


In [70]:
# DO NOT CHANGE

x19 = [0.4, 0.2, -0.5]
x19

[0.4, 0.2, -0.5]

In [71]:
# DO NOT CHANGE

y19 = [-0.3, -0.2, 0.4]
y19

[-0.3, -0.2, 0.4]

In [77]:
# YOUR CHANGES HERE

def cosine_similarity(a, b):
    dot_product = a @ b
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)
    return dot_product / (magnitude_a * magnitude_b)

x_np = np.array(x19)
y_np = np.array(y19)

p19 = cosine_similarity(x_np, y_np)

In [78]:
p19

np.float64(-0.9965457582448795)

### Problem 20

Set `p20` to the reduced row echelon form of `q20`.


In [79]:
# DO NOT CHANGE

q20 = np.array([[2., 5., -3., 2.0],
                [-2, 1, 3, -2],
                [ 4.,  1.,  0., 16.]])

In [83]:
# YOUR CHANGES HERE

q20_copy = q20.astype(float).copy()
print("Before:")
print(q20_copy)

n = q20_copy.shape[0]
col = 0
for i in range(n):
    if col >= q20_copy.shape[1]:
        break

	# i: is row indexing --> rows from i to the end (a range)
	# col is col indexing --> a single column (just one column, not a range)
    np_abs = np.abs(q20_copy[i:, col])
    max_row = i + np.argmax(np_abs)
    if np.abs(q20_copy[max_row, col]) < 1e-12:
        col += 1
        continue

    q20_copy[[i, max_row]] = q20_copy[[max_row, i]]

    q20_copy[i] = q20_copy[i] / q20_copy[i, col]

    for j in range(n):
        if i != j:
            q20_copy[j] = q20_copy[j] - q20_copy[j, col] * q20_copy[i]

    col += 1

q20_copy = np.where(np.isclose(q20_copy, 0.0, atol=1e-12), 0.0, q20_copy)

p20 = q20_copy

Before:
[[ 2.  5. -3.  2.]
 [-2.  1.  3. -2.]
 [ 4.  1.  0. 16.]]


In [84]:
p20

array([[1., 0., 0., 4.],
       [0., 1., 0., 0.],
       [0., 0., 1., 2.]])

### Generative AI Usage

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the [generative AI policy](https://www.bu.edu/cds-faculty/culture-community/gaia-policy/).
If you did not use any generative AI tools, simply write NONE below.

YOUR CHANGES HERE

Problem 9
* I knew I had to find the augmented matrix and then the reduced row echelon form from there, but didn't know what to do after that.
* https://chatgpt.com/share/6908c30f-ce88-800d-9126-eff92082ad34

Problem 13
* I thought I had to average 1989 and 1990 separately which would mean we would have two y_true values.
* https://chatgpt.com/share/690a0b16-a56c-800d-aa75-2130dd2c424f

Problem 17
* I only knew how to find the p-value if given the population mean and using the t-test. I learned about norm.cdf and how to find the p-value given the sample_mean and sample_variance.
* https://chatgpt.com/share/690a1410-8954-800d-9bc1-a4b56fd0456d