# Homework 6: Correlations.

*Instructions:*
Please answer the following questions and submit your work
by editing this jupyter notebook and submitting it on Canvas.
Questions may involve math, programming, or neither,
but you should make sure to *explain your work*:
i.e., you should usually have a cell with at least a few sentences
explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units,
and label axes of plots (again, with units when appropriate).

In [2]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()

# 1. Ever Upwards

You are part of a team aiming to predict future costs for a coffee shop,
and are given the following model.
Let $X_0 = \$1.50$ be the price (to the shop) of a cup of coffee today,
and model the price $n$ weeks from now as $X_n = X_{n-1} + Z_n$,
where each $Z_n$ has a Normal distribution with mean \\$0.10 and standard deviation \\$0.10,
and is independent of other $Z$.
We want to see how well we can predict prices for the next 10 weeks under this model.

*(a)* If we define $Z = (Z_1, Z_2, \ldots, Z_{10})$,
    and $X = (X_1, X_2, \ldots, X_{10})$,
    then (taking $X$ and $Z$ to be column vectors)
    we can write $X = X_0 + AZ$ for some matrix $A$.
    What is that matrix?

*(b)* What is the mean and covariance matrix of $X$?
    Explain, and check by simulation.

###a
Breakdown of the matrix A:
* $X_1 = X_0 + Z_1$
* $X_2 = X_0 + Z_1 + Z_2$
* $X_3 = X_0 + Z_1 + Z_2 + Z_3$
* . . .
* $X_n = X_0 + \Sigma_n^{i =1}Z_i$
i, and 0s after that.

So, A is a lower triangular matrix of size 10x10 where each row i has 1s from col 1 to i, and 0s after that.

###b
Mean of X:

$\mathbb{E}[X] = X_0 + A \cdot \mathbb{E}[Z]$

Since $\mathbb{E}[Z_i] = 0.10 \text{ , then } \mathbb{E}[Z] = 0.10 \text{(a 10D vector of 0.10s )}$

$\mathbb{E}[X] = 1.5 + A \cdot 0.10$

Covariance of X:

Since $Cov(Z) = 0.01 \cdot I $, then:

$\text{Cov(X)} = A \cdot \text{Cov(Z)} * A^T =0.1 * AA^T$

In [14]:
# Question 1a)
# Create the A matrix
n = 10
A = np.tril(np.ones((n, n)))
print(A)


# Question 1b)
# Parameters
X0 = 1.5
mu_Z = 0.10
sigma_Z = 0.10
n_samples = 10000
# Simulate Z and compute X = X0 + A @ Z
Z_samples = rng.normal(loc=mu_Z, scale=sigma_Z, size=(10, n_samples))
X_samples = X0 + A @ Z_samples

# Sample mean and covariance
mean_X_sim = np.mean(X_samples, axis=1)
cov_X_sim = np.cov(X_samples)

# Theoretical mean and covariance
mean_X_theory = X0 + A @ np.full(10, mu_Z)
cov_X_theory = (sigma_Z**2) * A @ A.T

# Show results
print("Simulated Mean:\n", mean_X_sim)
print("\nTheoretical Mean:\n", mean_X_theory)

print("\nSimulated Covariance Matrix:\n", np.round(cov_X_sim, 4))
print("\nTheoretical Covariance Matrix:\n", np.round(cov_X_theory, 4))


[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
Simulated Mean:
 [1.59983032 1.69835342 1.79885885 1.89797832 1.99807887 2.0966859
 2.19595211 2.29614483 2.39640544 2.49700212]

Theoretical Mean:
 [1.6 1.7 1.8 1.9 2.  2.1 2.2 2.3 2.4 2.5]

Simulated Covariance Matrix:
 [[0.0102 0.0101 0.0101 0.01   0.01   0.0099 0.0098 0.0098 0.0099 0.0098]
 [0.0101 0.0199 0.02   0.0199 0.0201 0.02   0.02   0.0201 0.0205 0.0205]
 [0.0101 0.02   0.03   0.0301 0.0303 0.03   0.0298 0.0301 0.0304 0.0303]
 [0.01   0.0199 0.0301 0.0403 0.0403 0.0399 0.0397 0.0399 0.0405 0.0404]
 [0.01   0.0201 0.0303 0.0403 0.0504 0.0501 0.05   0.0503 0.0507 0.0505]
 [0.0099 0.02   0.03   0.0399 0.0501 0.0598 0.0596 0.0599 0.0604 0.0603]
 [0.0098 

# 2. Books by a different name

In class, we did PCA on word count data from passages from three books. The passages are in the file [data/passages.txt](https://uodsci.github.io/dsci345/class_material/fall_2022/homeworks/data/passages.txt) and the sources of each passage are in [data/passage_sources.tsv](https://uodsci.github.io/dsci345/class_material/fall_2022/homeworks/data/passage_sources.tsv). Repeat the analysis. You may use the same code from class to read in and process the data,
but you should *use [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)* to do the PCA.  Your results should be similar but not the same as those from class, since scikit-learn's implementation differs somewhat. Also, you don't need to show everything that we did in class
(use your judgement) but we encourage you to explore.

*Note:* part of this question is to figure out how what another method gives you maps on to what we discussed in class. Big clues are provided by the sizes of various outputs.

# 3. The Matrix

The secret vault can only be unlocked by a stream of numbers satisfying certain statistical properties.
You can pass in 5 floating-point numbers at a time,
and each set of 5 must be related to eachother in the following way:
they should be Normally distributed with mean zero and the ($5 \times 5$) covariance matrix:
$$\begin{aligned}
    M_{ij} = (1+i+j) \times 2^{-|i-j|} \qquad \text{for } 1 \le j \le 5, \quad 1 \le i \le 5 .
\end{aligned}$$
Write a function to produce a random set of 5 numbers of this form,
and test the result by verifying that (a) $\text{var}[X_2] = 5$ and
(b) $\text{cov}[X_3,X_5] = 2.25$.

In [15]:
i, j = 3, 5
print(C[3-1, 5-1], (1 + i + j) * (2 ** (-1.0*np.abs(i-j))), 2.25)
assert np.isclose(C[3-1, 5-1], (1 + i + j) * (2 ** (-1.0*np.abs(i-j))))
assert C[3-1, 5-1] ==  2.25
i = j = 2
print(C[2-1, 2-1], (1 + i + j) * (2 ** (-1.0*np.abs(i-j))), 5.0)
assert C[2-1, 2-1] ==  5

NameError: name 'C' is not defined