# Example: Computing the Eigendecomposition using QR decomposition of a Covariance Matrix
In this example, we will compute the eigendecomposition of a covariance matrix using the QR algorithm, which relies on the Gram-Schmidt process for orthogonalization. The covariance matrix will be computed from the daily log growth rate of stock prices.

> __Learning Objectives__
> 
> By the end of this example, you should be able to:
> THree learning objectives go here

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
We gathered a daily open-high-low-close dataset for each firm in the S&P 500 from `01-03-2014` until `12-31-2024`, along with data for a few exchange-traded funds and volatility products during that time. 

Let's load the `original_dataset::DataFrame` by calling [the `MyTrainingMarketDataSet()` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyTrainingMarketDataSet) and remove firms that do not have the maximum number of trading days. The cleaned dataset $\mathcal{D}$ will be stored in the `dataset` variable.

In [2]:
original_dataset = MyTrainingMarketDataSet() |> x-> x["dataset"];

Not all tickers in our dataset have the maximum number of trading days for various reasons, e.g., acquisition or de-listing events. Let's collect only those tickers with the maximum number of trading days.

First, let's compute the number of records for a firm that we know has a maximum value, e.g., `AAPL`, and save that value in the `maximum_number_trading_days::Int64` variable:

In [3]:
maximum_number_trading_days = original_dataset["AAPL"] |> nrow;

Now, let's iterate through our data and collect only tickers with `maximum_number_trading_days` records. Save that data in the `dataset::Dict{String,DataFrame}` variable:

In [4]:
dataset = let

    dataset = Dict{String,DataFrame}();
    for (ticker,data) ∈ original_dataset
        if (nrow(data) == maximum_number_trading_days)
            dataset[ticker] = data;
        end
    end
    dataset
end;

Finally, let's get a list of the firms in our cleaned dataset (and sort them alphabetically). We store the sorted firm ticker symbols in the `list_of_tickers::Array{String,1}` variable.

In [5]:
list_of_tickers = keys(dataset) |> collect |> sort # list of firm "ticker" symbols in alphabetical order

424-element Vector{String}:
 "A"
 "AAL"
 "AAP"
 "AAPL"
 "ABBV"
 "ABT"
 "ACN"
 "ADBE"
 "ADI"
 "ADM"
 ⋮
 "WYNN"
 "XEL"
 "XOM"
 "XRAY"
 "XYL"
 "YUM"
 "ZBRA"
 "ZION"
 "ZTS"

Finally, let's set up a ticker map that holds the index of each ticker value. We'll save this in the `tickerindexmap::Dict{String,Int}` dictionary:

In [6]:
tickerindexmap = let

    # initialize -
    tickerindexmap = Dict{String,Int}();
    for i ∈ eachindex(list_of_tickers)
        tickerindexmap[list_of_tickers[i]] = i;
    end

    tickerindexmap;
end

Dict{String, Int64} with 424 entries:
  "EMR"  => 132
  "CTAS" => 101
  "HSIC" => 187
  "KIM"  => 217
  "PLD"  => 310
  "IEX"  => 194
  "BAC"  => 48
  "CBOE" => 69
  "EXR"  => 144
  "NCLH" => 271
  "CVS"  => 103
  "DRI"  => 119
  "DTE"  => 120
  "ZION" => 423
  "AVY"  => 43
  "EW"   => 140
  "EA"   => 124
  "NWSA" => 289
  "CAG"  => 65
  ⋮      => ⋮

### Compute the return matrix
Next, let's compute the return array which contains, for each day and each firm in our dataset, the value of the growth rate between time $j$ and $j-1$. 

>  __Continuously Compounded Growth Rate (CCGR)__
>
> Let's assume a model of the share price of firm $i$ is governed by an expression of the form:
>$$
\begin{align*}
S^{(i)}_{j} &= S^{(i)}_{j-1}\;\exp\left(g^{(i)}_{j,j-1}\Delta{t}_{j}\right)
\end{align*}
$$
> where $S^{(i)}_{j-1}$ denotes the share price of firm $i$ at time index $j-1$, $S^{(i)}_{j}$ denotes the share price of firm $i$ at time index $j$, and $\Delta{t}_{j} = t_{j} - t_{j-1}$ denotes the length of a time step (units: years) between time index $j-1$ and $j$. The value we are going to estimate is the growth rate $g^{(i)}_{j,j-1}$ (units: inverse years) for each firm $i$, and each time step in the dataset.

We've implemented [the `log_growth_matrix(...)` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.log_growth_matrix) which takes the cleaned dataset and a list of ticker symbols, and returns the growth rate array. Each row of the growth rate array is a time step, while each column corresponds to a firm from the `list_of_tickers::Array{String,1}` array.

We save the growth rate array in the `X::Array{Float64,2}` variable:

In [7]:
X = let

    # initialize -
    r̄ = 0.0; # assume the risk-free rate is 0

    # compute the growth matrix -
    growth_rate_array = log_growth_matrix(dataset, list_of_tickers, Δt = 1.0, 
        risk_free_rate = r̄); # other optional parameters are at their defaults

    growth_rate_array; # return
end;

___

## Task 1: Compute the Empirical Covariance Matrix
In this task, let's compute the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ for our dataset $\mathcal{D}$ using code that we write ourselves (we'll never do this in practice, but it's a good exercise). The empirical covariance matrix is given by:
$$
\hat{\mathbf{\Sigma}} = \frac{1}{n-1}\tilde{\mathbf{X}}^{\top}\tilde{\mathbf{X}}
$$
where $\tilde{\mathbf{X}}$ is the centered data matrix:
$$
\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\mathbf{m}^{\top}
$$
where $\mathbf{1} \in \mathbb{R}^{n}$ is a vector of ones, and $\mathbf{1}\mathbf{m}^{\top}$ creates an $n \times m$ matrix where each row is identical and contains the __returns__ on the columns. 

> __Outer product:__ The $\mathbf{1}\mathbf{m}^{\top}$ is an example of an outer product. The [outer product](https://en.wikipedia.org/wiki/Outer_product) of two vectors $\mathbf{a} \in \mathbb{R}^{n}$ and $\mathbf{b} \in \mathbb{R}^{m}$ is the $n \times m$ matrix $\mathbf{a}\mathbf{b}^{\top}$. Each element of the outer product is computed as $(\mathbf{a}\mathbf{b}^{\top})_{ij} = a_i b_j$. 

Let's start by constructing the data matrix $\mathbf{X} \in\mathbb{R}^{n \times m}$ where each row $k$ contains the __returns__ for all $m$ firms at time period $k$. To compute the returns, we [use the `log_growth_matrix(...)` function from the `VLQuantitativeFinancePackage.jl` package](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.log_growth_matrix) and multiply by the time step $\Delta{t}$. 

First, let's compute the mean returns for each firm and store them in the `m::Array{Float64,1}` variable:

In [8]:
m = mean(X, dims=1) |> vec # mean returns for each firm

424-element Vector{Float64}:
  0.00043152690397797527
 -0.0001469541078208755
 -0.00031710102271930243
  0.0009238140812477853
  0.00044105742821765466
  0.00038868965269282677
  0.0005289761752874106
  0.0007284547060446172
  0.0005265024709424277
  5.618535648601576e-5
  ⋮
 -0.00029807586991059844
  0.00032461074815436536
  2.6916182694995763e-5
 -0.0003392763144159303
  0.00043857557703725285
  0.0002064503070884351
  0.0007130714183989561
  0.00021701413350653834
  0.0005858730432623267

Now, let's form the centered data matrix $\tilde{\mathbf{X}}$ by subtracting the mean returns from each row of the data matrix $\mathbf{X}$. We store the centered data in the `X_centered::Array{Float64,2}` variable:

In [9]:
r, c = size(X)
ones_vector = ones(r)
⊗(ones_vector, m) # outer product of ones_vector and m

2766×424 Matrix{Float64}:
 0.000431527  -0.000146954  -0.000317101  …  0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101  …  0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 ⋮                                        ⋱               
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.000431527  -0.000146954  -0.000317101     0.000217014  0.000585873
 0.00

Fill me in

In [10]:
X_centered = let 
    r, c = size(X)
    ones_vector = ones(r)
    X̃ = X .- ⊗(ones_vector, m);
end

2766×424 Matrix{Float64}:
 -0.00391389    0.0250717    -0.0110757    …   0.000758758  -0.00457502
  0.0107441     0.00439892    0.00584244      -0.00340269    0.00332868
  0.0127155     0.00354218    0.000338403      0.00450918   -0.0108297
  0.00213365    0.0686387     0.00703199       0.0123199    -0.0020471
  0.00677517    0.0103835     0.0134887       -0.00882298    0.0168867
  0.00200431   -0.0155826    -0.00282885   …  -0.00779546   -0.0129519
  0.0109205    -0.00177269    0.0195462       -0.00726801   -0.00490968
  0.00769032    0.0041688     0.00788889       0.0174411    -0.00113277
  0.00477837    0.00679031    0.000742733     -0.00869703    0.00511985
  0.00441036    0.0244706     0.00401781       0.011203     -0.00628531
  ⋮                                        ⋱                
 -0.0177684     0.0154026    -0.0091056       -0.0366419    -0.0162462
 -0.0103991    -0.0102059    -0.0398453       -0.0282911    -0.02892
  0.00835239    0.0166178     0.0291932        0.0147388 

Finally, let's compute the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ and store it in the `Σ̂::Array{Float64,2}` variable:

In [11]:
Σ̂ = let 

    # initialize -
    T = 252; # number of trading days in a year
    (r,c) = size(X_centered)
    Σ = (1/(r-1)) * (X_centered' * X_centered)
    Σ*T; # return the annualized empirical covariance matrix
end

424×424 Matrix{Float64}:
 0.0535771   0.032397    0.0212982  …  0.0355662   0.0280938   0.0266666
 0.032397    0.207093    0.0431437     0.051429    0.0689041   0.0280481
 0.0212982   0.0431437   0.117391      0.0308539   0.037867    0.019717
 0.0229702   0.0312627   0.0163174     0.032633    0.0201827   0.022609
 0.0179206   0.0176175   0.0158407     0.0157326   0.0168026   0.0185472
 0.0244078   0.0195811   0.0145372  …  0.023159    0.0167416   0.0221966
 0.025643    0.0334402   0.0218931     0.0313218   0.0280694   0.023824
 0.0294242   0.02952     0.0185233     0.0369478   0.0198333   0.0262802
 0.0297618   0.0458278   0.0231343     0.0433011   0.0335821   0.0233085
 0.0169025   0.0325409   0.0206063     0.0219846   0.0327002   0.0131991
 ⋮                                  ⋱                          
 0.0339199   0.0921199   0.035255   …  0.0503196   0.0581754   0.0308858
 0.00888225  0.00814595  0.0118987     0.00616705  0.00592554  0.0112662
 0.0161836   0.0376862   0.0190661    

__Check__: Let's check our covariance matrix against [the `cov(...)` function from the Julia standard library](https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics.cov). Compute the covariance matrix using the built-in function and comapre it to your result:

> __Test__ We'll compare the two covariance matrices by computing the Frobenius norm of their difference. The Frobenius norm of a matrix $\mathbf{A} \in \mathbb{R}^{n \times m}$ is defined as:
> $$
\|\mathbf{A}\|_{F} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{m} |a_{ij}|^{2}}
> $$
> where $a_{ij}$ is the element in the $i^{th}$ row and $j^{th}$ column of matrix $\mathbf{A}$. If the Frobenius norm of the difference between the two covariance matrices is very small (close to zero), it indicates that they are nearly identical, confirming the correctness of our implementation. We'll use the [`@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) to enforce this check.

So what do we see?


In [12]:
let

    # initialize -
    ϵ = 1e-8; # tolerance for the Frobenius norm comparison
    T = 252; # number of trading days in a year
    Σ_builtin = cov(X)*T; # annualized empirical covariance matrix using built
    Δ = Σ̂ - Σ_builtin;
    frobenius_norm = norm(Δ, 2); # p = 2 is the Frobenius norm for a matrix
    test = frobenius_norm < ϵ

    # if test fails, throw an error -
    @assert test "Covariance matrices do not match within tolerance!"
end

Ok! So if we get here without an error, our covariance matrix implementation is correct!

___

## Test 2: Compute the Eigendecomposition using the QR Algorithm
In this task, we will compute the eigendecomposition of the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ using our implementation of the QR algorithm in [the `qriteration(...)` function in the `Compute.jl` file](../src/Compute.jl).

We'll save the eignenvalues in the `λ̂::Array{Float64,1}` variable and the eigenvectors in the `V̂::Array{Float64,2}` variables. The eigenvalues and eigenvectors are sorted in ascending order based on the eigenvalue magnitude.

In [13]:
(λ̂,V̂) = let

    # initilize -
    maxiter = 1000; # max number of iterations
    tolerance = 1e-9; # tolerance for convergence

    # call our QR iteration function -
    result = qriteration(Σ̂; maxiter = maxiter, tolerance = tolerance);

    λ = result[1]; # eigenvalues
    tmpdict = result[2]; # eigenvectors
    number_of_rows = length(λ);
    V = zeros(number_of_rows, number_of_rows);
    for i ∈ 1:number_of_rows
        V[:,i] = tmpdict[i];
    end

    (λ,V); # return
end

([0.00020091649390589135, 0.0008004082795630774, 0.0008197993962321032, 0.0011683475444739903, 0.0023239405869121253, 0.0024374860556287833, 0.0029043419685604965, 0.0029450466978901834, 0.0030032234323237325, 0.003141200093345577  …  0.40109541051295494, 0.42410094575995183, 0.4433158835074257, 0.5110538829534128, 0.7218934238657446, 0.8821003299986302, 0.9913463704158854, 1.1983172323103428, 1.595819459020517, 11.438058870655444], [0.002294131234357588 -0.003749383941944555 … 0.05785052459201625 -0.042413850294204546; -0.0009776709554531712 0.000697604059543025 … -0.05786956337900864 -0.08444908198854581; … ; 0.00809290103175616 0.001978653491762944 … -0.07600622038332525 -0.06871769022820234; 0.0019203225901283023 -0.006274549177533917 … 0.06134369996581131 -0.037098632606339624])

In [14]:
i = findall(x-> isnan(x) == true, V̂)
V̂[i]

Float64[]

Before we think about what the eigenvalues and eigenvectors mean, let's verify that our implementation is correct by checking the values against the built-in Julia [`eigen(...)` function](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen). 

In [15]:
let

    # compute the eigendecomposition using the built-in function -
    F = eigen(Σ̂);
    λ = F.values; # grab the eigenvalues
    V = F.vectors; # grab the eigenvectors

    # sort the eigenpairs by eigenvalue magnitude -
    p = sortperm(λ)
    λ = λ[p]
    V = V[:,p]

    # let's compare the eigenvalues -
    ϵ = 1e-4; # tolerance for comparison
    maximum_eigenvalue_delta = maximum(abs.(λ̂ - λ))
    @assert maximum_eigenvalue_delta < ϵ "Eigenvalues do not match within tolerance!"

    # let's compare the eigenvectors (up to a sign) -
    ΔV = similar(V̂)
    for i ∈ 1:length(λ̂)
        v1 = V̂[:,i] / norm(V̂[:,i])
        v2 = V[:,i] / norm(V[:,i])
        if dot(v1, v2) < 0
            v2 = -v2
        end
        ΔV[:,i] = v1 - v2
    end
    ϵv = 1e-3; # tolerance for eigenvector comparison
    maximum_eigenvector_delta = maximum(abs.(ΔV))
    @assert maximum_eigenvector_delta < ϵv "Eigenvectors do not match within tolerance!"

    (maximum_eigenvalue_delta, maximum_eigenvector_delta)
end

(2.585973623539517e-5, 1.8581128760630783e-8)

## Summary
One concise, direct summary sentence goes here

> __Key Takeaways__
> 
> Three key takeaways go here

One concise, direct conclusion sentence goes here
___