# Example: Computing the Eigendecomposition using QR decomposition of a Covariance Matrix
In this example, we will compute the eigendecomposition of a covariance matrix using the QR algorithm, which relies on the Gram-Schmidt process for orthogonalization. The covariance matrix will be computed from the daily log growth rate of stock prices.

> __Learning Objectives__
> 
> By the end of this example, you should be able to:
> THree learning objectives go here

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl.git`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Imath_jll ──────── v3.2.2+0
[32m[1m   Installed[22m[39m XZ_jll ─────────── v5.8.2+0
[32m[1m   Installed[22m[39m OpenEXR_jll ────── v3.4.4+0
[32m[1m   Installed[22m[39m WoodburyMatrices ─ v1.1.0
[32m[1m   Installed[22m[39m StaticArrays ───── v1.9.16
[32m[1m   Installed[22m[39m SciMLPublic ────── v1.0.1
[32m[1m   Installed[22m[39m NNlib ──────────── v0.9.33
[32m[1m   Installed[22m[39m ForwardDiff ────── v1.3.1
[32m[1m   Installed[22m[39m Graphs ─────────── v1.13.3
[32m[1m  Installing[22m[39m 3 artifacts
[32m[1m   Installed[22m[39m artifact XZ          724.9 KiB
[32m[1m   Installed[22m[39m artifact Imath       180.2 KiB
[32m[1m   Installed[22m[39m artifact OpenEXR     

LoadError: LoadError: expected package `VLDataScienceMachineLearningPackage [24b76065]` to be registered
in expression starting at /Users/jdv27/Desktop/julia_work/CHEME-5820-Instances/Spring-2026/CHEME-5820-Lectures-Spring-2026/lectures/week-2/L2c/Include.jl:8

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
We gathered a daily open-high-low-close dataset for each firm in the S&P 500 from `01-03-2014` until `12-31-2024`, along with data for a few exchange-traded funds and volatility products during that time. 

Let's load the `original_dataset::DataFrame` by calling [the `MyTrainingMarketDataSet()` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyTrainingMarketDataSet) and remove firms that do not have the maximum number of trading days. The cleaned dataset $\mathcal{D}$ will be stored in the `dataset` variable.

In [2]:
original_dataset = MyTrainingMarketDataSet() |> x-> x["dataset"];

UndefVarError: UndefVarError: `MyTrainingMarketDataSet` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Not all tickers in our dataset have the maximum number of trading days for various reasons, e.g., acquisition or de-listing events. Let's collect only those tickers with the maximum number of trading days.

First, let's compute the number of records for a firm that we know has a maximum value, e.g., `AAPL`, and save that value in the `maximum_number_trading_days::Int64` variable:

In [3]:
maximum_number_trading_days = original_dataset["AAPL"] |> nrow;

UndefVarError: UndefVarError: `original_dataset` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

Now, let's iterate through our data and collect only tickers with `maximum_number_trading_days` records. Save that data in the `dataset::Dict{String,DataFrame}` variable:

In [4]:
dataset = let

    dataset = Dict{String,DataFrame}();
    for (ticker,data) ∈ original_dataset
        if (nrow(data) == maximum_number_trading_days)
            dataset[ticker] = data;
        end
    end
    dataset
end;

UndefVarError: UndefVarError: `DataFrame` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Finally, let's get a list of the firms in our cleaned dataset (and sort them alphabetically). We store the sorted firm ticker symbols in the `list_of_tickers::Array{String,1}` variable.

In [5]:
list_of_tickers = keys(dataset) |> collect |> sort # list of firm "ticker" symbols in alphabetical order

UndefVarError: UndefVarError: `dataset` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

Finally, let's set up a ticker map that holds the index of each ticker value. We'll save this in the `tickerindexmap::Dict{String,Int}` dictionary:

In [6]:
tickerindexmap = let

    # initialize -
    tickerindexmap = Dict{String,Int}();
    for i ∈ eachindex(list_of_tickers)
        tickerindexmap[list_of_tickers[i]] = i;
    end

    tickerindexmap;
end

UndefVarError: UndefVarError: `list_of_tickers` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

### Compute the return matrix
Next, let's compute the return array which contains, for each day and each firm in our dataset, the value of the growth rate between time $j$ and $j-1$. 

>  __Continuously Compounded Growth Rate (CCGR)__
>
> Let's assume a model of the share price of firm $i$ is governed by an expression of the form:
>$$
\begin{align*}
S^{(i)}_{j} &= S^{(i)}_{j-1}\;\exp\left(g^{(i)}_{j,j-1}\Delta{t}_{j}\right)
\end{align*}
$$
> where $S^{(i)}_{j-1}$ denotes the share price of firm $i$ at time index $j-1$, $S^{(i)}_{j}$ denotes the share price of firm $i$ at time index $j$, and $\Delta{t}_{j} = t_{j} - t_{j-1}$ denotes the length of a time step (units: years) between time index $j-1$ and $j$. The value we are going to estimate is the growth rate $g^{(i)}_{j,j-1}$ (units: inverse years) for each firm $i$, and each time step in the dataset.

We've implemented [the `log_growth_matrix(...)` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.log_growth_matrix) which takes the cleaned dataset and a list of ticker symbols, and returns the growth rate array. Each row of the growth rate array is a time step, while each column corresponds to a firm from the `list_of_tickers::Array{String,1}` array.

We save the growth rate array in the `X::Array{Float64,2}` variable:

In [7]:
X = let

    # initialize -
    r̄ = 0.0; # assume the risk-free rate is 0

    # compute the growth matrix -
    growth_rate_array = log_growth_matrix(dataset, list_of_tickers, Δt = 1.0, 
        risk_free_rate = r̄); # other optional parameters are at their defaults

    growth_rate_array; # return
end;

UndefVarError: UndefVarError: `log_growth_matrix` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

___

## Task 1: Compute the Empirical Covariance Matrix
In this task, let's compute the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ for our dataset $\mathcal{D}$ using code that we write ourselves (we'll never do this in practice, but it's a good exercise). The empirical covariance matrix is given by:
$$
\hat{\mathbf{\Sigma}} = \frac{1}{n-1}\tilde{\mathbf{X}}^{\top}\tilde{\mathbf{X}}
$$
where $\tilde{\mathbf{X}}$ is the centered data matrix:
$$
\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\mathbf{m}^{\top}
$$
where $\mathbf{1} \in \mathbb{R}^{n}$ is a vector of ones, and $\mathbf{1}\mathbf{m}^{\top}$ creates an $n \times m$ matrix where each row is identical and contains the __returns__ on the columns. 

> __Outer product:__ The $\mathbf{1}\mathbf{m}^{\top}$ is an example of an outer product. The [outer product](https://en.wikipedia.org/wiki/Outer_product) of two vectors $\mathbf{a} \in \mathbb{R}^{n}$ and $\mathbf{b} \in \mathbb{R}^{m}$ is the $n \times m$ matrix $\mathbf{a}\mathbf{b}^{\top}$. Each element of the outer product is computed as $(\mathbf{a}\mathbf{b}^{\top})_{ij} = a_i b_j$. 

Let's start by constructing the data matrix $\mathbf{X} \in\mathbb{R}^{n \times m}$ where each row $k$ contains the __returns__ for all $m$ firms at time period $k$. To compute the returns, we [use the `log_growth_matrix(...)` function from the `VLQuantitativeFinancePackage.jl` package](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.log_growth_matrix) and multiply by the time step $\Delta{t}$. 

First, let's compute the mean returns for each firm and store them in the `m::Array{Float64,1}` variable:

In [8]:
m = mean(X, dims=1) |> vec # mean returns for each firm

UndefVarError: UndefVarError: `mean` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Now, let's form the centered data matrix $\tilde{\mathbf{X}}$ by subtracting the mean returns from each row of the data matrix $\mathbf{X}$. We store the centered data in the `X_centered::Array{Float64,2}` variable:

In [9]:
r, c = size(X)
ones_vector = ones(r)
⊗(ones_vector, m) # outer product of ones_vector and m

UndefVarError: UndefVarError: `X` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

Fill me in

In [10]:
X_centered = let 
    r, c = size(X)
    ones_vector = ones(r)
    X̃ = X .- ⊗(ones_vector, m);
end

UndefVarError: UndefVarError: `X` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

Finally, let's compute the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ and store it in the `Σ̂::Array{Float64,2}` variable:

In [11]:
Σ̂ = let 

    # initialize -
    T = 252; # number of trading days in a year
    (r,c) = size(X_centered)
    Σ = (1/(r-1)) * (X_centered' * X_centered)
    Σ*T; # return the annualized empirical covariance matrix
end

UndefVarError: UndefVarError: `X_centered` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

__Check__: Let's check our covariance matrix against [the `cov(...)` function from the Julia standard library](https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics.cov). Compute the covariance matrix using the built-in function and comapre it to your result:

> __Test__ We'll compare the two covariance matrices by computing the Frobenius norm of their difference. The Frobenius norm of a matrix $\mathbf{A} \in \mathbb{R}^{n \times m}$ is defined as:
> $$
\|\mathbf{A}\|_{F} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{m} |a_{ij}|^{2}}
> $$
> where $a_{ij}$ is the element in the $i^{th}$ row and $j^{th}$ column of matrix $\mathbf{A}$. If the Frobenius norm of the difference between the two covariance matrices is very small (close to zero), it indicates that they are nearly identical, confirming the correctness of our implementation. We'll use the [`@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) to enforce this check.

So what do we see?


In [12]:
let

    # initialize -
    ϵ = 1e-8; # tolerance for the Frobenius norm comparison
    T = 252; # number of trading days in a year
    Σ_builtin = cov(X)*T; # annualized empirical covariance matrix using built
    Δ = Σ̂ - Σ_builtin;
    frobenius_norm = norm(Δ, 2); # p = 2 is the Frobenius norm for a matrix
    test = frobenius_norm < ϵ

    # if test fails, throw an error -
    @assert test "Covariance matrices do not match within tolerance!"
end

UndefVarError: UndefVarError: `cov` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Ok! So if we get here without an error, our covariance matrix implementation is correct!

___

## Test 2: Compute the Eigendecomposition using the QR Algorithm
In this task, we will compute the eigendecomposition of the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ using our implementation of the QR algorithm in [the `qriteration(...)` function in the `Compute.jl` file](../src/Compute.jl).

We'll save the eignenvalues in the `λ̂::Array{Float64,1}` variable and the eigenvectors in the `V̂::Array{Float64,2}` variables. The eigenvalues and eigenvectors are sorted in ascending order based on the eigenvalue magnitude.

In [13]:
(λ̂,V̂) = let

    # initilize -
    maxiter = 1000; # max number of iterations
    tolerance = 1e-9; # tolerance for convergence

    # call our QR iteration function -
    result = qriteration(Σ̂; maxiter = maxiter, tolerance = tolerance);

    λ = result[1]; # eigenvalues
    tmpdict = result[2]; # eigenvectors
    number_of_rows = length(λ);
    V = zeros(number_of_rows, number_of_rows);
    for i ∈ 1:number_of_rows
        V[:,i] = tmpdict[i];
    end

    # sort the eigenpairs by eigenvalue magnitude -
    p = sortperm(λ, rev=true); # indices that would sort λ in descending order
    λ = λ[p]
    V = V[:,p]

    (λ,V); # return
end

UndefVarError: UndefVarError: `qriteration` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

Before we think about what the eigenvalues and eigenvectors mean, let's verify that our implementation is correct by checking the values against the built-in Julia [`eigen(...)` function](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen). 

> __Check__: Describe the checks here, starting with eigenvalues and then eigenvectors.

So what do we see?

In [15]:
let

    # compute the eigendecomposition using the built-in function -
    F = eigen(Σ̂);
    λ = F.values; # grab the eigenvalues
    V = F.vectors; # grab the eigenvectors

    # sort the eigenpairs by eigenvalue magnitude -
    p = sortperm(λ, rev=true); # indices that would sort λ in descending order
    λ = λ[p]
    V = V[:,p]

    # Test 1: let's compare the eigenvalues -
    ϵ = 1e-4; # tolerance for comparison
    maximum_eigenvalue_delta = maximum(abs.(λ̂ - λ))
    @assert maximum_eigenvalue_delta < ϵ "Eigenvalues do not match within tolerance!"

    # Test 2: let's compare the eigenvectors (up to a sign) -
    ΔV = similar(V̂)
    for i ∈ 1:length(λ̂)
        v1 = V̂[:,i] / norm(V̂[:,i])
        v2 = V[:,i] / norm(V[:,i])
        if dot(v1, v2) < 0
            v2 = -v2
        end
        ΔV[:,i] = v1 - v2
    end
    ϵv = 1e-3; # tolerance for eigenvector comparison
    maximum_eigenvector_delta = maximum(abs.(ΔV))
    @assert maximum_eigenvector_delta < ϵv "Eigenvectors do not match within tolerance!"
end

UndefVarError: UndefVarError: `eigen` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name also exists in LinearAlgebra.

Ok, so if we get here without an error, our QR algorithm implementation produced correct eigenvalues and eigenvectors!

___

## Task 3: What do the Eigenvalues and Eigenvectors Mean?
In this task, we'll interpret the eigenvalues and eigenvectors we computed from the empirical covariance matrix $\hat{\mathbf{\Sigma}}$. Eigendecompoistion is a type of matrix factorization of the form:
$$
\begin{align*}
\hat{\mathbf{\Sigma}} &= \mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}
\end{align*}
$$
where $\mathbf{V} \in \mathbb{R}^{m \times m}$ is a matrix whose columns are the orthonormal eigenvectors of $\hat{\mathbf{\Sigma}}$, and $\mathbf{\Lambda} \in \mathbb{R}^{m \times m}$ is a diagonal matrix whose diagonal elements are the eigenvalues of $\hat{\mathbf{\Sigma}}$. One way to interpret the eigenvalues and enigenvectors is through the lens of market factor models. 

> __Market Factor:__ The eigenvector $\mathbf{v}_1$ corresponding to the largest eigenvalue $\lambda_{1}$ can be interpreted as the market factor, while the other eigenvectors $\mathbf{v}_2, \mathbf{v}_3, \ldots, \mathbf{v}_m$ correspond to sector or idiosyncratic factors (where we assume the eigenvalues are sorted in descending order).

Let's start by verifying that the eigenvectors we computed are orthonormal. We can do this by checking that the matrix product of the transpose of the eigenvector matrix $\mathbf{V}^{\top}$ and the eigenvector matrix $\mathbf{V}$ yields the identity matrix $\mathbf{I}$.

> __Check:__ In the check below, we compute the product $\mathbf{V}^{\top}\mathbf{V}$ and verify that it is close to the identity matrix within a specified tolerance. In particular, we compute the maximum absolute difference between the elements of the computed product and the identity matrix, and assert that this difference is less than a small tolerance value (e.g., $1 \times 10^{-6}$).

So what do we see?

In [16]:
let
    I_test = transpose(V̂)*V̂ |> Matrix; # compute the identity test matrix
    I_true = Matrix(I, size(I_test)...); # true identity matrix
    ϵ = 1e-6; # tolerance for comparison
    ΔI = I_test - I_true;
    maximum_identity_delta = maximum(abs.(ΔI))
    @assert maximum_identity_delta < ϵ "Eigenvectors are not orthonormal within tolerance!" 
end

UndefVarError: UndefVarError: `V̂` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

If we get here without an error, the eigenvectors are orthonormal! Now, let's examine the largest eigenvector $\mathbf{v}_1$ and interpret it as the market factor. In particular, let's compute the normalized absolute values of the components of $\mathbf{v}_1$ to understand the relative influence of each firm on the market factor.

In [17]:
let
    
    # initialize -
    number_of_firms = length(list_of_tickers);
    number_of_firms_to_display = 10; # number of firms to display
    df = DataFrame(); # initialize the DataFrame to hold the results

    # get the largest eigenvector, and scale it appropriately
    v₁ = V̂[:,1]; # get the largest eigenvector (associated with the largest eigenvalue)
    σ = sum(abs.(v₁)) |> T -> (1/T)*abs.(v₁)
    sortperm_indices = sortperm(σ, rev=true); # indices that would sort σ in descending order
    for i ∈ 1:number_of_firms_to_display
        firm_index = sortperm_indices[i];
        ticker = list_of_tickers[firm_index];
        influence = σ[firm_index];
        push!(df, (Ticker = ticker, Influence = influence))
    end

    # make a table display of the results -
    pretty_table(
        df;
        backend = :text,
        table_format = TextTableFormat(borders = text_table_borders__compact)
    );

end

UndefVarError: UndefVarError: `list_of_tickers` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

In [18]:
list_of_tickers[271]

UndefVarError: UndefVarError: `list_of_tickers` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

## Summary
One concise, direct summary sentence goes here

> __Key Takeaways__
> 
> Three key takeaways go here

One concise, direct conclusion sentence goes here
___