# Activity: Estimating Single Index Models (SIMs) from Historical Data
In this activity, students will learn how to estimate single index models from historical data. Single index models are widely used in finance to model the returns of assets based on their relationship with a market index.

> __Learning Objectives:__
>
> By the end of this activity, you will be able to:
> * **Compute continuously compounded growth rates from historical price data** - We will transform daily S&P500 stock price data into growth rate matrices and learn how to structure financial time series data for econometric analysis, gaining practical experience with real market data spanning multiple years.
>
> * **Estimate single index model parameters using ordinary least squares** - We will formulate and solve the linear regression problem to estimate alpha (firm-specific return) and beta (market sensitivity) parameters for individual stocks, learning how to set up data matrices and apply closed-form solutions for parameter estimation.
>
> * **Quantify parameter uncertainty through confidence interval construction** - We will compute standard errors, estimate residual variance, and construct confidence intervals for our parameter estimates, learning essential techniques for statistical inference and uncertainty quantification in financial modeling.

Let's go!
___

## Background
Single index models are factor models that consider only the return (growth) of the market factor. These models were originaly developed by Sharpe, 1963: [Sharpe, William F. (1963). "A Simplified Model for Portfolio Analysis". Management Science, 9(2): 277-293. doi:10.1287/mnsc.9.2.277.](https://pubsonline.informs.org/doi/abs/10.1287/mnsc.9.2.277)

Suppose the growth of firm $i$ at time $t$ is denoted by $\mu^{(t)}_{i}$. Then, the single index model of the return (growth rate) is given by:
$$
\mu^{(t)}_{i} = \alpha_{i} + \beta_{i}\;\mu^{(t)}_{M} + \epsilon^{(t)}_{i},
$$
where $\alpha_{i}$ is the _idosyncratic (firm-specific) growth_, $\beta_{i}$ is the component of the growth rate of firm $i$ explained by the market (it is also a measure of risk), and $\epsilon^{(t)}_{i}$ denotes an error model associated with firm $i$ (describes growth rate not captured by the firm or market factors). 


### Parameters
The parameters of the single index model have some interesting interpretations.

* The $\alpha_{i}$ parameter is the idiosyncratic (firm-specific) growth, which captures the growth rate of firm $i$ that is __not__ explained by the market index. 
* The $\beta_{i}$ parameter has two meanings: it is a measure of the the growth rate of firm $i$ explained by the market index, and it is also a measure of risk. A higher $\beta_{i}$ indicates that the growth rate of firm $i$ is more sensitive to changes in the market index, and thus, it is more risky. 

Let's dig into the meaning of the $\beta_{i}$ parameter a little more, starting with the growth interpretation. We can rearrance the SIM as:
$$
\begin{align*}
\mu^{(t)}_{i} &= \alpha_{i} + \beta_{i}\;\mu^{(t)}_{M} + \epsilon^{(t)}_{i}\\
\mu^{(t)}_{i} - \alpha_{i} - \epsilon^{(t)}_{i} &= \beta_{i}\;\mu^{(t)}_{M}\\
\underbrace{\frac{\mu^{(t)}_{i} - \alpha_{i} - \epsilon^{(t)}_{i}}{\mu^{(t)}_{M}}}_{\text{fraction explained by market}} &= \beta_{i}\quad\blacksquare\\
\end{align*}
$$
The risk interpretation of $\beta$ is a more subtle. To understand this, let's start by taking the variance of both sides of the SIM:
$$
\begin{align*}
\text{Var}\left(\mu^{(t)}_{i}\right) &= \text{Var}\left(\alpha_{i} + \beta_{i}\;\mu^{(t)}_{M} + \epsilon^{(t)}_{i}\right)\\
&= \text{Var}\left(\alpha_{i}\right) + \text{Var}\left(\beta_{i}\;\mu^{(t)}_{M}\right) + \text{Var}\left(\epsilon^{(t)}_{i}\right)\\
&= 0 + \beta_{i}^{2}\;\text{Var}\left(\mu^{(t)}_{M}\right) + \text{Var}\left(\epsilon^{(t)}_{i}\right)\\
\sigma_{i}^{2} &= \beta_{i}^{2}\;\sigma_{M}^{2} + \sigma_{\epsilon,i}^{2}\quad\blacksquare
\end{align*}
$$
where we used the fact that $\alpha_{i}$ is a constant (variance is zero), $\beta_{i}$ is a constant that can be factored out of the variance, and we assume that the error term $\epsilon^{(t)}_{i}$ is uncorrelated with the market growth $\mu^{(t)}_{M}$. 

> __Risk__: The total risk of firm $i$ (measured by $\sigma_{i}^{2}$) consists of two components:  __Systematic risk__: $\beta_{i}^{2}\;\sigma_{M}^{2}$ and __Idiosyncratic risk__: $\sigma_{\epsilon,i}^{2}$.
> The systematic risk is the risk that comes from exposure to market movements, while the idiosyncratic risk is the firm-specific risk that is independent of the market.

Now, to derive the formula for $\beta_{i}$, we need to use the covariance relationship. Taking the covariance of both sides of the SIM with the market growth $\mu^{(t)}_{M}$:
$$
\begin{align*}
\text{Cov}\left(\mu^{(t)}_{i}, \mu^{(t)}_{M}\right) &= \text{Cov}\left(\alpha_{i} + \beta_{i}\;\mu^{(t)}_{M} + \epsilon^{(t)}_{i}, \mu^{(t)}_{M}\right)\\
&= \text{Cov}\left(\alpha_{i}, \mu^{(t)}_{M}\right) + \text{Cov}\left(\beta_{i}\;\mu^{(t)}_{M}, \mu^{(t)}_{M}\right) + \text{Cov}\left(\epsilon^{(t)}_{i}, \mu^{(t)}_{M}\right)\\
&= 0 + \beta_{i}\;\text{Cov}\left(\mu^{(t)}_{M}, \mu^{(t)}_{M}\right) + 0\\
&= \beta_{i}\;\text{Var}\left(\mu^{(t)}_{M}\right)\\
\text{Cov}\left(\mu^{(t)}_{i}, \mu^{(t)}_{M}\right) &= \beta_{i}\;\sigma_{M}^{2}\quad\Longrightarrow\text{solve for }\beta_{i}\\
\beta_{i} &= \frac{\text{Cov}\left(\mu^{(t)}_{i}, \mu^{(t)}_{M}\right)}{\text{Var}\left(\mu^{(t)}_{M}\right)} = \frac{\text{Cov}\left(\mu_{i}, \mu_{M}\right)}{\text{Var}\left(\mu_{M}\right)}\quad\blacksquare
\end{align*}
$$

> __Beta:__
> The $\beta_{i}$ parameter measures how much systematic risk the firm carries relative to the market. 
> * If $\beta_{i} = 1$, the firm moves in lockstep with the market. 
> * If $\beta_{i} > 1$, the firm is more volatile than the market (amplifies market movements). 
> * If $\beta_{i} < 1$, the firm is less volatile than the market (dampens market movements).

Wow! That's pretty cool! But, how do we estimate the parameters of the SIM? Let's take a look at that next.
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> __Include:__ The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
We gathered a daily open-high-low-close dataset for each firm in the [S&P500](https://en.wikipedia.org/wiki/S%26P_500) from `01-03-2014` until `12-31-2024`, along with data for a few exchange-traded funds and volatility products during that time. 

Let's load the `original_dataset::DataFrame` by calling [the `MyTrainingMarketDataSet()` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyTrainingMarketDataSet) and remove firms that do not have the maximum number of trading days. The cleaned dataset $\mathcal{D}$ will be stored in the `dataset` variable.

In [2]:
original_dataset = MyTrainingMarketDataSet() |> x-> x["dataset"];

Not all tickers in our dataset have the maximum number of trading days for various reasons, e.g., acquisition or de-listing events. Let's collect only those tickers with the maximum number of trading days.

First, let's compute the number of records for a firm that we know has a maximum value, e.g., `AAPL`, and save that value in the `maximum_number_trading_days::Int64` variable:

In [3]:
maximum_number_trading_days = original_dataset["AAPL"] |> nrow;

Now, let's iterate through our data and collect only tickers with `maximum_number_trading_days` records. Save that data in the `dataset::Dict{String,DataFrame}` variable:

In [4]:
dataset = Dict{String,DataFrame}();
for (ticker,data) ∈ original_dataset
    if (nrow(data) == maximum_number_trading_days)
        dataset[ticker] = data;
    end
end
dataset;

How many firms do we have with the full number of trading days? Let's use [the `length(...)` method](https://docs.julialang.org/en/v1/base/collections/#Base.length) - notice this works for dictionaries, in addition to arrays, sets, and other collections.

In [5]:
length(dataset) # tells us how many keys are in the dictionary (how many firms in our dataset?)

424

Finally, let's get a list of the firms in our cleaned dataset (and sort them alphabetically). We store the sorted firm ticker symbols in the `list_of_tickers::Array{String,1}` variable.

In [6]:
list_of_tickers = keys(dataset) |> collect |> sort; # list of firm "ticker" symbols in alphabetical order

___

## Task 1: Compute the growth rate matrix
In this task, we compute the growth rate array which contains, for each day and each firm in our dataset, the value of the growth rate between time $j$ and $j-1$. 

>  __Continuously Compounded Growth Rate (CCGR)__
>
> Let's assume a model of the share price of firm $i$ is governed by an expression of the form:
>$$
\begin{align*}
S^{(i)}_{j} &= S^{(i)}_{j-1}\;\exp\left(\mu^{(i)}_{j,j-1}\Delta{t}_{j}\right)
\end{align*}
> $$
> where $S^{(i)}_{j-1}$ denotes the share price of firm $i$ at time index $j-1$, $S^{(i)}_{j}$ denotes the share price of firm $i$ at time index $j$, and $\Delta{t}_{j} = t_{j} - t_{j-1}$ denotes the length of a time step (units: years) between time index $j-1$ and $j$. The value we are going to estimate is the growth rate $\mu^{(i)}_{j,j-1}$ (units: inverse years) for each firm $i$, and each time step in the dataset.

We've implemented [the `log_growth_matrix(...)` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.log_growth_matrix) which takes the cleaned dataset and a list of ticker symbols, and returns the growth rate array. Each row of the growth rate array is a time step, while each column corresponds to a firm from the `list_of_tickers::Array{String,1}` array.

In [7]:
growth_rate_array = let

    # initialize -
    τ = (1/252); # time-step one-day in units of years (trading year is 252 days)
    r̄ = 0.0; # assume the risk-free rate is 0

    # compute the growth matrix -
    growth_rate_array = log_growth_matrix(dataset, list_of_tickers, Δt = τ, 
        risk_free_rate = r̄); # other optional parameters are at their defaults

    growth_rate_array; # return
end

2766×424 Matrix{Float64}:
 -0.877554    6.28105    -2.87097     …  -0.755391   0.245894  -1.00527
  2.81626     1.07149     1.39239         2.13832   -0.80279    0.986468
  3.31305     0.855597    0.00536803      0.109877   1.191     -2.58144
  0.646425   17.2599      1.69215         0.274716   3.1593    -0.368228
  1.81609     2.57961     3.31924         0.621677  -2.1687     4.40309
  0.61383    -3.96384    -0.79278     …  -0.862739  -1.90977   -3.11624
  2.86071    -0.483751    4.84573         1.7657    -1.77685   -1.0896
  2.04671     1.0135      1.90809         1.67597    4.44984   -0.137819
  1.31289     1.67413     0.107259       -1.50708   -2.13696    1.43784
  1.22016     6.12957     0.932578       -1.53202    2.87784   -1.43626
  ⋮                                   ⋱                        
 -4.36889     3.84443    -2.37452        -4.26011   -9.17906   -3.94641
 -2.51182    -2.60891   -10.1209         -3.03895   -7.07468   -7.14019
  2.21355     4.15066     7.27678         3.

> **Growth Rate Matrix Structure**
>
> The `growth_rate_array` is a matrix $\mathbf{G} \in \mathbb{R}^{m \times n}$ where each **row** represents a trading day (time step) in our dataset, each **column** represents a firm from the S&P500, and each **element** $G_{i,j}$ contains the continuously compounded growth rate for firm $j$ on day $i$.

The matrix has 424 firms (columns) and there are $T-1$ = 2,766 trading days (rows), capturing the daily growth rate dynamics of the S&P500 components from 2014 to 2024. Is there redundancy in the data?

Let's check the rank of the growth rate array using [the `rank(...)` function](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.rank):

In [8]:
rank(growth_rate_array) # tells us the rank of the growth rate array

424

The growth rate matrix has **full column rank** (rank = 424), which means that all 424 firms contribute independent information to the dataset. 

> __Why is this significant?__ This tells us that no firm's growth pattern can be perfectly predicted from the others. Each stock brings unique behavior to the market, even though there may be strong correlations between them.

Now, let's build a single index model for a single firm, to see how this works.
___

# Task 2: Build a single index model for a test firm
In this task, let's build a single index model for a test firm, so that we can see how to set up this calculation and test that it works. Let's start by picking the ticker symbol for our test firm; we'll save this in the `my_test_ticker::String` variable:

In [9]:
my_test_ticker = "AAPL"; # ticker symbol for our test firm

Next, we need to pull out the excess growth (return) of the market portfolio from `growth_rate_array::Array{Float64,2}`. To do this, look up the index for our market portfolio surrogate `SPY`, then store the growth rate (column from the growth rate array) in the `Rₘ::Array{Float64,1}` variable:

In [10]:
Rₘ = findfirst(x->x=="SPY", list_of_tickers) |> i -> growth_rate_array[:,i];

Then, we need to formulate the data matrix $\hat{\mathbf{X}}$ and the response vector $\mathbf{y}$ for our test firm. The data matrix $\hat{\mathbf{X}} \in \mathbb{R}^{(T-1) \times 2}$ will have two columns: a column of ones (to account for the intercept term $\alpha$) and a column containing the market excess growth $R_{m}(t)$ values. The response vector $\mathbf{y} \in \mathbb{R}^{(T-1)}$ will contain the excess growth values for our test firm.

In [28]:
X̂,y = let

    # get the growth values for our test firm -
    Rᵢ = findfirst(x-> x== my_test_ticker, list_of_tickers) |> j-> growth_rate_array[:, j];

    max_length = length(Rᵢ);
    y = Rᵢ;
    X̂ = [ones(max_length) Rₘ];
    
    X̂,y # return
end;

Now, we can estimate the single index model parameters $\theta=(\alpha, \beta)$ for our test firm by solving the ordinary least squares problem:
$$
\begin{equation*}
\min_{\alpha, \beta} \sum_{t=1}^{T} \left(R_{i}(t) - \alpha - \beta R_{m}(t)\right)^2
\end{equation*}
$$
which has the closed form solution:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$
if $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ is invertible (has full column rank).  In our case, $\hat{\mathbf{X}}$ has two columns, so we need to check that $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ has rank 2.


In [30]:
@assert rank(transpose(X̂) * X̂) == 2 # check that X'X is invertible

If we get here without an error, then we know that $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ is invertible, and we can compute the single index model parameters for our test firm. Let save these value sin the $\hat{\mathbf{\theta}}$ variable:

In [31]:
θ̂ = inv(transpose(X̂) * X̂) * transpose(X̂) * y; # compute the SIM parameters for our test firm

## Task 3: Uncertainty quantification of the single index model parameters
In this task, we'll compute the uncertainty in our single index model parameters $\hat{\mathbf{\theta}}$ for our test firm. To do this, we need to compute the standard errors of the parameter estimates $\mathrm{SE}(\hat{\mathbf{\theta}})$, which requires us to estimate the variance of the error terms $\hat{\sigma}^{2}$. 

Let's start there.

> __Theory:__ Since the __true__ variance $\sigma^2$ is unknown, we can estimate the population variance $\hat{\sigma}^2$ from the residuals $\mathbf{r} = \mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{\theta}}$ as:
> $$
\begin{align*}
\hat{\sigma}^{2} &= \frac{\lVert~\mathbf{r}~\rVert^{2}_{2}}{n-p} = \frac{1}{n-p}\sum_{i=1}^{n}r_i^2
\end{align*}
> $$
> where $n$ is the number of observations, $p$ is the number of parameters, $\lVert\star\rVert_{2}^{2}$ denotes the $\ell_2$ norm squared, and $r_i = y_i - \hat{\mathbf{x}}_i^{\top}\hat{\mathbf{\theta}}$ is the $i$-th residual, i.e., the difference between the observed and predicted value for observation $i$.

We implement this computation in the code block below and save the result in the `training_variance::Float64` variable:

In [34]:
training_variance = let

    # initialize -
    p = length(θ̂); # number of parameters
    n = size(X̂,1); # number of training observations

    # compute the residual -
    r = y .- X̂*θ̂; # residual vector
    
    # variance -
    my_variance = (1/(n-p))*norm(r)^2

    # let's compute the variance of the residuals (Julia)
    built_in_variance = var(r, corrected=true); # variance - Julia
    @show built_in_variance, my_variance; # show

    my_variance; # return
end;

(built_in_variance, my_variance) = (7.293793994188719, 7.296432848745242)


Next, let's compute the standard error. The standard error of the parameter estimates $\hat{\mathbf{\theta}}$ quantifies the uncertainty in the estimated parameters due to the variability in the data. Let's compute the standard error for each parameter $\hat{\theta}_j$ that we estimated in Task 1.

We save the standard errors in the `SE::Vector{Float64}` variable, where element $j$ corresponds to the standard error of a parameter estimate $\text{SE}(\hat{\theta}_j)$.

In [35]:
SE = let
    
    # initialize -
    p = length(θ̂); # number of parameters
    n = size(X̂,1); # number of training samples

    # compute the standard error -
    SE = sqrt.(diag(inv(transpose(X̂)*X̂))*training_variance);

    SE; # return
end;

Now that we have the standard error for each of the model parameters, we can compute the uncertainty in the parameter estimates $\hat{\mathbf{\theta}}$. Let's compute confidence intervals for each parameter estimate.
> __Confidence Intervals:__ A $(1-\alpha) \times 100\%$ confidence interval for each parameter $\hat{\theta}_j$ is given by:
> $$
\begin{align*}
\hat{\theta}_j \pm t_{1-\alpha/2,\nu}\; \hat{\sigma}\; \sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}
\end{align*}
> $$
> where $t_{1-\alpha/2,\nu}$ is the $(1-\alpha/2)$-quantile of a Student $t$ distribution with $\nu$ degrees of freedom. For a 95% confidence interval, $\alpha = 0.05$ and $t_{1-\alpha/2,\nu} \approx 1.96$. For a 99.9% confidence interval, $\alpha = 0.001$ and $t_{1-\alpha/2,\nu} \approx 3.291$.

Let's build a table that shows the parameter ranges for a 95.0% confidence interval using [the `PrettyTables.jl` package](https://github.com/ronisbr/PrettyTables.jl). (You can adjust this to show another confidence interval if you like).

In [36]:
let

    # initialize -
    t = 1.96; # for a 95% confidence interval
    df = DataFrame(); # hold the data (rows) for the table

    # build features of the table -
    feature_labels = Array{String,1}();
    push!(feature_labels, "α");
    push!(feature_labels, "β");

    for i ∈ eachindex(θ̂)

        center = θ̂[i];
        lower_bound = θ̂[i] - t*SE[i];
        upper_bound = θ̂[i] + t*SE[i];

        row_df = (
            i = i,
            feature = feature_labels[i],
            p = round(center, digits=4),
            l = round(lower_bound, digits=4),
            u = round(upper_bound, digits=4),
            cz = (lower_bound <= 0.0 <= upper_bound ? "yes" : "no")
        ) # data for the row

        push!(df, row_df) # add the row to the dataframe
    end

    # show the table -
    pretty_table(df, backend = :text,
        table_format = TextTableFormat(borders = text_table_borders__simple)) # new table API. Hmmm
end

 [1m     i [0m [1m feature [0m [1m       p [0m [1m       l [0m [1m       u [0m [1m     cz [0m
 [90m Int64 [0m [90m  String [0m [90m Float64 [0m [90m Float64 [0m [90m Float64 [0m [90m String [0m
      1         α    0.1061    0.0053    0.2068       no
      2         β    1.1946    1.1476    1.2415       no


___

## Disclaimer and Risks

__This content is offered solely for training and informational purposes__. No offer or solicitation to buy or sell securities or derivative products or any investment or trading advice or strategy is made, given, or endorsed by the teaching team.

__Trading involves risk__. Carefully review your financial situation before investing in securities, futures contracts, options, or commodity interests. Past performance, whether actual or indicated by historical tests of strategies, is no guarantee of future performance or success. Trading is generally inappropriate for someone with limited resources, investment or trading experience, or a low-risk tolerance. Only risk capital that is not required for living expenses.

__You are fully responsible for any investment or trading decisions you make__. Such decisions should be based solely on evaluating your financial circumstances, investment or trading objectives, risk tolerance, and liquidity needs.