# Asset Allocation with Unsupervised Machine Learning Techniques

Suppose we have $M$ assets with prices $\left\{ S_t^{(k)} \right\}_{k=1}^M$, where the quantity of the $k$-th asset at time $t$ is denoted by $x_t^{(k)}$. The investment in the $k$-th asset at time $t$ is given by $V_t^{(k)} = S_t^{(k)} \cdot x_t^{(k)}$. Since a portfolio is a linear combination of financial assets, the value of the portfolio $\Pi_{t}$ at each instant is determined by the specific combination of assets held at that moment:

$$
\Pi_{t} = \sum_{k=1}^M V_t^{(k)}
$$

The value of a portfolio can change either due to a change in the prices of the assets it contains, a rebalancing of its composition, or both. Generally, the returns of the assets over a given investment horizon are represented by an $M$-dimensional random vector $\mathbf{R}_{t}$, defined as:

$$
\mathbf{R}_{t} = \left[ R_t^{(1)}, R_t^{(2)}, \dots, R_t^{(M)} \right]^T
$$

where $R_t^{(k)} = \frac{S_{t+1}^{(k)} - S_t^{(k)}}{S_t^{(k)}}, \,\,\, \forall k = 1, 2, \dots, M$. The expected return of the assets over an investment horizon is then a real (deterministic) vector given by the expected value of the random return vector, i.e.,

$$
\boldsymbol{\mu}_{t} = \mathbb{E}\left[\mathbf{R}_{t}\right] \in \mathbb{R}^M
$$

The return of a portfolio is the percentage change in the $\text{P&L}$ based on an initial reference level, which can be taken as the size of the initial position. The $\text{P&L}$ (Profit and Loss) refers to the economic outcome of a trading operation over a certain period. This can be due to capital appreciations of a portfolio $\Pi_{t}$, the receipt of dividends from stocks, coupon payments from bonds, etc. Mathematically, we can express the $\text{P&L}$ in discrete time as $\text{P&L}_{t+1} = \Delta\Pi_{t+1} = \Pi_t - \Pi_{t-1}$. Thus, the return of a portfolio over an investment horizon $\Delta t$ is given by:

$$
\begin{split}
R_{\mathbf{w}} & = \frac{\Pi_{t+1} - \Pi_t}{\Pi_{t}} \\
               & = \frac{\sum_{k=1}^M V_{t+1}^{(k)} - \sum_{k=1}^M V_t^{(k)}}{\Pi_t} \\
               & = \frac{\sum_{k=1}^M \left( V_{t+1}^{(k)} - V_t^{(k)}\right)}{\Pi_t} \\
               & = \frac{\sum_{k=1}^M \Delta V_{t+1}^{(k)}}{\Pi_t} \\
               & = \sum_{k=1}^M \frac{\Delta V_{t+1}^{(k)}}{\Pi_t} \\
               & = \sum_{k=1}^M \frac{V_t^{(k)}}{\Pi_t} \frac{\Delta V_{t+1}^{(k)}}{V_t^{(k)}} \\
               & = \sum_{k=1}^M \frac{V_t^{(k)}}{\sum_{k=1}^M V_t^{(k)}} \frac{\Delta V_{t+1}^{(k)}}{V_t^{(k)}} \\
\end{split}
$$

Assuming the quantity of the $k$-th asset $x_t^{(k)}$ remains fixed during the investment horizon, the term $\frac{\Delta V_{t+1}^{(k)}}{V_t^{(k)}}$ equals the return of the $k$-th asset at time $t$. Moreover, defining the portfolio weight of the $k$-th asset at time $t$ as:

$$
w_t^{(k)} = \frac{V_t^{(k)}}{\Pi_{t}} = \frac{V_t^{(k)}}{\sum_{k=1}^M V_t^{(k)}}
$$

we find that the discrete return of the entire portfolio can be written as the weighted average of the discrete returns of its constituent assets:

$$
R_{\mathbf{w}} = \sum_{k=1}^M w_t^{(k)} R_t^{(k)}
$$

In finance, if $w_t^{(k)} > 0$, the $k$-th asset is said to be "Long", and if $w_t^{(k)} < 0$, the $k$-th asset is said to be "Short".

To simplify the notation, we can define the portfolio composition as an $M$-component vector where each component represents the individual weights of each asset:

$$
\mathbf{w}_t = \mathbf{w}_t(\mathbf{S}_t) = \left[ w_t^{(1)}, w_t^{(2)}, \dots, w_t^{(M)} \right]^T
$$

Therefore, the discrete return of the entire portfolio over the holding period is given by:

$$
R_{\mathbf{w}} = \mathbf{w}_t^T\mathbf{R}_t
$$

and the expected return of the portfolio $\Pi$ is:

$$
\mu_{\Pi} = \mathbb{E}\left[ R_{\mathbf{w}} \right] = \mathbb{E}\left[ \mathbf{w}_t^T\mathbf{R}_t \right] = \mathbf{w}_t^T \mathbb{E}\left[ \mathbf{R}_t \right] = \mathbf{w}_t^T \boldsymbol{\mu}_{t} \in \mathbb{R}
$$

Another important quantity to measure is the risk, which can be defined as the variation between what is expected and what is observed. If we expect the return of an asset, it is reasonable to think of risk as the fluctuation of the expected return. There are several metrics to measure this fluctuation, but in this project, we will use **variance**. On one hand, if we calculate the variance of the return vector:

$$
\begin{split}
\text{Var}\left[ \mathbf{R}_{t} \right] 
& = \mathbb{E}\left[ \left(\mathbf{R}_{t} - \mathbb{E}\left[ \mathbf{R}_{t} \right] \right)^2 \right] \\
& = \mathbb{E}\left[ \left(\mathbf{R}_{t} - \boldsymbol{\mu}_{t} \right)\left(\mathbf{R}_{t} - \boldsymbol{\mu}_{t} \right)^T \right] \\ \\
& = \mathbb{E}
\left[  
\begin{pmatrix}
R_t^{(1)} - \mu_t^{(1)} \\ \\
\vdots \\
R_t^{(M)} - \mu_t^{(M)}
\end{pmatrix}
\left( R_t^{(1)} - \mu_t^{(1)}, \dots, R_t^{(M)} - \mu_t^{(M)} \right)
\right] \\ \\
& = 
\begin{pmatrix}
\mathbb{E}\left[ \left(R_t^{(1)} - \mu_t^{(1)}\right)\left(R_t^{(1)} - \mu_t^{(1)}\right) \right] & \cdots & \mathbb{E}\left[ \left(R_t^{(1)} - \mu_t^{(1)}\right)\left(R_t^{(M)} - \mu_t^{(M)}\right)\right]\\
\vdots & \ddots & \vdots \\
\mathbb{E}\left[\left(R_t^{(M)} - \mu_t^{(M)}\right)\left(R_t^{(1)} - \mu_t^{(1)}\right)\right] & \cdots & \mathbb{E}\left[ \left(R_t^{(M)} - \mu_t^{(M)}\right)\left(R_t^{(M)} - \mu_t^{(M)}\right)\right]
\end{pmatrix} \\ \\
& =
\begin{pmatrix}
\text{Var}\left[ R_t^{(1)} \right] & \cdots & \text{Cov}\left[ R_t^{(1)}, R_t^{(M)} \right]\\
\vdots & \ddots & \vdots \\
\text{Cov}\left[ R_t^{(M)}, R_t^{(1)} \right] & \cdots & \text{Var}\left[ R_t^{(M)} \right]
\end{pmatrix} \\ \\
& = \boldsymbol{\Sigma}_t
\end{split}
$$

In other words, the variance of the random return vector is the covariance matrix. Using this expression, we can calculate the portfolio risk as follows:

$$
\begin{split}
\text{Var}\left[ R_{\mathbf{w}} \right] 
& = \text{Var}\left[ \mathbf{w}_t^T\mathbf{R}_t \right] \\
& = \mathbb{E}\left[ \left( \mathbf{w}_t^T\mathbf{R}_t - \mathbf{w}_t^T\boldsymbol{\mu}_{t} \right)\left( \mathbf{w}_t^T\mathbf{R}_t - \mathbf{w}_t^T\boldsymbol{\mu}_{t} \right)^T \right] \\
& = \mathbb{E}\left[ \mathbf{w}_t^T \left(\mathbf{R}_t - \boldsymbol{\mu}_{t}\right)\left(\mathbf{R}_t - \boldsymbol{\mu}_{t}\right)^T \mathbf{w}_t \right] \\
& = \mathbf{w}_t^T \mathbb{E}\left[\left(\mathbf{R}_t - \boldsymbol{\mu}_{t}\right)\left(\mathbf{R}_t - \boldsymbol{\mu}_{t}\right)^T \right]\mathbf{w}_t \\ \\
& = \mathbf{w}_t^T \boldsymbol{\Sigma}_t\mathbf{w}_t \in \mathbb{R}
\end{split}
$$

## Principal Portfolios Approach

So far, we have established the following:

- **Portfolio composition** $\Pi$ at each time $t$: $\mathbf{w}_t$
- **Portfolio return**: $R_{\mathbf{w}} = \mathbf{w}_t^T\mathbf{R}_t$
- **Expected portfolio return**: $\mu_\Pi = \mathbb{E}\left[ R_{\mathbf{w}} \right] = \mathbf{w}_t^T \boldsymbol{\mu}_{t}$
- **Portfolio risk**: $\text{Var}\left[ R_{\mathbf{w}} \right] = \sigma_\Pi^2 = \mathbf{w}_t^T \boldsymbol{\Sigma}_t\mathbf{w}_t$

At this stage, it's relevant to ask whether it's possible to choose the portfolio composition $\mathbf{w}_t$ in such a way that it provides an optimal balance between risk and return. There are various approaches to tackle this problem, with perhaps the most well-known being the **Modern Portfolio Theory** developed by **Harry Markowitz** in 1952. However, in this project, we will take a different approach based on a widely recognized unsupervised machine learning technique called **Principal Component Analysis (PCA)**.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used statistical technique for dimensionality reduction in datasets. This method identifies the directions in which the data exhibit the greatest variability and transforms the data into a new coordinate system based on these directions. PCA was developed by Karl Pearson in 1901 and later refined by Harold Hotelling in the 1930s.

PCA is an unsupervised method, meaning it relies solely on a set of features $X_1, X_2, \dots, X_M$ without considering any associated response variable $Y$. Although its primary use is for dimensionality reduction, it is also applied in supervised learning problems, for example, by using the principal components as predictors in machine learning algorithms instead of the larger original set of variables. Additionally, PCA is a powerful tool for data visualization.

PCA transforms the data into a new coordinate system where the principal components represent the greatest variability in the data. These components are sequences of unit vectors that optimally fit the data, minimizing the average squared perpendicular distance between the points and the line. The directions of the principal components correspond to those in the feature space along which the data show the greatest variability, defining lines and subspaces that closely approximate the data cloud.

One of the main advantages of PCA is its ability to summarize a large set of correlated variables into a smaller number of representative variables that, together, explain most of the variability of the original set. This is especially useful in contexts with a large number of variables, where visualizing the data becomes a challenge. For example, if you have $N$ observations and a set of $M$ features $X_1, X_2, \dots, X_M$, you could generate $\binom{M}{2} = \frac{M(M - 1)}{2}$ scatterplots to visualize the data. This means that for $M = 10$, there would be 45 possible scatterplots, which can become impractical if $M$ is large. Moreover, it's likely that none of these plots alone would be informative, as each captures only a fraction of the total information. Clearly, a more efficient method is needed to visualize the $N$ observations when $M$ is large, and this is where PCA is particularly useful.

PCA allows for finding a low-dimensional representation of the data that captures as much variation as possible. While each of the $N$ observations resides in an $M$-dimensional space, not all of these dimensions are equally relevant. PCA identifies a small number of dimensions that are the most interesting, where the degree of interest is measured by the amount of variation in the observations along each dimension. Each of these dimensions is a linear combination of the original $M$ features.

Therefore, PCA provides an effective solution for dimensionality reduction and data visualization in a low-dimensional space. For example, if two principal components can be derived that capture most of the variability in the data, the observations can be represented in a two-dimensional scatterplot, making it easier to identify patterns and clusters of related data.


### Recipe

En lo que sigue explicamos el procedimiento mediante el cual se encuentran las componentes principales pero previo a esto vale la pena observar que en PCA, las variables deben estar centradas para que tengan una media de cero. Además, los resultados obtenidos cuando realizamos el PCA también dependerán de si las variables han sido escaladas individualmente (cada una multiplicada por una constante diferente). 

Lo que estamos buscando es una mapeo que transforme las características originales $X_1, X_2, \dots, X_M$ en otras características $Z_1, Z_2, \dots, Z_M$ de forma tal que la varianza sea máxima sujeto a la restricción de que los vectores columna que conforman la representación matricial de esa transformación tengan norma uno. Matemáticamente, esto quiere decir que el primer componente principal del conjunto de características transformadas es la combinación lineal normalizada de las características:

$$
Z_1 = R_{11}X_1 + R_{21}X_2 + \dots + R_{M1}X_M = \sum_{j=1} R_{j1}X_j = \mathbf{R}_1^T \mathbf{X} = 
\left( R_{11}, R_{21}, \dots, R_{M1} \right)
\begin{bmatrix}
X_1 \\
X_2 \\
\vdots \\
X_M
\end{bmatrix}
$$

Nos referimos a los elementos $R_{11}, R_{21}, \dots, R_{M1}$ como los loadings del primer componente principal; en conjunto, los loadings forman el vector de loadings del componente principal, $\mathbf{R}_1 = (R_{11}, R_{21}, \dots, R_{M1})^T$. Restringimos los loadings de modo que su suma de cuadrados sea igual a uno, ya que de lo contrario, establecer que estos elementos sean arbitrariamente grandes en valor absoluto podría resultar en una varianza arbitrariamente grande.  Por lo tanto, para obtener el primer componente principal tenemos que encontrar la combinación lineal de los valores de las características que tenga la mayor varianza de muestra sujeto a la restricción  $\lVert R_1 \rVert^2 = \sum_{j=1}^M R_{j1}^2 = 1$. En otras palabras, el que el primer vector de carga del componente principal resuelve el problema de optimización restringido:

$$
\max_{R_{11}, \dots, R_{M1}} \left\{ \text{Var}\left[ Z_1 \right] \right\}, \,\,\, \text{sujeto a} \,\,\, \lVert R_1 \rVert^2 = 1
$$

Por comodidad, escribimos matricialmente la varianza de $Z_1$, que es la función objetivo:

$$
\text{Var}\left[ Z_1 \right] = \text{Var}\left[ \mathbf{R}_1^T \mathbf{X} \right] = \mathbf{R}_1^T \boldsymbol{\Sigma} \mathbf{R}_1
$$

Por lo tanto, el problema toma la forma:

$$
\max_{R_{11}, \dots, R_{M1}} \left\{ \mathbf{R}_1^T \boldsymbol{\Sigma} \mathbf{R}_1 \right\}, \,\,\, \text{sujeto a} \,\,\, \mathbf{R}_1^T \mathbf{R}_1 = 1
$$

Un método muy conocido para resolver problemas de optimización restringido es el método de multiplicadores de Lagrange, con lo cual, para abordar nuestro problema, se hace necesario introducir el lagrangiano, que se define como la función objetivo menos una constante por cada una de las restricciones que tuviera. En nuestro caso concreto, el lagrangiano toma la forma:

$$
\mathcal{L}(\mathbf{R}_1) = \mathbf{R}_1^T\boldsymbol{\Sigma} \mathbf{R}_1 - \alpha_1\left( \mathbf{R}_1^T\mathbf{R}_1 - 1 \right)
$$

Debemos ahora maximizar el lagrangiano y para ello debemos encontrar los puntos críticos del mismo, entonces:



Para resumir el procedimiento, al realizar el PCA, el primer componente principal de un conjunto de $M$ variables es la variable derivada formada como una combinación lineal de las variables originales que explica la mayor varianza. El segundo componente principal explica la mayor varianza de lo que queda una vez que se elimina el efecto del primer componente, y podemos proceder a través de $M$ iteraciones hasta que se explique toda la varianza. El PCA se utiliza más comúnmente cuando muchas de las variables están altamente correlacionadas entre sí y es deseable reducir su número a un conjunto independiente. El primer componente principal se puede definir de manera equivalente como una dirección que maximiza la varianza de los datos proyectados. El $i$-ésimo componente principal se puede tomar como una dirección ortogonal a los primeros $i-1$ componentes principales que maximizan la varianza de los datos proyectados.

## Dataset Description

In [1]:
# Import libraries
import yfinance as yf
import pandas as pd
import datetime

In [2]:
# Define date range
start_date = "2022-01-01"
end_date = "2023-12-31"

# Get the symbols of S&P 500 companies
sp500_tickers = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
tickers = sp500_tickers['Symbol'].tolist()

# Create a DataFrame to store relevant variables
historical_data = pd.DataFrame()

# Iterate over each ticker to get relevant data
for ticker in tickers:
    try:
        # Download historical data from Yahoo Finance
        ticker_data = yf.download(ticker, start=start_date, end=end_date)
        ticker_data['Ticker'] = ticker
        historical_data = pd.concat([historical_data, ticker_data])
    except Exception as e:
        print(f"Could not download data for {ticker}: {e}")

# Drop rows with missing data
historical_data.dropna(inplace=True)

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%*******

$BF.B: possibly delisted; No price data found  (1d 2022-01-01 -> 2023-12-31)


  historical_data = pd.concat([historical_data, ticker_data])
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**************

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%*******

  historical_data = pd.concat([historical_data, ticker_data])
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed

1 Failed download:
['SOLV']: YFChartError("%ticker%: Data doesn't exist for startDate = 1641013200, endDate = 1703998800")
  historical_data = pd.concat([historical_data, ticker_data])
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 com

In [3]:
# Reset index
historical_data.reset_index(inplace=True)

# Display the first few rows of the processed DataFrame
display(historical_data.head())

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Ticker
0,2022-01-03,149.096985,149.740799,147.023407,148.612045,131.330872,2309117.0,MMM
1,2022-01-04,149.230774,151.555191,148.854507,150.693985,133.170715,3016551.0,MMM
2,2022-01-05,148.102005,151.98996,147.993317,150.075256,132.623947,3531070.0,MMM
3,2022-01-06,151.237457,151.571899,148.444809,148.829437,131.52301,2996458.0,MMM
4,2022-01-07,148.938126,150.911377,148.177261,150.459869,132.963837,3349039.0,MMM


In [4]:
historical_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248472 entries, 0 to 248471
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Date       248472 non-null  datetime64[ns]
 1   Open       248472 non-null  float64       
 2   High       248472 non-null  float64       
 3   Low        248472 non-null  float64       
 4   Close      248472 non-null  float64       
 5   Adj Close  248472 non-null  float64       
 6   Volume     248472 non-null  float64       
 7   Ticker     248472 non-null  object        
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 15.2+ MB


In [None]:
# Save the data to a CSV file (optional)
historical_data.to_csv('sp500_historical_data.csv', index=False)