# Asset Pricing II - Problem Set #3

## University of Chicago - Booth School of Business

**Name**: Trent Potter

**Date**: 3/5/2025

---

### Embedding Cosine Similarities

Taking the cosine similarity of embedded company names and ranking the most similar companies we find:

| Rank | APPLE INC                | CITIGROUP INC                        | WALMART INC                   |
| ---- | ------------------------ | ------------------------------------ | ----------------------------- |
| 1    | _0.668_ - TESLA INC      | _0.716_ - CITI TRENDS INC            | _0.666_ - PRICESMART INC      |
| 2    | _0.631_ - INTEL CORP     | _0.666_ - CINTAS CORP                | _0.634_ - PEPSICO INC         |
| 3    | _0.627_ - NIKE INC       | _0.640_ - CIVISTA BANCSHARES INC     | _0.630_ - DOW INC             |
| 4    | _0.625_ - ALPHABET INC   | _0.612_ - GOLDMAN SACHS GROUP INC    | _0.624_ - HOME DEPOT INC      |
| 5    | _0.620_ - ADOBE INC      | _0.584_ - CITIUS PHARMACEUTICALS INC | _0.619_ - AMAZON COM INC      |
| 6    | _0.618_ - MICROSOFT CORP | _0.568_ - CIGNA CORP NEW             | _0.613_ - MACYS INC           |
| 7    | _0.600_ - VISA INC       | _0.565_ - COMERICA INC               | _0.596_ - EBAY INC            |
| 8    | _0.597_ - AMAZON COM INC | _0.561_ - MASTERCARD INC             | _0.588_ - LOWES COMPANIES INC |
| 9    | _0.588_ - EBAY INC       | _0.561_ - BANK OF AMERICA CORP       | _0.581_ - APPLE INC           |
| 10   | _0.586_ - NETFLIX INC    | _0.560_ - C P I CARD GROUP INC       | _0.579_ - WAYFAIR INC         |

Apple and Walmart both give results that seem align with company characteristics roughly. We might expect Apple to have higher cosine similarity with Alphabet or Microsoft given that they're competitors in markets, unlike Tesla, but on balance it looks like there's information here. Walmart has a similar structure, a nice balance of competitors and similar, but non-competing companies.

Citigroup however seems to suffer from many matches that are syntactically (or token-lelve) similar, but not semantically. We see significant numbers of 'CI' and 'CITI' tokens in the nearest cosine companies. Ranks 3, 4, 8, 9, and 10 do list financial services companies notably, but there are notable ommissions like Visa.

---

## Reducing Embedding Dimensions with Principal Components

First, we generate the principal components from the embeddings of different company names. With $N$ companies, $H$ original embedding dimension, and $K$ principal components we can write:

$$\mathbf{X} = \mathbf{Z} \mathbf{V^T}$$

where:

- $\mathbf{Z}$ are the original cohere embeddings ($N \times H$) ,
- $\mathbf{V^T}$ are the principal component weights ($H \times K$),
- $\mathbf{X}$ is the matrix of transformed embeddings ($N \times K$).

Next, we run a regression on the transformed, lower-dimension embeddings. We use the first few principal components as the independent variables in the regression model. The regression equation can be written as:

$$
m = \mathbf{X} \mathbf{\beta_1} + \mathbf{\beta_0}
$$

where:

- $\mathbf{m}$ is the log market equity,
- $\mathbf{\beta_1}$ is the vector of regression coefficients,
- $\mathbf{\beta_0}$ is an intercept term.

The OLS regression results indicate that the first four principal components (PC1, PC2, PC3, and PC4) are statistically significant predictors of the log market equity (LNme). The fifth principal component (PC5) is not statistically significant (p-value = 0.247).

The R-squared value of 0.123 suggests that approximately 12.3% of the variance in the log market equity is explained by the first five principal components. The F-statistic of 118.9 with a p-value of 4.42e-118 indicates that the model is statistically significant overall.

The coefficients for the principal components ($\beta_{1,1}$ to $\beta_{1,5}$) show the direction and magnitude of their relationship with the log market equity. Specifically:

- PC1 has a negative coefficient (-4.2068)
- PC2 has a positive coefficient (1.0102)
- PC3 has a negative coefficient (-2.6108)
- PC4 has a negative coefficient (-3.9875)
- PC5 is not statistically significant.

With out interpretation of the PCs, these coefficients have little economic meaning (see end section for rough interpretation of the PCs).

Overall, the results support the idea that the low-dimensional PCA representation contains information relevant to modeling the log market equity.

---

## Embedding Information Orthogonal to Book Equity

Generating residuals, $m_a^\perp$ by regressing log market equity on log book equity, we can next regress our PCs on these residuals to understand information content beyond book equity.

$$ m_a = \alpha_0 + \alpha_1b_a + m_a^\perp $$

Next, regress the residuals on our PCs with a coefficient:

$$ m_a^\perp = \beta_0 + \mathbf{X}\beta_1 + \varepsilon $$

The regression results indicate that the model explains only 2.1% of the variance in the residuals. The F-statistic is 16.24 with a p-value of 0, suggesting that the model is statistically significant overall. However, only PC2, PC3, and PC4 are statistically significant predictors of the residuals, with PC2 having a positive coefficient (1.1131), and PC3 (-0.5724) and PC4 (-0.5853) having negative coefficients. The constant term and the coefficients for PC1 and PC5 are not statistically significant.

The reduction in R^2 and lower significance implies that some of the information in original model simply captured information already present in book equity. Even with this loss of information, this points at embeddings capturing additional cross-sectional information about market equity. Again, without interpretation of the PCs, there's little economic intuition for the slopes fit.

---

## Bonus: Interpretation of the PC Loadings

1. Get the component loadings in the original 1024 embedding space
2. Find the companies with highest (and lowest) cos-similarity to the PC loading

- PC 1 - Medical Research to Banks
  - `+ Therapeutics`
  - `- Financial, Bank, Banc` (Dollar General gets looped in here)
- PC 2 - Industrials to Banks
  - `+ Technology, Industrial, Automation` (Bias toward acronyms as well?)
  - `- Bancshares, Bancorp, First, Community`
- PC 3 Acquisitions to Banks
  - `+ Group, Holdings, Acquisitions`
  - `- Bancorp, Bank, Bancsystem`
- PC 4 Tech to CPG
  - `+ Future, Quantum, Technologies`
  - `- Consumer Packaged Goods`
- PC 5 Energy to Travel
  - `+ Energy, Resources`
  - `- Expedia, Booking, Tripadvisor, Paypal`


In [112]:
import os
import cohere
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import statsmodels.api as sm

file_path = 'cohere.key'
if os.path.exists(file_path):
  with open(file_path, 'r') as file:
    cohere_key = file.read().strip()
else:
  cohere_key = os.getenv('COHERE_KEY')
co = cohere.Client(api_key=cohere_key)
df = pd.read_parquet('data_jkp.parquet')
df.drop_duplicates(subset=['comnam'], inplace=True)

In [113]:
# Generate embeddings & cosine similarities (cache to file)
embeddings_file_path = 'embeddings.parquet'
if os.path.exists(embeddings_file_path):
  embeddings_df = pd.read_parquet(embeddings_file_path)
else:
  comnam_list = df['comnam'].tolist()
  embeddings = co.embed(texts=comnam_list, model='embed-english-v3.0', input_type='clustering').embeddings
  embeddings_df = pd.DataFrame(embeddings, index=df.index)
  embeddings_df.to_parquet(embeddings_file_path)
df['embedding'] = embeddings_df.apply(lambda row: np.array(row), axis=1)
embeddings = np.stack(df['embedding'].values)
cosine_sim_file_path = 'cosine_sim.parquet'
if os.path.exists(cosine_sim_file_path):
  cosine_sim_df = pd.read_parquet(cosine_sim_file_path)
else:
  cosine_sim_matrix = cosine_similarity(embeddings)
  cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=df['comnam'], columns=df['comnam'])
  cosine_sim_df.to_parquet(cosine_sim_file_path)

In [114]:
test_companies = ['APPLE INC', 'CITIGROUP INC', 'WALMART INC']
top_n = 11
result_dfs = []

for company in test_companies:
  if company in cosine_sim_df.columns:
    top_similarities = cosine_sim_df[company].sort_values(ascending=False).head(top_n)
    top_similarities = top_similarities.apply(lambda x: f"{x:.3f}")
    result_dfs.append(pd.Series([f"{sim}-{idx}" for idx, sim in zip(top_similarities.index, top_similarities)], name=company))
similar_companies_df = pd.concat(result_dfs, axis=1)
similar_companies_df

Unnamed: 0,APPLE INC,CITIGROUP INC,WALMART INC
0,1.000-APPLE INC,1.000-CITIGROUP INC,1.000-WALMART INC
1,0.668-TESLA INC,0.716-CITI TRENDS INC,0.666-PRICESMART INC
2,0.631-INTEL CORP,0.666-CINTAS CORP,0.634-PEPSICO INC
3,0.627-NIKE INC,0.640-CIVISTA BANCSHARES INC,0.630-DOW INC
4,0.625-ALPHABET INC,0.612-GOLDMAN SACHS GROUP INC,0.624-HOME DEPOT INC
5,0.620-ADOBE INC,0.584-CITIUS PHARMACEUTICALS INC,0.619-AMAZON COM INC
6,0.618-MICROSOFT CORP,0.568-CIGNA CORP NEW,0.613-MACYS INC
7,0.600-VISA INC,0.565-COMERICA INC,0.596-EBAY INC
8,0.597-AMAZON COM INC,0.561-MASTERCARD INC,0.588-LOWES COMPANIES INC
9,0.588-EBAY INC,0.561-BANK OF AMERICA CORP,0.581-APPLE INC


In [115]:
pca = PCA(n_components=5)
low_dim_embeddings = pca.fit_transform(embeddings)
df['low_dim_embedding'] = list(low_dim_embeddings)

In [116]:
# Regress Market Equity on PCA'd embeddings
y = df['LNme']
X = np.stack(df['low_dim_embedding'])
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   LNme   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.121
Method:                 Least Squares   F-statistic:                     118.9
Date:                Tue, 04 Mar 2025   Prob (F-statistic):          4.89e-118
Time:                        14:09:11   Log-Likelihood:                -9530.5
No. Observations:                4264   AIC:                         1.907e+04
Df Residuals:                    4258   BIC:                         1.911e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.2807      0.035    181.196      0.0

In [117]:
# Regress Market Equity on Book Equity, grab the residuals
df_cleaned = df.dropna().copy()
y = df_cleaned['LNme']
X = df_cleaned['LNbe']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
df_cleaned['residual'] = model.resid

                            OLS Regression Results                            
Dep. Variable:                   LNme   R-squared:                       0.773
Model:                            OLS   Adj. R-squared:                  0.773
Method:                 Least Squares   F-statistic:                 1.274e+04
Date:                Tue, 04 Mar 2025   Prob (F-statistic):               0.00
Time:                        14:09:11   Log-Likelihood:                -5878.2
No. Observations:                3734   AIC:                         1.176e+04
Df Residuals:                    3732   BIC:                         1.177e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4800      0.056      8.573      0.0

In [118]:
# Regress (Market Equity~Book Equity) residuals  on PCA'd embeddings
y = df_cleaned['residual']
X = np.stack(df_cleaned['low_dim_embedding'])
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:               residual   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     16.24
Date:                Tue, 04 Mar 2025   Prob (F-statistic):           7.01e-16
Time:                        14:09:11   Log-Likelihood:                -5838.0
No. Observations:                3734   AIC:                         1.169e+04
Df Residuals:                    3728   BIC:                         1.173e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0112      0.019     -0.582      0.5

In [144]:
### PCA Interpretation
# For fun, which companies load highest on each of the PCs?
# We can also check out the 'negative' PC interpretation to get some interpretation of axes of variation for each component

components = pca.components_
pc_similarity = cosine_similarity(components, embeddings)
pc_similarity_df = pd.DataFrame(pc_similarity, index=[f'PC{i}' for i in range(1, 6)], columns=df['comnam'])

result_dfs = []
for component in range(len(components)):
  similarities = pc_similarity_df.iloc[component].sort_values(ascending=False)
  positive_similarities = similarities.head(top_n).apply(lambda x: f"{x:.3f}")
  negative_similarities = similarities.tail(top_n).apply(lambda x: f"{x:.3f}")
  result_dfs.append(pd.Series([f"{idx}" for idx, sim in zip(positive_similarities.index, positive_similarities)], name=f'PC{component + 1}'))
  result_dfs.append(pd.Series([f"{idx}" for idx, sim in zip(negative_similarities.index, negative_similarities)], name=f'-PC{component + 1}'))

similar_companies_df = pd.concat(result_dfs, axis=1)
for column in similar_companies_df.columns:
  print('\n\n'+column)
  for i,value in enumerate(similar_companies_df[column][:-1]):
    if i % 3 == 0 and i != 0:
      print('\t')
    print(f"{value}", end=", ")



PC1
AGILE THERAPEUTICS INC, VIVOS THERAPEUTICS INC, AGEX THERAPEUTICS INC, 	
BIORA THERAPEUTICS INC, VIKING THERAPEUTICS INC, BIOXCEL THERAPEUTICS INC, 	
VOYAGER THERAPEUTICS INC, XORTX THERAPEUTICS INC, OMEGA THERAPEUTICS INC, 	
PROTAGENIC THERAPEUTICS INC, 

-PC1
DOLLAR GENERAL CORP NEW, BANK SOUTH CAROLINA CORP, C N O FINANCIAL GROUP INC, 	
UNITED COMMUNITY BANKS INC GA, FIRST CITIZENS BANCSHARES INC NC, UNION PACIFIC CORP, 	
CAPITAL CITY BANK GROUP, CITIZENS FINANCIAL GROUP INC, FARMERS NATIONAL BANC CORP, 	
FIRST REPUBLIC BANK S F NEW, 

PC2
APPLIED INDUSTRIAL TECHS INC, TACTILE SYSTEMS TECHNOLOGY INC, MASON INDUSTRIAL TECHNOLOGY INC, 	
ROCKWELL AUTOMATION INC, E S S TECH INC, G S I TECHNOLOGY INC, 	
INDUSTRIAL TECH ACQ II INC, O S I SYSTEMS INC, T E S S C O TECHNOLOGIES INC, 	
E S C O TECHNOLOGIES INC, 

-PC2
TEXAS COMMUNITY BANCSHARES INC, COMMUNITY WEST BANCSHARES, FIRST FINANCIAL BANKSHARES INC, 	
COMMUNITY FINANCIAL CORP MD, P C B BANCORP, KENTUCKY FIRST FEDERAL BANCORP, 	
