# Text processing

In [None]:
# hide
%load_ext autoreload
%autoreload 2
%matplotlib inline

from datetime import date
from pathlib import Path

import numpy as np
import pandas as pd
from matplotlib import cm
from matplotlib import pyplot as plt
from skfin.plot import bar, line
from tqdm.auto import tqdm

def plot_document_embeddings(X): 
    fig, ax = plt.subplots(1, 1, figsize=(8, 7))
    years = [str(y) for y in X.index.year.unique()]
    colors = cm.RdBu(np.linspace(0, 1, len(years)))
    for i, y in enumerate(years):
        ax.scatter(x=X.loc[y][0], y=X.loc[y][1], color=colors[i])
    ax.legend(years, loc="center left", bbox_to_anchor=(1, 0.5))
    ax.set_xlabel("PC 0")
    ax.set_ylabel("PC 1")

    d = "2020-03-03"
    ax.text(x=X.loc[d][0], y=X.loc[d][1], s=d);
    
def plot_word_embeddings(H, n=6): 
    fig, ax = plt.subplots(int(n/2), 2, figsize=(20, 16), sharex=True)
    plt.subplots_adjust(wspace=0.5)
    ax = ax.ravel()
    for i in range(n):
        bar(
            H[i].sort_values(ascending=False).head(10),
            horizontal=True,
            ax=ax[i],
            title=i,
        )

In this section, we introduce several techniques to analyse a corpus of documents. This is done in the context of the statements of the Federal Open Market Committee (FOMC). 

## Loading the FOMC statements 

In [None]:
from skfin.datasets import load_fomc_statements
from skfin.text import show_text

statements = load_fomc_statements(force_reload=False)

In [None]:
show_text(statements)

In [None]:
special_days = ["2008-01-22", "2010-05-09", "2020-03-15"]

In [None]:
show_text(statements.loc[special_days])

## Text representation and vectorization

The progress of Natural Language Processing has been based on coming up with progressively better representations of text. In this section, we first discuss "word counting" in a given corpus and then embedding derived from pretrained language models. 

### Word counting and TFIDF 

In order to extract features from text, the simplest way is to count words. In `scikit-learn`, this is done with the function `CountVectorizer`. A slightly more advanced feature is to select words based on a `TFIDF` score, defined as the product of the term frequency (`TF`) and the inverse document frequency (`IDF`). More precisely, the `TFIDF` score trades off: 
- the terms that are frequent and therefore important in a corpus: 
- the terms that appear in almost all documents and therefore are not helping to discriminate across documents. 

In `TfidfVectorizer`, terms can be filtered additionally with: 
- a `stop word` list
- min and max document frequencies or counts 
- some token pattern (e.g. that eliminates the short tokens). 

In [None]:
from sklearn.decomposition import NMF, PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [None]:
vectorizer = TfidfVectorizer(
    stop_words="english",
    min_df=5,
    max_df=0.8,
    ngram_range=(1, 3),
    token_pattern=r"\b[a-zA-Z]{3,}\b",
)
X_ = vectorizer.fit_transform(statements["text"].values)

In [None]:
cols = vectorizer.get_feature_names_out()
print(len(cols))
list(cols)[:10]

In what follows, to reduce the impact over large  `tfidf` coefficients, we use the log transformationl $x \mapsto log(1+x)$ 

In [None]:
X_tfidf = pd.DataFrame(np.log1p(X_.toarray()), index=statements["text"].index, columns=cols)

In [None]:
bar(X_tfidf.mean().sort_values(ascending=False).head(30), horizontal=True, title='Largest (log1p) tfidf scores') 

### Deep-learning embeddings: sentence transformers

Deep-learning models have been used heavily to power NLP applications, in particular with `transformers` architecture starting with Delvin et al. (2018): "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". Sentence Transformers are language models fined-tuned from pretrained language models to specifically generate meaningful text representations (as embeddings). 

- https://www.sbert.net/

In [None]:
from sentence_transformers import SentenceTransformer
def count_trainable_parameters(model):
    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    return params

We use here a specific pretrained model and use it to derive embeddings for the corpus of Fed statements. 

In [None]:
lm_name = "all-distilroberta-v1"
m = SentenceTransformer(lm_name, device="cpu", trust_remote_code=True)
X_sbert = m.encode(statements["text"].values, batch_size=2)

In [None]:
print(f'Model card:\n - model name: {lm_name}\n - number of parameters: {count_trainable_parameters(m)/1e6:.1f}m\n - embedding size: {m.get_sentence_embedding_dimension()}')

## Low-rank decomposition and visualisation

In terms of representation of text, one key difference between the `tfidf` and `sentence transformer` reprenstation is that the former is very sparse (with many zeros) while the latter is dense. For the `tfidf` representation $X$ (where each row is a document and each column is an n-gram term out of the tfidf), the idea of a low-rank approximation is $\hat{X}$ and $H$ such that $$ X \approx \hat{X} H^T, $$ 

where $\hat{X}$ is a "denser" matrix than $X$ because intuitively, some columns have been combined. 

### Principal component exploration

We first perform Principal Component Analysis (`PCA`) using the singular-value decomposition (`svd`) function in `numpy`. 

In [None]:
u, s, w = np.linalg.svd(X_tfidf)

A singular value decomposition yields: $$X = U \times Diag (s) \times W^T, $$

where $s$ is the vector of eigenvalues and $U$ and $W$ are the matrix of eigenvectors. For a number of modes $n$, define: $s_n$ the vector with the first (largest) $n$ eivenvalues; and $U_n$ and $W_n$ the first $n$ columns of $U$ and $W$. 

Then with $\hat{X} = U_n \times Diag(\sqrt{s_n})$ and $H_n =  W_n \times Diag(\sqrt{s_n})$, we have: 

$$X \approx \hat{X} H_n^T.$$ 

In [None]:
n_modes = 6

signed_sqrt_eigv = np.diag(np.sqrt(s[:n_modes])* np.sign(np.mean(u[:, :n_modes], axis=0)))
X_pca = pd.DataFrame(u[:, :n_modes].dot(signed_sqrt_eigv), index=statements.index)
H_ = pd.DataFrame(w[:n_modes, :].T.dot(signed_sqrt_eigv), index=cols)

We can compute a distance between the features $X$ and the approximation $W_n H_n^T$. More precisely, the Frobinus norm is the sum of squared coefficients of the matrix and it can be computed with `scipy`.

In [None]:
import scipy
norm_pca = scipy.linalg.norm(X_pca.dot(H_.T).sub(X_tfidf), ord='fro')

In [None]:
np.allclose(norm_pca, np.sqrt(X_pca.dot(H_.T).sub(X_tfidf).pow(2).sum().sum()))

In [None]:
def func(x, n=5):
    return pd.concat([x.nlargest(n=n), x.sort_values(ascending=False).tail(n)])

fig, ax = plt.subplots(int(n_modes/2), 2, figsize=(20, 16))
ax = ax.ravel()
plt.subplots_adjust(wspace=0.5)
for i in range(n_modes):
    bar(H_[i].pipe(func, n=5), horizontal=True, ax=ax[i], title=f'PC {i}')

The plot above shows that the first principal component `PC0` is related to the labor market and the second principal component `PC1` is relate to economic growth. The graph below shows the loadings on these factor over time by document. 

- in the earlier years (1999-2009), the statements talk more about economic growth (high loading on `PC1`, low loading on `PC0`), but there is a switch in later years (2010-2020). 
- interestingly the last part of the sample (2021-2023) are in the middle and do not seem well explained by the loadings. 

In [None]:
plot_document_embeddings(X_pca)

### Non-negative matrix factorization

It is often information to group tokens into topics that explain differences across documents. A powerful algorithm is the non-negative matrix factorisation (`NMF`): for a non-negative matrix $X$ (such as the one with tfidf scores), `NMF` finds two other non-negative matrices such that $$ X \approx \hat{X}_n H_n^T. $$

The number of topics (called `n_components` in the `scikit-learn` implementation) determines the number of columns in $X_n$ and the number of rows in $H_n$. 

In [None]:
n_components = 8
m = NMF(
    n_components=n_components,
    init="nndsvd",
    solver="cd",
    beta_loss="frobenius",
    random_state=1,
    alpha_W=0,
    l1_ratio=0,
    max_iter=500,
).fit(X_tfidf)

In [None]:
H2_ = pd.DataFrame(m.components_.T, index=cols)
X_nmf = pd.DataFrame(m.transform(X_tfidf), index=statements.index)

The non-negative matrix factorization provides a slightly better approximation as measured by the Frobinus norm. 

In [None]:
norm_nmf = scipy.linalg.norm(X_nmf.dot(H2_.T).sub(X_tfidf), ord='fro')
print(f'The norm of the approximations are: pca = {norm_pca:.2f}; nmf = {norm_nmf:.2f}.')

In [None]:
plot_word_embeddings(H2_)

Are these topics interesting? This is a matter of interpretation, but at least, the graph below shows that these topics capture a strong element of time-clustering which makes it a bit less useful.  

In [None]:
line(X_nmf.resample("B").last().ffill(), cumsum=True, title="Cumulative topic loadings")

### UMAP

Uniform Manifold Approximation and Projection (UMAP)  is a non-linear dimensionality reduction technique. It works by constructing a high-dimensional graph representation of the data and then optimizing a low-dimensional graph to be as structurally similar as possible. UMAP is useful because it effectively preserves both local and global structures in the data, which makes it particularly good for visualizing clusters and relationships in high-dimensional datasets.

- https://umap-learn.readthedocs.io/en/latest/

In [None]:
from umap import UMAP

In [None]:
embedding_ = UMAP().fit_transform(X_tfidf)
X_umap = pd.DataFrame(embedding_, index=statements.index)

In [None]:
plot_document_embeddings(X_umap)

## Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
m = KMeans(n_clusters=6).fit(X_tfidf)
X_kmeans = pd.get_dummies(pd.Series(m.labels_, index=statements.index))

H3_ = pd.DataFrame(m.cluster_centers_.T, index=cols)
norm_kmeans = scipy.linalg.norm(X_kmeans.dot(H3_.T).sub(X_tfidf), ord='fro')

In [None]:
print(f'The norm of the approximations are: pca = {norm_pca:.2f}; nmf = {norm_nmf:.2f}, kmeans = {norm_kmeans:.2f}.')

In [None]:
plot_word_embeddings(H3_)

In [None]:
line(X_kmeans.resample("B").last().ffill(), cumsum=True, title="Cumulative topic loadings")

Does the `sentence transformers` embeddings yield a decomposition which is less clustered in time? 

In [None]:
m = KMeans(n_clusters=6).fit(X_sbert)
X_kmeans_ = pd.get_dummies(pd.Series(m.labels_, index=statements.index))

In [None]:
line(X_kmeans_.resample("B").last().ffill(), cumsum=True, title="Cumulative topic loadings")

## Supervised learning: vector representation + Elastic net

In this section, we use the corpus of FOMC statements for supervised learning. More precisely, we match the text of the statements to the decision of the committee to raise rates, decrease rates or do nothing.  

In practice, this implemented by using `scikit-learn pipelines` and chaining the `TfidfVectorizer` with a logistic regression. 

In [None]:
import numpy as np
from skfin.datasets import load_fomc_change_date

fomc_change_up, fomc_change_dw = load_fomc_change_date()

In [None]:
fomc_change_up, fomc_change_dw

In [None]:
other = {
    "other_dt_change": ["2003-01-09", "2008-03-16", "2011-06-22"],
    "statements_dt_change_other": ["2007-08-16"],
    "qe1": ["2008-11-25", "2008-12-01", "2008-12-16", "2009-03-18"],
    "qe2": ["2010-11-03"],
    "twist": ["2011-09-21", "2012-06-20"],
    "qe3": ["2012-09-13", "2012-12-12", "2013-12-13"],
    "corona": ["2020-03-20"],
}

In [None]:
dates = {
    "up": fomc_change_up,
    "dw": fomc_change_dw,
    "other": [d for c in other.values() for d in c],
}
dates["no change"] = statements.index.difference([d for c in dates.values() for d in c])

In [None]:
from skfin.text import coefs_plot, show_text
from sklearn.linear_model import ElasticNet, LogisticRegression
from sklearn.preprocessing import FunctionTransformer

In [None]:
est = Pipeline(
    [
        (
            "tfidf",
            TfidfVectorizer(
                vocabulary=None,
                ngram_range=(1, 3),
                max_features=500,
                stop_words="english",
                token_pattern=r"\b[a-zA-Z]{3,}\b",
            ),
        ),
        ("log1p", FunctionTransformer(np.log1p)), 
        (
            "reg",
            LogisticRegression(
                C=1, l1_ratio=0.35, penalty="elasticnet", solver="saga", max_iter=500
            ),
        ),
    ]
)
X, y = pd.concat(
    [
        statements.loc[fomc_change_up].assign(change=1),
        statements.loc[fomc_change_dw].assign(change=-1),
    ]
).pipe(lambda df: (df["text"], df["change"]))
est.fit(X, y)
vocab_ = pd.Series(est.named_steps["tfidf"].vocabulary_).sort_values().index

In [None]:
interpret_coef = pd.DataFrame(np.transpose(est.named_steps["reg"].coef_), index=vocab_)
coefs_plot(interpret_coef, title="Interpreted coefficients for trained model")

A trick is that using a linear regression (e.g. ElasticNet) instead of a logistic regression is faster and as efficient (even sometimes better)

In [None]:
est = Pipeline(
    [
        (
            "tfidf",
            TfidfVectorizer(
                vocabulary=None,
                ngram_range=(1, 3),
                max_features=500,
                stop_words="english",
                token_pattern=r"\b[a-zA-Z]{3,}\b",
            ),
        ),
         ("log1p", FunctionTransformer(np.log1p)), 
        ("reg", ElasticNet(alpha=0.01)),
    ]
)
X, y = pd.concat(
    [
        statements.loc[fomc_change_up].assign(change=1),
        statements.loc[fomc_change_dw].assign(change=-1),
    ]
).pipe(lambda df: (df["text"], df["change"]))
est.fit(X, y)
vocab_ = pd.Series(est.named_steps["tfidf"].vocabulary_).sort_values().index

In [None]:
interpret_coef = pd.DataFrame(np.transpose(est.named_steps["reg"].coef_), index=vocab_)
coefs_plot(interpret_coef, title="Interpreted coefficients for trained model")

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
pred_tfidf = (
    pd.Series(est.predict(statements["text"]), index=statements.index)
    .resample("B")
    .last()
    .ffill()
)
line(
    pred_tfidf.rename("implied rate")
    .to_frame()
    .join(
        pd.Series(1, index=fomc_change_up)
        .reindex(pred_tfidf.index)
        .fillna(0)
        .rename("up")
    )
    .join(
        pd.Series(-1, index=fomc_change_dw)
        .reindex(pred_tfidf.index)
        .fillna(0)
        .rename("dw")
    ),
    sort=False,
    ax=ax,
    title="Implied interest rate (with forward information)",
)
cols = ["corona", "twist", "qe1", "qe2", "qe3"]
for c in cols:
    ax.plot(pred_tfidf.loc[other[c]], marker="*", ms=10)
ax.legend(
    ["implied rate", "up", "down"] + cols, loc="center left", bbox_to_anchor=(1, 0.5)
);

In [None]:
lexica = {
    "positive": interpret_coef.squeeze().nlargest(n=10),
    "negative": interpret_coef.squeeze().nsmallest(n=10),
}

In [None]:
idx_ = (
    pd.Series(est.predict(X), index=X.index)
    .sort_values()
    .pipe(lambda x: [x.index[0], x.index[-1]])
)
show_text(statements.loc[idx_], lexica=lexica, n=None)

### comparison with sentence transformer embeddings

To test the usefulness of these `SentenceTransformer` , we run a regression of the embeddings on the rate decison. Warning: this is a full sample regression, so this is just an illustration, not a statistical test. 

In [None]:
df = pd.DataFrame(X_sbert, index=statements.index)
m = ElasticNet(alpha=0.01)
X_, y_ = pd.concat(
    [df.loc[fomc_change_up].assign(change=1), df.loc[fomc_change_dw].assign(change=-1)]
).pipe(lambda df: (df.drop("change", axis=1), df["change"]))
m.fit(X_, y_);

pred_sbert = (
    pd.Series(m.predict(df), index=statements.index).resample("B").last().ffill()
)

In [None]:
corr_tfidf_sbert = pd.concat({"sbert": pred_sbert, "tdfidf": pred_tfidf}, axis=1).corr().iloc[0, 1]
print(f'The correlation of the in-sample prediction for the decisions of the Fed for the two text representations (tfidf and sbert) is {corr_tfidf_sbert:.2f}.')

In [None]:
line(
    pd.concat({"sbert": pred_sbert, "tdfidf": pred_tfidf}, axis=1).pipe(
        lambda x: x.div(x.std())
    )
)