# Quantifying Firm-Level Exposure to a Climate Theme with Natural Language Processing: a Simple Framework

Following Sautner et al. (2022) approach, one can build a firm-level theme exposure to a specific theme (in Sautner et al., climate risks exposure) based on text mining. 

More specifically, we can define the firm-level exposure to a theme as *the percentage of keywords from the firm's text (in the following of this proposal earnings call transcripts or business description) that are related (ie. semantically similar) to a list of specific theme keywords.*


## Keywords extraction and Theme-specific keywords

We first need to extract, for each firm, a set of keywords $K_{i,t}$ from the firm specific text. We also need to define a set $S$ of theme-specific keywords to be compared with.

In [None]:
!pip install keyphrase-vectorizers

In [None]:
from keyphrase_vectorizers import KeyphraseCountVectorizer  
vectorizer = KeyphraseCountVectorizer(stop_words = 'english')

firm_text = "The profitable growth in the gas and low carbon electricity integrated value chains is one of the key axes of Total's strategy. In order to give more visibility to these businesses, a new reporting structure for the business segments’ financial information has been put in place, effective January 1, 2019 and organized around four business segments: Exploration & Production (EP), Integrated Gas, Renewables & Power segment (iGRP), Refining & Chemicals (RC) and Marketing & Services (MS). The iGRP segment spearheads Total’s ambitions in integrated gas (including LNG, liquefied natural gas) and low carbon electricity businesses. It consists of the upstream and midstream LNG activity that was previously reported in the EP segment (refer to the indicative list of assets in the Annex) and the activity previously reported in the Gas Renewables & Power segment. The new EP segment is adjusted accordingly"


In [None]:
K = vectorizer.fit([firm_text]).get_feature_names_out()
K

array(['key axes', 'services', 'natural gas', 'new ep segment',
       'exploration', 'power segment', 'ambitions', 'activity',
       'low carbon electricity businesses', 'business segments',
       'refining', 'ms', 'value chains', 'lng', 'gas renewables', 'ep',
       'indicative list', 'profitable growth', 'place', 'renewables',
       'ep segment', 'businesses', 'integrated gas', 'order', 'strategy',
       'more visibility', 'rc', 'effective january', 'production',
       'low carbon electricity', 'annex', 'financial information',
       'midstream lng activity', 'chemicals', 'igrp segment', 'marketing',
       'new reporting structure', 'gas', 'assets', 'total'], dtype=object)

In [None]:
S = ['solar pv', 'wind technology','renewables equipment']
S

['solar pv', 'wind technology', 'renewables equipment']

## Embeddings

Once firm's keywords $K_{i,t}$ are extracted and the set of theme-specific keywords $S$ are defined, we need to transform those unstructured data (text) into a numerical representation. To do so, we use Sentence Transformer capacities to create numerical vector representation of the meanings of $K_{i,t}$ and keywords in $S$:

\begin{equation}
Emb^{k_{i,t}} = ST(k_{i,t})
\end{equation}

and 

\begin{equation}
Emb^{s} = ST(s)
\end{equation}

Where $Emb^{k_{i,t}}$ and $Emb^{s}$ are the numerical vector representation (embeddings) of the keyword $k_{i,t}$ and the keyword $s$, performed with the Sentence Transformer model $ST()$.

In [None]:
!pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

ST = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
emb_K = ST.encode(K, convert_to_tensor=True)
print(emb_K[0].shape)
print(emb_K[0])

torch.Size([384])
tensor([ 6.0319e-04, -4.7906e-02, -1.3035e-01, -7.0738e-02, -5.9981e-02,
        -2.3404e-02,  3.2279e-02, -8.2857e-03,  7.3527e-02,  9.1478e-02,
         1.1488e-01,  6.6241e-02,  1.1704e-02, -6.7941e-03, -1.5293e-02,
         2.6211e-04, -9.4862e-02, -2.5783e-02, -5.4984e-02,  1.9077e-02,
        -1.7698e-02, -8.5624e-02,  3.7961e-02, -9.3274e-03, -2.8578e-02,
         4.0039e-02,  6.8037e-02,  9.6512e-02,  4.1532e-02, -3.9542e-02,
        -2.8297e-02,  7.7898e-03, -8.0047e-03,  3.8723e-02, -1.3231e-01,
         1.9708e-02,  5.8352e-02,  2.9194e-02, -5.1628e-02,  2.8699e-02,
        -6.5272e-03,  4.1512e-02,  8.5519e-02,  3.4931e-02,  4.3499e-02,
         1.0456e-01, -6.8079e-02, -4.3619e-02,  3.9288e-03, -7.3304e-04,
        -1.4330e-02, -4.3029e-03, -7.0102e-02, -2.6189e-02,  2.5443e-02,
         1.6860e-02, -1.9855e-02,  1.1232e-03,  8.1846e-02, -3.4303e-02,
         2.7013e-02,  5.2045e-02, -7.3718e-03, -2.5660e-02, -5.9664e-03,
         2.7971e-02, -5.2767e-02,

In [None]:
emb_S = ST.encode(S, convert_to_tensor=True)

## Identifying Keywords Theme-Related with Cosine Similarity

We want to be able to determine if the keyword $k_{i,t}$ is related to our theme, thanks to the theme-specific keywords.

As we now have numerical vector representations of our keywords $Emb^{k_{i,t}}$ and $Emb^s$, we can apply principles from semantic search by determining the closeness of our two vectors. Recalling that dimensions of our vector representation relate to the underlying meaning of the keywords, computing the closeness of our vectors allows for determining the semantic similarity between our keywords. 

One way to do so is by computing the cosine similarity between our two vectors. The cosine similarity measures the cosine of the angle of those two vectors. The closer the cosine similarity to 1 is, the more related the keywords are:

\begin{equation}
cos(\theta_{k_{i,t},s}) = \frac{Emb^{k_{i,t}} \cdot Emb^s}{||Emb^{k_{i,t}}|| ||Emb^s||}
\end{equation}

For each keyword extracted from firm's text $k_{i,t}$, the cosine similarity is computed against each theme-specific keywords $s$. The maximum cosine similarity measure is retained, and then compared to a predefined threshold (0.6) in order to finally determined if the keyword is related to our theme:


\begin{equation}
  \tau(k_{i,t})=
  \begin{cases}
    0, & \text{if}\ cos(\theta_{k_{i,t},s}) < 0.6 \\
    1, & \text{otherwise}
  \end{cases}
\end{equation}

Where $\tau(k_{i,t})$ corresponds to an indicator taking the value 1 if the keyword is related to our specific theme and 0 otherwise.

In [None]:
hits = util.semantic_search(emb_K, emb_S, top_k=1)
hits

[[{'corpus_id': 1, 'score': 0.08907477557659149}],
 [{'corpus_id': 2, 'score': 0.2804577648639679}],
 [{'corpus_id': 2, 'score': 0.2923046946525574}],
 [{'corpus_id': 0, 'score': 0.11122070252895355}],
 [{'corpus_id': 0, 'score': 0.10033018887042999}],
 [{'corpus_id': 0, 'score': 0.2162817418575287}],
 [{'corpus_id': 0, 'score': 0.09399345517158508}],
 [{'corpus_id': 0, 'score': 0.2687630355358124}],
 [{'corpus_id': 2, 'score': 0.39827126264572144}],
 [{'corpus_id': 2, 'score': 0.12914493680000305}],
 [{'corpus_id': 2, 'score': 0.15724506974220276}],
 [{'corpus_id': 1, 'score': 0.1424086093902588}],
 [{'corpus_id': 2, 'score': 0.25599735975265503}],
 [{'corpus_id': 2, 'score': 0.14899525046348572}],
 [{'corpus_id': 2, 'score': 0.6965174674987793}],
 [{'corpus_id': 0, 'score': 0.18591859936714172}],
 [{'corpus_id': 2, 'score': 0.08812528103590012}],
 [{'corpus_id': 2, 'score': 0.19480028748512268}],
 [{'corpus_id': 0, 'score': 0.15173721313476562}],
 [{'corpus_id': 2, 'score': 0.7917177

In [None]:
hits[1][0]['score']

0.2804577648639679

In [None]:
T = []

for i in range(len(hits)):
  if hits[i][0]['score'] >= 0.6:
    T.append(i)


print(K[T])

['gas renewables' 'renewables']


## Computing Theme Exposure

Finally, we can compute our theme exposure such as:

\begin{equation}
Exposure_{i,t} = \frac{1}{K_{i,t}}\sum_{k}^{K_{i,t}}\tau(k_{i,t})
\end{equation}

In [None]:
Exposure = len(T) / len(K)
print(Exposure)

0.05
