ECON0127: Statistical Learning for Public Policy
# Assignment 8

**Instructions**: This assignment is voluntary and does not count towards your final assessment. It will be discussed in your tutorial session on March 20. 

##### **General Feedback**: 

To be updated - sorry!

##### **Part 0: Load and Explore Data**

In this assignment, we continue to use the text data from the **10-K reports** filed by publicly-traded firms in the U.S. in 2019. Remember in Assignment 4, we build boosted trees to predict the sector membership based on word count. Now we have learned word embeddings and large language models, would it help us predict the sector membership better?

The raw data of 10-K reports has a total of 1,744,131 sentences for 4,033 firms. However, to decrease training time, we will work with a subset of firms 22 firms from 4 different economic sectors.

In [1]:
import pandas as pd
import numpy as np

from transformers import BertTokenizer, AutoModel
import torch

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [2]:
# read data
file_id = "1eQB8rwSklyVD3u8sZImFII74b7jBeUIL"
df = pd.read_parquet(f"https://drive.google.com/uc?export=download&id={file_id}&authuser=0&export=download")
print(df.shape)
df.head()

(6798, 9)


Unnamed: 0,sentences,cik,year,sent_no,sent_id,naics2,naics2_name,sentence_len,keep_sent
0,The following discussion sets forth the materi...,19617,2019,0,19617_0,52,Finance and Insurance,18,True
1,Readers should not consider any descriptions o...,19617,2019,1,19617_1,52,Finance and Insurance,23,True
2,Any of the risk factors discussed below could ...,19617,2019,2,19617_2,52,Finance and Insurance,53,True
3,JPMorgan Chase's businesses are highly regulat...,19617,2019,4,19617_4,52,Finance and Insurance,25,True
4,JPMorgan Chase is a financial services firm wi...,19617,2019,5,19617_5,52,Finance and Insurance,10,True


In [3]:
df.shape

(6798, 9)

In [4]:
# read some observations
df.loc[1001, "sentences"]

'If spam and fake accounts increase on Twitter, this could hurt our reputation for delivering relevant content or reduce user growth rate and user engagement and result in continuing operational cost to us.'

In [3]:
# explore the economic sectors covered by the data
df.groupby("naics2_name").size()

naics2_name
Finance and Insurance    1824
Information              2341
Manufacturing            1588
Retail Trade             1045
dtype: int64

In [6]:
# how many firms are in the data?
df["cik"].nunique()

22

In [4]:
# Add labels as Y
sector_labels = df["naics2"].astype("category").cat.codes
df["sector_labels"] = sector_labels
df["sector_labels"].value_counts()

sector_labels
2    2341
3    1824
0    1588
1    1045
Name: count, dtype: int64

##### **Part 1: Bert Base Model**

**Q1**. Tokenize the text data from 10-K filings using the pre-trained model `bert-base-uncased`. Randomly select one sentence, print out the original text and the tokenized version.

*Instructions*: 

- We initialize a tokenizer using `BertTokenizer.from_pretrained()` and put in the model name.
- In Stephen's lecture, each time we tokenize a sentence. Alternatively, you can do batch-tokenization by converting the values from the column `sentences` into a list and feed into the tokenizer at once.
- Specify several hyperparameters: `truncation` (whether or not to truncate the sequences longer than specified length), `max_length` (maximum number of tokens per sequence), `padding` (pad all sequences to the same size), `return_tensors` (data type of results).

In [6]:
# Tokenize the sentences
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer

encoded_sentences = tokenizer(list(df["sentences"].values),     # list of sequences we want to tokenize
                              truncation=True,                  # truncate sequences longer than specified length
                              max_length=5,                    # maximum number of tokens per sequence （降低计算成本）
                              padding="max_length",             # pad all sequences to the same size
                              return_tensors='pt'               # data type of results
                              )

In [7]:
# examine BERT's tokenization in detail for a random sentence
i = np.random.randint(0, len(df)) # 随机选择一句话
print("Original sentence:")
print(df.loc[i, "sentences"]) # 打印原始句子
print("\n------------------------------------------\n")
print("Tokens:")
temp_tokens = encoded_sentences["input_ids"][i] # 取出该句子的 token id 序列
print(tokenizer.convert_ids_to_tokens(temp_tokens)) # 打印token id 序列对应的 token
print("\n------------------------------------------\n")
print("Tokens IDs:")
print(temp_tokens)
#  Tokens IDs 是 BERT 分词器（tokenizer）将原始句子转换为模型输入的数字序列。
#  这些数字是每个 token（词或子词）在 BERT 词表（vocabulary）中的唯一编号。

Original sentence:
Any of these events could negatively impact Cat Financial's business, as well as our and Cat Financial's results of operations and financial condition.

------------------------------------------

Tokens:
['[CLS]', 'any', 'of', 'these', '[SEP]']

------------------------------------------

Tokens IDs:
tensor([ 101, 2151, 1997, 2122,  102])


<small>[CLS] 的 embedding 是 BERT 句子级分类任务的“句子表示”。
[SEP] 用于分隔句子或标记句子结束。</small>

**Q2**. Load the model `bert-base-uncased` and get sequence embedding by passing the tokenized sentences to the model.

*Instructions*: 

- Intialize the model using `AutoModel.from_pretrained()`.
- Pass the tokens through the model. 
- All codes you can follow from Stephen's lecture.
- It may take a while to run this step.


In [8]:
# Load BERT model
model = AutoModel.from_pretrained("bert-base-uncased",
                                  output_hidden_states=True,
                                  output_attentions=True,
                                  attn_implementation="eager"
                                  )
# Pass through model
with torch.no_grad():
    model_output = model(**encoded_sentences)

**Q3**. Extract the embeddings for predicting the sector membership in the following two ways: (i) use the average of the token embeddings as taught in Stephen's lecture, or (ii) use embedding for the `CLS` token.

In [9]:
# Extract token embeddings (excluding the last hidden state)
token_embeddings = model_output.last_hidden_state  # 这是BERT输出的每个token的向量表示，Shape: (batch_size, sequence_length, hidden_size)

# Extract attention mask to avoid averaging over padding tokens
attention_mask = encoded_sentences["attention_mask"]  # 去掉[pad]的真实token，Shape: (batch_size, sequence_length)

# Expand attention mask to match embedding dimensions
attention_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

# Compute mean embedding (excluding padding tokens)
sum_embeddings = torch.sum(token_embeddings * attention_mask_expanded, dim=1)
sum_mask = attention_mask_expanded.sum(dim=1)
mean_embeddings = sum_embeddings / sum_mask  # Shape: (batch_size, hidden_size)
mean_embeddings.shape

torch.Size([6798, 768])

In [10]:
# Extract CLS embeddings
cls_embeddings = model_output.last_hidden_state[:, 0, :].numpy() # 选取所有句子，每个句子的第一个 token（也就是 [CLS] ），该 token 的所有 embedding 维度。
print(cls_embeddings.shape)

(6798, 768)


**Q4**. Now we build multinomial logistic regression models to predict the sector membership.

*Instructions*:

- Create the sector label as numbers 0 to 3, corresponding to the sector classification based on `naics2`.
- Train **two** multinomial logistic regression models:
  1. One using mean embeddings as input features X.
  2. One using CLS embeddings as input features X.
- Evaluate both models using `classification_report` to compare their performance.

In [11]:
# Add labels as Y
sector_labels = df["naics2"].astype("category").cat.codes
df["sector_labels"] = sector_labels
df["sector_labels"].value_counts()

sector_labels
2    2341
3    1824
0    1588
1    1045
Name: count, dtype: int64

<small>你不需要把 mean_embeddings 合并到 df，因为顺序已经对齐：

mean_embeddings 的每一行，顺序和你输入 BERT 的句子顺序一致（即 df["sentences"] 的顺序）</small>

In [12]:
# Convert embeddings to NumPy for ML models
X = mean_embeddings.cpu().numpy() # 把PyTorch张量转为NumPy数组，方便sklearn使用
y = np.array(sector_labels) # 把标签转为NumPy数组

# Train classification models
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X, y)

# Predictions
log_reg_preds = log_reg.predict(X)

# Evaluation
print("Logistic Regression Performance:")
print(classification_report(y, log_reg_preds))

Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.60      0.48      0.53      1588
           1       0.56      0.33      0.42      1045
           2       0.53      0.72      0.61      2341
           3       0.59      0.56      0.57      1824

    accuracy                           0.56      6798
   macro avg       0.57      0.52      0.53      6798
weighted avg       0.57      0.56      0.55      6798



In [13]:
# Use CLS embeddings to train models
X_cls = cls_embeddings
log_reg.fit(X_cls, y)

# Predictions
log_reg_preds_cls = log_reg.predict(X_cls)

# Evaluation
print("Logistic Regression Performance (CLS embeddings):")
print(classification_report(y, log_reg_preds_cls))

# Accuracy
print("Accuracy (mean embeddings):", accuracy_score(y, log_reg_preds))
print("Accuracy (CLS embeddings):", accuracy_score(y, log_reg_preds_cls))

Logistic Regression Performance (CLS embeddings):
              precision    recall  f1-score   support

           0       0.56      0.43      0.49      1588
           1       0.59      0.32      0.42      1045
           2       0.51      0.71      0.59      2341
           3       0.57      0.54      0.56      1824

    accuracy                           0.54      6798
   macro avg       0.56      0.50      0.51      6798
weighted avg       0.55      0.54      0.53      6798

Accuracy (mean embeddings): 0.5613415710503089
Accuracy (CLS embeddings): 0.5406001765225066


**Q5. [Bonus]** Compare the above models with boosted trees with word counts as input.

In [None]:
# Build a XGBoost model with word counts as input
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

# 这段代码的作用是用词袋模型（CountVectorizer）+ XGBoost 分类器，对原始句子进行分类，并评估模型效果。详细解释如下：
# 创建一个流水线，先用 CountVectorizer() 把句子转为词频特征（词袋模型），再用 XGBClassifier() 做分类。
pipe = make_pipeline(CountVectorizer(), XGBClassifier()) 

# Train model without train-test split
pipe.fit(df["sentences"], df["sector_labels"])

# Predictions
xgb_preds = pipe.predict(df["sentences"])

# Evaluation
print("XGBoost Performance:")
print(classification_report(y, xgb_preds))
print("Accuracy:", accuracy_score(y, xgb_preds))

# 试试样本外测试
# Train model with train-test split
X_train, X_test, y_train, y_test = train_test_split(df["sentences"], y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)

# Predictions
xgb_preds = pipe.predict(X_test)

# Evaluation
print("XGBoost Performance:")
print(classification_report(y_test, xgb_preds))
print("Accuracy:", accuracy_score(y_test, xgb_preds))

XGBoost Performance:
              precision    recall  f1-score   support

           0       0.95      0.83      0.89      1588
           1       0.97      0.73      0.84      1045
           2       0.79      0.96      0.87      2341
           3       0.91      0.88      0.90      1824

    accuracy                           0.87      6798
   macro avg       0.91      0.85      0.87      6798
weighted avg       0.89      0.87      0.87      6798

Accuracy: 0.8746690203000883
XGBoost Performance:
              precision    recall  f1-score   support

           0       0.76      0.61      0.68       309
           1       0.71      0.48      0.57       219
           2       0.61      0.80      0.69       441
           3       0.76      0.76      0.76       391

    accuracy                           0.69      1360
   macro avg       0.71      0.66      0.68      1360
weighted avg       0.71      0.69      0.69      1360

Accuracy: 0.6919117647058823


##### **Part 2: Domain-specific Models**

So far, we have used the base version of BERT to generate features. However, one concern with this approach is that the language in the training data used for base BERT (i.e. Wikipedia and Books) migth be very different from the language in 10-K reports. In order to alleviate this concern, we will use a different version of the model trained on 260,773 10-K filings from 1993-2019.

These family of models are called `SEC-BERT` and were developed by the Natural Language Processing Group at the Athens University of Economics and Business. 

**Q6.** Repeat the process in part 1 but tokenize and embed with model `nlpaueb/sec-bert-base` instead. Compare the modeling performance.

In [None]:
sec_tokenizer = BertTokenizer.from_pretrained("nlpaueb/sec-bert-base")

sec_sentences = sec_tokenizer(list(df["sentences"].values),     # list of sequences we want to tokenize
                              truncation=True,                  # truncate sequences longer than specified length
                              max_length=60,                    # maximum number of tokens per sequence
                              padding="max_length",             # pad all sequences to the same size
                              return_tensors='pt'               # data type of results
                              )

sec_model = AutoModel.from_pretrained("nlpaueb/sec-bert-base",
                                  output_hidden_states=True,
                                  output_attentions=True,
                                  attn_implementation="eager"
                                  )
# Pass through model
with torch.no_grad():
    sec_model_output = sec_model(**sec_sentences)

# Extract the cls embeddings
sec_cls_embeddings = sec_model_output.last_hidden_state[:, 0, :].numpy()



In [None]:
# Use CLS embeddings to train models
X_cls_sec = sec_cls_embeddings
y = np.array(sector_labels)

# Train classification models
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_cls_sec, y)

# Predictions
log_reg_preds_cls = log_reg.predict(X_cls_sec)

# Evaluation
print("Logistic Regression Performance (CLS embeddings):")
print(classification_report(y, log_reg_preds_cls))
print("Accuracy (CLS embeddings):", accuracy_score(y, log_reg_preds_cls))

Logistic Regression Performance (CLS embeddings):
              precision    recall  f1-score   support

           0       0.84      0.79      0.82      1588
           1       0.78      0.73      0.75      1045
           2       0.80      0.85      0.83      2341
           3       0.87      0.87      0.87      1824

    accuracy                           0.83      6798
   macro avg       0.82      0.81      0.82      6798
weighted avg       0.83      0.83      0.82      6798

Accuracy (CLS embeddings): 0.8252427184466019


##### **Part 3: Exploring Google’s Chinese BERT [Bonus]**

While multilingual BERT (mBERT) is designed to handle multiple languages, using a language-specific BERT model (e.g., Chinese BERT, Arabic BERT, German BERT) often leads to better embeddings and performance for tasks in that particular language. 

In English, changing the order of certain words may still preserve the meaning (though it may sound unnatural). However, in Chinese, word order is often strictly required for grammatical correctness and meaning preservation. Here we test word order impact on Chinese BERT embeddings. I provide one example below but you can experiment with different sentences.

In English: 

- "I ate a cake at restaurant yesterday." 
- "Yesterday at restaurant I ate a cake." 

In Chinese: 

- "昨天我在餐厅吃了蛋糕。"
- "我吃了蛋糕在餐厅昨天。" (incorrect grammar in Chinese)

Use `bert-base-uncased` and `bert-base-chinese` to tokenize and embed these two sentences and compare the **cosine similarity**.

In [None]:
# Load Chinese BERT tokenizer & model
MODEL_NAME = "bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Define sentences (correct & incorrect order)
sentences = [
    "昨天我在餐厅吃了蛋糕。",  
    "我吃了蛋糕在餐厅昨天。" 
]

# Tokenize input
tokenized_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Pass through model
with torch.no_grad():  
    outputs = model(**tokenized_inputs)

# Extract CLS embeddings
cls_embeddings = outputs.last_hidden_state[:, 0, :].numpy()  # Shape: (2, 768)

# Compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(cls_embeddings[0], cls_embeddings[1])

# Print results
print(f"Cosine Similarity Between Correct & Incorrect Sentence: {similarity:.4f}")

Cosine Similarity Between Correct & Incorrect Sentence: 0.9130


In [None]:
# Load Chinese BERT tokenizer & model
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Define sentences (correct & incorrect order)
sentences = [
    "I ate a cake at restaurant yesterday.",  
    "Yesterday at restaurant I ate a cake."  
]

# Tokenize input
tokenized_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Pass through model
with torch.no_grad():  
    outputs = model(**tokenized_inputs)

# Extract CLS embeddings
cls_embeddings = outputs.last_hidden_state[:, 0, :].numpy()  # Shape: (2, 768)

# Compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(cls_embeddings[0], cls_embeddings[1])

# Print results
print(f"Cosine Similarity Between Correct & Incorrect Sentence: {similarity:.4f}")

Cosine Similarity Between Correct & Incorrect Sentence: 0.9672
