ECON0127: Statistical Learning for Public Policy
# Assignment 9

**Instructions**: This assignment is voluntary and does not count towards your final assessment. It will be discussed in your tutorial session on March 27.

#### **Part 0. Introduction and Setup in Colab**

In the previous assignment, we explored how to use a pre-trained model to extract word embeddings and then build a multinomial logistic regression model to predict a firm’s sector membership. We experimented with both bert-base and a domain-specific model. However, we observed that these embeddings did not consistently outperform a simple word count approach — especially on the test set, where accuracy dropped significantly. This suggests the pre-trained models, without adaptation, may not generalize well to our task.

In this part, we’ll take the next step and explore **fine-tuning**, which involves training (usually with a small learning rate) all of the model’s parameters for a specific task. This approach is one of the most powerful ways to adapt large language models to downstream tasks, such as classification or question answering.

##### Using Colab for GPU Acceleration

Modern models like BERT are very large — for example:
	•	bert-base has 110 million parameters
	•	Even DistilBERT has 66 million parameters

Training such models is computationally intensive, especially on CPUs. To make this process feasible, we’ll move our work to **Google Colab**, which provides access to **free GPUs** like Tesla T4s.

Before you begin, here are some important tips to ensure you use it properly:

- Make sure your Colab environment is set to use a GPU: Go to Runtime → Change runtime type → Hardware accelerator → T4 GPU

- Colab monitors resource usage. If you’re idle for too long, or leave the GPU running without activity, you may lose access or get temporarily banned from using GPU. **Always shut down the kernel when you’re done**: Go to Runtime → Manage sessions → Terminate

In [1]:
import nbformat
import json

# 读取你的notebook文件
with open('assignment9_soln.ipynb', 'r', encoding='utf-8') as f:
    notebook = json.load(f)

# 修复metadata.widgets问题
if 'metadata' in notebook and 'widgets' in notebook['metadata']:
    # 如果widgets存在但没有state，添加一个空state
    if isinstance(notebook['metadata']['widgets'], dict):
        if 'state' not in notebook['metadata']['widgets']:
            notebook['metadata']['widgets']['state'] = {}
    # 或者直接删除widgets（如果不重要）
    # del notebook['metadata']['widgets']

# 保存修复后的文件
with open('assignment9_soln_fixed.ipynb', 'w', encoding='utf-8') as f:
    json.dump(notebook, f, indent=2)

print("文件已修复，保存为 assignment9_soln_fixed.ipynb")

文件已修复，保存为 assignment9_soln_fixed.ipynb


In [None]:
# install required libraries
# !pip3 install transformers                  # HuggingFace library for interacting with BERT (and multiple other models)
# !pip3 install accelerate                    # fast optimization with transformers
!pip3 install datasets                      # HuggingFace library to process dataframes
# !pip3 install ipywidgets
!pip3 install evaluate                      # HuggingFace library to evaluate models



In [None]:
#### import libraries

# basic libraries
import pandas as pd
import numpy as np
import torch
import random
from IPython.core.display import HTML
from scipy.special import softmax
from tqdm import tqdm  # import tqdm

# libraries for plots and figures
import seaborn as sns
import matplotlib.pyplot as plt

# HuggingFace relevant classes
from transformers import AutoModel, BertModel, BertForSequenceClassification, AutoTokenizer, AutoModelForSequenceClassification, pipeline, TrainingArguments, Trainer, utils
from transformers import TextClassificationPipeline
from transformers.pipelines.base import KeyDataset
from datasets import load_dataset, Dataset, DatasetDict
from torch.utils.data import DataLoader
import evaluate


# scikit-learn relevant classes
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# test GPU
print(f"GPU: {torch.cuda.is_available()}")

GPU: True


In [None]:
# read data
file_id = "1eQB8rwSklyVD3u8sZImFII74b7jBeUIL"
df = pd.read_parquet(f"https://drive.google.com/uc?export=download&id={file_id}&authuser=0&export=download")
df.head()

Unnamed: 0,sentences,cik,year,sent_no,sent_id,naics2,naics2_name,sentence_len,keep_sent
0,The following discussion sets forth the materi...,19617,2019,0,19617_0,52,Finance and Insurance,18,True
1,Readers should not consider any descriptions o...,19617,2019,1,19617_1,52,Finance and Insurance,23,True
2,Any of the risk factors discussed below could ...,19617,2019,2,19617_2,52,Finance and Insurance,53,True
3,JPMorgan Chase's businesses are highly regulat...,19617,2019,4,19617_4,52,Finance and Insurance,25,True
4,JPMorgan Chase is a financial services firm wi...,19617,2019,5,19617_5,52,Finance and Insurance,10,True


#### **Part 1. Fine-tuning**

Fine-tuning is the process of taking a pre-trained language model (like BERT) and continuing its training on a specific downstream task, such as classification, using task-specific labeled data. Instead of training from scratch, we start with a model that already understands general language patterns, and adapt it to our domain or objective. This typically involves training all (or most) of the model’s parameters with a small learning rate, allowing it to adjust to the new task without forgetting what it has already learned.

Q1. Fine-tune the domain-specific language model `sec-bert-base` to predict a firm’s sector membership.

*Instructions*:
- Coding should be very similar to what you saw in Stephen's notebook.
- Training should take approximately 4 minutes when using a GPU.
- Save the finetuned model in your Google Drive using the following sample code, so you can reload it directly next time without re-training.

In [None]:
# get only the NAICS2 code for each sentence to use as the labels for our regression
labels = df[["naics2"]]

# create list with all the indexes of available sentences
sent_idxs = list(range(0, len(labels)))

# perform a train/test split
train_idxs, test_idxs = train_test_split(sent_idxs, test_size=0.2, random_state=92)
print(f" Train sentences: {len(train_idxs)}\n", f"Test sentences: {len(test_idxs)}")

 Train sentences: 5438
 Test sentences: 1360


In [None]:
# format the train data adequately
df_finetune = df.loc[train_idxs].copy() # explicitly create an ndependent copy of the sliced data

df_finetune = df_finetune[["sentences", "naics2"]]
df_finetune.columns = ["sentences", "label"]

# transform labels into integers
df_finetune["label"] = df_finetune["label"].astype(int)

# map labels from original sector code to ints from 0 to num_sectors
num_sectors = len(df_finetune.groupby('label').size())
naics2id = {k:v for k,v in zip(df_finetune.groupby('label').size().index.values, range(0, num_sectors))}
id2naics = {v:k for k,v in naics2id.items()}
df_finetune["label"] = df_finetune["label"].apply(lambda x: naics2id[x])
df_finetune

Unnamed: 0,sentences,label
3639,We experienced a work stoppage in 2008 when a ...,0
2680,"Finally, holders of the Tesla Convertible Note...",0
1507,There can be significant differences between o...,3
911,We also rely on other companies to maintain re...,2
621,The techniques used for attacks by third parti...,0
...,...,...
5007,Our revenues and cash requirements are affecte...,1
710,"As is common in our industry, our advertisers ...",2
6162,Global markets for the Company's products and ...,0
4138,"Longer payment cycles in some countries, incre...",2


In [None]:
# format the test data adequately
df_test = df.loc[test_idxs].copy()

df_test = df_test[["sentences", "naics2"]]
df_test.columns = ["sentences", "label"]

# transform labels into integers
df_test["label"] = df_test["label"].astype(int)
df_test

# map labels from original sector code to ints from 0 to num_sectors
df_test["label"] = df_test["label"].apply(lambda x: naics2id[x])
df_test

Unnamed: 0,sentences,label
1236,• integration of the acquired company's accou...,2
474,Competition for qualified personnel within the...,3
3418,Any reduction in our and our subsidiaries' cre...,3
6564,Natural disasters or other catastrophes could ...,1
2646,For the battery and drive unit on our current ...,0
...,...,...
1950,Regulatory requirements in the U.S. and in non...,3
6648,If personal information of our customers or em...,2
6529,The evolution of retailing in online and mobil...,1
6452,"Our success depends, in part, on our ability t...",1


In [None]:
# transform data into Dataset class
finetune_dataset = Dataset.from_pandas(df_finetune)
test_dataset = Dataset.from_pandas(df_test)
finetune_dataset[0]

In [None]:
# load a tokenizer using the name of the model we want to use
sec_tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")

# tokenize the dataset
def tokenize_function(examples):
    return sec_tokenizer(examples["sentences"], max_length=60, padding="max_length", truncation=True)

tokenized_ft = finetune_dataset.map(tokenize_function, batched=True)    # batched=True is key for training
tokenized_test = test_dataset.map(tokenize_function, batched=True)

tokenized_ft

Map:   0%|          | 0/5438 [00:00<?, ? examples/s]

Map:   0%|          | 0/1360 [00:00<?, ? examples/s]

Dataset({
    features: ['sentences', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5438
})

In [None]:
# load the model for finetunning.
# NOTE that we use a different class from the transformers library:
# AutoModel vs. AutoModelForSequenceClassification
num_labels = len(df_finetune.groupby('label').size())
model_ft = AutoModelForSequenceClassification.from_pretrained("nlpaueb/sec-bert-base",
                                                              num_labels=num_labels,
                                                              output_hidden_states=False)

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlpaueb/sec-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# define the set of metrics to be computed through the training process
def compute_metrics(eval_pred):
    metric1 = evaluate.load("precision")
    metric2 = evaluate.load("recall")
    metric3 = evaluate.load("f1")
    metric4 = evaluate.load("accuracy")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision = metric1.compute(predictions=predictions, references=labels, average="micro")["precision"]
    recall = metric2.compute(predictions=predictions, references=labels, average="micro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=labels, average="micro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=labels)["accuracy"]

    return {"precision": precision, "recall": recall,
            "f1": f1, "accuracy": accuracy}

In [None]:
# define the main arguments for training
training_args = TrainingArguments("./",                               # path to save model
                                  learning_rate=5e-5,                 # we use a very small learning rate
                                  num_train_epochs=3,                 # number of iterations through the corpus
                                  per_device_train_batch_size=8,      # defined by the capacity of our GPU
                                  per_device_eval_batch_size=8,       # defined by the capacity of our GPU
                                  evaluation_strategy="no",
                                  save_strategy="no")



In [None]:
# by default the Trainer will use MSEloss from (torch.nn) for regression and
# CrossEntropy loss for classification
trainer = Trainer(
    model=model_ft,
    args=training_args,
    train_dataset=tokenized_ft,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics
)

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

In [None]:
# train model (should take around 4 minutes with GPU)
import wandb
wandb.init(mode="disabled")
trainer.train()



Step,Training Loss
500,0.8058
1000,0.5156
1500,0.3766
2000,0.2175


TrainOutput(global_step=2040, training_loss=0.47313914088641895, metrics={'train_runtime': 237.3304, 'train_samples_per_second': 68.74, 'train_steps_per_second': 8.596, 'total_flos': 503023926150720.0, 'train_loss': 0.47313914088641895, 'epoch': 3.0})

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive/colab_notebooks/ECON0127/2425')

Mounted at /content/drive


In [None]:
# save final version of the model
trainer.save_model("./ft_model/")

# Load fine-tuned model
final_model = BertForSequenceClassification.from_pretrained("./ft_model/",
                                                            output_hidden_states=False,
                                                            output_attentions=False)

final_model.eval()
print("Model ready")

Model ready


In [None]:
# evaluate final model on the test dataset
results = trainer.predict(tokenized_test)
final_metrics = results[2]
print(final_metrics)

Downloading builder script:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'test_loss': 0.9357860684394836, 'test_precision': 0.7941176470588235, 'test_recall': 0.7941176470588235, 'test_f1': 0.7941176470588235, 'test_accuracy': 0.7941176470588235, 'test_runtime': 8.2971, 'test_samples_per_second': 163.912, 'test_steps_per_second': 20.489}


Q2. Compare the classification accuracy of the following approaches for predicting sector membership: (i) word count; (ii) `bert-base-uncase` without finetuning; (iii) `sec-bert-base` without fine-tuning; (iv) `sec-bert-base` with fine-tuning.

*Instructions*:

- The first three approaches should follow Assignment 8.

- If you haven't done train-test split last time, please refer to my updated solution (*Q4 CLS part*) and make sure to include it now.


In [None]:
##### XGboost with word count
# See assignment 8 solution: accuracy 0.69

In [None]:
##### bert-base-uncase without fine-tuning

# Tokenize the sentences
base_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Define a tokenization function for bert base
def tokenize_function_base(examples):
    return base_tokenizer(examples["sentences"],
                     truncation=True,
                     padding="max_length",
                     max_length=60,
                     return_tensors="pt")

# Tokenize in batches
tokenized_dataset = finetune_dataset.map(tokenize_function_base, batched=True, batch_size=64)

# Load BERT model
model_base = AutoModel.from_pretrained("bert-base-uncased",
                                  output_hidden_states=True, # return the hidden states from all encoder layers, not just the last layer
                                  output_attentions=True, # return attention (similarity) scores for all self-attention layers
                                  attn_implementation="eager"
                                  )

# To pass the dataset to the model without training
# Set format to PyTorch tensors
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Create DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=64)

# Move the model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_base.to(device)

# Extract embeddings
all_cls_embeddings = []
with torch.no_grad(): # turns off gradient computation
    for batch in tqdm(dataloader, desc="Extracting embeddings"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model_base(input_ids=input_ids, attention_mask=attention_mask)

        # Get CLS embeddings
        cls_embeddings = outputs.pooler_output
        all_cls_embeddings.append(cls_embeddings)

# Concatenate all embeddings
final_embeddings = torch.cat(all_cls_embeddings, dim=0)  # shape: (n_sentences, 768)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/5438 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Extracting embeddings: 100%|██████████| 85/85 [00:17<00:00,  4.89it/s]


In [None]:
# Tokenize test data
tokenized_dataset_test = test_dataset.map(tokenize_function_base, batched=True, batch_size=64)

# Set format to PyTorch tensors
tokenized_dataset_test.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Create DataLoader
dataloader_test = DataLoader(tokenized_dataset_test, batch_size=64)

# Extract embeddings
test_cls_embeddings = []
with torch.no_grad():
    for batch in tqdm(dataloader_test, desc="Extracting embeddings"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model_base(input_ids=input_ids, attention_mask=attention_mask)

        # Get CLS embeddings
        cls_embeddings = outputs.pooler_output
        test_cls_embeddings.append(cls_embeddings)

# Concatenate all embeddings
test_embeddings = torch.cat(test_cls_embeddings, dim=0)  # shape: (n_sentences, 768)


Map:   0%|          | 0/1360 [00:00<?, ? examples/s]

Extracting embeddings: 100%|██████████| 22/22 [00:04<00:00,  5.04it/s]


In [None]:
# Train logistic regression model
log_reg_base = LogisticRegression(max_iter=1000)
log_reg_base.fit(final_embeddings.cpu().numpy(), df_finetune['label'].values)

# Predictions
log_reg_base_preds = log_reg_base.predict(test_embeddings.cpu().numpy())

# Evaluation
print("Logistic Regression Performance:")
print(classification_report(df_test['label'], log_reg_base_preds))
print("Accuracy:", accuracy_score(df_test['label'], log_reg_base_preds))

Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.63      0.57      0.60       300
           1       0.62      0.37      0.46       228
           2       0.61      0.72      0.66       485
           3       0.62      0.68      0.65       347

    accuracy                           0.62      1360
   macro avg       0.62      0.59      0.59      1360
weighted avg       0.62      0.62      0.61      1360

Accuracy: 0.6191176470588236


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
##### sec-bert-base without fine-tuning

# Load sec bert
sec_model = AutoModel.from_pretrained("nlpaueb/sec-bert-base",
                                  output_hidden_states=True,
                                  output_attentions=True,
                                  attn_implementation="eager"
                                  )
sec_model.to(device)

## Training Data

# Set format to PyTorch tensors
tokenized_ft.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Create DataLoader
dataloader_sec = DataLoader(tokenized_ft, batch_size=64)

all_sec_cls_embeddings = []
with torch.no_grad():
    for batch in tqdm(dataloader_sec, desc="Extracting embeddings"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = sec_model(input_ids=input_ids, attention_mask=attention_mask)

        # Get CLS embeddings
        cls_embeddings = outputs.pooler_output
        all_sec_cls_embeddings.append(cls_embeddings)

# Concatenate all embeddings
final_sec_embeddings = torch.cat(all_sec_cls_embeddings, dim=0)  # shape: (n_sentences, 768)


# Test Data
tokenized_test.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Create DataLoader
dataloader_sec_test = DataLoader(tokenized_test, batch_size=64)

test_sec_cls_embeddings = []
with torch.no_grad():
    for batch in tqdm(dataloader_sec_test, desc="Extracting embeddings"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = sec_model(input_ids=input_ids, attention_mask=attention_mask)

        # Get CLS embeddings
        cls_embeddings = outputs.pooler_output
        test_sec_cls_embeddings.append(cls_embeddings)

# Concatenate all embeddings
test_sec_embeddings = torch.cat(test_sec_cls_embeddings, dim=0)  # shape: (n_sentences, 768)


# Train logistic regression model
log_reg_sec = LogisticRegression(max_iter=1000)
log_reg_sec.fit(final_sec_embeddings.cpu().numpy(), df_finetune['label'].values)


# Predictions
log_reg_sec_preds = log_reg_sec.predict(test_sec_embeddings.cpu().numpy())

# Evaluation
print("Logistic Regression Performance:")
print(classification_report(df_test['label'], log_reg_sec_preds))
print("Accuracy:", accuracy_score(df_test['label'], log_reg_sec_preds))


Extracting embeddings: 100%|██████████| 85/85 [00:17<00:00,  4.91it/s]
Extracting embeddings: 100%|██████████| 22/22 [00:04<00:00,  4.83it/s]


Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.65      0.68      0.66       300
           1       0.65      0.51      0.57       228
           2       0.71      0.75      0.73       485
           3       0.76      0.77      0.76       347

    accuracy                           0.70      1360
   macro avg       0.69      0.68      0.68      1360
weighted avg       0.70      0.70      0.70      1360

Accuracy: 0.7


In [None]:
# Table to compare performance
results = pd.DataFrame({"Model": ["Word Count", "BERT Base withtout FT", "Sec BERT withtout FT", "Sec BERT with FT"],
                        "Accuracy": [0.67, accuracy_score(df_test['label'], log_reg_base_preds),
                                     accuracy_score(df_test['label'], log_reg_sec_preds), final_metrics['test_accuracy']]})
results

Unnamed: 0,Model,Accuracy
0,Word Count,0.67
1,BERT Base withtout FT,0.619118
2,Sec BERT withtout FT,0.7
3,Sec BERT with FT,0.794118


#### Part 2. Explore the sector membership probability

Q3. Using the fine-tuned model, we can compute the probability that a given sentence belongs to each sector. Since each firm may have multiple sentences, we can estimate the firm-level sector probability by averaging predictions across all its sentences.

*Instructions*:
- First, initialize a text classification pipeline (code provided). This pipeline is used to make predictions on given text data for a classification task.
- Second, write a loop that (i) iterates through all unique firms in the dataset, (ii) for each firm, passes all of its sentences through the classifier, (iii) collects the predicted sector probabilities for each sentence, (iv) averages the probabilities **across all sentences** to obtain the final **firm-level sector prediction**

In [None]:
# use a text classification pipeline
classifier = TextClassificationPipeline(model=model_ft,
                                        tokenizer=sec_tokenizer,
                                        device="cuda",
                                        return_all_scores=True)
# sample code for using the classifier
classifier(df["sentences"][0])[0] # return a list of dictionaries containing the probability of the text belonging to each sector

Device set to use cuda


[{'label': 'LABEL_0', 'score': 0.00017543153080623597},
 {'label': 'LABEL_1', 'score': 0.00023250767844729125},
 {'label': 'LABEL_2', 'score': 0.00015906291082501411},
 {'label': 'LABEL_3', 'score': 0.9994329810142517}]

In [None]:
# get firms identifier
firms = df["cik"].unique()
firms

array([  19617,   70858,   37996, 1418091,  927628,   83246, 1318605,
         18230,    4962,   12927, 1467858, 1652044,  104169,  769397,
         27419,  794367, 1634117, 1341439, 1754301,  320193,  909832,
       1744489])

In [None]:
# get an average of the probability for each sector across all sentences of a firm
# (takes around 2 minutes)

firms_composition = {}
for firm in firms:

    print(f"Processing firm: {firm}")
    # calculate probabilities for all the text of each firm
    df_firm = df.loc[df["cik"] == firm]
    all_probs = []
    for text in df_firm["sentences"].values:
        probs = classifier(text)[0]
        all_probs.append(probs)

    # sum all the probabilities for the same label
    results_firm = {k:0 for k in range(4)}
    for p in all_probs:
        for label in range(4):
            results_firm[label] += p[label]["score"]

    results_firm = {k:v/len(all_probs) for k,v in results_firm.items()}
    firms_composition[firm] = results_firm

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Processing firm: 19617
Processing firm: 70858
Processing firm: 37996
Processing firm: 1418091
Processing firm: 927628
Processing firm: 83246
Processing firm: 1318605
Processing firm: 18230
Processing firm: 4962
Processing firm: 12927
Processing firm: 1467858
Processing firm: 1652044
Processing firm: 104169
Processing firm: 769397
Processing firm: 27419
Processing firm: 794367
Processing firm: 1634117
Processing firm: 1341439
Processing firm: 1754301
Processing firm: 320193
Processing firm: 909832
Processing firm: 1744489


Q4. Now we load the firms’ metadata to map each firm ID to its corresponding real-world firm name. Using the sector membership probabilities you computed in Q3, create a dataframe with the following columns:

| cik     | firm_name       | naics2_name     | manufacturing | retail_trade | information | finance |
|---------|------------------|------------------|----------------|----------------|--------------|----------|

The last four columns should contain the firm-level sector membership probabilities (one per sector) as predicted in Q3.

Explore the data!

In [None]:
# read firms metadata
file_id = "1vFcFJYdLD0sE_fbhGES8KQmvMcLuuCdc"
df_cov = pd.read_csv(f"https://drive.google.com/uc?export=download&id={file_id}&authuser=0&export=download")
df_cov = df_cov.loc[df_cov["cik"].isin(firms)]
df_cov.head()

Unnamed: 0,gvkey,datadate,fyear,indfmt,consol,popsrc,datafmt,tic,conm,curcd,act,at,emp,cik,costat,naics,naics2,naics2_name
26,1447,20191231,2019,FS,C,D,STD,AXP,AMERICAN EXPRESS CO,USD,,198321.0,64.5,4962,A,522210,52,Finance and Insurance
27,1447,20191231,2019,INDL,C,D,STD,AXP,AMERICAN EXPRESS CO,USD,,198321.0,64.5,4962,A,522210,52,Finance and Insurance
41,1690,20190930,2019,INDL,C,D,STD,AAPL,APPLE INC,USD,162819.0,338516.0,137.0,320193,A,334220,33,Manufacturing
51,1878,20200131,2019,INDL,C,D,STD,ADSK,AUTODESK INC,USD,2659.3,6179.3,10.1,769397,A,519130,51,Information
70,2285,20191231,2019,INDL,C,D,STD,BA,BOEING CO,USD,102229.0,133625.0,161.1,12927,A,336411,33,Manufacturing


In [None]:
# create a dataframe
df_firms = pd.DataFrame()
df_firms["cik"] = firms_composition.keys()
df_firms["name"] = [df_cov.loc[df_cov["cik"] == cik, "conm"].iloc[0] for cik in firms]
df_firms["naics2_name"] = [df.loc[df["cik"] == cik, "naics2_name"].iloc[0] for cik in firms]

# add text-based data
df_firms["manufacturing"] = [comp[0] for comp in firms_composition.values()]
df_firms["retail_trade"] = [comp[1] for comp in firms_composition.values()]
df_firms["information"] = [comp[2] for comp in firms_composition.values()]
df_firms["finance"] = [comp[3] for comp in firms_composition.values()]
df_firms

Unnamed: 0,cik,name,naics2_name,manufacturing,retail_trade,information,finance
0,19617,JPMORGAN CHASE & CO,Finance and Insurance,0.000325,0.000321,0.025188,0.974167
1,70858,BANK OF AMERICA CORP,Finance and Insurance,0.003721,0.009101,0.003795,0.983383
2,37996,FORD MOTOR CO,Manufacturing,0.975149,0.008444,0.014564,0.001843
3,1418091,TWITTER INC,Information,0.010071,0.027584,0.955946,0.006399
4,927628,CAPITAL ONE FINANCIAL CORP,Finance and Insurance,0.013472,0.017098,0.035348,0.934082
5,83246,HSBC USA INC,Finance and Insurance,0.012621,0.017527,0.021886,0.947967
6,1318605,TESLA INC,Manufacturing,0.906424,0.023257,0.059465,0.010854
7,18230,CATERPILLAR INC,Manufacturing,0.843913,0.040235,0.096545,0.019307
8,4962,AMERICAN EXPRESS CO,Finance and Insurance,0.01201,0.025349,0.040315,0.922326
9,12927,BOEING CO,Manufacturing,0.922003,0.03039,0.044741,0.002866


Q5. Now try passing a sentence not from the original dataset to the classifier, and inspect the predicted sector probabilities.

*Instructions*: Choose any sentence that could plausibly appear in a firm’s 10-K filing.

In [None]:
# define a target sentence
outside_target = "We are worried about misinformation and fake news."

# get predicted probabilities for each label
probs = classifier(outside_target)[0]
probs

[{'label': 'LABEL_0', 'score': 0.001261436496861279},
 {'label': 'LABEL_1', 'score': 0.002438503550365567},
 {'label': 'LABEL_2', 'score': 0.9961152076721191},
 {'label': 'LABEL_3', 'score': 0.00018487875058781356}]

In [None]:
# clean the labels
for pred_dict in probs:

    # extract the ID from the label
    id = int(pred_dict["label"].split("_")[1])
    # convert to NAICS code
    naics = id2naics[id]
    naics_name = df.loc[df["naics2"] == naics, "naics2_name"].iloc[0]
    print(f"Probability for NAICS sector {naics_name} (code {naics}): {pred_dict['score']}")

# wrap process in a function
def print_predictions(text):
    # get predicted probabilities for each label
    probs = classifier(text)[0]

    # clean the labels
    for pred_dict in probs:

        # extract the ID from the label
        id = int(pred_dict["label"].split("_")[1])
        # convert to NAICS code
        naics = id2naics[id]
        naics_name = df.loc[df["naics2"] == naics, "naics2_name"].iloc[0]
        print(f"Probability for NAICS sector {naics_name} (code {naics}): {pred_dict['score']}")

In [None]:
print_predictions("Our production of cars is affected by the price of steel.")

Probability for NAICS sector Manufacturing (code 33): 0.9992215633392334
Probability for NAICS sector Retail Trade (code 45): 0.0002250309771625325
Probability for NAICS sector Information (code 51): 0.00028313795337453485
Probability for NAICS sector Finance and Insurance (code 52): 0.00027031206991523504


In [None]:
print_predictions("The decisions from the federal reserve board can affect us greatly.")

Probability for NAICS sector Manufacturing (code 33): 0.0002547609037719667
Probability for NAICS sector Retail Trade (code 45): 0.00022182043176144361
Probability for NAICS sector Information (code 51): 0.0002959604898933321
Probability for NAICS sector Finance and Insurance (code 52): 0.9992274045944214
