<a href="https://colab.research.google.com/github/tsengcc2023/Financial-Big-Data-Analysis/blob/main/week10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

使用 Hugging Face 框架建立 BERT 模型，針對金融領域文本進行情緒分析訓練。
可使用以下提供的資料集，或選擇其他適合的金融文本資料集。
資料集網址：https://huggingface.co/datasets/takala/financial_phrasebank

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

# 載入資料集

In [2]:
from datasets import load_dataset

# 載入金融情緒分析的資料集並指定配置
dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree", split="train")
dataset = dataset.shuffle(seed=42)  # 打亂數據以避免過度擬合
print(dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

The repository for takala/financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/takala/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

{'sentence': 'Indigo and Somoncom serve 377,000 subscribers and had a market share of approximately 27 % as of May 2007 .', 'label': 1}


# 資料前處理

In [3]:
from transformers import AutoTokenizer

# 使用BERT的預訓練 tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 定義 tokenization function
def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

# 使用 map 函數來對整個資料集進行 tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/2264 [00:00<?, ? examples/s]

# 載入預訓練的 BERT 模型並進行微調

In [4]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# 定義模型（使用 BERT 並針對 3 種情緒進行分類：正向、中立、負向）
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 訓練設置

In [5]:
# 訓練參數
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none"  # 禁用 WandB
)
# 定義評估函數
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}



# 使用 Trainer 進行模型訓練

In [6]:
# 定義 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,  # 這裡將部分數據用於評估
    compute_metrics=compute_metrics,
)

# 開始訓練
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.08755,0.979682,0.979712,0.980125,0.979682
2,0.250800,0.016463,0.995583,0.995588,0.995597,0.995583
3,0.250800,0.007229,0.998675,0.998676,0.99868,0.998675


TrainOutput(global_step=849, training_loss=0.16116823886111992, metrics={'train_runtime': 886.889, 'train_samples_per_second': 7.658, 'train_steps_per_second': 0.957, 'total_flos': 1787066333208576.0, 'train_loss': 0.16116823886111992, 'epoch': 3.0})

In [8]:
# 模型評估
eval_result = trainer.evaluate()
print(f"Evaluation result: {eval_result}")

# 保存訓練後的模型
model.save_pretrained("./financial_sentiment_model")
tokenizer.save_pretrained("./financial_sentiment_model")

Evaluation result: {'eval_loss': 0.007228520233184099, 'eval_accuracy': 0.9986749116607774, 'eval_f1': 0.9986757035059867, 'eval_precision': 0.9986799281293525, 'eval_recall': 0.9986749116607774, 'eval_runtime': 71.8594, 'eval_samples_per_second': 31.506, 'eval_steps_per_second': 3.938, 'epoch': 3.0}


('./financial_sentiment_model/tokenizer_config.json',
 './financial_sentiment_model/special_tokens_map.json',
 './financial_sentiment_model/vocab.txt',
 './financial_sentiment_model/added_tokens.json',
 './financial_sentiment_model/tokenizer.json')

# 測試

In [9]:
# 匯入所需的模組
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 設定裝置（device）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 載入訓練好的模型和 tokenizer
tokenizer = AutoTokenizer.from_pretrained("./financial_sentiment_model")
model = AutoModelForSequenceClassification.from_pretrained("./financial_sentiment_model").to(device)

# 測試句子
test_texts = [
    "The company’s profit has increased significantly this quarter.",  # 預期情緒: 正向
    "The increase in costs negatively affected the revenue.",           # 預期情緒: 負向
    "The company’s performance remained stable."                        # 預期情緒: 中立
]

# 將測試句子進行 tokenization
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors="pt").to(device)

# 推論
outputs = model(**test_encodings)

# 取得預測結果
preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

# 將數字標籤轉換為文字標籤
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
predicted_labels = [label_map[pred] for pred in preds]

# 輸出預測標籤
print(predicted_labels)

['Positive', 'Negative', 'Positive']
