# **Financial Applicactions with ML & AI**

<img style="float: right;" src="https://github.com/torreblanca99/course_financial_applications/blob/develop/docs/img/logo_bourbaki.png?raw=1" width="100"/>

## **Module III:** AI & Finance
#### Topic: Using a LLM

##### Name: Julio César Avila Torreblanca

- **Objective**: fine tuning a LLM (BERT).
- **Contents**:
    - Notes:
        - Large Lenguage Models: BERT
    - Code:
        1. Libraries and parameters
        2. Reading data & Model
        3. Preprocessing data
        4. Fine Tunning
        5. Evaluation
----

# Large Language Models (LLMs) in Financial Applications

## Introduction to LLMs
Large Language Models (LLMs) are advanced deep learning models designed to understand and generate human language. They are trained on extensive text corpora, making them useful for a variety of natural language processing (NLP) applications due to their ability to effectively model context and semantics.

## What is a Corpus?
A **corpus** (plural: corpora) is a large and structured set of texts that are used for training models in natural language processing. It is essentially the raw text data that LLMs learn from. The quality, diversity, and representativeness of a corpus significantly affect a model's performance and accuracy.

## BERT (Bidirectional Encoder Representations from Transformers)

### Description
- **Architecture:** BERT is based on the Transformer architecture and is notable for its bidirectional capability, processing text by considering the full context both from the left and the right side.
- **Functionality:** It is built to handle tasks such as sentiment analysis, question answering, and entity recognition.
  
### Training Corpus
- **Corpora Used:** BERT is typically trained on extensive datasets like the Wikipedia corpus and the BookCorpus, which consist of diverse text data required to capture a wide range of linguistic nuances.

### Mathematical Function
BERT is trained using two tasks:
1. **Masked Language Modeling (MLM):** Predicts masked words in a sentence.
   $$ P(\text{Masked Word} \mid \text{Context}) $$
2. **Next Sentence Prediction (NSP):** Determines if a given sentence follows another.

### Advantages
- Effectively captures the bidirectional context of words.
- Greatly enhances performance in various NLP tasks by pre-training on large volumes of data.

## FinBERT

### Description
FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) is used for fine-tuning.

- **Specialization:** FinBERT is a domain-specific adaptation of BERT optimized for financial contexts.
- **Functionality:** It is trained on financial corpora, allowing it to accurately comprehend the specific language used in finance, such as market reports and risk analyses.

### Training Corpus
- **Corpora Used:** FinBERT is trained on financial datasets that include financial reports, news articles, and other finance-related documents to ensure domain-specific understanding.

### Advantages
- **Domain-Specificity:** FinBERT is better suited for financial contexts than standard BERT due to its training on domain-specific data.
- **Applications:** It improves financial text analysis tasks like sentiment classification in financial documents and extracting relevant information.

## Comparison Between BERT and FinBERT

- **Domain:** BERT is a general-purpose model, while FinBERT is specialized for financial language.
- **Training:** Both models share the same base architecture, but FinBERT is pre-trained on financial datasets to enhance its accuracy in this specific field.

## Conclusion
LLMs, particularly BERT and its derivative FinBERT, represent significant advancements in NLP, offering highly accurate and tailored models for various applications. In the financial realm, FinBERT provides a powerful tool for efficiently analyzing and extracting useful information from vast amounts of financial text.

-----

# 1. Libraries and parameters

In [26]:
import numpy as np
import pandas as pd
from datasets import Dataset

from transformers import Trainer, TrainingArguments
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
import torch
import transformers
torch.__version__, transformers.__version__

('2.6.0', '4.51.2')

In [4]:
torch.cuda.is_available()

False

# 2. Reading Data & Model
### [[Datastes](https://arxiv.org/pdf/1908.10063.pdf)]

## 2.1 Data

In [5]:
data = pd.read_csv(
    'Sentences_AllAgree.txt',
    sep=' .@', 
    names=['text','label'], 
    encoding='utf-8', 
    engine='python'
)

data.shape

(2264, 2)

In [6]:
data.head()

Unnamed: 0,text,label
0,"According to Gran , the company has no plans t...",neutral
1,"For the last quarter of 2010 , Componenta 's n...",positive
2,"In the third quarter of 2010 , net sales incre...",positive
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive


## 2.2 Reading Model
### [[Model](https://huggingface.co/yiyanghkust/finbert-tone)]

In [7]:
##
finbert = BertForSequenceClassification.from_pretrained(
    'yiyanghkust/finbert-tone',
    num_labels=3
)
tokenizer = BertTokenizer.from_pretrained(
    'yiyanghkust/finbert-tone'
)

# 3. Preprocessing data

## 3.1 Data cleaning

In [8]:
data['label'].value_counts(dropna=False)

label
neutral     1363
positive     561
negative     296
None          44
Name: count, dtype: int64

In [9]:
data.isnull().sum()

text      0
label    44
dtype: int64

In [10]:
data = data.dropna()
data.shape

(2220, 2)

## 3.2 Split Data

In [11]:
df_train, df_test, = train_test_split(data, stratify=data['label'], test_size=0.1, random_state=42)
df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)
print(df_train.shape, df_test.shape, df_val.shape)

(1798, 2) (222, 2) (200, 2)


## 3.3 Tokenizer & Tensors Creation

In [35]:
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

# oke
dataset_train = dataset_train.map(lambda e: tokenizer(e['text'], 
                                                      truncation=True, 
                                                      padding='max_length', 
                                                      max_length=128), 
                                  batched=True
                                  )

dataset_val = dataset_val.map(lambda e: tokenizer(e['text'],
                                                  truncation=True, 
                                                  padding='max_length', 
                                                  max_length=128), 
                              batched=True
                              )

dataset_test = dataset_test.map(lambda e: tokenizer(e['text'],
                                                    truncation=True, 
                                                    padding='max_length' , 
                                                    max_length=128), 
                                batched=True
                                )

dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

Map: 100%|██████████| 1798/1798 [00:00<00:00, 2810.17 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 1721.88 examples/s]
Map: 100%|██████████| 222/222 [00:00<00:00, 3799.93 examples/s]


In [36]:
dataset_train

Dataset({
    features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1798
})

In [14]:
type(dataset_train)

datasets.arrow_dataset.Dataset

In [19]:
df_train.head(1)

Unnamed: 0,text,label
1403,The company 's equipment rental portfolio cons...,neutral


#### Understading Tokenization

In [20]:
#@title before preprocessing
Dataset.from_pandas(df_train)[0]

{'text': "The company 's equipment rental portfolio consists primarily of tools for small and mid-sized customers in the construction industry",
 'label': 'neutral',
 '__index_level_0__': 1403}

In [21]:
#@title Processed
dataset_train[0]

{'label': 'neutral',
 'input_ids': tensor([    3,     6,    37,  5674,    58,   442,  1313,   318,  1534,   106,
             7,  2166,    14,   856,     8,  2473,    85, 12779,   175,    10,
             6,   672,   228,     4,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,   

# 4. Fine Tuning

## 4.1 Configuration

In [37]:
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for the model's predictions.

    Args:
        eval_pred (tuple): A tuple containing two elements:
            - predictions (numpy.ndarray): The raw predictions from the model.
            - labels (numpy.ndarray): The true labels for the evaluation dataset.

    Returns:
        dict: A dictionary containing the computed metrics. Currently, it includes:
            - 'accuracy': The accuracy score of the predictions compared to the true labels.
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    return {'accuracy': accuracy_score(predictions, labels)}

In [38]:
args = TrainingArguments(
        output_dir = 'temp/',
        eval_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=1e-3,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

trainer = Trainer(
        model=finbert, # the instantiated 🤗 Transformers model to be trained
        args=args,     # training arguments, defined above
        train_dataset=dataset_train, # training dataset
        eval_dataset=dataset_val, # evaluation dataset
        compute_metrics=compute_metrics
)

# 5. Evaluation

In [40]:

finbert.eval()
trainer.predict(dataset_test).metrics

ValueError: too many dimensions 'str'

In [41]:
y_pred = trainer.predict(dataset_test).predictions

ValueError: too many dimensions 'str'

In [66]:
# loss fnction
log_loss(y_test, clf.predict_proba(X_test))

0.5805673289689371