<a href="https://colab.research.google.com/github/thowley1207/capstone_project/blob/main/09_tokenize_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets
!pip install transformers


import numpy as np
import pandas as pd

from datasets import (Dataset, DatasetDict, ClassLabel,
                      concatenate_datasets, load_dataset)
from transformers import AutoTokenizer

from huggingface_hub import notebook_login
notebook_login()

**Define a helper functions for use in tokenizing the cleaned and preprocessed 8K text entries in the dataset_stacked_{2/3}_labels dataframes created in 08_create_and_push_datasets:**

        def shape_tokenize_function(example):

        def base_tokenize_function(examples):

* **Functions input parameter is:**
         # The desired stacked dataframe containing event_id, labels, text, and
         #   the additional descriptor details added last script
         examples

* **Function returns:**
        # The AutoTokenizer.from_pretrained({relevant_model_loc}) object that
        #     the name of the function indicates (sec-bert-shape for the #.
        #     shape_tokenize_function, sec-bert-base for the
        #     base_tokenize_function)
        base_tokenizer / shape_tokenizer object

These functions are soon aplied to the stacked datasets via the dataset map functionality to efficiently create the tokenized data for both versions of the text input included as columns



In [None]:
shape_tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
base_tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")

def shape_tokenize_function(example):
    return shape_tokenizer(example["text_8k_sec_bert_shape"],
                           padding = "max_length",
                           truncation=True)


def base_tokenize_function(examples):
    return base_tokenizer(examples['text_8k_sec_bert_base'],
                          padding = 'max_length',
                          truncation=True)

**Create several datasets from subsets of the overall 2 label and 3 label stacked datasets with different characteristics.**

**NOTE:** for each subset we reference in the descriptions provided below, the subset should be assumed to have been generated for both the 2 label and the 3 label data
* This is assumed because it is impossible not to consider this data seperately, as the labels cause each of the respective datasets to have a fundamental difference in the definition of each's class label.


We will fine tune several of the eventual final resultant datasets that are created in this process.

* They will allow us to observe whether there are any obvious patterns or differences between fine tuning performance based on the contrasting characteristics of the datasets (or conversely if the result is inconclusive).

* Datasets with different combinations of the following differing characteristics will be generated in this process and pushed to the hub:

    * **Text tokenized using sec-bert-base vs. text tokenized using sec-bert-shape** (these cannot remain in the same Dataset because the model accepts specifically formatted Datasets as training and evaluation data)
    * Data with **labels representing 2 bins** vs. data with **labels representing 3 bins**
    * Labels based on the **non-standardized CAR values** associated with events vs. labels based on the **standardized CAR values** associated with the events**
    * Labels generated from the results of event studies with a **long event window (starting 5 days before the event date, ending 5 days after)** vs. labels generated from the results of event studies with a **short event window (starting 1 day before the event, ending 1 day after)**
        * **NOTE:** in the prior work, we have generated event study results based on many different event windows.
            * However, due to time constraints, rather than testing all possible permutations (of which there are 72) we will only fine-tune the data corresponding to the longest and shortest symetric windows we utilized

In [None]:
dataset_2_labels = load_dataset("thowley824/dataset_stacked_2_labels")
dataset_3_labels = load_dataset("thowley824/dataset_stacked_3_labels")

**Step 1: Convert the text columns to tokenized datasets using the tokenizer functions above**

* To do so, pass the tokenizer function to the dataset map function as an argument.

In [None]:
shape_2_labels = dataset_2_labels.map(
    shape_tokenize_function, batched=True)

shape_3_labels = dataset_3_labels.map(
    shape_tokenize_function, batched=True)

base_2_labels = dataset_2_labels.map(
    base_tokenize_function, batched=True)

base_3_labels = dataset_3_labels.map(
    base_tokenize_function, batched=True)

**Step 2: Create long and short window datasets from the tokenized datasets**

* To do so, pass a lambda function to the Dataset filter function resulting in the retention of only data with the desired event window start and ends.

In [None]:
shape_long_window_2_labels = shape_2_labels.filter(
    lambda x: (x['event_window_start']==-5)&(x['event_window_end']==5))
shape_long_window_3_labels = shape_3_labels.filter(
    lambda x: (x['event_window_start']==-5)&(x['event_window_end']==5))

shape_short_window_2_labels = shape_2_labels.filter(
    lambda x: (x['event_window_start']==-1)&(x['event_window_end']==1))
shape_short_window_3_labels = shape_3_labels.filter(
    lambda x: (x['event_window_start']==-1)&(x['event_window_end']==1))

base_long_window_2_labels = base_2_labels.filter(
    lambda x: (x['event_window_start']==-5)&(x['event_window_end']==5))
base_long_window_3_labels = base_3_labels.filter(
    lambda x: (x['event_window_start']==-5)&(x['event_window_end']==5))

base_short_window_2_labels = base_2_labels.filter(
    lambda x: (x['event_window_start']==-1)&(x['event_window_end']==1))
base_short_window_3_labels = base_3_labels.filter(
    lambda x: (x['event_window_start']==-1)&(x['event_window_end']==1))

**Step 3: Create CAR-label-based and SCAR-label-based datasets from the long and short window / base and shape tokenized data created in Step 2.**

In [None]:
shape_long_window_2_labels_car = shape_long_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

shape_long_window_3_labels_car = shape_long_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

shape_long_window_2_labels_scar = shape_long_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

shape_long_window_3_labels_scar = shape_long_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

base_long_window_2_labels_car = base_long_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

base_long_window_2_labels_scar = base_long_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

base_long_window_3_labels_car = base_long_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

base_long_window_3_labels_scar = base_long_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

shape_short_window_2_labels_car = shape_short_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

shape_short_window_2_labels_scar = shape_short_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

shape_short_window_3_labels_car = shape_short_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

shape_short_window_3_labels_scar = shape_short_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

base_short_window_2_labels_car = base_short_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

base_short_window_2_labels_scar = base_short_window_2_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

base_short_window_3_labels_car = base_short_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='car'))

base_short_window_3_labels_scar = base_short_window_3_labels.filter(
    lambda x: (x['abnormal_return_metric']=='scar'))

**Step 4: Clean up the dataset formatting for each of the created datasets so that they can be used in model fine-tuning and testing.**

- Remove all columns from every dataset created in Step 3 except for those created by the tokenizer and the label column.

- Rename the label column labels to conform with model input requirements.

In [None]:
remove_columns = [
    'event_id','text_8k_sec_bert_base','text_8k_sec_bert_shape',
    'event_window_start','event_window_end','abnormal_return_metric']


shape_long_window_2_labels_car = shape_long_window_2_labels_car.map(
    remove_columns = remove_columns)
shape_long_window_2_labels_car = shape_long_window_2_labels_car.rename_column(
    "label", "labels")


shape_long_window_3_labels_car = shape_long_window_3_labels_car.map(
    remove_columns = remove_columns)
shape_long_window_3_labels_car = shape_long_window_3_labels_car.rename_column(
    "label", "labels")


shape_long_window_2_labels_scar = shape_long_window_2_labels_scar.map(
    remove_columns = remove_columns)
shape_long_window_2_labels_scar = shape_long_window_2_labels_scar.rename_column(
    "label", "labels")


shape_long_window_3_labels_scar = shape_long_window_3_labels_scar.map(
    remove_columns = remove_columns)
shape_long_window_3_labels_scar = shape_long_window_3_labels_scar.rename_column(
    "label", "labels")


base_long_window_2_labels_car = base_long_window_2_labels_car.map(
    remove_columns = remove_columns)
base_long_window_2_labels_car = base_long_window_2_labels_car.rename_column(
    "label", "labels")


base_long_window_2_labels_scar = base_long_window_2_labels_scar.map(
    remove_columns = remove_columns)
base_long_window_2_labels_scar = base_long_window_2_labels_scar.rename_column(
    "label", "labels")


base_long_window_3_labels_car = base_long_window_3_labels_car.map(
    remove_columns = remove_columns)
base_long_window_3_labels_car = base_long_window_3_labels_car.rename_column(
    "label", "labels")


base_long_window_3_labels_scar = base_long_window_3_labels_scar.map(
    remove_columns = remove_columns)
base_long_window_3_labels_scar = base_long_window_3_labels_scar.rename_column(
    "label", "labels")


shape_short_window_2_labels_car = shape_short_window_2_labels_car.map(
    remove_columns = remove_columns)
shape_short_window_2_labels_car = shape_short_window_2_labels_car.rename_column(
    "label", "labels")


shape_short_window_2_labels_scar = shape_short_window_2_labels_scar.map(
    remove_columns = remove_columns)
shape_short_window_2_labels_scar = shape_short_window_2_labels_scar.rename_column(
    "label", "labels")


shape_short_window_3_labels_car = shape_short_window_3_labels_car.map(
    remove_columns = remove_columns)
shape_short_window_3_labels_car = shape_short_window_3_labels_car.rename_column(
    "label", "labels")


shape_short_window_3_labels_scar = shape_short_window_3_labels_scar.map(
    remove_columns = remove_columns)
shape_short_window_3_labels_scar = shape_short_window_3_labels_scar.rename_column(
    "label", "labels")


base_short_window_2_labels_car = base_short_window_2_labels_car.map(
    remove_columns = remove_columns)
base_short_window_2_labels_car = base_short_window_2_labels_car.rename_column(
    "label", "labels")


base_short_window_2_labels_scar = base_short_window_2_labels_scar.map(
    remove_columns = remove_columns)
base_short_window_2_labels_scar = base_short_window_2_labels_scar.rename_column(
    "label", "labels")

base_short_window_3_labels_car = base_short_window_3_labels_car.map(
    remove_columns = remove_columns)
base_short_window_3_labels_car = base_short_window_3_labels_car.rename_column(
    "label", "labels")

base_short_window_3_labels_scar = base_short_window_3_labels_scar.map(
    remove_columns = remove_columns)
base_short_window_3_labels_scar = base_short_window_3_labels_scar.rename_column(
    "label", "labels")

**Step 5: Push the final version of the datasets to the Hugging Face Hub.**

In [None]:
shape_long_window_2_labels_car.push_to_hub('shape_long_window_2_labels_car')
shape_long_window_3_labels_car.push_to_hub('shape_long_window_3_labels_car')
base_long_window_2_labels_car.push_to_hub('base_long_window_2_labels_car')
base_long_window_3_labels_car.push_to_hub('base_long_window_3_labels_car')
shape_short_window_2_labels_car.push_to_hub('shape_short_window_2_labels_car')
shape_short_window_3_labels_car.push_to_hub('shape_short_window_3_labels_car')
base_short_window_2_labels_car.push_to_hub('base_short_window_2_labels_car')
base_short_window_3_labels_car.push_to_hub('base_short_window_3_labels_car')

shape_long_window_2_labels_scar.push_to_hub('shape_long_window_2_labels_scar')
shape_long_window_3_labels_scar.push_to_hub('shape_long_window_3_labels_scar')
base_long_window_2_labels_scar.push_to_hub('base_long_window_2_labels_scar')
base_long_window_3_labels_scar.push_to_hub('base_long_window_3_labels_scar')
shape_short_window_2_labels_scar.push_to_hub('shape_short_window_2_labels_scar')
shape_short_window_3_labels_scar.push_to_hub('shape_short_window_3_labels_scar')
base_short_window_2_labels_scar.push_to_hub('base_short_window_2_labels_scar')
base_short_window_3_labels_scar.push_to_hub('base_short_window_3_labels_scar')