# Deep Learning for Business Applications course

## TOPIC 15: Texts classification with Transformers

Submission:
* There is no late submission for this assignment.
* It is individual work.
* Each student should submit an .ipynb file to the Teams Assignment.
* Grading: 0-15% of the course.

### <font color='red'>HOME ASSIGNMENT</font>
Here is a <a href="https://huggingface.co/datasets/banking77">dataset</a> composed of online banking queries annotated with their corresponding intents. You can download it by running the cell below. The data will be saved in the working directory.

DO NOT USE function ```datasets.load('banking77')```! Run the cell below and work with raw data.

What I expect you to do:
* Explore data: shape, number of classes, balance of classes. __2 points__.
* Solve a classification problem for a dataset using any <a href="https://huggingface.co/docs/transformers/tasks/sequence_classification">transformer</a> from the huggingface library. __3 points__.  
The tutorial at the link might be helpful.
* Justify choice of a metric. __3 points__.
* Split the training dataset into train and valid datasets. Train the model on train dataset and evaluate it on the valid dataset during training. Evaluate model on the test dataset after training. DO NOT USE a test dataset for validation!  __2 points__.
* Come up with 3 or more queries on banking topics and make a forecast of intents using your model. __2 points__.
* Comment code and describe your actions in the notebook. __1 point__.
* You must achieve a metric value of at least 90% __2 points__.
* Attach this file to the Teams Assignment. If you do not attach the file, you will get __0 points__.

The final score is calculated as a sum of all points.

If any of the tasks below will be completed with an error, the number of points for it may be reduced. For example, if you wrote only one query to the model instead of at least 3, then instead of __2 points__ you will get __1 point__.

### Notes
* Feel free to ask questions.
* Google Colab and Kaggle provide some free CPU and GPU time. Feel free to use it or use university resources.
* Good luck!

### 1. Libraries

In [None]:
DATASPHERE = False

if DATASPHERE:
    # !!! ATTENTION for DataSphere !!!
    # You will need to restart kernel
    # after libraries installed
    %pip install transformers==4.49.0 evaluate accelerate
else:
    !pip install tensorflow-cpu==2.16.1
    !pip install tf-keras==2.16.0 --no-dependencies
    !pip install transformers evaluate accelerate

### 2. Data

In [None]:
%%bash
wget -q -nc "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv"
wget -q -nc "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv"

### 3. Solutions
#### Task #1: Start with the data
* Read train and test data.
* Explore data: shape, number of classes, balance of classes. Bar chart from `plotly.express` can be helpful.
* Is there a class imbalance?

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
# YOUR CODE HERE

#### Task #2: Prepare datasets to train a model.
* Create `label2id` and `id2label` mappings.
* Create a `label` column in train and test datasets.
* Split train dataset to train and valid. Select proportion. Don't forget to fix the `seed`. Use stratification if you need.
* For model training, it will be useful to wrap your dataframes in `Dataset` and `DatasetDict` objects of the `datasets` library.

In [None]:
id2label = {i:c for i, c in enumerate(set(train['category']))}
label2id = {c:i for i, c in id2label.items()}

In [None]:
for i, c in id2label.items():
    assert label2id[c] == i

In [None]:
train['label'] = train['category'].map(label2id)
test['label'] = test['category'].map(label2id)

In [None]:
from torch.utils.data import TensorDataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
import numpy as np

from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# YOUR CODE HERE

In [None]:
from datasets import Dataset, DatasetDict

In [None]:
# YOUR CODE HERE

#### Task #3: Fit the model
* Choose a model, for example, `roberta-base`.
* Download the appropriate tokenizer and apply it to the dataset. Do not forget to truncate long sequences.
* Create ever `DataCollatorWithPadding` and metric.
* Create e`DataCollatorWithPadding`.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

In [None]:
tokenized_dataset = dataset.map(lambda data: tokenizer(data['text'], truncation=True), batched=True)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
f1 = evaluate.load('f1')

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    f1_macro = f1.compute(predictions=predictions, references=labels, average='macro')
    f1_weighted = f1.compute(predictions=predictions, references=labels, average='weighted')
    metrics = dict(f1_macro=f1_macro['f1'], f1_weighted=f1_weighted['f1'])
    return metrics

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# YOUR CODE HERE

In [None]:
# YOUR CODE HERE

trainer.train()