## Appeals classification task

The task is to train a model to find the appeals of bank customers consist of description of special type of fraud they faced from all of the customers' feedback. We interested in the situations where a potential impostor calls to the customer and introduces himself as a member of a customer service of the bank. Then, impostor ***tells the customer the actual balance of customer's card account*** to convince his prey. The language of the text is Russian

The complexity of the task lies in the fact that there are a lot of messages containing problems with the balance of card account and different fraud types, so it is hard to use regular expressions or similar default approaches to catch what we need here. 

We are going to use Transformers and Pytorch libraries to fine-tune a BERT model since BERT shows wonderful results on different tasks and is rather easy to fine-tune

### Installing Packages

In [2]:
!pip install -r /content/requirements.txt

Collecting torch==1.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/56/74/6fc9dee50f7c93d6b7d9644554bdc9692f3023fa5d1de779666e6bf8ae76/torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1MB)
[K     |████████████████████████████████| 804.1MB 23kB/s 
[?25hCollecting sentencepiece==0.1.95
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 45.0MB/s 
[?25hCollecting transformers==4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 42.0MB/s 
Collecting scikit-learn==0.22.2
[?25l  Downloading https://files.pythonhosted.org/packages/71/b0/471bfdb7741523dfbddd038cb5f7cc9e21d8aaa1987839af6f17238254c0/scikit_learn-0.22.2

In [2]:
import pandas as pd

In [2]:
import torch

### Data

We have 1141 short texts manualy labeled. **1** is type of message we need and **0** is any other message.
We will use 700 for training, 225 for validation and 216 for testing and evaluating the model.

Importing Data

In [3]:
train_data = pd.read_csv('train.csv', sep="\t")
valid_data = pd.read_csv('valid.csv', sep="\t")
test_data  = pd.read_csv('test.csv', sep="\t")

For this specific task we created ***two helper classes***. One for **classifier** and the other for the **data proccessing**. 

***CustomDataset*** class consists of methods to proccess the input texts and make it ready for DataLoader class from pytorch. More specificaly it tokenizes input texts with tokenizer that was defined previously using *padding* and converts the target data into *tensors*. It is written with the help of [this tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

***BertClassifier*** is our main class that trains and evaluates the model. It gets as input *path to the model*, *path to the tokenizer*, *number of classess to predict*, *number of epochs*.
- **Preparation** method initializes dataloaders using our *CustomDataset* class, *optimizer parameters* and a *loss function*
- **fit** method defines our train loop, performs optimization steps
- **eval** is our evaluation method. It returns losses and accuracy on validation dataset
- **train** method performs fit method as many times as needed saving the best model
- **predict** method takes a text and outputs predictions by trained model which was saved inside **train method**

In [4]:
from bert_dataset import CustomDataset
from bert_classifier import BertClassifier

### Initialize BERT classifier
Here we Initialize the object of our BertClassifier class. Model that is used is **Rubert** - popular BERT model for Russian language. You can find it on [HuggingFace](https://huggingface.co/DeepPavlov/rubert-base-cased)

In [5]:
classifier = BertClassifier(
        model_path='rubert_cased_L-12_H-768_A-12_v1',
        tokenizer_path='rubert_cased_L-12_H-768_A-12_v1/vocab.txt',
        n_classes=2,
        epochs=2,
        model_save_path='bertmodel_.pt'
)

Some weights of the model checkpoint at rubert_cased_L-12_H-768_A-12_v1 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model

Prepare data and helpers for train and evlauation.. Our train data consists of 

In [6]:
train_data

Unnamed: 0,text,value
0,дк не согласен с ответом с ее слов никаких пар...,0
1,клиент хочет получить на электронную почту так...,0
2,звонила служба безоп. сбербанка запрашивали но...,0
3,с этого номера тоже звонили,0
4,треб решить его вопрос говорит. что на момент ...,1
...,...,...
695,клиент сам попросит зафиксировать обращение,0
696,сообщили о переводе назвали кол-во карт клиент...,1
697,клиент категоричен и настаивает на том чтобы б...,1
698,при обращении клиента выяснилось что баланс по...,0


In [7]:
classifier.preparation(
        X_train=list(train_data['text']),
        y_train=list(train_data['value']),
        X_valid=list(valid_data['text']),
        y_valid=list(valid_data['value'])
    )

In [8]:
if torch.cuda.is_available():
    device = torch.device("cuda", 1)
    print('GPU avaliable')
else:
    device = torch.device("cpu")
    print("GPU UNavaliable")

GPU avaliable


Train our model

In [9]:
classifier.train()

Epoch 1/2
Train loss 0.5305999353025774 accuracy 0.87
Val loss 0.1620393122280577 accuracy 0.9688888888888889
----------
Epoch 2/2
Train loss 0.21008016967630413 accuracy 0.9528571428571428
Val loss 0.1951964583280867 accuracy 0.9688888888888889
----------


Check test data

In [10]:
texts = list(test_data['text'])
labels = list(test_data['value'])

predictions = [classifier.predict(t) for t in texts]

In [12]:
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1score = precision_recall_fscore_support(labels, predictions,average='macro')[:3]

print(f'precision: {precision}, recall: {recall}, f1score: {f1score}')

precision: 0.9161256228295334, recall: 0.9680706521739131, f1score: 0.9396334890406036


We got 0.94 F1 score on 214 test samples. It is a good result. We can use the model to gather additional data for training to obtain better results.