<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/text_classification/NLPModel_MultiClass_SimpleTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Simple Transformers - Text classification

The Simple Transformers library is built on top of the excellent Transformers library by Hugging Face. 

[Explanation](https://medium.com/swlh/simple-transformers-multi-class-text-classification-with-bert-roberta-xlnet-xlm-and-8b585000ce3a)

[GitHub](https://github.com/ThilinaRajapakse/simpletransformers)

## Installation

In [0]:
! pip install simpletransformers

It needs apex 

In [0]:
! git clone https://github.com/NVIDIA/apex

In [0]:
! pip install -v --no-cache-dir ./apex

## DATA

SimpleTransformers need input dataframe following the structure:
- The first column contains the text and is of type str.
- The second column contains the labels and is of type int.

For multiclass classification, the labels should be integers starting from 0. 

In [0]:
import pandas as pd
df = pd.read_csv('https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv')
df = df[pd.notnull(df['tags'])]
categories = df["tags"].unique()
df['tags'] = pd.Categorical(pd.factorize(df['tags'])[0])

print(f'Number of examples : {len(df)}')
print(f'Number of words in the data: {df["post"].apply(lambda x: len(x.split(" "))).sum()}')
print(f'Number of categorie : {len(categories)}')
print()
print(df.head(10))
print()
print(f'Categories: {categories}')

Number of examples : 40000
Number of words in the data: 10286120
Number of categorie : 20

                                                post tags
0  what is causing this behavior  in our c# datet...    0
1  have dynamic html load as if it was in an ifra...    1
2  how to convert a float value in to min:sec  i ...    2
3  .net framework 4 redistributable  just wonderi...    3
4  trying to calculate and print the mean and its...    4
5  how to give alias name for my website  i have ...    1
6  window.open() returns null in angularjs  it wo...    5
7  identifying server timeout quickly in iphone  ...    6
8  unknown method key  error in rails 2.3.8 unit ...    7
9  from the include  how to show and hide the con...    5

Categories: ['c#' 'asp.net' 'objective-c' '.net' 'python' 'angularjs' 'iphone'
 'ruby-on-rails' 'ios' 'c' 'sql' 'java' 'jquery' 'css' 'c++' 'php'
 'android' 'mysql' 'javascript' 'html']


## The model

The model_type may be one of ['bert', 'xlnet', 'xlm', 'roberta', 'distilbert']

A ClassificationModel has a dict args which contains many attributes that provide control over hyperparameters. 



In [0]:
from simpletransformers.classification import ClassificationModel

# Create a ClassificationModel
model = ClassificationModel('roberta', 'roberta-base',
                            num_labels=len(categories),
                            args={'learning_rate':1e-5, 'num_train_epochs': 2, 'reprocess_input_data': True, 'overwrite_output_dir': True})

HBox(children=(IntProgress(value=0, description='Downloading', max=524, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=501200538, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=898823, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




## Train model

In [0]:
model.train_model(df)

Converting to features started. Cache is not used.


  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(IntProgress(value=0, max=40000), HTML(value='')))


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

HBox(children=(IntProgress(value=0, description='Current iteration', max=5000, style=ProgressStyle(description…

Running loss: 3.000385



Running loss: 2.896610



Running loss: 3.082296Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Running loss: 2.648316Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Running loss: 2.031017Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Running loss: 1.371356Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Running loss: 0.925076Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Running loss: 0.230277

HBox(children=(IntProgress(value=0, description='Current iteration', max=5000, style=ProgressStyle(description…

Running loss: 0.693092

## Evaluation

In [0]:
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

To evaluate the model, just call eval_model. This method has three return values.

* **result**: The evaluation result in the form of a dict. By default, only the Matthews correlation coefficient (MCC) is calculated for multiclass classification.

* **model_outputs**: A list of model outputs for each item in the evaluation dataset. This is useful if you need probabilities for each class rather than a single prediction. Indeed, the prediction is calculated by applying a softmax function over the outputs.

* **wrong_predictions**: A list of InputFeature of each incorrect prediction. The text may be obtained from the InputFeature.text_a attribute. (The InputFeature class can be found in the utils.py file in the repo)

You can also include additional metrics to be used in the evaluation. Simply pass in the metrics functions as keyword arguments to the eval_model method.

In [0]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1=f1_multiclass, acc=accuracy_score)


## Predictions

In [0]:
predictions, raw_outputs = model.predict(['how to convert a string value to float.'])

In [0]:
predictions