<a href="https://colab.research.google.com/github/samhiggs/journal-title-text-classifier/blob/main/journal_title_conference_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/08/a2/d4e1024c891506e1cee8f9d719d20831bac31cb5b7416983c4d2f65a6287/datasets-1.8.0-py3-none-any.whl (237kB)
[K     |████████████████████████████████| 245kB 6.8MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 10.6MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/0e/3a/666e63625a19883ae8e1674099e631f9737bd5478c4790e5ad49c5ac5261/fsspec-2021.6.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 12.6MB/s 
Installing collected packages: xxhash, fsspec, datasets
Successfully installed datasets-1.8.0 fsspec-2021.6.1 xxhash-2.0.2


## Load Dataset
Using the labeled and clean dataset that aims to predict the conference a journal will be in based on it's title.

In [None]:
import pandas as pd

paper_url = "https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/title_conference.csv"
papers_df = pd.read_csv(paper_url)

In [None]:
papers_df.head()

Unnamed: 0,Title,Conference
0,Innovation in Database Management: Computer Sc...,VLDB
1,High performance prime field multiplication fo...,ISCAS
2,enchanted scissors: a scissor interface for su...,SIGGRAPH
3,Detection of channel degradation attack by Int...,INFOCOM
4,Pinning a Complex Network through the Betweenn...,ISCAS


## Split Dataset

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(papers_df, test_size=0.3, stratify=papers_df.Conference)

## Transform DataFrame to Huggingface Dataset

In [None]:
from datasets import Dataset, DatasetDict

papers_datasets = DatasetDict({
    "train": Dataset.from_pandas(train),
    "test": Dataset.from_pandas(test)
})

In [None]:
papers_datasets = papers_datasets.rename_column("Conference", "label")
papers_datasets = papers_datasets.rename_column("Title", "text")
papers_datasets = papers_datasets.rename_column("__index_level_0__", "idx")
papers_datasets = papers_datasets.class_encode_column("label")

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
papers_datasets["train"].features

{'idx': Value(dtype='int64', id=None),
 'label': ClassLabel(num_classes=5, names=['INFOCOM', 'ISCAS', 'SIGGRAPH', 'VLDB', 'WWW'], names_file=None, id=None),
 'text': Value(dtype='string', id=None)}

In [None]:
papers_datasets["train"][0]

{'idx': 2213, 'label': 2, 'text': 'True 3D display.'}

## Tokenize

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-distilroberta-base-v1")

In [48]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = papers_datasets.map(tokenize_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [49]:
print(f"Max length (should be 512): {max([len(x["input_ids"]) for x in tokenized_datasets["train"]]}")

512

In [50]:
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

In [37]:
import torch
torch.cuda.is_available()

True

<a id='trainer'></a>

In [39]:
!nvidia-smi

Mon Jun 28 03:15:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    29W /  70W |   1084MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load Model

In [45]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/paraphrase-distilroberta-base-v2", 
                                                           num_labels=len(train.Conference.unique()), 
                                                           output_attentions = False, 
                                                           output_hidden_states = False)

loading configuration file https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/9a6ba3616fce237954d0662fc5bb8c63062b4366a8ee3c8a2409eb4c3eaa87d9.59ff880ec4bbee9c546413210dc333e77481a1a523ce4697b8fbb24778c88886
Model config RobertaConfig {
  "_name_or_path": "old_models/paraphrase-distilroberta-base-v2/0_Transformer",
  "architectures": [
    "RobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings"

In [51]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", per_device_train_batch_size=8, per_device_eval_batch_size=8)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [52]:
from transformers import Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=full_train_dataset, 
    eval_dataset=full_eval_dataset
)

In [53]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, text.
***** Running training *****
  Num examples = 1754
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 660


Step,Training Loss
500,0.4533


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=660, training_loss=0.3722421039234508, metrics={'train_runtime': 273.5668, 'train_samples_per_second': 19.235, 'train_steps_per_second': 2.413, 'total_flos': 1327494921799680.0, 'train_loss': 0.3722421039234508, 'epoch': 3.0})

In [None]:
!ls test_trainer

checkpoint-1000  checkpoint-500  runs


In [None]:
print(model)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [None]:
!nvidia-smi

Mon Jun 28 02:06:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    31W /  70W |   5766MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Set to eval mode to avoid running out of memory
model.eval()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=full_train_dataset,
    eval_dataset=full_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, text.
***** Running Evaluation *****
  Num examples = 753
  Batch size = 8


{'eval_accuracy': 0.8857901726427623,
 'eval_loss': 0.45794352889060974,
 'eval_runtime': 11.5177,
 'eval_samples_per_second': 65.378,
 'eval_steps_per_second': 8.248}

88.6% is a pretty strong classifier given the domain knowledge required to understand the model. But it may fall flat where there are misspelt titles, or unseen examples (one shot problem). One approach to this is to do NLP Augmentation.

In [None]:
!pip install nlpaug

Collecting nlpaug
[?25l  Downloading https://files.pythonhosted.org/packages/ec/ba/3354ed4339d775660ab018b62f9222da1060a9facfbf0aba495c89f6d96c/nlpaug-1.1.4-py3-none-any.whl (398kB)
[K     |▉                               | 10kB 20.0MB/s eta 0:00:01[K     |█▋                              | 20kB 27.5MB/s eta 0:00:01[K     |██▌                             | 30kB 22.5MB/s eta 0:00:01[K     |███▎                            | 40kB 17.5MB/s eta 0:00:01[K     |████                            | 51kB 8.6MB/s eta 0:00:01[K     |█████                           | 61kB 8.7MB/s eta 0:00:01[K     |█████▊                          | 71kB 9.1MB/s eta 0:00:01[K     |██████▋                         | 81kB 10.0MB/s eta 0:00:01[K     |███████▍                        | 92kB 10.4MB/s eta 0:00:01[K     |████████▏                       | 102kB 8.3MB/s eta 0:00:01[K     |█████████                       | 112kB 8.3MB/s eta 0:00:01[K     |█████████▉                      | 122kB 8.3MB/s eta

<a id='keras'></a>