Skip to content

Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020

License

Notifications You must be signed in to change notification settings

socialmediaie/TRAC2020

Repository files navigation

Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020

Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying.

Our trained models as well as evaluation metrics during traing are available at: https://databank.illinois.edu/datasets/IDB-8882752# We also make a few of our models available in HuggingFace's models repository at https://huggingface.co/socialmediaie/, these models can be further fine-tuned on your dataset of choice.

Our approach is described in our paper titled:

Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).

The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020

NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper.

Video presentation and slides

Model overview

Slides PDF PPTX

Citation

If you plan to use the dataset please cite the following resources:

  • Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).
  • Mishra, Shubhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Trained Models for Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8882752_V1.
@inproceedings{Mishra2020TRAC,
author = {Mishra, Sudhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
booktitle = {Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020)},
title = {{Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
year = {2020}
}

@data{illinoisdatabankIDB-8882752,
author = {Mishra, Shubhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
doi = {10.13012/B2IDB-8882752_V1},
publisher = {University of Illinois at Urbana-Champaign},
title = {{Trained models for Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
url = {https://doi.org/10.13012/B2IDB-8882752{\_}V1},
year = {2020}
}

Usage

The models can be used via the code below.

A notebook with this code can be found at ipynb Open In Colab

from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

TASK_LABEL_IDS = {
    "Sub-task A": ["OAG", "NAG", "CAG"],
    "Sub-task B": ["GEN", "NGEN"],
    "Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"]
}

model_version="databank" # other option is hugging face library
if model_version == "databank":
    # Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
    # Unzip the file at some model_path (we are using: "databank_model")
    model_path = next(Path("databank_model").glob("./*/output/*/model"))
    # Assuming you get the following type of structure inside "databank_model"
    # 'databank_model/ALL/Sub-task C/output/bert-base-multilingual-uncased/model'
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased"
    base_model = f"socialmediaie/TRAC2020_{lang}_{task.split()[-1]}_{base_model}"
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

# For doing inference set model in eval mode
model.eval()
# If you want to further fine-tune the model you can reset the model to model.train()

task_labels = TASK_LABEL_IDS[task]

sentence = "This is a good cat and this is a bad dog."
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
  logits, = model(tokens_tensor, labels=None)
logits


preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(task_labels)[preds]
print(dict(zip(task_labels, preds_probs[0])), preds_labels)
"""You should get an output as follows:

({'CAG-GEN': 0.06762535,
  'CAG-NGEN': 0.03244293,
  'NAG-GEN': 0.6897794,
  'NAG-NGEN': 0.15498641,
  'OAG-GEN': 0.034373745,
  'OAG-NGEN': 0.020792078},
 array(['NAG-GEN'], dtype='<U8'))

"""

Visualize models in exbert

You can find the exbert link on each model's URL. E.g. if you want to visualize socialmediaie/TRAC2020_ALL_C_bert-base-multilingual-uncased, you can use the following link.

About

Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published