# Preprocessing

I am saving all pre-trained huggingface models, transformers and data-sets locally in a unified format for easier access later

## Models and Tokenizers

In [2]:
from transformers import AutoTokenizer
from transformers import BertForSequenceClassification

BERT_MODEL = "bert-base-cased"

path_model = f'./Models/Pretrained/{BERT_MODEL}.pt'
path_tokenizer = f'./Tokenizers/Pretrained/{BERT_MODEL}.pt'

tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL,
    do_lower_case = True ## Because BERT-uncased
    )

tokenizer.save_pretrained(path_tokenizer)

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    BERT_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

model.save_pretrained(path_model)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model

In [36]:
from transformers import AutoTokenizer
from transformers import RobertaForSequenceClassification

BASE_MODEL = "roberta-base"

path_model = f'./Models/Pretrained/{BASE_MODEL}.pt'
path_tokenizer = f'./Tokenizers/Pretrained/{BASE_MODEL}.pt'

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL,
    do_lower_case = True ## Because BERT-uncased
    )

tokenizer.save_pretrained(path_tokenizer)

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = RobertaForSequenceClassification.from_pretrained(
    BASE_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

model.save_pretrained(path_model)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should pr

In [7]:
from transformers import AutoTokenizer
from transformers import ElectraForSequenceClassification

BASE_MODEL = "google/electra-base-discriminator"
BASE_MODEL_OUT = "electra-base"

path_model = f'./Models/Pretrained/{BASE_MODEL_OUT}.pt'
path_tokenizer = f'./Tokenizers/Pretrained/{BASE_MODEL_OUT}.pt'

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL,
    do_lower_case = True ## Because BERT-uncased
    )

tokenizer.save_pretrained(path_tokenizer)

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = ElectraForSequenceClassification.from_pretrained(
    BASE_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

model.save_pretrained(path_model)

Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.d

In [5]:
from transformers import AutoTokenizer
from transformers import GPT2ForSequenceClassification
from transformers import GPT2Config


BASE_MODEL = "gpt2"

path_model = f'./Models/Pretrained/{BASE_MODEL}.pt'
path_tokenizer = f'./Tokenizers/Pretrained/{BASE_MODEL}.pt'
path_config = f'./Config/Pretrained/{BASE_MODEL}.pt'


tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL,
    do_lower_case = False
    )


tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'


# ## See info about adding tokens here: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
# SPECIAL_TOKENS = {
#     "pad_token": "[PAD]", # Does not originally come with padding token
#     #"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy']
# }
# tokenizer.add_special_tokens(SPECIAL_TOKENS)

# tokenizer.padding_side = 'left'

# tokenizer.save_pretrained(path_tokenizer)

model = GPT2ForSequenceClassification.from_pretrained(
    BASE_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer)) # must match vocabulary size
model.config.pad_token_id = tokenizer.pad_token_id # must ensure model has pad_token

# model.save_pretrained(path_model)


# config = GPT2Config(BASE_MODEL, num_labels=2)
# config.save_pretrained(path_config)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [25]:
BASE_MODEL = "roberta-base"

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model2 = RobertaForSequenceClassification.from_pretrained(
    BASE_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

tokenizer2 = AutoTokenizer.from_pretrained(BASE_MODEL,
    do_lower_case = True ## Because BERT-uncased
    )



Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should pr

In [10]:
from transformers import AutoTokenizer
from transformers import OPTForSequenceClassification
from transformers import OPTConfig
import torch


BASE_MODEL = "facebook/opt-350m"
BASE_MODEL_OUT = "opt"

# path_model = f'./Models/Pretrained/{BASE_MODEL_OUT}.pt'
# path_tokenizer = f'./Tokenizers/Pretrained/{BASE_MODEL_OUT}.pt'
# path_config = f'./Config/Pretrained/{BASE_MODEL_OUT}.pt'


tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL,
    do_lower_case = False
    )


tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'


# ## See info about adding tokens here: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
# SPECIAL_TOKENS = {
#     "pad_token": "[PAD]", # Does not originally come with padding token
#     #"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy']
# }
# tokenizer.add_special_tokens(SPECIAL_TOKENS)

# tokenizer.padding_side = 'left'

# tokenizer.save_pretrained(path_tokenizer)

# model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16, attn_implementation="flash_attention_2")


model = OPTForSequenceClassification.from_pretrained(
    BASE_MODEL, # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
    # torch_dtype=torch.float16, attn_implementation="flash_attention_2" ### Speeds up inference
)
# # add new, random embeddings for the new tokens
# model.resize_token_embeddings(len(tokenizer)) # must match vocabulary size
# model.config.pad_token_id = tokenizer.pad_token_id # must ensure model has pad_token

# model.save_pretrained(path_model)


# config = GPT2Config(BASE_MODEL, num_labels=2)
# config.save_pretrained(path_config)

Some weights of OPTForSequenceClassification were not initialized from the model checkpoint at facebook/opt-350m and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
model.model.decoder?

[0;31mSignature:[0m      [0mmodel[0m[0;34m.[0m[0mmodel[0m[0;34m.[0m[0mdecoder[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m           OPTDecoder
[0;31mString form:[0m   
OPTDecoder(
           (embed_tokens): Embedding(50272, 512, padding_idx=1)
           (embed_positions): OPTLearne <...> rue)
           (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
           )
           )
           )
[0;31mFile:[0m           ~/anaconda3/envs/noise-paper-flashattn/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py
[0;31mDocstring:[0m     
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OPTDecoderLayer`]

Args:
    config: OPTConfig
[0;31mInit docstring:[0m Initializes internal Module state, shared by both nn.Module and ScriptModule.

## SemEval-2013 Task 2

In [8]:
import pandas as pd
from datasets import Dataset, DatasetDict

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
category_codes = {'positive':0, 'negative': 1} ## Ensure same mapping between all

train = pd.read_csv("Data/SemEval/train/allTrainingData.tsv", sep='\t', header=None)
train.columns = ['id', 'user', 'text_label', 'text']
train.index.name='index'

## Do we want neutral examples?
## Currently proceeded with as if no

train = train[train.text_label.isin(['positive', 'negative'])]
train['label'] = train.text_label.map(category_codes)

train = Dataset.from_pandas(train[['text','label']])

dev = pd.read_csv("Data/SemEval/dev/twitdata_dev.tsv", sep='\t', header=None)
dev.columns = ['id', 'user', 'text_label', 'text']
dev.index.name='index'

## Do we want neutral examples?
## Currently proceeded with as if no

dev = dev[dev.text_label.isin(['positive', 'negative'])]
dev['label'] = dev.text_label.map(category_codes)

dev = Dataset.from_pandas(dev[['text','label']])

test = pd.read_csv("Data/SemEval/test/gold/twitdata_TEST.tsv", sep='\t', header=None, on_bad_lines='skip')
test.columns = ['id', 'user', 'text_label', 'text']
test.index.name='index'

test = test[test.text_label.isin(['positive', 'negative'])]
test['label'] = test.text_label.map(category_codes)

test = Dataset.from_pandas(test[['text','label']])

full_dataset = DatasetDict({
    'train': train,
    'test': test,
    'validation': dev})

path_dataset = "./Data/SemEval"

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                               

In [10]:
import pandas as pd
from datasets import Dataset, DatasetDict
import numpy as np

category_codes = {'positive':0, 'negative': 1} ## Ensure same mapping between all

lowlevel_anno = pd.read_csv("Data/SemEval/test/gold/twitter-test-gold-A.tsv", sep='\t', header=None, on_bad_lines='skip')
# lowlevel_anno.columns = ['id', '_', 'start_span', 'end_span', 'span_annotation']
test_text = pd.read_csv("Data/SemEval/test/gold/twitdata_TEST.tsv", sep='\t', header=None, on_bad_lines='skip')
# test_text.columns = [['id', '_', 'label', 'text']]

all_anno = lowlevel_anno[[0,2,3,4]].merge(test_text[[0,2,3]], how='inner', on=0)
all_anno.columns = ['id', 'start_span', 'end_span', 'span_annotation', 'label', 'text']

all_anno ['annotations'] = all_anno.text.map(lambda x: np.zeros(len(x.split())))
all_anno['indices'] = all_anno.apply(lambda x: np.arange(x.start_span, x.end_span+1), axis=1)

df = all_anno[['id', 'span_annotation', 'label', 'text', 'annotations', 'indices']].groupby(['id', 'label', 'text']).agg(({
        'annotations': lambda x: x.tolist()[0],
        'span_annotation': lambda x: x.tolist(),
        'indices': lambda x: x.tolist()}
                                           ))
df = df.reset_index()

def update_annos(annotations, span_labels, indices, text):
    for i,span in enumerate(indices):
        for j in span:
            try:
                if span_labels[i] == 'positive':
                    annotations[j] = 1
                elif span_labels[i] == 'negative':
                    annotations[j] = -1
            except IndexError:
                # print(text)
                return np.nan
                
    return annotations

df['annotations'] = df.apply(lambda x: update_annos(x.annotations, x.span_annotation, x.indices, x.text), axis=1)

df['label'] = df['label'].map(category_codes)
df.index = df['id']
df.index.name = 'index'
out_df = df[['label', 'text', 'annotations']].dropna()
test = Dataset.from_pandas(out_df[['label', 'text', 'annotations']])



In [11]:
test

Dataset({
    features: ['label', 'text', 'annotations', 'index'],
    num_rows: 1659
})

In [12]:
full_dataset = DatasetDict({
    'test': test,})

full_dataset

DatasetDict({
    test: Dataset({
        features: ['label', 'text', 'annotations', 'index'],
        num_rows: 1659
    })
})

In [13]:
path_dataset = "./Data/Clean/SemEval"

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                              

## SST-2

In [1]:
from datasets import Dataset, load_dataset

path_dataset = "./Data/SST-2"

dataset = load_dataset("sst2")
dataset.cache_files

dataset.save_to_disk(path_dataset)

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset sst2 (/home/fvd442/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5)
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 181.69it/s]
                                                                                                 

In [1]:
from datasets import load_from_disk

path_dataset = "./Data/SST-2"
dataset = load_from_disk(path_dataset)
print(dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})


In [8]:
### Determine max length

lens = [len(x.split()) for x in dataset['train']['sentence']]
max(lens)

52

### Get subset of Hummingbird in SST-2

In [12]:
import pandas as pd
from datasets import load_from_disk, Dataset, DatasetDict


DATA = "SST-2"

### Load SST test data
SST_dataset = load_from_disk(f"./Data/{DATA}")['validation'] #.map(tokenize_function, batched=True) #.map(reference_ids)

In [13]:
SST_dataset[:5] ### Positive = 1, Negaitve = 0

{'idx': [0, 1, 2, 3, 4],
 'sentence': ["it 's a charming and often affecting journey . ",
  'unflinchingly bleak and desperate ',
  'allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . ',
  "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . ",
  "it 's slow -- very , very slow . "],
 'label': [1, 0, 1, 1, 0]}

In [14]:
hum_dataset = pd.read_csv("./Data/Hummingbird/sentiment.tsv", sep='\t')
hum_dataset.head()

# Positive = 0, Negative = 1?

Unnamed: 0,human_label,avg_label_score,orig_text,processed_text,perception_scores
0,0.0,1.0,"An artful , intelligent film that stays within...",an artful intelligent film that stays within t...,0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0....
1,1.0,1.0,"Utterly lacking in charm , wit and invention ,...",utterly lacking in charm wit and invention rob...,-0.3333 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0...
2,0.5,1.0,"Generally, I'd anticipate a code bug. Can you ...",generally i'd anticipate a code bug can you pa...,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0....
3,0.0,1.0,Thanks very much indeed! :-) I don't suppose y...,thanks very much indeed :-) i don't suppose yo...,0.6667 0.0 0.0 0.6667 0.0 0.0 0.0 0.0 0.0 0.66...
4,1.0,1.0,"Don't ever kiss the other girl in a 3some, don...",don't ever kiss the other girl in a 3some don'...,-0.3333 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.3333 -0...


In [15]:
hum_sentences = hum_dataset.orig_text.map(lambda x: x.lower() + ' ').to_list()

mask = []

for i, h in enumerate(hum_sentences):
    if h in SST_dataset['sentence']:
        mask.append(True)
    else:
        mask.append(False)
        
print(sum(mask))

67


In [19]:
masked = hum_dataset[mask]
masked.index.name = "index"
masked.columns = ['label', 'x', 'sentence', 'text', 'annotations'] # match labels to what I chose for SemEval
masked['sentence'] = masked['sentence'].map(lambda x: x.lower() + ' ') ## To match SST dataset

### Labels are reversed btwn SST and Hummingbird somehow? Return 1 
category_codes = {1:0, 0: 1}
masked['label'] = masked['label'].map(category_codes)
masked.dropna(inplace=True) # Some neutral values that become NaN. Just drop them from the list.

mask_final = Dataset.from_pandas(masked[['label', 'text', 'annotations']]) ### only use processed sentence as we need that for annotations
path_dataset = "./Data/Interim/Hummingbird"

# mask_final.cache_files
# mask_final.save_to_disk(path_dataset)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  masked['sentence'] = masked['sentence'].map(lambda x: x.lower() + ' ') ## To match SST dataset
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  masked['label'] = masked['label'].map(category_codes)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  masked.dropna(inplace=True) # Some neutral values that become NaN. Just drop them from the list.


In [20]:
dataset = DatasetDict({
    'test': mask_final})

dataset

DatasetDict({
    test: Dataset({
        features: ['label', 'text', 'annotations', 'index'],
        num_rows: 63
    })
})

In [21]:
dataset.cache_files
dataset.save_to_disk(path_dataset)

                                                                                         

## HateXplain

In [1]:
from datasets import Dataset, load_dataset, DatasetDict
import pandas as pd
from statistics import mode
import numpy as np

# 0 Offensive, 1 Normal, 2 Hatespeech
dataset = load_dataset("hatexplain")
dataset

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset hatexplain (/home/fvd442/.cache/huggingface/datasets/hatexplain/plain_text/1.0.0/df474d8d8667d89ef30649bf66e9c856ad8305bef4bc147e8e31cbdf1b8e0249)
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 80.14it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 15383
    })
    validation: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 1922
    })
    test: Dataset({
        features: ['id', 'annotators', 'rationales', 'post_tokens'],
        num_rows: 1924
    })
})

In [2]:
train = pd.DataFrame(dataset["train"])
train["text"] = train["post_tokens"].map(lambda example: ' '.join(example))

annos = pd.DataFrame.from_dict(dataset["train"]['annotators'])
annos["label"] = annos["label"].map(lambda x: mode(x))

out = train[['text']].merge(annos["label"], left_index=True, right_index=True).reset_index()

train = Dataset.from_pandas(out)

validation = pd.DataFrame(dataset["validation"])
validation["text"] = validation["post_tokens"].map(lambda example: ' '.join(example))

annos = pd.DataFrame.from_dict(dataset["validation"]['annotators'])
annos["label"] = annos["label"].map(lambda x: mode(x))

out = validation[['text']].merge(annos["label"], left_index=True, right_index=True).reset_index()
validation = Dataset.from_pandas(out)

validation

Dataset({
    features: ['index', 'text', 'label'],
    num_rows: 1922
})

In [3]:
# dataset = load_dataset("hatexplain")["train"]

# max_l = 0

# for p in dataset["post_tokens"]:
#     if len(p)> max_l:
#         max_l = len(p)
        
# max_l


In [3]:
def merge_annotations(r, p):
    try:
        if len(r) > 1:
            return np.mean(r, axis=0)
        else:
            return  [0.0]*len(p)
    except ValueError:
        return np.nan
    
    
test = pd.DataFrame(dataset["test"])
test["text"] = test["post_tokens"].map(lambda example: ' '.join(example))
test["annotations"] = test.apply(lambda x: merge_annotations(x['rationales'], x['post_tokens']), axis=1)

annos = pd.DataFrame.from_dict(dataset["test"]['annotators'])
annos["label"] = annos["label"].map(lambda x: mode(x))

out = test[['text', 'annotations']].merge(annos["label"], left_index=True, right_index=True).reset_index()

test = Dataset.from_pandas(out)
test

Dataset({
    features: ['index', 'text', 'annotations', 'label'],
    num_rows: 1924
})

In [10]:
full_dataset = DatasetDict({
    'train': train,
    'test': test,
    'validation': validation})

path_dataset = "./Data/hatexplain"

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                                

In [11]:
# clean_out = out[out['label'] != 1]

# test = Dataset.from_pandas(clean_out)
# path_dataset = "./Data/Clean/hatexplain"

# test.cache_files

# test.save_to_disk(path_dataset)

                                                                                              



In [24]:
clean_out = out[out['label'] != 1]

#### Corrupted at this datapoint
clean_out.at[331,'text'] = 'i forgot all about my steak in the damn sink ü§¶ ion even wanna cook the bitch now'
clean_out.at[331,'annotations'] =  [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]


clean_out.at[1828,'text'] = 'i get on twitter just talk shit about you cause u a scary ass hoe üòÇ amp i want your head hoe ü§∑ but im off this shit good night üòò ‚ùó'
clean_out.at[1828,'annotations'] = [0. , 0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0. , 0. , 0. , 0. , 0. ,
       1. , 1. , 0. , 0. , 0.5, 0.5, 0.5, 0.5, 1. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0.]

clean_out.at[932,'text'] = 'peter kopteros peter kopteros <number> hours ago i sure hope that nappy headed lowlife never gets to play professional football ever again unless it in the ghetto they grew up in lol'
clean_out.at[932,'annotations'] =  clean_out.loc[932]['annotations'][:-1]

clean_out.at[407,'text'] = 'am not trying to be racist some of them mmg rude especially arabs mabuk tu takyah cerita la dah mabuk blh pulak drive ü§¶'
clean_out.at[407,'annotations'] =  clean_out.loc[407]['annotations'][:-3]

clean_out.at[304,'text'] = 'recordbreakerr recordbreakerrrrrrrrrrrrrr5 hours ago this was a racist killing against a defenseless white woman the shooter was a somalian muzzie mandingo rapefugee who only got hired because of the affirmative action'
clean_out.at[304,'annotations'] =  clean_out.loc[304]['annotations'][:-1]

In [25]:
test = Dataset.from_pandas(clean_out)
test = test.map(remove_columns=['__index_level_0__'])

path_dataset = "./Data/Clean/hatexplain"

full_dataset = DatasetDict({
    'test': test})

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                             

In [15]:
from datasets import load_from_disk
test = load_from_disk('Data/Noise/hatexplain')
for i, idx in enumerate(test['index']):
    if idx == 407:
        break
    
text_label = 'text'
print(test[text_label][i], len(test[text_label][i].split()))
text_label = 'random_synonym_25'
print(test[text_label][i], len(test[text_label][i].split()))
text_label = 'annotations'
print(test[text_label][i], len(test[text_label][i]))


am not trying to be racist some of them mmg rude especially arabs mabuk tu takyah cerita la dah mabuk blh pulak drive ü§¶ 24
am not trying to constitute racist some of them mmg unmannered especially arabs mabuk tu takyah cerita la dah mabuk blh pulak drive ü§¶ 24
[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 24


In [16]:
## Identify invalid characters

clean_out = load_from_disk("Data/Clean/hatexplain/")['test']
for i,t in enumerate(clean_out['text']):
    if '\u200d' in t or '\u200f' in t or " üèΩ" in t or " üèæ" in t or " Ô∏è" in t:
        print(i, t)

80 <user> ‚Äç daily reminder that nick the spic is a faggot who gets off on catboy porn pathetic and sad
109 <user> lets the degenerate kike <user> ‚Äç fuck him in the ass and cries piteously during the torrid act but deep down he very much aroused at the idea of a filthy jew dominating him that a real radical agenda next week tune in for the threesome with paul nehlen
395 <user> ‚Äç i agree the catholic church loves their muzzie infiltration at any cost they give and give and give to the sand niggers at the cost of white catholics shame on them


In [49]:
clean_out = load_from_disk("Data/Clean/esnli/")['test']
len(clean_out)

9824

In [32]:
print(df.loc[395, "annotations"])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [33]:
df = pd.DataFrame(clean_out)
df.at[80, "text"] = "<user> daily reminder that nick the spic is a faggot who gets off on catboy porn pathetic and sad"
df.at[80, "annotations"] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

df.at[109, "text"] = "<user> lets the degenerate kike <user> fuck him in the ass and cries piteously during the torrid act but deep down he very much aroused at the idea of a filthy jew dominating him that a real radical agenda next week tune in for the threesome with paul nehlen"
df.at[109, "annotations"] = [0.0, 0.0, 0.0, 0.3333333333333333, 1.0, 0.0, 0.3333333333333333, 0.3333333333333333, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6666666666666666, 0.6666666666666666, 0.3333333333333333, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

df.at[395, "text"] = "<user> i agree the catholic church loves their muzzie infiltration at any cost they give and give and give to the sand niggers at the cost of white catholics shame on them"
df.at[395, "annotations"] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Unnamed: 0,index,text,annotations,label
80,141,<user> daily reminder that nick the spic is a ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.33333333...",0
109,198,<user> lets the degenerate kike <user> fuck hi...,"[0.0, 0.0, 0.0, 0.3333333333333333, 1.0, 0.0, ...",0
395,688,<user> i agree the catholic church loves their...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",0


In [36]:
for i, row in df.iterrows():
    if len(row["text"].split()) != len(row["annotations"]):
        print(i)

In [43]:
from datasets import Dataset, load_dataset, DatasetDict
import pandas as pd
from statistics import mode
import numpy as np

test = Dataset.from_pandas(df.loc[[80,109,395]])
test = test.map(remove_columns=['__index_level_0__'])

path_dataset = "./Data/Clean/new_hatexplain"

full_dataset = DatasetDict({
    'test': test})

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                      

### Hatexplain_extra

In [17]:
from datasets import load_from_disk, Dataset, DatasetDict
import pandas as pd

test = load_from_disk("Data/hatexplain")['test']

df = pd.DataFrame(test)
df.set_index("index", inplace=True)
df_out = df[df.label == 1]
df_out['text'] = df_out['text'].str.replace("\u200d ", "").str.replace("\u200f ", "").str.replace(" üèΩ", "").str.replace(" üèæ", "").str.replace(" Ô∏è ","").str.replace(" Ô∏è", "")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_out['text'] = df_out['text'].str.replace("\u200d ", "").str.replace("\u200f ", "").str.replace(" üèΩ", "").str.replace(" üèæ", "").str.replace(" Ô∏è ","").str.replace(" Ô∏è", "")


In [18]:
# \u200d # \ufeff

test = Dataset.from_pandas(df_out)

path_dataset = "./Data/Clean/hatexplain_neutral"

full_dataset = DatasetDict({
    'test': test})

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                            

In [19]:
# test = load_from_disk("./Data/Clean/hatexplain_neutral")['test']

# for i, idx in enumerate(test["index"]):
#     if idx == 959:
#         break
# test["text"][i]

'breaking911 verified account <user> 2 4 h24 hours ago more flashback video michigan democrat rep john conyers allegedly reading playboy magazine on a packed airplane in <number> allegations of sexual harassment have surfaced tonight'

In [15]:
from datasets import load_from_disk
import pandas as pd
test = load_from_disk("./Data/Clean/hatexplain_neutral")['test']
df = pd.DataFrame(test)
df.loc[df['text'][df['text'].str.contains("\uffef")].index]

Unnamed: 0,text,annotations,label,index


## eSNLI

In [1]:
from datasets import Dataset, load_dataset, DatasetDict
import pandas as pd
from statistics import mode
import numpy as np

# 0 entailment 1 neutral 2 contradiction
dataset = load_dataset("esnli")
dataset

  from .autonotebook import tqdm as notebook_tqdm


KeyboardInterrupt: 

In [14]:
train = pd.DataFrame(dataset["train"])
train.drop(columns=['explanation_1', 'explanation_2', 'explanation_3'], inplace=True)
train.reset_index(inplace=True)
train.columns = ["index", "text_1", "text_2", "label"]
train = Dataset.from_pandas(train)

validation = pd.DataFrame(dataset["validation"])
validation.drop(columns=['explanation_1', 'explanation_2', 'explanation_3'], inplace=True)
validation.reset_index(inplace=True)
validation.columns = ["index", "text_1", "text_2", "label"]
validation = Dataset.from_pandas(validation)

test = pd.DataFrame(dataset["test"])
test.drop(columns=['explanation_1', 'explanation_2', 'explanation_3'], inplace=True)
test.reset_index(inplace=True)
test.columns = ["index", "text_1", "text_2", "label"]
test = Dataset.from_pandas(test)

full_dataset = DatasetDict({
    'train': train,
    'test': test,
    'validation': validation})

path_dataset = "./Data/esnli"

full_dataset.cache_files

full_dataset.save_to_disk(path_dataset)

                                                                                                   

In [15]:
max_1 = 0
for t in train["text_1"]:
    if len(t.split()) > max_1:
        max_1 = len(t.split())

max_2 = 0
for t in train["text_2"]:
    if len(t.split()) > max_2:
        max_2 = len(t.split())      

max_1 + max_2
### Make max 254, is_lower = False

134

In [16]:
real_test = pd.read_csv( "./Data/esnli/esnli_test.csv")
real_test.columns

Index(['pairID', 'gold_label', 'Sentence1', 'Sentence2', 'Explanation_1',
       'Sentence1_marked_1', 'Sentence2_marked_1', 'Sentence1_Highlighted_1',
       'Sentence2_Highlighted_1', 'Explanation_2', 'Sentence1_marked_2',
       'Sentence2_marked_2', 'Sentence1_Highlighted_2',
       'Sentence2_Highlighted_2', 'Explanation_3', 'Sentence1_marked_3',
       'Sentence2_marked_3', 'Sentence1_Highlighted_3',
       'Sentence2_Highlighted_3'],
      dtype='object')

In [17]:
category_codes = {  'entailment': 0,
                    'neutral': 1,
                    'contradiction':2}

real_test["label"] = real_test["gold_label"].map(category_codes)
real_test.rename(columns={'Sentence1': 'text_1', 'Sentence2': "text_2"}, inplace=True)
clean_test = real_test[["label", "text_1", "text_2"] + [c for c in real_test.columns if 'marked' in c]]
clean_test['anno_1_1'] = clean_test['Sentence1_marked_1'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
clean_test['anno_2_1'] = clean_test['Sentence2_marked_1'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
clean_test['anno_1_2'] = clean_test['Sentence1_marked_2'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
clean_test['anno_2_2'] = clean_test['Sentence2_marked_2'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
clean_test['anno_1_3'] = clean_test['Sentence1_marked_3'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
clean_test['anno_2_3'] = clean_test['Sentence2_marked_3'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))

# clean_test['len_2'] = clean_test['text_2'].map(lambda x: len(x.split()))
clean_test.drop(columns=[c for c in real_test.columns if 'marked' in c], inplace=True)
clean_test["annotation_1"] = np.array(clean_test[["anno_1_1", "anno_1_2", "anno_1_3"]]).mean(axis=1)
clean_test["annotation_2"] = np.array(clean_test[["anno_2_1", "anno_2_2", "anno_2_3"]]).mean(axis=1)
clean_test.drop(columns=[c for c in clean_test.columns if 'anno_' in c], inplace=True)
clean_test.reset_index(inplace=True)

clean_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_test['anno_1_1'] = clean_test['Sentence1_marked_1'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_test['anno_2_1'] = clean_test['Sentence2_marked_1'].map(lambda x: np.array([ 1 if '*' in w else 0 for w in x.split()]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/

Unnamed: 0,index,label,text_1,text_2,annotation_1,annotation_2
0,0,1,This church choir sings to the masses as they ...,The church has cracks in the ceiling.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]"
1,1,0,This church choir sings to the masses as they ...,The church is filled with song.,"[0.0, 0.3333333333333333, 0.6666666666666666, ...","[0.0, 0.3333333333333333, 0.0, 0.6666666666666..."
2,2,2,This church choir sings to the masses as they ...,A choir singing at a baseball game.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333...","[0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 1.0, ..."
3,3,1,"A woman with a green headscarf, blue shirt and...",The woman is young.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0]"
4,4,0,"A woman with a green headscarf, blue shirt and...",The woman is very happy.,"[0.3333333333333333, 0.3333333333333333, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0]"
...,...,...,...,...,...,...
9819,9819,2,Two women are observing something together.,Two women are standing with their eyes closed.,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0]"
9820,9820,0,Two women are observing something together.,Two girls are looking at something.,"[0.0, 0.3333333333333333, 0.0, 0.6666666666666...","[0.0, 0.0, 0.0, 0.6666666666666666, 0.0, 0.666..."
9821,9821,2,A man in a black leather jacket and a book in ...,A man is flying a kite.,"[0.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.6666666666666666, 1.0]"
9822,9822,0,A man in a black leather jacket and a book in ...,A man is speaking in a classroom.,"[0.0, 0.3333333333333333, 0.0, 0.0, 0.33333333...","[0.0, 0.6666666666666666, 0.0, 0.6666666666666..."


In [18]:
test_out = Dataset.from_pandas(clean_test)
test_out

Dataset({
    features: ['index', 'label', 'text_1', 'text_2', 'annotation_1', 'annotation_2'],
    num_rows: 9824
})

In [19]:
test_out = Dataset.from_pandas(clean_test)
full_out = DatasetDict({
    'test': test_out})
path_dataset = "./Data/Clean/esnli"

full_out.cache_files

full_out.save_to_disk(path_dataset)

                                                                                              

In [6]:
for i in data['test']['index']:
    if i == 459:
        break
data['test'][i]

{'index': 459,
 'label': 2,
 'text_1': 'Two roadside workers with lime green safety jackets, white hard hats and gloves on with construction cones in the background',
 'text_2': 'A woman chastises another.',
 'annotation_1': [0.3333333333333333,
  0.6666666666666666,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 'annotation_2': [0.0, 0.6666666666666666, 0.3333333333333333, 0.0]}

## Debugging

In [14]:
# download the dataset
!wget -q -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# unzip it
!tar -zxf /content/aclImdb_v1.tar.gz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
tar (child): /content/aclImdb_v1.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now


In [15]:
# unzip it
!tar -zxf aclImdb_v1.tar.gz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
