In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

# Exploring the datasets

**Warning: Offensive language**

The data directory should look like:
```
../pynlp/data/
├── gibert
│   ├── extraData.csv
│   ├── testData.csv
│   └── trainData.csv
├── OLID-A
│   ├── testData.csv
│   └── trainData.csv
├── README.md
├── trac2018
│   ├── devData.csv
│   ├── testData-fb.csv
│   ├── testData-tw.csv
│   └── trainData.csv
└── vidgen_extra
    ├── Dynamically Generated Hate Dataset - annotation guidelines.pdf
    ├── Dynamically Generated Hate Dataset v0.2.2.csv
    ├── Dynamically Generated Hate Dataset v0.2.3.csv
    └── README.md
```

In [2]:
DESIRED_COLUMNS = [
    'Id',
    'Text',
    'Label'
]

## TRAC2018

Presented in
*Ritesh Kumar, Aishwarya N. Reganti, Akshit Bhatia, Tushar Maheshwari (2018) Aggression-annotated Corpus of Hindi-English Code-mixed Data. In: Proceedings of LREC-2018*

In [3]:
df_trac2018 = pd.read_csv('../pynlp/data/trac2018/trainData.csv', delimiter='\t')
df_trac2018.sample(5, random_state=12345)

Unnamed: 0,Id,Text,Label
4297,facebook_corpus_msr_1521828,Kerala has government with guts to counter politicising issue of ban of beef by BJP when all the people let it be minority only have all respect and uphold sanctity of cow but creating fear sychosis even for other livestock which are legally allowed is a hypocrisy and arrogance.,OAG
3628,facebook_corpus_msr_423176,Tigor will be in showrooms post March 28: Tata Motors MD,NAG
7839,facebook_corpus_msr_394922,https://www.youtube.com/watch?v=k-b22XCgwqAHye > #Rapeistan First Make Ur Soldiers Brave then Talk About War Against Pakistan and SoCald Surgical Strikes,NAG
913,facebook_corpus_msr_1791669,Dont look like a bad person atleast in these pics...maybe media overrated that one issue...,CAG
3549,facebook_corpus_msr_1521918,Kerala beef fry is delicious with rice cake...,NAG


In [4]:
df_trac2018.columns = DESIRED_COLUMNS

In [5]:
df_trac2018.groupby('Label')[['Id']].count()

Unnamed: 0_level_0,Id
Label,Unnamed: 1_level_1
CAG,4240
NAG,5051
OAG,2708


In [6]:
!ls -l ../pynlp/data/trac2018/

total 3180
-rw-r--r-- 1 rutger rutger  502403 jul 13  2019 devData.csv
-rw-rw-r-- 1 rutger rutger  324171 sep 27 08:45 testData.csv
-rw-r--r-- 1 rutger rutger  176206 jul 13  2019 testData-fb.csv
-rw-r--r-- 1 rutger rutger  147979 jul 13  2019 testData-tw.csv
-rw-r--r-- 1 rutger rutger 2092225 jul 13  2019 trainData.csv


not sure why train data is not split, whereas test data is

In [7]:
df_trac2018_test_fb = pd.read_csv('../pynlp/data/trac2018/testData-fb.csv', delimiter='\t')
df_trac2018_test_tw = pd.read_csv('../pynlp/data/trac2018/testData-tw.csv', delimiter='\t')

In [8]:
df_trac2018_test = pd.concat([
    df_trac2018_test_fb,
    df_trac2018_test_tw
])

In [9]:
df_trac2018_test.to_csv(
    '../pynlp/data/trac2018/testData.csv',
    index=False,
    sep='\t'
)

## OLID

Presented in *[Predicting the Type and Target of Offensive Posts in Social Media](https://aclanthology.org/N19-1144) (Zampieri et al., NAACL 2019)*

In [10]:
df_olid = pd.read_csv('../pynlp/data/OLID-A/trainData.csv', delimiter='\t')
df_olid.sample(5, random_state=12345)

Unnamed: 0,ID,Text,Label
5505,99282,@USER @USER This can ONLY help the Conservatives unify! 🙄,NOT
5847,17954,@USER @USER @USER She is laughing the most. URL,NOT
10439,10332,#meDIAtoo Selective putrid outrage ONLY AT Conservatives only BY LIBERALS who want us to know how MORAL they are!!! Re: Kavanaugh and ANY OTHER REPUB who will stand up against them!! URL,NOT
10025,37450,#GeraniumInTheCranium. Please also dredge up some of her ridiculous “gun control” bills that were shot down. URL,OFF
7320,91919,@USER @USER @USER How many @USER backbenchers think it’s workable? URL,NOT


In [11]:
df_olid.columns = DESIRED_COLUMNS

In [12]:
df_olid.groupby('Label')[['Id']].count()

Unnamed: 0_level_0,Id
Label,Unnamed: 1_level_1
NOT,8840
OFF,4400


In [13]:
df_olid_test = pd.read_csv('../pynlp/data/OLID-A/testData.csv', delimiter='\t')
df_olid_test.columns

Index(['Id', 'Text', 'Label'], dtype='object')

Test dataset for OLID has an empty column for some reason.

In [14]:
# If notebook has been run once the empty column is not there anymore
if ' ' in df_olid_test.columns:
    df_olid_test = df_olid_test.drop(columns=' ')

In [15]:
df_olid_test.to_csv('../pynlp/data/OLID-A/testData.csv', index=False, sep='\t')

# Gibert

Presented in *O. de Gibert, N. Pérez, A. García-Pablos, M. Cuadros, 2018. Hate Speech Dataset from a White Supremacy Forum. In ALW2: 2nd Workshop on Abusive Language Online.*

In [16]:
df_gibert = pd.read_csv('../pynlp/data/gibert/trainData.csv', delimiter='\t')
df_gibert.sample(5, random_state=12345)

Unnamed: 0,Id,Text,Label
1290,13260874_2,I hope all their farms burn down and their nappy headed children squandle in the flames .,hate
260,13586829_1,How on earth can they be Swedes when they are the result of race treason ?,hate
1085,13434590_2,Where a white woman can go about in public and not fear assault by muds .,hate
347,14416429_3,Let the non-whites pollute their bodies with that garbage .,hate
1758,12899322_3,See the pattern .,noHate


In [17]:
df_gibert.columns = DESIRED_COLUMNS

In [18]:
df_gibert.groupby('Label')[['Id']].count()

Unnamed: 0_level_0,Id
Label,Unnamed: 1_level_1
hate,957
noHate,957


## Vidgen

Presented in *[Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection](https://aclanthology.org/2021.acl-long.132) (Vidgen et al., ACL 2021)*

In [19]:
df_vidgen = pd.read_csv('../pynlp/data/vidgen_extra/Dynamically Generated Hate Dataset v0.2.3.csv')
df_vidgen.sample(5, random_state=12345)

Unnamed: 0.1,Unnamed: 0,acl.id,X1,text,label,type,target,level,split,round.base,annotator,round,acl.id.matched
3102,3139,acl12822,3139,This bloody tumble dryer is nothing but trash,nothate,none,none,original,train,2,LqLKTtrOmx,2a,acl12823
8887,8931,acl18035,8931,"He is a prick so I'm just calling him by the appropriate word, it's not my fault he's a fool",nothate,none,none,perturbation,train,2,TrRF46JWfP,2b,acl18034
15780,15839,acl24252,15839,Get castrated you fucking dickhead,nothate,none,none,original,train,3,oemYWm1Tjg,3a,acl24253
1696,1731,acl11555,1731,salmon is the superior of all fish. She really a bad bitch!,nothate,none,none,original,train,2,GNZuCtwed3,2a,acl11556
16691,16752,acl25074,16752,If the data is properly encoded and access to personal devices denied then ad-hoc networks of this sort are the future. That being said: those spazzi3s are certainly not doing it right or in good faith and are soooo not trustworthy to carry out such an advancement,hate,derogation,dis,perturbation,train,3,CAgNlUizNm,3b,acl25073


**label**

In [20]:
df_vidgen.groupby('label')[['acl.id']].count()

Unnamed: 0_level_0,acl.id
label,Unnamed: 1_level_1
hate,22175
nothate,18969


Same annotation scheme used as Gibert. Make it uniform.

In [21]:
df_vidgen['label'] = df_vidgen['label'].replace({
    'hate': 'hate',
    'nothate': 'noHate'
})

In [22]:
df_vidgen['split'].unique()

array(['train', 'test', 'dev'], dtype=object)

**VUA format**

In [23]:
!mkdir -p ../pynlp/data/vidgen_vua

In [24]:
import csv

for split in ['train']:
    df_vidgen_split = df_vidgen[df_vidgen['split'] == split]
    df_vidgen_split = df_vidgen_split[['acl.id', 'text', 'label']]
    df_vidgen_split.to_csv(
        f'../pynlp/data/vidgen_vua/{split}Data.csv',
        index=False,
        sep='\t',
        quoting=csv.QUOTE_ALL
    )

# Combining datasets

Here we combine datasets from above to form larger training datasets

### Dataset 1

Dataset 1 represents a concatenation of the Gibert, OLID and TRAC 2018 data

In [25]:
!mkdir -p ../pynlp/data/dataset1/

In [26]:
for split in ['train', 'test']:
    df_all = []
    for dataset in ['gibert', 'OLID-A', 'trac2018']:
        df_part = pd.read_csv(f'../pynlp/data/{dataset}/{split}Data.csv', delimiter='\t')
        df_part.columns = DESIRED_COLUMNS
        df_part['Source'] = dataset
        df_all.append(df_part)
    
    df_all = pd.concat(df_all)
    df_all.to_csv(
        f'../pynlp/data/dataset1/{split}Data.csv',
        index=False,
        sep='\t',
        quoting=csv.QUOTE_ALL
    )

In [27]:
d1_train = pd.read_csv('../pynlp/data/dataset1/trainData.csv', delimiter='\t')
d1_train.sample(5)

Unnamed: 0,Id,Text,Label,Source
14853,47669,@USER Believe in something. Even if it means sacrificing everything - like Hitler did,NOT,OLID-A
22694,facebook_corpus_msr_397733,And we denied bcz u didnt even have a proof to do a surgical strike in pakistan territory or even u have no video evidence to enter in pakistan ..,NAG,trac2018
8442,27176,@USER @USER It says everything. When your principles hang on the fraudulent word of a porn performer. What integrity have you got left?,OFF,OLID-A
4697,61564,Yes!!!!! Please!!!! #MAGA #ConfirmKavanaugh #VoteRedToSaveAmerica URL,NOT,OLID-A
21883,facebook_corpus_msr_451689,stopped watching the video long time back ... just enjoying the comments here now :D .. the worst launch ever for any product,CAG,trac2018


### Dataset 2

Dataset 2 represents a concatenation of dataset 1 (Gibert, OLID and TRAC 2018), together with the dynamically generated data from Vidgen.

In [28]:
!mkdir -p ../pynlp/data/dataset2/

In [29]:
for split in ['train']:
    df_all = []
    for dataset in ['gibert', 'OLID-A', 'trac2018', 'vidgen_vua']:
        df_part = pd.read_csv(
            f'../pynlp/data/{dataset}/{split}Data.csv',
            encoding='latin-1',
            engine='python',
            delimiter='\t'
        )
        df_part.columns = DESIRED_COLUMNS
        df_part['Source'] = dataset
        df_all.append(df_part)
    
    df_all = pd.concat(df_all)
    df_all.to_csv(
        f'../pynlp/data/dataset2/{split}Data.csv',
        index=False,
        sep='\t',
        quoting=csv.QUOTE_ALL
    )

In [30]:
d2_train = pd.read_csv('../pynlp/data/dataset2/trainData.csv', delimiter='\t')
d2_train.sample(5)

Unnamed: 0,Id,Text,Label,Source
36455,acl21228,"tbh Israel does have a pretty impressive nuclear arsenal, and many submarines to deliver it. But jews built none of it themselves. Keep in mind that (((they))) are an extremely vengeful bunch, and somehow still believe europeans tried to exterminate them lol.",hate,vidgen_vua
9087,41732,@USER That's cause she's got (head wave) at least 50 or so pounds and a couple inches or feet on her so it's okay. Not to mention Miley doesnt look the fighting type. (IDK if she is or not),NOT,OLID-A
59561,acl865,I'm so sorry for those that are indians ... it must be such a shame,hate,vidgen_vua
12559,13874,@USER @USER @USER Here your proof! Our African ðºð¸ friends don't get free college. The Demoncratic are putting illegals over ðºð¸ ppl. These Antifa Ninja Turtles running around acting like jihadist and never lived a real life experience. Most of these Antifa colleges kids are White privileged,OFF,OLID-A
10930,45312,@USER i saw this before the results and literally thought i got eliminated WE URL,NOT,OLID-A
