# Hate-Speech detection data description

In [1]:
import pandas as pd
import datasets

In [2]:
dataset = datasets.load_dataset('ucberkeley-dlab/measuring-hate-speech', 'binary')   
hate_speech_ucb = dataset['train'].to_pandas()

Using custom data configuration ucberkeley-dlab--measuring-hate-speech-f91f636a830ad73c
Reusing dataset parquet (C:\Users\UTKARSH\.cache\huggingface\datasets\parquet\ucberkeley-dlab--measuring-hate-speech-f91f636a830ad73c\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
hate_speech_github = pd.read_csv("../data/github/hate_speech.csv", index_col=0)

## Dataset description

We plan to work on two datasets for this project:
1. [t-davidson_hate-speech-and-offensive-language](https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data)
2. [ucberkeley-dlab_measuring-hate-speech](https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech)

These two datasets are readily available.
- The first one is available on github. It's in raw for so it needs pre-processing.
- The second one is available publicly on huggingface and can be acquired using the `datasets` library. It's slightly processed but still needs more pre-processing.

Here is a brief description about the two datasets:

---

### [t-davidson_hate-speech-and-offensive-language](https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data)

This dataset is available on github and consists of the following columns (as the dataset [README](https://github.com/t-davidson/hate-speech-and-offensive-language/blob/master/data/readme.md) describes):
- `tweet`: The tweet text
- `count`: number of users who annotated each tweet _(numerical - 5 unique values - 3, 4, 6, 7, 9)_
- `hate_speech`: number of users who judged the tweet to be hate speech _(numerical - can have 10 values, between 0 and 9)_
- `offensive_language`: number of users who judged the tweet to be offensive _(numerical - can have 10 values, between 0 and 9)_
- `neither`: number of users who judged the tweet to be neither offensive nor non-offensive _(numerical - can have 10 values, between 0 and 9)_
- `class`: class label for majority of users. `0 - hate speech`, `1 - offensive language`  and `2 - neither`

__Genre__: It is a multi-class twitter (or social in general) dataset which has raw tweet text and certain other features along with the class of tweet.

__Categories__: A tweet can have exactly one of the three possible classes:  `hate speech`, `offensive language` or `neither`

__Size__: This corpus has __6 columns__, where one of the columns refers to the class of the tweet and the remaining five, the tweet text and other features. It has __24783 non-null__ values for each column.

__Language__: The tweet text is in English.

__Style__: The style of text is informal, as is usually the case with social media text. It has emojis and emoticons, abbreviations, hashtags and mentions, spelling errors and a lot of swear words.

__Tweet length__: The tweets vary between __5__ and __754__ characters with the mean and median lengths __85__ and __81__ respectively.

__Annotator count__: Each tweet was annotated 3 annotators (or more in case of any potential unreliable judgment of class)

#### Stats about the data

In [4]:
hate_speech_github.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [5]:
hate_speech_github.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24783 entries, 0 to 25296
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   count               24783 non-null  int64 
 1   hate_speech         24783 non-null  int64 
 2   offensive_language  24783 non-null  int64 
 3   neither             24783 non-null  int64 
 4   class               24783 non-null  int64 
 5   tweet               24783 non-null  object
dtypes: int64(5), object(1)
memory usage: 1.3+ MB


In [6]:
for col in hate_speech_github.iloc[:, :5]:
    values = hate_speech_github[col].value_counts()
    print(f"For {col} - {len(values)} unique values:\n{'-' * 20}")
    display(values)
    print("-" * 50)

For count - 5 unique values:
--------------------


3    22807
6     1571
4      211
9      167
7       27
Name: count, dtype: int64

--------------------------------------------------
For hate_speech - 8 unique values:
--------------------


0    19790
1     3419
2     1251
3      287
4       21
5        7
6        5
7        3
Name: hate_speech, dtype: int64

--------------------------------------------------
For offensive_language - 10 unique values:
--------------------


3    13383
2     4246
0     3475
1     2066
6      857
5      369
4      251
9       66
8       37
7       33
Name: offensive_language, dtype: int64

--------------------------------------------------
For neither - 10 unique values:
--------------------


0    18892
3     2790
1     1694
2     1200
6      103
5       54
4       35
9        5
8        5
7        5
Name: neither, dtype: int64

--------------------------------------------------
For class - 3 unique values:
--------------------


1    19190
2     4163
0     1430
Name: class, dtype: int64

--------------------------------------------------


In [7]:
min(hate_speech_github["tweet"].str.len())

5

In [8]:
max(hate_speech_github["tweet"].str.len())

754

In [9]:
hate_speech_github["tweet"].str.len().mean()

85.43606504458701

In [10]:
hate_speech_github["tweet"].str.len().median()

81.0

---

### [ucberkeley-dlab_measuring-hate-speech](https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech)

This is a public release of the dataset described in [Kennedy et al. (2020)](https://aclanthology.org/2020.acl-main.483.pdf), consisting of 39,565 comments annotated by 7,912 annotators. This dataset comprises of the following columns as described by the authors [here](https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech#key-dataset-columns):
- `hate_speech_score` - continuous hate speech measure, where higher means more hateful and lower means less _(numerical - between -8.34 and 6.30)_
- `text` - lightly processed text of a social media post
- `comment_id` - unique ID for each comment _(categorical - 39565 unique values)_
- `annotator_id` - unique ID for each annotator _(categorical - 7912 unique values)_
- `platform` - id of the platform the text was taken from _(categorical - 4 possible values)_
- `sentiment` - sentiment score of the text _(ordinal - 5 values, from 0 to 4)_
- `respect` - respect score of the text _(ordinal - 5 values, from 0 to 4)_
- `insult` - insult score of the text _(ordinal - 5 values, from 0 to 4)_
- `humiliate` - humiliation score of the text _(ordinal - 5 values, from 0 to 4)_
- `status` - __status__ score of the text _(ordinal - 5 values, from 0 to 4)_ --??
- `dehumanize` - dehumanization score of the text _(ordinal - 5 values, from 0 to 4)_
- `violence` - violence score of the text _(ordinal - 5 values, from 0 to 4)_
- `genocide` - genocide score of the text _(ordinal - 5 values, from 0 to 4)_
- `attack_defend` - attack/defend score of the text _(ordinal - 5 values, from 0 to 4)_
- `hatespeech` - __hate speech__ score of the text _(ordinal - 3 values, from 0 to 2)_ --??
- `annotator_severity` - annotator's estimated survey interpretation bias _(numerical - between -1.82 and 1.36)_

__Genre__: It is a social-media-posts dataset which has text from the social media posts, with each post annotated with several constituent labels (such as sentiment, (dis)respect, insult, etc.) and a floating-point hate-speech score.

__Size__:

- This corpus has __131 columns__ and __135,556 non-null__ values for each column.
- Out of these 131 columns:
    - 2 columns refer to ids of comments and annotators
    - 10 columns refer to constituent labels (such as sentiment, respect, insult, etc.) of the tweets
    - 1 column refer to lightly processed text from the post
    - 1 column refer to source platform 
    - 1 column refer to hate speech score
    - Remaining columns refer to 8 target identity groups (race/ethnicity, religion, national origin/citizenship, gender, sexual orientation, age, disability, political ideology) and 42 identity subgroups.

__Language__: The tweet text is in English.

__Style__: The style of text is informal, as is usually the case with social media text. It has emojis and emoticons, abbreviations, hashtags and mentions, spelling errors and a lot of swear words.

__Tweet length__: The tweets vary between __7__ and __603__ characters with the mean and median lengths __151__ and __115__ respectively.

#### Stats about the data

In [11]:
hate_speech_ucb.iloc[:, :22].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135556 entries, 0 to 135555
Data columns (total 22 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   comment_id          135556 non-null  int32  
 1   annotator_id        135556 non-null  int32  
 2   platform            135556 non-null  int8   
 3   sentiment           135556 non-null  float64
 4   respect             135556 non-null  float64
 5   insult              135556 non-null  float64
 6   humiliate           135556 non-null  float64
 7   status              135556 non-null  float64
 8   dehumanize          135556 non-null  float64
 9   violence            135556 non-null  float64
 10  genocide            135556 non-null  float64
 11  attack_defend       135556 non-null  float64
 12  hatespeech          135556 non-null  float64
 13  hate_speech_score   135556 non-null  float64
 14  text                135556 non-null  object 
 15  infitms             135556 non-nul

In [12]:
hate_speech_ucb.head()

Unnamed: 0,comment_id,annotator_id,platform,sentiment,respect,insult,humiliate,status,dehumanize,violence,...,annotator_religion_hindu,annotator_religion_jewish,annotator_religion_mormon,annotator_religion_muslim,annotator_religion_nothing,annotator_religion_other,annotator_sexuality_bisexual,annotator_sexuality_gay,annotator_sexuality_straight,annotator_sexuality_other
0,47777,10873,3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,False,True,False
1,39773,2790,2,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,False,True,False
2,47101,3379,3,4.0,4.0,4.0,4.0,4.0,4.0,0.0,...,False,False,False,False,True,False,False,False,True,False
3,43625,7365,3,2.0,3.0,2.0,1.0,2.0,0.0,0.0,...,False,False,False,False,False,False,False,False,True,False
4,12538,488,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,False,False,False,False,False,False,False,False,True,False


In [13]:
for col in hate_speech_ucb.iloc[:, :13]:
    values = hate_speech_ucb[col].value_counts()
    print(f"For {col} - {len(values)} unique values:\n{'-' * 20}")
    display(values)
    print("-" * 50)

For comment_id - 39565 unique values:
--------------------


20011    815
20002    801
20001    799
20007    796
20014    793
        ... 
32639      1
23062      1
45993      1
10714      1
37304      1
Name: comment_id, Length: 39565, dtype: int64

--------------------------------------------------
For annotator_id - 7912 unique values:
--------------------


1410    26
4997    26
3887    26
4521    26
8565    26
        ..
8466     6
8564     6
8003     6
7558     6
6695     6
Name: annotator_id, Length: 7912, dtype: int64

--------------------------------------------------
For platform - 4 unique values:
--------------------


1    43227
0    37164
2    35027
3    20138
Name: platform, dtype: int64

--------------------------------------------------
For sentiment - 5 unique values:
--------------------


4.0    61435
3.0    35243
2.0    19785
0.0     9658
1.0     9435
Name: sentiment, dtype: int64

--------------------------------------------------
For respect - 5 unique values:
--------------------


4.0    58473
3.0    30720
2.0    23895
0.0    12839
1.0     9629
Name: respect, dtype: int64

--------------------------------------------------
For insult - 5 unique values:
--------------------


4.0    44153
3.0    39673
2.0    18587
0.0    18476
1.0    14667
Name: insult, dtype: int64

--------------------------------------------------
For humiliate - 5 unique values:
--------------------


3.0    39382
4.0    30278
2.0    25035
0.0    21306
1.0    19555
Name: humiliate, dtype: int64

--------------------------------------------------
For status - 5 unique values:
--------------------


2.0    61154
3.0    36413
4.0    32500
1.0     4261
0.0     1228
Name: status, dtype: int64

--------------------------------------------------
For dehumanize - 5 unique values:
--------------------


0.0    31855
3.0    28653
1.0    28434
2.0    25292
4.0    21322
Name: dehumanize, dtype: int64

--------------------------------------------------
For violence - 5 unique values:
--------------------


0.0    67922
1.0    30727
4.0    13404
2.0    12241
3.0    11262
Name: violence, dtype: int64

--------------------------------------------------
For genocide - 5 unique values:
--------------------


0.0    90058
1.0    22838
4.0     9252
2.0     8107
3.0     5301
Name: genocide, dtype: int64

--------------------------------------------------
For attack_defend - 5 unique values:
--------------------


3.0    44883
2.0    38201
4.0    33468
1.0    11046
0.0     7958
Name: attack_defend, dtype: int64

--------------------------------------------------
For hatespeech - 3 unique values:
--------------------


0.0    80624
2.0    46021
1.0     8911
Name: hatespeech, dtype: int64

--------------------------------------------------


In [14]:
min(hate_speech_ucb["hate_speech_score"])

-8.34

In [15]:
max(hate_speech_ucb["hate_speech_score"])

6.3

In [16]:
min(hate_speech_ucb["text"].str.len())

7

In [17]:
max(hate_speech_ucb["text"].str.len())

603

In [18]:
hate_speech_ucb["text"].str.len().mean()

151.24074183363334

In [19]:
hate_speech_ucb["text"].str.len().median()

115.0