### Dataset overview

The SemEval GitHub contains two main datasets: edos_labeled_aggregrated and edos_labeled_individual_annotations. They contain (among other things) text snippets and their categorization as sexist/non-sexist. You can find a short overview of the datasets here. 

Original Task: https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details

Git: https://github.com/rewire-online/edos/tree/main/data

In [24]:
import pandas as pd
import matplotlib as plt

import os

In [25]:
data_folder = os.path.join(os.getcwd())

df_aggr = pd.read_csv(os.path.join(data_folder, "edos_labelled_aggregated.csv"))
df_individual = pd.read_csv(os.path.join(data_folder, "edos_labelled_individual_annotations.csv"))


def describe_dataset(df, name:str): 
    """Print out some basic information about the dataset."""
    print(f"#########Name of Dataset: {name}#########")
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns}")
    print()
    print(f"Overview of contents and data types: {df.describe()}")

### Get an initial overview of what our data looks like for both datasets:

In [26]:
df_individual

Unnamed: 0,rewire_id,text,annotator,label_sexist,label_category,label_vector,split
0,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,17,sexist,2. derogation,2.2 aggressive and emotive attacks,train
1,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,2,sexist,2. derogation,2.2 aggressive and emotive attacks,train
2,sexism2022_english-0,[USER] I wonder what keeps that witch looking ...,6,not sexist,none,none,train
3,sexism2022_english-1,"What do you guys think about female ""incels""? ...",17,not sexist,none,none,train
4,sexism2022_english-1,"What do you guys think about female ""incels""? ...",15,not sexist,none,none,train
...,...,...,...,...,...,...,...
59995,sexism2022_english-9998,"Since 1973, America has sanctioned the murder ...",2,sexist,4. prejudiced discussions,4.2 supporting systemic discrimination against...,test
59996,sexism2022_english-9998,"Since 1973, America has sanctioned the murder ...",16,not sexist,none,none,test
59997,sexism2022_english-9999,Laura Bates Lol she looks like she'd suck the ...,2,sexist,2. derogation,2.2 aggressive and emotive attacks,train
59998,sexism2022_english-9999,Laura Bates Lol she looks like she'd suck the ...,16,not sexist,none,none,train


In [27]:
df_aggr

Unnamed: 0,rewire_id,text,label_sexist,label_category,label_vector,split
0,sexism2022_english-9609,"In Nigeria, if you rape a woman, the men rape ...",not sexist,none,none,dev
1,sexism2022_english-16993,"Then, she's a keeper. üòâ",not sexist,none,none,train
2,sexism2022_english-13149,This is like the Metallica video where the poo...,not sexist,none,none,train
3,sexism2022_english-13021,woman?,not sexist,none,none,train
4,sexism2022_english-966,I bet she wished she had a gun,not sexist,none,none,dev
...,...,...,...,...,...,...
19995,sexism2022_english-5228,girls really get fucked almost every weekend ?,not sexist,none,none,train
19996,sexism2022_english-10140,The hatred for moslems are Reasonable and Just...,not sexist,none,none,train
19997,sexism2022_english-9726,Now this is a woman who gets it. üëÜ,not sexist,none,none,train
19998,sexism2022_english-13365,‚ÄúAmerican Idol‚Äù finalist [USER] said nothing i...,not sexist,none,none,train


### Shapes, Datatypes,...

In [28]:
describe_dataset(df_individual, "labeled individual")

#########Name of Dataset: labeled individual#########
Shape: (60000, 7)
Columns: Index(['rewire_id', 'text', 'annotator', 'label_sexist', 'label_category',
       'label_vector', 'split'],
      dtype='object')

Overview of contents and data types:           annotator
count  60000.000000
mean       8.603583
std        5.247291
min        0.000000
25%        4.000000
50%        8.000000
75%       14.000000
max       18.000000


In [29]:
describe_dataset(df_aggr, "labeled aggregated")

#########Name of Dataset: labeled aggregated#########
Shape: (20000, 6)
Columns: Index(['rewire_id', 'text', 'label_sexist', 'label_category', 'label_vector',
       'split'],
      dtype='object')

Overview of contents and data types:                        rewire_id  \
count                      20000   
unique                     20000   
top     sexism2022_english-17659   
freq                           1   

                                                     text label_sexist  \
count                                               20000        20000   
unique                                              20000            2   
top     This is easily the dumbest thing ever written....   not sexist   
freq                                                    1        15146   

       label_category label_vector  split  
count           20000        20000  20000  
unique              5           12      3  
top              none         none  train  
freq            15146        15146  