<a href="https://colab.research.google.com/github/visumania/TFM-AMM/blob/main/cuadernos/baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 6: Sexism Categorization in Memes

This subtask is a multi-label classification. This task aims to classify sexist memes according to the categorization provided for Task 3:
1. Ideological and inequality
2. Sterotyping and dominance
3. Objectification
4. Sexual violence
5. Misogyny and non-sexual

## EXIST 2024 Memes Dataset

The EXIST 2024 Memes Dataset contains more than 5,000 labeled memes, both in English and Spanish. In particular, the traning set contains 4,044 memes and the test set contains 1,053 memes. Distribution between both languages has been balanced.

The data are provided in **JSON format**. Each meme is represented as a JSON with the following attributes:
1. **id_EXIST**: a unique identifier for the meme.
2. **lang**: the languages of the meme ("en" or "es").
3. **text**: the text automatically extracted from the meme.
4. **meme**: the name of the file that contains the meme.
5. **path_memes**: the path to the file that contains the meme.
6. **number_annotators**: the number of persons that have annotated the meme.
7. **annotators**: a unique identifier for each annotators
8. **gender_annotators**: the gender of the different annotators. Possible values are: "F" and "M", for female o male respectively.
9. **age_annotators**: the age group of the different annotators. Possible values are: 18-22, 23-45, and 46+.
10. **ethnicity_annotators**: the self-reported, ethnicity of the different annotators. Possible values are: "Black or African America", "Hispano or Latino", "White or Caucasian", "Multiracial", "Asia", "Asian Indian" and "Middle Eastern".
11. **study_level_annotators**: the self-reported level of study achieved by the different annotators. Possible values are: "Less than high school diploma", "High school degree or equivalent", "Bachelor's degree" and "Doctorate".
12. **country_annotators**: the self-reported country where the different annotators live in.
13. **labels_task4**: a set of the labels (one for each of the annotators) that indicate if the meme contains sexist expressions of refers to sexist behaviours or not. Possible values are "YES" and "NO".
14. **labels_task5**: a set of labels (one for each of the annotators) recording the intention of the person who created the meme. Possible labels are: "DIRECT", "JUDGEMENTAL", "", and "UNKNOWN".
15. **labels_task6**: a set of arrays of labels (one array for each of the annotators) indicating the type or types of sexism that are found in the meme. Possible labels are: "IDEOLOGICAL-INEQUALITY", "STEREOTYPING-DOMINANCE", "OBJECTIFICATION", "SEXUAL-VIOLENCE", "MISOGYNY-NON-SEXUAL-VIOLENCE", "-", and "UNKNOWN".
16. **split**: subset within the dataset the meme belongs to ("TRAIN-MEME", "TRAIN-MEME" + "EN"/"ES").

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In [3]:
dataset = pd.read_json('/content/drive/MyDrive/I2C/Adrián Moreno/EXIST2024/Datasets/training/EXIST2024_training.json', orient='index')

# Eliminamos las columnas que no nos aportan información relevante
dataset = dataset.drop('meme', axis=1)
dataset = dataset.drop('path_memes', axis=1)
dataset = dataset.drop('number_annotators', axis=1)
dataset = dataset.drop('annotators', axis=1)
dataset = dataset.drop('gender_annotators', axis=1)
dataset = dataset.drop('age_annotators', axis=1)
dataset = dataset.drop('ethnicities_annotators', axis=1)
dataset = dataset.drop('study_levels_annotators', axis=1)
dataset = dataset.drop('countries_annotators', axis=1)
dataset = dataset.drop('labels_task4', axis=1)
dataset = dataset.drop('labels_task5', axis=1)
dataset = dataset.drop('split', axis=1)
dataset = dataset.drop('lang', axis=1)

# Reseteamos el index que viene por defecto ya que no es secuencial
dataset = dataset.reset_index(drop=True)

# Insertamos las columnas correspondientes a la categorización de la misoginia
dataset['IDEOLOGICAL-INEQUALITY'] = 0
dataset['STEREOTYPING-DOMINANCE'] = 0
dataset['OBJECTIFICATION'] = 0
dataset['SEXUAL-VIOLENCE'] = 0
dataset['MISOGYNY-NON-SEXUAL-VIOLENCE'] = 0

In [4]:
rango = dataset.shape[0]
# Recorremos el fichero
for i in range(rango):
  contador_ideological_inequality = 0
  contador_stereotyping_dominance = 0
  contador_objectification = 0
  contador_sexual_violence = 0
  contador_mysogyny_non_sexual_violence = 0
  tmp_row = dataset.iloc[i]['labels_task6']
  # Recorremos el vector que está dentro de la columna 'labels_task6'
  for j in range(6):
    if 'IDEOLOGICAL-INEQUALITY' in tmp_row[j]:
      contador_ideological_inequality += 1
    if 'STEREOTYPING-DOMINANCE' in tmp_row[j]:
      contador_stereotyping_dominance += 1
    if 'OBJECTIFICATION' in tmp_row[j]:
      contador_objectification += 1
    if 'SEXUAL-VIOLENCE' in tmp_row[j]:
      contador_sexual_violence += 1
    if 'MISOGYNY-NON-SEXUAL-VIOLENCE' in tmp_row[j]:
      contador_mysogyny_non_sexual_violence += 1

  # En caso de que haya una categoría que lo hayan puesto 2 o más anotadores lo consideramos unificamos como positivo
  if contador_ideological_inequality > 1:
    dataset.loc[i, 'IDEOLOGICAL-INEQUALITY'] = 1
  if contador_stereotyping_dominance > 1:
    dataset.loc[i, 'STEREOTYPING-DOMINANCE'] = 1
  if contador_objectification > 1:
    dataset.loc[i, 'OBJECTIFICATION'] = 1
  if contador_sexual_violence > 1:
    dataset.loc[i, 'SEXUAL-VIOLENCE'] = 1
  if contador_mysogyny_non_sexual_violence > 1:
    dataset.loc[i, 'MISOGYNY-NON-SEXUAL-VIOLENCE'] = 1

dataset = dataset.drop('labels_task6', axis=1)

In [5]:
dataset

Unnamed: 0,id_EXIST,text,IDEOLOGICAL-INEQUALITY,STEREOTYPING-DOMINANCE,OBJECTIFICATION,SEXUAL-VIOLENCE,MISOGYNY-NON-SEXUAL-VIOLENCE
0,110001,2+2=5 MITO Albert Einstein tenía bajo rendimie...,1,0,0,0,0
1,110002,CUANDO UNA MUJER VA A LUCHAR POR SUS DERECHOS,1,0,0,0,1
2,110003,ІЯ ЕГЕЯ Е MOA ¿El Partido Republicano busca pe...,0,1,0,0,0
3,110004,"Paises que ""apoyan"" los derechos de la mujer A...",1,0,0,0,0
4,110005,Ya verás como este 8 de marzo hay uno que te s...,1,0,0,0,0
...,...,...,...,...,...,...,...
4039,212006,u gon act like a bitch u gon die like a bitch,0,0,0,1,1
4040,212007,SHE LOOKS LIKE EVERY OTHER BITCH LIKE makeamem...,0,0,1,0,1
4041,212008,YOURE A BASIC BITCH CASE DISMISSED,0,1,1,0,0
4042,212009,WHEN YOU'RE AUNT HAS THIS WEIRD ASS MAN AND SH...,0,0,0,0,0


In [6]:
dataset.to_csv('/content/drive/MyDrive/I2C/Adrián Moreno/EXIST2024/Datasets/training/EXIST2024_training_formated.csv', index=False)

## Estadísticas sobre el dataset que hemos generado

In [10]:
print('Distribución de la clase IDEOLOGICAL-INEQUALITY')
print(dataset['IDEOLOGICAL-INEQUALITY'].value_counts())

print('Distribución de la clase STEREOTYPING-DOMINANCE')
print(dataset['STEREOTYPING-DOMINANCE'].value_counts())

print('Distribución de la clase OBJECTIFICATION')
print(dataset['OBJECTIFICATION'].value_counts())

print('Distribución de la clase SEXUAL-VIOLENCE')
print(dataset['SEXUAL-VIOLENCE'].value_counts())

print('Distribución de la clase MISOGYNY-NON-SEXUAL-VIOLENCE')
print(dataset['MISOGYNY-NON-SEXUAL-VIOLENCE'].value_counts())

Distribución de la clase IDEOLOGICAL-INEQUALITY
IDEOLOGICAL-INEQUALITY
0    2945
1    1099
Name: count, dtype: int64
Distribución de la clase STEREOTYPING-DOMINANCE
STEREOTYPING-DOMINANCE
0    2732
1    1312
Name: count, dtype: int64
Distribución de la clase OBJECTIFICATION
OBJECTIFICATION
0    2833
1    1211
Name: count, dtype: int64
Distribución de la clase SEXUAL-VIOLENCE
SEXUAL-VIOLENCE
0    3475
1     569
Name: count, dtype: int64
Distribución de la clase MISOGYNY-NON-SEXUAL-VIOLENCE
MISOGYNY-NON-SEXUAL-VIOLENCE
0    3590
1     454
Name: count, dtype: int64
