In [1]:
import pandas as pd
import re
import string
import emoji
import spacy

nlp = spacy.load("es_core_news_sm")

For this project, we have two different datasets: 

* [MEX-A3T](https://sites.google.com/view/mex-a3t/) 
* Comments from Facebook

In this notebook, we create new features for the datasets and combine the two into a new one. The new categories help us categorize the different types of Comments so that the Category feature expresses the prescence of hate speech against woemn in the comment. The following table shows the new features that we add to the combined dataset (MEXT-A3T and Comments from Facebook):

| Comment           |  Hate_Speech  |  Hate_Speech_Women |
|-------------------|:-------------:|-------------------:|
| No hate speech    |      0        |         0          |
| Hate speech       |      1        |         0          |
| Hate speech women |      1        |         1          | 

In [2]:
data_dir = '../data/'

In [3]:
df_inv = pd.read_csv(data_dir + 'clean_train_aggressiveness.csv')

In [4]:
df_fb = pd.read_csv(data_dir + 'clean_comentarios_facebook.csv', encoding='utf-8')

In [5]:
df_inv.Category.value_counts()

0    5222
1    2110
Name: Category, dtype: int64

In [6]:
df_fb.Category.value_counts()

0    1671
1     297
Name: Category, dtype: int64

In [7]:
df_inv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7332 entries, 0 to 7331
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Id        7332 non-null   int64 
 1   Category  7332 non-null   int64 
 2   Text      7323 non-null   object
dtypes: int64(2), object(1)
memory usage: 172.0+ KB


In [8]:
df_fb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1968 entries, 0 to 1967
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      1959 non-null   object
 1   Category  1968 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 30.9+ KB


We create the new features `Hate_Speech` and `Hate_Speech_Women`. Note that in the case of the MEX-A3T dataset, we copy the `Category` feature to `Hate_Speech` since in this case the feature expresses whether the comment contains hate speech in general and create the feature `Hate_Speech_Women` with all entries to 0 since this dataset does not contain hate speech targeted particularly against women.

In [9]:
df_inv['Hate_Speech'] = df_inv['Category']

In [10]:
df_inv['Hate_Speech_Women'] = 0

On the other hand, the dataset with comments from Facebook does contain hate speech against women and this is expressed by the `Category` feature. Hence, both new features are equivalente to the `Category` one.

In [11]:
df_fb['Hate_Speech'] = df_fb['Category']

In [12]:
df_fb['Hate_Speech_Women'] = df_fb['Category']

Finally, we combine the two datasets into one.

In [14]:
#Concatenar los dos datasets
df = pd.concat([df_fb, df_inv])

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9300 entries, 0 to 7331
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Text               9282 non-null   object 
 1   Category           9300 non-null   int64  
 2   Hate_Speech        9300 non-null   int64  
 3   Hate_Speech_Women  9300 non-null   int64  
 4   Id                 7332 non-null   float64
dtypes: float64(1), int64(3), object(1)
memory usage: 435.9+ KB


In [16]:
df.to_csv(data_dir + "join_clean_train_aggressiveness_comentarios_facebook_new_categories.csv", index=False)