# Merge Datasets
### This notebook implements the merging of three individual datasets concerned with hate-speech in the form of sexism.

#### Importing Required Libraries

In [1]:
import pandas as pd

#### Loading Individual Datasets

In [2]:
df_initial = pd.read_csv("/content/drive/MyDrive/markov/Initial_Dataset.csv")
df_isep = pd.read_excel("/content/drive/MyDrive/markov/ISEP Sexist Data labeling.xlsx")
df_sexism = pd.read_csv("/content/drive/MyDrive/markov/sexism_data.csv")

#### Analyzing the features of each dataset

In [3]:
df_initial.head()

Unnamed: 0,text,label_sexist
0,"Damn, this writing was pretty chaotic",not sexist
1,"Yeah, and apparently a bunch of misogynistic v...",not sexist
2,How the FUCK is this woman still an MP!!!???,not sexist
3,Understand. Know you're right. At same time I ...,not sexist
4,Surprized they didn't stop and rape some women,not sexist


In [4]:
df_isep.head()

Unnamed: 0,Sentences,Label
0,"For a woman, that is good",1
1,Just dress sexy and you will sign any contract.,1
2,Are you having period ?,1
3,Stop being bitchy,1
4,I leave early because of her children instead ...,1


In [5]:
df_sexism.head()

Unnamed: 0,id,dataset,text,toxicity,sexist,of_id
0,0,other,MENTION3481 i didn't even know random was an o...,0.11818,False,-1
1,1,other,Bottom two should've gone! #mkr,0.25185,False,-1
2,2,callme,MENTION3111 MENTION3424 ladyboner deserves so ...,0.113331,False,-1
3,3,other,She shall now be known as Sourpuss #MKR #KatAn...,0.531153,False,-1
4,4,other,Tarah W threw a bunch of women under the bus s...,0.118718,False,-1


#### Normalizing the datasets
We will now normalize each dataset in the same structure as `df_initial`

In [6]:
# normalizing df_isep
df_isep = df_isep.rename(columns={'Sentences': 'text', 'Label': 'label_sexist'})
df_isep['label_sexist'] = df_isep['label_sexist'].map({0: 'not sexist', 1: 'sexist'})
df_isep

Unnamed: 0,text,label_sexist
0,"For a woman, that is good",sexist
1,Just dress sexy and you will sign any contract.,sexist
2,Are you having period ?,sexist
3,Stop being bitchy,sexist
4,I leave early because of her children instead ...,sexist
...,...,...
1132,Opportunities are like night owls. They like t...,not sexist
1133,It's not just a question of doing what you lov...,not sexist
1134,A genius does what he masters. An ordinary man...,not sexist
1135,If you want to know the real reasons behind th...,not sexist


In [7]:
# normalizing df_sexism
df_sexism = df_sexism.rename(columns={'sexist': 'label_sexist'})
df_sexism.drop(['id', 'dataset','toxicity', 'of_id'], axis=1, inplace=True)
df_sexism['label_sexist'] = df_sexism['label_sexist'].map({True: 'sexist', False: 'not sexist'})
df_sexism

Unnamed: 0,text,label_sexist
0,MENTION3481 i didn't even know random was an o...,not sexist
1,Bottom two should've gone! #mkr,not sexist
2,MENTION3111 MENTION3424 ladyboner deserves so ...,not sexist
3,She shall now be known as Sourpuss #MKR #KatAn...,not sexist
4,Tarah W threw a bunch of women under the bus s...,not sexist
...,...,...
13626,this reminds me of the MENTION3079 situation; ...,not sexist
13627,#mkr I love Annie and loyld there like a real ...,not sexist
13628,No u. http://t.co/zOr0eWahSS,not sexist
13629,#mkr the way kat looks at Annie is like she's ...,not sexist


#### Merging all datasets

In [13]:
df_master = pd.concat([df_initial, df_isep, df_sexism])
df_master.to_csv("/content/drive/MyDrive/markov/master_dataset.csv", sep=',', index=False)
df_master

Unnamed: 0,text,label_sexist
0,"Damn, this writing was pretty chaotic",not sexist
1,"Yeah, and apparently a bunch of misogynistic v...",not sexist
2,How the FUCK is this woman still an MP!!!???,not sexist
3,Understand. Know you're right. At same time I ...,not sexist
4,Surprized they didn't stop and rape some women,not sexist
...,...,...
13626,this reminds me of the MENTION3079 situation; ...,not sexist
13627,#mkr I love Annie and loyld there like a real ...,not sexist
13628,No u. http://t.co/zOr0eWahSS,not sexist
13629,#mkr the way kat looks at Annie is like she's ...,not sexist
