# Merging and Binary Label Conversion of Twitter and Instagram Data

This notebook is a central hub for merging Twitter and Instagram datasets. To prepare the datasets for merging, we need to convert the labels for each dataset into binary labels. This simplifies the format and allows for seamless integration and analysis. By bringing these two datasets together, we can explore and uncover valuable insights across both Twitter and Instagram platforms.

## About the datasets used

Twitter Dataset
This dataset is used as a secondary resource in this project. It was obtained from [IEEEDataport](https://ieee-dataport.org/open-access/fine-grained-balanced-cyberbullying-dataset). It has 6 different datasets, each labeled with a specific type of cyberbullying such as age, religion, gender, ethnicity, and one without a cyberbullying label. To simplify our classification problem, we will convert each type to a cyberbullying label, making it a binary classification.

## Twitter Dataset

In [None]:
# Import necessary libraries
import pandas as pd

In [None]:
# Reading the twitter datasets
import pandas as pd

def load_txt_dataset(txt_file_path):
    """
    Load a TXT dataset into a DataFrame.

    Parameters:
        txt_file_path (str): The path to the TXT file.

    Returns:
        pandas.DataFrame: The loaded dataset as a DataFrame.
    """
    # Load the TXT dataset into a DataFrame
    df = pd.read_csv(txt_file_path, delimiter='\t',header=None)
    return df

#### Loading age dataset

In [None]:
txt_file_path = "/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000age.txt"
df_age = load_txt_dataset(txt_file_path)

In [None]:
df_age

Unnamed: 0,0
0,Here at home. Neighbors pick on my family and ...
1,Being bullied at school: High-achieving boys u...
2,There was a girl in my class in 6th grade who ...
3,He’s probably a white gay kid from some suburb...
4,You are pushed ti resorting. Treating thr bull...
...,...
7987,This girl really tried to say I bullied her in...
7988,a bully at school who has been messing with me...
7989,I remember I wrote an entire song in 6th grade...
7990,I was not the Prom Queen. I was the bullied gi...


In [None]:
# Assign column names to the existing column and the new column
df_age.columns = ['text']
df_age['label'] = 'cyberbullying'

# Display the updated dataset
df_age

Unnamed: 0,text,label
0,Here at home. Neighbors pick on my family and ...,cyberbullying
1,Being bullied at school: High-achieving boys u...,cyberbullying
2,There was a girl in my class in 6th grade who ...,cyberbullying
3,He’s probably a white gay kid from some suburb...,cyberbullying
4,You are pushed ti resorting. Treating thr bull...,cyberbullying
...,...,...
7987,This girl really tried to say I bullied her in...,cyberbullying
7988,a bully at school who has been messing with me...,cyberbullying
7989,I remember I wrote an entire song in 6th grade...,cyberbullying
7990,I was not the Prom Queen. I was the bullied gi...,cyberbullying


#### Loading ethnicity dataset

In [None]:
# loading ethnicity dataset
txt_file_path = "/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000ethnicity.txt"
df_ethnicity = load_txt_dataset(txt_file_path)

In [None]:
# Assign column names to the existing column and add a new column
df_ethnicity.columns = ['text']
df_ethnicity['label'] = 'cyberbullying'
df_ethnicity

Unnamed: 0,text,label
0,Hey dumb fuck celebs stop doing something for ...,cyberbullying
1,"Fuck u bitch RT @tayyoung_: FUCK OBAMA, dumb a...",cyberbullying
2,"@JoeBiden No Joe, YOU are the RACIST. They hav...",cyberbullying
3,when your truck looks dumb as fuck out trying ...,cyberbullying
4,That nigger food in the cafe today was disgusting,cyberbullying
...,...,...
7956,"Black ppl aren't expected to do anything, depe...",cyberbullying
7957,Turner did not withhold his disappointment. Tu...,cyberbullying
7958,I swear to God. This dumb nigger bitch. I have...,cyberbullying
7959,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,cyberbullying


#### Loading gender dataset

In [None]:
# loading gender dataset
txt_file_path = "/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000gender.txt"
df_gender = load_txt_dataset(txt_file_path)

In [None]:
# Assign column names to the existing column and add a new column
df_gender.columns = ['text']
df_gender['label'] = 'cyberbullying'
df_gender

Unnamed: 0,text,label
0,rape is real..zvasiyana nema jokes about being...,cyberbullying
1,You never saw any celebrity say anything like ...,cyberbullying
2,"@ManhattaKnight I mean he's gay, but he uses g...",cyberbullying
3,RT @Raul_Novoa16: @AliciaBernardez @Alex_Aim @...,cyberbullying
4,Rape is rape. And the fact that I read one pos...,cyberbullying
...,...,...
7968,Any females that really know me know Ion even ...,cyberbullying
7969,RT @_chrisdowns_: #QuestionsForMen This one's ...,cyberbullying
7970,Sucks to have the smile wiped off your own fac...,cyberbullying
7971,"No. He said women choose to be gay, men don't....",cyberbullying


#### Loading non cyberbullying dataset

In [None]:
df_noncb = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000notcb.csv', header=None)

In [None]:
df_noncb

Unnamed: 0,0
0,"In other words #katandandre, your food was cra..."
1,Why is #aussietv so white? #MKR #theblock #ImA...
2,@XochitlSuckkks a classy whore? Or more red ve...
3,"@Jason_Gio meh. :P thanks for the heads up, b..."
4,@RudhoeEnglish This is an ISIS account pretend...
...,...
7995,I don't know what I want to wear#ugh
7996,Argh another round of instant restaurants....o...
7997,Teacher sets up new charity to tackle anti-gay...
7998,"I can barely tolerate Kat and Andre, Katie and..."


In [None]:
# Assign column names to the existing column and add a new column
df_noncb.columns = ['text']
df_noncb['label'] = 'non-cyberbullying'
df_noncb

Unnamed: 0,text,label
0,"In other words #katandandre, your food was cra...",non-cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,non-cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,non-cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",non-cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,non-cyberbullying
...,...,...
7995,I don't know what I want to wear#ugh,non-cyberbullying
7996,Argh another round of instant restaurants....o...,non-cyberbullying
7997,Teacher sets up new charity to tackle anti-gay...,non-cyberbullying
7998,"I can barely tolerate Kat and Andre, Katie and...",non-cyberbullying


#### Loading religion dataset

In [None]:
# Loading religion dataset
txt_file_path = "/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000religion.txt"
df_religion = load_txt_dataset(txt_file_path)

In [None]:
# Assign column names to the existing column and add a new column
df_religion.columns = ['text']
df_religion['label'] = 'cyberbullying'
df_religion

Unnamed: 0,text,label
0,"Sudeep, did she invite him though? No right? W...",cyberbullying
1,@discerningmumin Islam has never been a resist...,cyberbullying
2,"Boy, your comment about Journalists wanting to...",cyberbullying
3,@ShashiTharoor @INCIndia Hindus were and are g...,cyberbullying
4,White supremicists? How many do you know? Ther...,cyberbullying
...,...,...
7993,Can you imagine if Christians came together li...,cyberbullying
7994,So how to support justice from the initial pro...,cyberbullying
7995,RT @TRobinsonNewEra: If you harbour any doubts...,cyberbullying
7996,@dankmtl @PeaceNotHate_ One thing about Muslim...,cyberbullying


#### Loading other dataset

In [None]:
# Loading religion dataset
txt_file_path = "/content/drive/MyDrive/MSC Data science/Thesis/Twitter/8000other.txt"
df_other = load_txt_dataset(txt_file_path)

In [None]:
# Assign column names to the existing column and add a new column
df_other.columns = ['text']
df_other['label'] = 'cyberbullying'
df_other

Unnamed: 0,text,label
0,"@ikralla fyi, it looks like I was caught by it...",cyberbullying
1,I need to just switch to an organization-based...,cyberbullying
2,RMAed my monoprice. Shoddy power bricks on tho...,cyberbullying
3,@murphy_slaw https://t.co/M8w8xnUnDL,cyberbullying
4,@1Life0Continues i've got the code to interpre...,cyberbullying
...,...,...
7818,"@kufr666 @blockbot no, that's @oolon",cyberbullying
7819,@AriMelber why are you giving these idiots air...,cyberbullying
7820,I am right now watching Enforcers defend Chums...,cyberbullying
7821,✨✨✨ misandry is not a word iOS can autocomplet...,cyberbullying


### Concatenating twitter dataset

In [None]:
# merge twitter dataset
twitter_df = pd.concat([df_age,df_ethnicity,df_gender,df_noncb,df_religion,df_other])

In [None]:
twitter_df

Unnamed: 0,text,label
0,Here at home. Neighbors pick on my family and ...,cyberbullying
1,Being bullied at school: High-achieving boys u...,cyberbullying
2,There was a girl in my class in 6th grade who ...,cyberbullying
3,He’s probably a white gay kid from some suburb...,cyberbullying
4,You are pushed ti resorting. Treating thr bull...,cyberbullying
...,...,...
7818,"@kufr666 @blockbot no, that's @oolon",cyberbullying
7819,@AriMelber why are you giving these idiots air...,cyberbullying
7820,I am right now watching Enforcers defend Chums...,cyberbullying
7821,✨✨✨ misandry is not a word iOS can autocomplet...,cyberbullying


In [None]:
twitter_df.label.unique()

array(['cyberbullying', 'non-cyberbullying'], dtype=object)

In [None]:
# convert twitter combined datset tocsv
twitter_df.to_csv('twitter.csv',index=False)

## Instagram Dataset

We will read through all the Instagram datasets and combine them into one unified dataset. Additionally, we will rename the columns to match the ones in the Twitter dataset. This will ensure consistency and make it easier to merge the data during the modeling process.

In [None]:
df1 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/0-500.csv')

In [None]:
df1

Unnamed: 0.1,Unnamed: 0,label,idx,comment
0,0,F,0,Cool nail but I dont bite on mine (created at:...
1,1,T,0,@offthis_ way to be a asshole (created at:2012...
2,2,T,0,They are ugly (created at:2013-01-09 15:35:17)
3,3,F,0,T
4,4,F,0,Wooow (created at:2013-02-13 21:56:42)
...,...,...,...,...
495,495,F,15,Anwar Sadat (created at:2012-04-30 21:30:10)
496,496,F,15,s'ok Earl .. you embracing the animal lover in...
497,497,F,15,http://instagr.am/p/IJ8GIeM--s/ (created at:20...
498,498,F,15,Earl are you a animal hoarder you have lots of...


In [None]:
df2 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/1000-1500.csv',encoding = "ISO-8859-1")
df3 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/1500-2000.csv')
df4 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/2000-2500.csv',encoding = "ISO-8859-1")
df5 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/2500-3000.csv')
df6 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/3000-3500.csv', encoding = "ISO-8859-1")
df7 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/3500-4000.csv')
df8 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/4000-4500.csv')
df9 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/4500-5000.csv')
df10 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/5000-5500.csv',encoding = "ISO-8859-1")
df11 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/5500-6000.csv')
df12 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/6000-6500.csv', encoding = "ISO-8859-1")
df13 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/6500-7000.csv',encoding = "ISO-8859-1")
df14 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/7000-7500.csv')
df15 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/7500-8000.csv',encoding = "ISO-8859-1")
df16 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/8000-8500.csv',encoding = "ISO-8859-1")
df17 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/8500-9000.csv',encoding = "ISO-8859-1")
df18 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/9000-9500.csv',encoding = "ISO-8859-1")
df19 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/9500-10000.csv',encoding = "ISO-8859-1")
df20 = pd.read_csv('/content/drive/MyDrive/MSC Data science/Thesis/Instagram-10klabeled/Instagram-10klabeled/500-1000.csv')

### Concatenating Instagram dataset

In [None]:
# merge twitter dataset
insta_df = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,df15,df16,df17,df18,df19,df20])

In [None]:
insta_df

Unnamed: 0.1,Unnamed: 0,label,idx,comment,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,0.0,F,0.0,Cool nail but I dont bite on mine (created at:...,,,,,,,,,
1,1.0,T,0.0,@offthis_ way to be a asshole (created at:2012...,,,,,,,,,
2,2.0,T,0.0,They are ugly (created at:2013-01-09 15:35:17),,,,,,,,,
3,3.0,F,0.0,T,,,,,,,,,
4,4.0,F,0.0,Wooow (created at:2013-02-13 21:56:42),,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,,,,,,,,,,,,,
995,,,,,,,,,,,,,
996,,,,,,,,,,,,,
997,,,,,,,,,,,,,


In [None]:
# Get the list of column names containing 'Unnamed'
unnamed_columns = [col for col in insta_df.columns if 'Unnamed' in col]

# Drop the columns with 'Unnamed' from the DataFrame
df = insta_df.drop(columns=unnamed_columns)


In [None]:
df

Unnamed: 0,label,idx,comment
0,F,0.0,Cool nail but I dont bite on mine (created at:...
1,T,0.0,@offthis_ way to be a asshole (created at:2012...
2,T,0.0,They are ugly (created at:2013-01-09 15:35:17)
3,F,0.0,T
4,F,0.0,Wooow (created at:2013-02-13 21:56:42)
...,...,...,...
994,,,
995,,,
996,,,
997,,,


In [None]:
# Drop rows with null values
df = df.dropna()

In [None]:
instagram_df = df.drop(columns=['idx'])

In [None]:
# Rename the 'comment' column to 'text'
instagram_df = instagram_df.rename(columns={'comment': 'text'})

In [None]:
# Specify the new order of column names
new_column_order = ['text', 'label']

# Reorder the columns in the DataFrame
instagram_df = instagram_df[new_column_order]

instagram_df

Unnamed: 0,text,label
0,Cool nail but I dont bite on mine (created at:...,F
1,@offthis_ way to be a asshole (created at:2012...,T
2,They are ugly (created at:2013-01-09 15:35:17),T
3,T,F
4,Wooow (created at:2013-02-13 21:56:42),F
...,...,...
495,Jesus* (created at:2013-02-08 21:23:53),F
496,How do you repost (created at:2013-02-16 22:42...,F
497,I do t know how to repost but if I did I would...,F
498,Screen shot x (created at:2013-02-22 10:47:34),F


In [None]:
# Define the mapping of values to replace
label_mapping = {'T': 'cyberbullying', 'F': 'non-cyberbullying',' T': 'cyberbullying','FF': 'non-cyberbullying','Y': 'cyberbullying'}

# Replace the values in the 'label' column
instagram_df['label'] = instagram_df['label'].replace(label_mapping)

In [None]:
# Filter out rows where the 'label' column is 'FF'
filtered_df = df[df['label'] == 'Y']

In [None]:
filtered_df

Unnamed: 0,label,idx,comment
302,Y,79.0,@sdotditomo fuck u babin's a fucking boss u bi...
415,Y,138.0,@cdiaz37 actually I knew almost every on my ig...


In [None]:
instagram_df.label.unique()

array(['non-cyberbullying', 'cyberbullying'], dtype=object)

In [None]:
instagram_df.to_csv('instagram.csv',index=False)