The repository contains two Nepali Comment datasets. 

File "nepali_hate_lexicon.csv" contains Nepali words and they are have categorical labels. For the purposes of data cleaning, we will use this file. 

We will then apply transformation mechanisms to "HateSpeechDatasetNep.csv" as the file contains nepali comments from social media. 

In [1]:
import pandas as pd

lexicon = pd.read_csv("../Data/nepali_hate_lexicon.csv")
comments = pd.read_csv("../Data/HateSpeechDatasetNep.csv")

print(lexicon.head())
print(comments.head())

             RawRom          RawNep         NormNep           NormRom  \
0         adhinayak         ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï         ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï         adhinayak   
1  adhinayak tantra  ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï ‡§§‡§®‡•ç‡§§‡•ç‡§∞  ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï ‡§§‡§®‡•ç‡§§‡•ç‡§∞  adhinayak tantra   
2     adhinayakatwa      ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§§‡•ç‡§µ      ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§§‡•ç‡§µ     adhinayakatwa   
3      adhinayakbad      ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶      ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶      adhinayakbad   
4     adhinayakbadi     ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶‡•Ä     ‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶‡•Ä     adhinayakbadi   

   Offensiveness  IsTaboo     Class  
0              1        0  Politics  
1              1        0  Politics  
2              1        0  Politics  
3              1        0  Politics  
4              1        0  Politics  
                        Nepali_Text  binary_label
0                       ‡§≠‡§æ‡§ï ‡§≠‡•ã‡§∏‡§°‡•Ä‡§ï‡•á             1
1        

Let's first initially work with Lexicon dataset. 

In the lexicon dataset, we will initially identify all the unique classes, then we will take all the NormNep values and the Class and only use them to create a new dataset. 

In this dataset, we will categorize the words into one of [Insult, Violence , Sexual , Racism , Religious] categories.  

In [2]:
unique_classes = lexicon["Class"].unique()
print(unique_classes)

['Politics' 'Disability' 'Other' 'Marriage' 'Relation' 'Animal' 'Vulgar'
 'Gender' 'Action' 'Excretion' 'Idiom' 'Plant' 'Race' 'Disease' 'Cloth'
 'Body Part' 'Death' 'Weapon' 'Religion' 'Geographic Location']


In [3]:
category_map = {
    # Insults
    "Politics": "Insult",
    "Disability": "Insult",
    "Other": "Insult",
    "Animal": "Insult",
    "Vulgar": "Insult",
    "Excretion": "Insult",
    "Disease": "Insult",
    "Cloth": "Insult",
    "Body Part": "Insult",
    "Idiom": "Insult",

    # Sexual
    "Marriage": "Sexual",
    "Relation": "Sexual",
    "Gender": "Sexual",

    # Violence
    "Action": "Violence",
    "Death": "Violence",
    "Weapon": "Violence",

    # Racism
    "Race": "Racism",
    "Geographic Location": "Racism",
    "Plant": "Racism",

    # Religious
    "Religion": "Religious"
}


In [4]:
lexicon_small = lexicon[["NormNep", "Class"]].copy()
lexicon_small.head()


Unnamed: 0,NormNep,Class
0,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï,Politics
1,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï ‡§§‡§®‡•ç‡§§‡•ç‡§∞,Politics
2,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§§‡•ç‡§µ,Politics
3,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶,Politics
4,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶‡•Ä,Politics


In [5]:
lexicon_small["Category"] = lexicon_small["Class"].map(category_map)


In [6]:
lexicon_small["Category"].value_counts(dropna=False)


Category
Insult       802
Racism        99
Sexual        88
Violence      76
Religious     12
Name: count, dtype: int64

In [7]:
# Drop the 'Class' column
lexicon_small = lexicon_small.drop(columns=["Class"])



In [8]:
# Preview the result
lexicon_small.head()

Unnamed: 0,NormNep,Category
0,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï,Insult
1,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï ‡§§‡§®‡•ç‡§§‡•ç‡§∞,Insult
2,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§§‡•ç‡§µ,Insult
3,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶,Insult
4,‡§Ö‡§ß‡§ø‡§®‡§æ‡§Ø‡§ï‡§µ‡§æ‡§¶‡•Ä,Insult


Now, we will use the Comments dataset. Here, we will identify all columns with [binary_label = 1]. This would mean that they are identified as hate speech. If 1, we will then add the category as 1 or 0 for the columns. 

In [9]:
# Create new dataset
NepaliComments = pd.DataFrame()
NepaliComments["Comment"] = comments["Nepali_Text"].str.strip()  # remove extra spaces
NepaliComments["Hate/NoHate"] = comments["binary_label"]

In [10]:
# Initialize category columns to 0
categories = ["Insult", "Violence", "Sexual", "Racism", "Religious"]
for cat in categories:
    NepaliComments[cat] = 0

In [11]:
# Create dictionary of category: set of words for faster lookup
category_words = {}
for cat in categories:
    words = lexicon_small[lexicon_small["Category"] == cat]["NormNep"].tolist()
    category_words[cat] = set(words)

In [12]:
# Function to tokenize and check for presence of lexicon words
def flag_categories(comment):
    tokens = comment.strip().split()  # split by whitespace
    flags = {cat: 0 for cat in categories}
    for cat, words_set in category_words.items():
        for token in tokens:
            if token in words_set:
                flags[cat] = 1
                break  # mark only once per category
    return pd.Series(flags)


In [13]:
# Apply the function ONLY to comments where Hate/NoHate = 1
hate_mask = NepaliComments["Hate/NoHate"] == 1

# Apply the function to all comments
NepaliComments.loc[hate_mask, categories] = NepaliComments.loc[hate_mask, "Comment"].apply(flag_categories)

# Preview the final dataset
NepaliComments.head()

Unnamed: 0,Comment,Hate/NoHate,Insult,Violence,Sexual,Racism,Religious
0,‡§≠‡§æ‡§ï ‡§≠‡•ã‡§∏‡§°‡•Ä‡§ï‡•á,1,0,0,0,0,0
1,‡§™‡§ø‡§ï‡•ç‡§ö‡§∞ ‡§∏‡•ç‡§™‡•Ä‡§ï‡•ç‡§∏ ‡§µ‡§æ‡§ï ‡§Æ‡•Å‡§ú‡•Ä,1,1,0,0,0,0
2,‡§π‡•á ‡§Æ‡•Å‡§ú‡•Ä....,1,0,0,0,0,0
3,‡§≤‡§∏‡•ç‡§§‡•à‡§á ‡§π‡•ç‡§Ø‡§æ‡§®‡•ç‡§°‡§∏‡§Æ ‡§ï‡•á ‡§Æ‡•Å‡§ú‡•Ä ü§£‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è,1,1,0,0,0,0
4,‡§Ü‡§Ø‡•ã ‡§è‡§Æ‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡§°‡§æ ‡§™‡§∞‡•ç‡§∏‡•Å‡§∏‡•ç‡§§‡§§‡§ø‚ô•Ô∏è‚ô•Ô∏è‚ô•Ô∏è,1,0,0,0,0,0


In [14]:
# Columns
categories = ["Insult", "Violence", "Sexual", "Racism", "Religious"]

#Count of each column
print("Count in each column:")
print(NepaliComments.count())

#Count of 1s in each column
print("Count of 1s in each column:")
print(NepaliComments[["Hate/NoHate"] + categories].sum())

#Rows where Hate/NoHate = 0 but any category = 1
rows_hateno_category_yes = NepaliComments[
    (NepaliComments["Hate/NoHate"] == 0) & (NepaliComments[categories].sum(axis=1) > 0)
]
print(f"\nNumber of rows where Hate/NoHate=0 but at least one category=1: {len(rows_hateno_category_yes)}")
rows_hateno_category_yes.head()

#Rows where Hate/NoHate = 1 but all categories = 0
rows_hate_yes_category_no = NepaliComments[
    (NepaliComments["Hate/NoHate"] == 1) & (NepaliComments[categories].sum(axis=1) == 0)
]
print(f"\nNumber of rows where Hate/NoHate=1 but all categories=0: {len(rows_hate_yes_category_no)}")
rows_hate_yes_category_no.head()


Count in each column:
Comment        11681
Hate/NoHate    11681
Insult         11681
Violence       11681
Sexual         11681
Racism         11681
Religious      11681
dtype: int64
Count of 1s in each column:
Hate/NoHate    3542
Insult         2218
Violence         55
Sexual          636
Racism           87
Religious        10
dtype: int64

Number of rows where Hate/NoHate=0 but at least one category=1: 0

Number of rows where Hate/NoHate=1 but all categories=0: 1080


Unnamed: 0,Comment,Hate/NoHate,Insult,Violence,Sexual,Racism,Religious
0,‡§≠‡§æ‡§ï ‡§≠‡•ã‡§∏‡§°‡•Ä‡§ï‡•á,1,0,0,0,0,0
2,‡§π‡•á ‡§Æ‡•Å‡§ú‡•Ä....,1,0,0,0,0,0
4,‡§Ü‡§Ø‡•ã ‡§è‡§Æ‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡§°‡§æ ‡§™‡§∞‡•ç‡§∏‡•Å‡§∏‡•ç‡§§‡§§‡§ø‚ô•Ô∏è‚ô•Ô∏è‚ô•Ô∏è,1,0,0,0,0,0
7,‡§¨‡•ã‡§ï‡§æü§£üòÇü§£,1,0,0,0,0,0
12,‡§ú‡§ø‡§®‡•ç‡§¶‡§ó‡•Ä‡§Æ‡§æ ‡§≠‡•á‡§ú ‡§Æ:‡§Æ ‡§™‡§®‡§ø ‡§Æ‡§ø‡§†‡•ã ‡§Æ‡§æ‡§®‡•á‡§∞ ‡§ñ‡§æ‡§®‡•á ‡§¶‡§ø‡§® ‡§Ü‡§Ø‡•ãüòÇüòÜüòÖ,1,0,0,0,0,0


Now, we will go ahead and save this dataset we created! 

In [15]:
# Save the NepaliComments dataframe as Excel
output_path = "../Data/Nepali_With_labels.xlsx"
NepaliComments.to_excel(output_path, index=False)

print(f"NepaliComments dataset saved successfully at: {output_path}")

NepaliComments dataset saved successfully at: ../Data/Nepali_With_labels.xlsx


Now that "Nepali_With_labels.xlsx" and "Multi-labelBengali_Hate_Specch_Dataset.xlsx" exist, lets merge both of them.

In [16]:
# Load Nepali dataset
nepali_df = pd.read_excel("../Data/Nepali_With_labels.xlsx")

# Load Bengali dataset
bengali_df = pd.read_excel("../Data/Multi-labelBengali_Hate_Specch_Dataset.xlsx")

# Preview first few rows
print("Nepali dataset:")
display(nepali_df.head())

print("Bengali dataset:")
display(bengali_df.head())

Nepali dataset:


Unnamed: 0,Comment,Hate/NoHate,Insult,Violence,Sexual,Racism,Religious
0,‡§≠‡§æ‡§ï ‡§≠‡•ã‡§∏‡§°‡•Ä‡§ï‡•á,1,0,0,0,0,0
1,‡§™‡§ø‡§ï‡•ç‡§ö‡§∞ ‡§∏‡•ç‡§™‡•Ä‡§ï‡•ç‡§∏ ‡§µ‡§æ‡§ï ‡§Æ‡•Å‡§ú‡•Ä,1,1,0,0,0,0
2,‡§π‡•á ‡§Æ‡•Å‡§ú‡•Ä....,1,0,0,0,0,0
3,‡§≤‡§∏‡•ç‡§§‡•à‡§á ‡§π‡•ç‡§Ø‡§æ‡§®‡•ç‡§°‡§∏‡§Æ ‡§ï‡•á ‡§Æ‡•Å‡§ú‡•Ä ü§£‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è,1,1,0,0,0,0
4,‡§Ü‡§Ø‡•ã ‡§è‡§Æ‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡§°‡§æ ‡§™‡§∞‡•ç‡§∏‡•Å‡§∏‡•ç‡§§‡§§‡§ø‚ô•Ô∏è‚ô•Ô∏è‚ô•Ô∏è,1,0,0,0,0,0


Bengali dataset:


Unnamed: 0,Comment,Hate/NoHate,Insult,Violence,Sexual,Racism,Religious
0,‡¶≠‡¶æ‡¶∞‡¶§‡ßá‡¶∞ ‡¶∑‡ßú‡¶Ø‡¶®‡ßç‡¶§‡ßç‡¶∞‡ßá‡¶∞ ‡¶∂‡¶ø‡¶ï‡¶æ‡¶∞ ‡¶∏‡¶æ‡¶ï‡¶ø‡¶¨‡¶¨‡¶æ‡¶Ç‡¶≤‡¶æ‡¶¶‡ßá‡¶∂ ‡¶ï‡ßç‡¶∞‡¶ø‡¶ï‡ßá‡¶ü ...,1,1,0,0,1,0
1,‡¶ü‡¶ø‡¶Æ ‡¶°‡¶Ç‡¶∏‡ßã ‡¶π‡¶á‡¶õ‡ßá ‡¶∏‡¶æ‡¶≤ ‡¶•‡ßá‡¶ï‡ßá‡¶ì‡ßü‡¶æ‡¶∞‡ßç‡¶≤‡ßç‡¶° ‡¶ï‡¶æ‡¶™‡ßá ‡¶§‡¶æ‡¶∞ ‡¶™‡ßç‡¶∞‡¶Æ‡¶æ‡¶®...,0,0,0,0,0,0
2,‡¶è‡¶ï ‡¶è‡¶ï ‡¶Ø‡ßÅ‡¶ó‡ßá ‡¶¨‡¶æ‡¶Ç‡¶≤‡¶æ‡¶¶‡ßá‡¶∂ ‡¶è‡¶ï‡ßá‡¶ï‡¶ü‡¶æ ‡¶π‡¶ø‡¶∞‡ßã‡¶ï‡ßá ‡¶ß‡ßç‡¶¨‡¶Ç‡¶∏ ‡¶ï‡¶∞‡¶õ‡ßá,0,0,0,0,0,0
3,‡¶®‡¶ü‡¶ø‡¶∞ ‡¶™‡ßã‡¶≤‡¶æ ‡¶™‡¶æ‡¶™‡¶® ‡¶∏‡¶¨ ‡¶§‡ßã‡¶∞ ‡¶ñ‡ßá‡¶≤‡¶æ,1,1,0,1,0,0
4,‡¶¨‡¶æ‡¶Ç‡¶≤‡¶æ‡¶¶‡ßá‡¶∂‡ßá‡¶∞ ‡¶Æ‡¶æ‡¶®‡ßÅ‡¶∑ ‡¶∏‡¶¨‡¶æ‡¶á ‡¶ó‡¶∞‡ßç‡¶ú‡ßá ‡¶ì‡¶†‡ßã ‡¶™‡¶æ‡¶™‡¶®‡ßá‡¶∞ ‡¶¨‡¶ø‡¶∞‡ßÅ‡¶¶‡ßç‡¶ß...,0,0,0,0,0,0


In [17]:
# Each dataset
print("Columns in Nepali dataset:", list(nepali_df.columns))
print(f"Total in Nepali Dataset: {len(nepali_df)}")
print("Columns in Bengali dataset:", list(bengali_df.columns))
print(f"Total in Bengali Dataset: {len(bengali_df)}")

# Check if columns are identical
if list(nepali_df.columns) == list(bengali_df.columns):
    print("\n‚úÖ Columns match exactly. Safe to merge.")
else:
    print("\n‚ö†Ô∏è Columns do not match. We may need to reorder or rename columns.")


Columns in Nepali dataset: ['Comment', 'Hate/NoHate', 'Insult', 'Violence', 'Sexual', 'Racism', 'Religious']
Total in Nepali Dataset: 11681
Columns in Bengali dataset: ['Comment', 'Hate/NoHate', 'Insult', 'Violence', 'Sexual', 'Racism', 'Religious']
Total in Bengali Dataset: 35001

‚úÖ Columns match exactly. Safe to merge.


In [18]:
# Concatenate both datasets
hate_speech_df = pd.concat([nepali_df, bengali_df], ignore_index=True)

# Preview merged dataset
print("Merged dataset preview:")
display(hate_speech_df.head())
print(f"Total rows after merge: {len(hate_speech_df)}")


Merged dataset preview:


Unnamed: 0,Comment,Hate/NoHate,Insult,Violence,Sexual,Racism,Religious
0,‡§≠‡§æ‡§ï ‡§≠‡•ã‡§∏‡§°‡•Ä‡§ï‡•á,1,0,0,0,0,0
1,‡§™‡§ø‡§ï‡•ç‡§ö‡§∞ ‡§∏‡•ç‡§™‡•Ä‡§ï‡•ç‡§∏ ‡§µ‡§æ‡§ï ‡§Æ‡•Å‡§ú‡•Ä,1,1,0,0,0,0
2,‡§π‡•á ‡§Æ‡•Å‡§ú‡•Ä....,1,0,0,0,0,0
3,‡§≤‡§∏‡•ç‡§§‡•à‡§á ‡§π‡•ç‡§Ø‡§æ‡§®‡•ç‡§°‡§∏‡§Æ ‡§ï‡•á ‡§Æ‡•Å‡§ú‡•Ä ü§£‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è,1,1,0,0,0,0
4,‡§Ü‡§Ø‡•ã ‡§è‡§Æ‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡§°‡§æ ‡§™‡§∞‡•ç‡§∏‡•Å‡§∏‡•ç‡§§‡§§‡§ø‚ô•Ô∏è‚ô•Ô∏è‚ô•Ô∏è,1,0,0,0,0,0


Total rows after merge: 46682


In [19]:
# Save the merged dataset
output_path = "../Data/HateSpeechData.xlsx"
hate_speech_df.to_excel(output_path, index=False)

print(f"Merged dataset saved successfully at: {output_path}")

Merged dataset saved successfully at: ../Data/HateSpeechData.xlsx


Now, Data folder contains the file "HateSpeechData.xlsx" that can be used for our project!