In [1]:
# Import necessary libraries
import pandas as pd

# Load dataset with specified encoding
try:
    df = pd.read_csv('Downloads/hate.csv (1)/hate.csv', encoding='utf-8')
except UnicodeDecodeError:
    df = pd.read_csv('Downloads/hate.csv (1)/hate.csv', encoding='latin1')

# General dataset information
print("Dataset Info:\n")
print(df.info())

Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  41144 non-null  int64 
 1   comment     41144 non-null  object
 2   label       41144 non-null  object
dtypes: int64(1), object(2)
memory usage: 964.4+ KB
None


In [2]:
# Display first few rows of the dataset
print("\nFirst few rows of the dataset:\n")
print(df.head())


First few rows of the dataset:

   Unnamed: 0                                            comment label
0           1                                dalits are lowlives     N
1           2             gay people are a burden to our society     N
2           3                              Arabs are not welcome     N
3           4  I'm not saying we should actually eliminate he...     N
4           5                       bananas are for black people     N


In [6]:
# Check for duplicates
duplicates = df.duplicated().sum()
print("\nNumber of duplicate rows:", duplicates)


Number of duplicate rows: 0


In [9]:
original_positives_df = df[df['label'] == 'N']
original_positives_df.shape

(22158, 3)

This code creates a new DataFrame original_negatives_df containing only the rows from the original DataFrame data where the value in the label column is P (indicating a Hate sentiment). The shape attribute is then called on this new DataFrame to show the dimensions (number of rows and columns) of the DataFrame.

The resulting output would show the dimensions of the original_negatives_df DataFrame, indicating the number of negative sentiment samples in the original dataset.

In [11]:
original_negatives_df = df[df['label'] == 'P']
original_negatives_df.shape

(18950, 3)

Negation Generation for Positive Examples

Negation generation is a technique used in natural language processing to create negative versions of positive text data. It is achieved by adding negation words such as "not" before positive words in a sentence. The resulting negated sentences can then be used to train models to better understand and handle negations in sentiment analysis tasks.
Negation generation is important because it can help improve the accuracy of sentiment analysis models, especially in cases where negations are common in the text data being analyzed. By incorporating negated examples in the training data, the model can learn to recognize and handle negations more effectively, leading to more accurate sentiment predictions.
This code block uses the TextBlob library to generate negated versions of the positive examples in the dataset. The negated examples are labeled as negative (P) and stored in a new dataframe called neg_data. The original dataframe of positive examples, original_positives_df, is not modified.

In [12]:
from textblob import TextBlob

# Generate negated versions of the positive examples
neg_data = original_positives_df.copy()

neg_data['comment'] = neg_data['comment'].apply(lambda x: " ".join(["not " + w for w in x.split()]))

# Label the negated examples as negative (0)
neg_data['label'] = 'P'

In [13]:
print(neg_data.shape)
neg_data.head(7)

(22158, 3)


Unnamed: 0.1,Unnamed: 0,comment,label
0,1,not dalits not are not lowlives,P
1,2,not gay not people not are not a not burden no...,P
2,3,not Arabs not are not not not welcome,P
3,4,not I'm not not not saying not we not should n...,P
4,5,not bananas not are not for not black not people,P
6,7,not women not can not not not reproduce not ki...,P
8,9,not Who not cares not what not Chinese not peo...,P


Oversampling to Address Class Imbalance in a Dataset

In machine learning, class imbalance refers to a situation where the number of samples in one class is significantly lower than the number of samples in the other class. This is a common problem in many real-world datasets, such as fraud detection or medical diagnosis, where the number of negative examples is much larger than the number of positive examples.

Class imbalance can pose a challenge when building predictive models, as the model may become biased towards the majority class and perform poorly on the minority class. To address this problem, one common approach is to oversample the minority class to create a balanced dataset.

In [14]:
# Oversample 1-class and concat the DataFrames of both classes
df_positive_oversampled = original_positives_df.sample(18950, replace=True)
df_positive_oversampled.shape

(18950, 3)

In the given above code, we can see that the original dataset has a class imbalance with fewer positive examples compared to negative examples. To address this, we oversample the positive class by randomly sampling with replacement to create a new dataset with a balanced class distribution.

After oversampling, the new dataset can be used to train a model that is less biased towards the majority class and can perform better on the minority class. However, oversampling can also lead to overfitting, so it is important to validate the model performance on an independent dataset to ensure generalization.

Concatenating Oversampled Negative Examples with Original Negative Examples

In [15]:
df_negative_oversampled = pd.concat([neg_data,original_negatives_df],ignore_index =True)
df_negative_oversampled.shape

(41108, 3)

Oversampling the minority class in an imbalanced dataset

In [16]:
df_positive_oversampled = pd.concat([df_positive_oversampled,original_positives_df],ignore_index=True)
df_positive_oversampled.shape

(41108, 3)

In [None]:
Balancing Imbalanced Datasets using Oversampling

In machine learning, having an imbalanced dataset can lead to poor performance of the model, as the model may be biased towards the majority class. One approach to tackle this issue is to use oversampling, which involves increasing the number of samples in the minority class to match the number of samples in the majority class.

In [18]:
# Combine the negated examples with the original positive examples
df_balanced = pd.concat([df_positive_oversampled, df_negative_oversampled], ignore_index=True)
df_balanced.label.value_counts()

label
N    41108
P    41108
Name: count, dtype: int64

In [19]:
df_balanced.sample(5)

Unnamed: 0.1,Unnamed: 0,comment,label
19675,1154,i wanna turn all black people white,N
19457,807,These white men are big as this whale,N
63179,151000,not Indians not have not no not conscience,P
68233,10093,"You are a lovely people, I'm not disputing tha...",P
58388,33774,not Personally not I not don't not see not a n...,P


In [20]:
import re

This function is used as a preprocessing step to remove unwanted characters or symbols from a text corpus before further analysis or processing is done.

In [23]:
def clean (comment):
    clean_text = re.sub(r'[^\w\s]', '', comment)
    return clean_text

In [25]:
df_balanced['comment'] = df_balanced.comment.apply(clean)
df_balanced.sample(6)

Unnamed: 0.1,Unnamed: 0,comment,label
16844,25940,illiterate and a Jocker you mum be incredible ...,N
73074,20495,Wait wait BLM is fascist because they want to...,P
75684,25908,New antismoking add is warning the younger gen...,P
35032,31168,its never going to end is it unless we the peo...,N
70350,14713,The only thing that you can expect women to be...,P
11486,6003,gay ppl ruin british culture,N
