We will follow these steps:

1.Loading the dataset and inspect its contents.

2.Performing tokenization using nltk.

3.Encoding the text data using TF-IDF.

1: Importing Libraries and Load Dataset

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
file_path = '/content/HateSpeechDetection (preprocessed).csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Display column names to verify
print("\nColumn names in the dataset:")
print(df.columns)

# Display basic information about the dataset
print("\nBasic information about the dataset:")
print(df.info())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


First few rows of the dataset:
  Platform                                            Comment  Hateful
0   Reddit  Damn I thought they had strict gun laws in Ger...        0
1   Reddit  I dont care about what it stands for or anythi...        0
2   Reddit                  It's not a group it's an idea lol        0
3   Reddit                          So it's not just America!        0
4   Reddit  The dog is a spectacular dancer considering he...        0

Column names in the dataset:
Index(['Platform', 'Comment', 'Hateful'], dtype='object')

Basic information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Platform  3000 non-null   object
 1   Comment   3000 non-null   object
 2   Hateful   3000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 70.4+ KB
None


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


2: Inspect Dataset and Check for Missing Values

In [4]:
# Display basic information about the dataset
print("\nBasic information about the dataset:")
print(df.info())

# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing values in each column:\n", missing_values)

# Display the first few rows of the text column
# Assuming the text column is named 'processed_text' and the label column is named 'label'
text_column = 'Comment'
label_column = 'label'

print("\nFirst few rows of the text column:")
print(df[text_column].head())



Basic information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Platform  3000 non-null   object
 1   Comment   3000 non-null   object
 2   Hateful   3000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 70.4+ KB
None

Missing values in each column:
 Platform    0
Comment     0
Hateful     0
dtype: int64

First few rows of the text column:
0    Damn I thought they had strict gun laws in Ger...
1    I dont care about what it stands for or anythi...
2                    It's not a group it's an idea lol
3                            So it's not just America!
4    The dog is a spectacular dancer considering he...
Name: Comment, dtype: object


3: Tokenization

In [6]:
# Tokenize the text data
df['tokenized_text'] = df[text_column].apply(word_tokenize)

# Display the first few rows of the tokenized text column
print("\nFirst few rows of the tokenized text column:")
print(df['tokenized_text'].head(10))



First few rows of the tokenized text column:
0    [Damn, I, thought, they, had, strict, gun, law...
1    [I, dont, care, about, what, it, stands, for, ...
2       [It, 's, not, a, group, it, 's, an, idea, lol]
3                  [So, it, 's, not, just, America, !]
4    [The, dog, is, a, spectacular, dancer, conside...
5    [If, ppl, dont, wear, masks, you, complain, .....
6      [We, should, send, them, All, back, to, africa]
7    [Checking, to, see, if, it, 's, whataboutism, ...
8            [As, a, european, ,, I, approve, this, .]
9    [Idk, which, of, these, groups, to, join, ,, t...
Name: tokenized_text, dtype: object


4: Split Dataset into Training and Testing Sets

In [9]:
# Split the dataset into training and testing sets
X = df['Comment']
y = df['Comment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining set size:", X_train.shape)
print("Testing set size:", X_test.shape)



Training set size: (2400,)
Testing set size: (600,)


5: TF-IDF Encoding

In [10]:
# Convert text data to TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("\nTF-IDF feature matrix for training set:\n", X_train_tfidf.shape)
print("TF-IDF feature matrix for testing set:\n", X_test_tfidf.shape)



TF-IDF feature matrix for training set:
 (2400, 5000)
TF-IDF feature matrix for testing set:
 (600, 5000)
