DATA PREPROCESSING:

Here is the outline of the preprocessing steps we will follow:

1.Load the dataset and inspect its structure.

2.Check for missing values and handle them if necessary.

3.Perform text preprocessing (e.g., removing punctuation, converting to lowercase, removing stop words).

4.Tokenize the text data.

5.Convert the text data to a suitable format for machine learning (e.g., using TF-IDF or word embeddings).





In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
file_path = '/content/HateSpeechDetection (cleaned data set).csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,Platform,Comment,Hateful
0,Reddit,Damn I thought they had strict gun laws in Ger...,0
1,Reddit,I dont care about what it stands for or anythi...,0
2,Reddit,It's not a group it's an idea lol,0
3,Reddit,So it's not just America!,0
4,Reddit,The dog is a spectacular dancer considering he...,0


 Now Next, let's inspect the structure of the dataset and look for any missing values:

In [2]:
# Display basic information about the dataset
df.info()

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Platform  3000 non-null   object
 1   Comment   3000 non-null   object
 2   Hateful   3000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 70.4+ KB
Missing values in each column:
 Platform    0
Comment     0
Hateful     0
dtype: int64


Assuming that there are no missing values or they are handled appropriately, we can move on to text preprocessing. We'll remove punctuation, convert text to lowercase, remove stop words, and lemmatize the words.

In [6]:
# Define text preprocessing function
def preprocess_Comment(Comment):
    # Remove punctuation
    Comment = re.sub(r'[^\w\s]', '', Comment)
    # Convert text to lowercase
    Comment = Comment.lower()
    # Tokenize text
    tokens = word_tokenize(Comment)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply text preprocessing to the dataset
df['processed_Comment'] = df['Comment'].apply(preprocess_Comment)

# Display the first few rows of the dataset after preprocessing
df.head(20)


Unnamed: 0,Platform,Comment,Hateful,processed_Comment
0,Reddit,Damn I thought they had strict gun laws in Ger...,0,damn thought strict gun law germany
1,Reddit,I dont care about what it stands for or anythi...,0,dont care stand anything connected like shield
2,Reddit,It's not a group it's an idea lol,0,group idea lol
3,Reddit,So it's not just America!,0,america
4,Reddit,The dog is a spectacular dancer considering he...,0,dog spectacular dancer considering two left foot
5,Reddit,If ppl dont wear masks you complain... If ppl ...,0,ppl dont wear mask complain ppl wear mask u al...
6,Reddit,We should send them All back to africa,1,send back africa
7,Reddit,Checking to see if it's whataboutism from the ...,0,checking see whataboutism past yup indeed what...
8,Reddit,"As a european, I approve this.",0,european approve
9,Reddit,"Idk which of these groups to join, they're all...",0,idk group join theyre cool


After preprocessing the text, we'll split the dataset into training and testing sets:

In [9]:
# Split the dataset into training and testing sets
X = df['processed_Comment']
y = df['Comment']  # Assuming the target variable is named 'Comment'

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


Training set size: (2400,)
Testing set size: (600,)


Next, we'll convert the text data into a format suitable for machine learning. We'll use the TF-IDF vectorizer for this purpose:

In [10]:
# Convert text data to TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF feature matrix for training set:\n", X_train_tfidf.shape)
print("TF-IDF feature matrix for testing set:\n", X_test_tfidf.shape)


TF-IDF feature matrix for training set:
 (2400, 4869)
TF-IDF feature matrix for testing set:
 (600, 4869)
