
# Hate Speech and Offensive Language Detection Project

This project is part of the **Natural Language Processing (NLP)** coursework and focuses on detecting hate speech and offensive language in social media posts. The project leverages NLP techniques for preprocessing and feature extraction, alongside machine learning models for classification and performance evaluation.

## Objectives
1. Perform text preprocessing using NLP techniques.
2. Visualize data to understand key patterns and trends.
3. Apply multiple machine learning models for classification, including Logistic Regression, Random Forest, SVM, and Neural Networks.
4. Compare model performances using metrics like precision, recall, F1-score, and confusion matrix.
5. Gain insights into model strengths and identify the best-performing approach for text classification tasks.



## Dataset Loading and Exploration

We start by loading the dataset and performing an initial exploration to understand its structure, key attributes, and any preprocessing requirements.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset (Replace 'path_to_file.csv' with your actual dataset path)
data = pd.read_csv('path_to_file.csv')

# Display dataset overview
data.info()
data.head()



### Data Visualization

We will visualize the class distribution and identify patterns in the dataset using various plots.


In [None]:

# Plot class distribution
sns.countplot(x='label', data=data)  # Replace 'label' with the actual label column name
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()



## Data Preprocessing with NLP Techniques

This step involves cleaning the text data and preparing it for feature extraction. We will use NLTK for tokenization, stemming, and stopword removal.


In [None]:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

# Define preprocessing function
def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text)
    # Convert to lowercase and remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]
    # Apply stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    return ' '.join(stemmed_tokens)

# Apply preprocessing to text column
data['cleaned_text'] = data['text'].apply(preprocess_text)  # Replace 'text' with actual text column name
data.head()



## Feature Extraction with TF-IDF

We will convert the cleaned text into numerical features using TF-IDF vectorization.


In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['cleaned_text']).toarray()
y = data['label']  # Replace 'label' with the actual target column

# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



## Model Training and Evaluation

We will train and evaluate multiple machine learning models and compare their performances.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Logistic Regression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)

# Performance evaluation
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_log_reg))
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix: Logistic Regression')
plt.show()


In [None]:

from sklearn.ensemble import RandomForestClassifier

# Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Performance evaluation
print("Random Forest Performance:")
print(classification_report(y_test, y_pred_rf))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Greens')
plt.title('Confusion Matrix: Random Forest')
plt.show()



## Conclusion and Insights

By comparing the models based on their evaluation metrics, we can determine the best-performing algorithm for detecting hate speech and offensive language. Future work may involve using deep learning models for further improvement.
