PROJECT:- Hate speech Recognition in online group chat rooms.

# # Data Preprocessing
The first step in our hate speech detection model involves collecting and preprocessing the data. This step is crucial as the quality and cleanliness of the data significantly affect the model's performance.

We use the provided dataset (HateSpeechDetection.csv), which contains text data labeled as hate speech (1) or not hate speech (0).

In [18]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("HateSpeechDetection.csv")
data.head(15)


Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
5,If ppl dont wear masks you complain... If ppl ...,0
6,We should send them All back to africa,1
7,Checking to see if it's whataboutism from the ...,0
8,"As a european, I approve this.",0
9,"Idk which of these groups to join, they're all...",0


In [19]:
data.tail(10)

Unnamed: 0,Text,Label
17586,africans should be in a zoo with the other ani...,1
17587,I wouldn't want my daughter to marry a Chink,1
17588,not anymore. they are like the shitty utube no...,1
17589,"Yet still, better for a 750cc to seek a win (o...",0
17590,people this is new thread due to the impending...,0
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17596 entries, 0 to 17595
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    17596 non-null  object
 1   Label   17596 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 275.1+ KB


After observing the dataset, we can infer that we need to clean and transform the raw text data into a format suitable for our  model. This involves several sub-steps:

Removing Extra Spaces: Normalize the spacing in the text to remove any extra spaces.

In [21]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text) #the re.sub function replaces one or more whitespace characters (\s+) with a single space.
data['Text'] = data['Text'].apply(remove_extra_spaces)



Remove usernames: Same as for the URL, a username in a text won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.

In [22]:
def remove_username(text):
    return re.sub(r"@\S+", "",text) 
#We used pattern “@\S+” -> it suggests string group which starts with ‘@’ and followed by non-whitespace character(\S), ‘+’ means repeatition of preceding character one or more times

data['Text'] = data['Text'].apply(remove_username)



Remove Hashtags: Hashtags are hard to apprehend, but usually contain useful information about the context of a text and its content. The problem with hashtags is that the words are all after the other, without a space. 

In [23]:
def remove_hashtags(text):
    return re.sub(r'#', '', text)
# replacing the character("#") with "" but not removing the term.

data['Text'] = data['Text'].apply(remove_hashtags)
data.tail()

Unnamed: 0,Text,Label
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


Lowercasing: Convert all text to lowercase to ensure uniformity, as the model should treat "Hate" and "hate" as the same word.

In [24]:
def text_lower(text):
    return text.lower()
data['Text'] = data['Text'].apply(text_lower)

Removing Punctuation: Strip out punctuation to focus on the words themselves.

In [25]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)
    #\w: Represents any alphanumeric character (equivalent to [a-zA-Z0-9_]).
    #\s: Denotes any whitespace character, such as space, tab, or newline.
    # so it defines the other than a alphanumeric character followed by a single space, ('^' for negation) remove other characters

data['Text'] = data['Text'].apply(remove_punctuation)
data.tail()

Unnamed: 0,Text,Label
17591,i find rats nicer and cleaner than most chinese,1
17592,check out this niggar they hit things like wil...,1
17593,this country has become an absolute shambles t...,0
17594,me aged 16 antisemitism is bad me aged 18 an...,1
17595,so messed up saying blacks dont deserve rights,0


Remove URLs: URLs do not give any information when we try to analyze text from words.

In [26]:
def remove_url(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# it identifies the words starting with http or https or www and ending with a non-white space Character(\S) then remove it

data['Text'] = data['Text'].apply(remove_url)

In [14]:
data

In [27]:
from nltk.tokenize import word_tokenize
data['Text'] = data['Text'].apply(word_tokenize)
data

Unnamed: 0,Text,Label
0,"[damn, i, thought, they, had, strict, gun, law...",0
1,"[i, dont, care, about, what, it, stands, for, ...",0
2,"[its, not, a, group, its, an, idea, lol]",0
3,"[so, its, not, just, america]",0
4,"[the, dog, is, a, spectacular, dancer, conside...",0
...,...,...
17591,"[i, find, rats, nicer, and, cleaner, than, mos...",1
17592,"[check, out, this, niggar, they, hit, things, ...",1
17593,"[this, country, has, become, an, absolute, sha...",0
17594,"[me, aged, 16, antisemitism, is, bad, me, aged...",1


Data claening

In [None]:
import numpy as np

# Remove rows with missing values from the DataFrame
cleaned_data = train_data.dropna()

# Separate features and target variable
features = cleaned_data['text']
target = cleaned_data['label']

# Print the shape of the features and target arrays
print("Shape of features (X):", features.shape)
print("Shape of target (y):", target.shape)


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
training_features, testing_features, training_labels, testing_labels = train_test_split(
    features, target, test_size=0.2, random_state=42, stratify=target
)

# Print the shapes of the resulting datasets
print("Training features shape:", training_features.shape)
print("Testing features shape:", testing_features.shape)
print("Training labels shape:", training_labels.shape)
print("Testing labels shape:", testing_labels.shape)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer with specified parameters
tfidf_vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))

# Fit and transform the training data
X_train_tfidf_matrix = tfidf_vectorizer.fit_transform(training_features)

# Transform the test data based on the fitted TF-IDF vectorizer
X_test_tfidf_matrix = tfidf_vectorizer.transform(testing_features)

# Print shapes of TF-IDF matrices for training and testing sets
print("Shape of training TF-IDF matrix:", X_train_tfidf_matrix.shape)
print("Shape of testing TF-IDF matrix:", X_test_tfidf_matrix.shape)


In [None]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE for handling class imbalance
smote_resampler = SMOTE(random_state=42)

# Apply SMOTE to the training data to generate synthetic samples
X_resampled_features, y_resampled_labels = smote_resampler.fit_resample(X_train_tfidf_matrix, training_labels)

# Display the distribution of the resampled labels
print("Resampled label distribution:\n", pd.Series(y_resampled_labels).value_counts())


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
logistic_regression_model = LogisticRegression()

# Train the model using the resampled training data
logistic_regression_model.fit(X_resampled_features, y_resampled_labels)

# Optional: Print a message confirming that the model has been trained
print("Logistic Regression model has been trained successfully.")


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the RandomForestClassifier
random_forest_clf = RandomForestClassifier(random_state=42)

# Optional: Print a message indicating the Random Forest model has been initialized
print("Random Forest Classifier has been initialized.")


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier with a fixed random seed
rf_classifier = RandomForestClassifier(random_state=42)

# Train the Random Forest model using the resampled data
rf_classifier.fit(X_resampled_features, y_resampled_labels)

# Optional: Print a message confirming that the Random Forest model has been trained
print("Random Forest Classifier has been trained successfully.")


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the RandomForestClassifier with a fixed random seed for reproducibility
forest_classifier = RandomForestClassifier(random_state=42)

# Optional: Print a message confirming the classifier initialization
print("RandomForestClassifier instance has been created.")


In [None]:
# Evaluate the model's performance on the test set
test_accuracy = rf_classifier.score(X_test_tfidf_matrix, testing_labels)

# Print the accuracy of the Random Forest Classifier on the test data
print(f"Accuracy of the Random Forest Classifier on the test set: {test_accuracy:.4f}")


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Make predictions using the trained Random Forest model
predicted_labels = rf_classifier.predict(X_test_tfidf_matrix)

# Compute and print the confusion matrix
conf_matrix = confusion_matrix(testing_labels, predicted_labels)
print("Confusion Matrix:\n", conf_matrix)

# Generate and print the classification report
class_report = classification_report(testing_labels, predicted_labels)
print("Classification Report:\n", class_report)


In [None]:
from sklearn.metrics import classification_report

# Generate predictions on the test set using the trained Random Forest model
predicted_labels = rf_classifier.predict(X_test_tfidf_matrix)

# Create a classification report for the predictions
classification_summary = classification_report(testing_labels, predicted_labels)

# Print the classification report
print("Classification Report:\n", classification_summary)


In [None]:
param_grid = {
    'n_estimators': [100, 150,200,300,350,400], 
    'max_features': [1,2,'sqrt', 'log2', None], 
    'max_depth': [4, 6, 10,15,20], 
    'max_leaf_nodes': [2, 4, 6,12,20]
    

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV with the Random Forest Classifier and parameter grid
grid_search = GridSearchCV(estimator=rf_classifier, 
                           param_grid=param_grid, 
                           cv=5, 
                           verbose=2, 
                           n_jobs=-1)

# Optionally, fit GridSearchCV to the training data
grid_search.fit(X_resampled_features, y_resampled_labels)

# Print the best parameters found by GridSearchCV
print("Best parameters found:\n", grid_search.best_params_)


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set using the best model from GridSearchCV
predicted_labels_best = grid_search.predict(X_test_tfidf_matrix)

# Generate and print the classification report
report_summary = classification_report(testing_labels, predicted_labels_best)
print("Classification Report:\n", report_summary)

# Compute and print the confusion matrix
confusion_mat = confusion_matrix(testing_labels, predicted_labels_best)
print("Confusion Matrix:\n", confusion_mat)


Data balancing

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import RandomOverSampler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import re

# Assume train_data is already loaded into a pandas DataFrame
# train_data = pd.read_csv('path_to_your_csv.csv')

# Preprocess text data: Remove NaNs, clean text, and split data
train_data = train_data.dropna(subset=['text', 'label'])

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r'


In [None]:
import pandas as pd
import re

# Load your DataFrame (df)
# df = pd.read_csv('path_to_your_csv.csv')

# Count the number of columns in the DataFrame
column_count = df.shape[1]
print(f"Number of columns: {column_count}")

# Function to preprocess text
def preprocess_text(text):
    # Remove special characters and numbers
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    cleaned_text = cleaned_text.lower()
    return cleaned_text

# Apply preprocessing to the 'comment' column
df['processed_comment'] = df['comment'].apply(preprocess_text)

# Split the dataset into features (X) and target (y)
X = df['processed_comment']
y = df['label']


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Vectorize the text data (convert text to numerical features)
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X_tfidf.shape}, {y.shape}")

# Split the data into training and testing sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
import re

# Load your DataFrame (df)
# df = pd.read_csv('path_to_your_csv.csv')

# Count the number of columns in the DataFrame
column_count = df.shape[1]
print(f"Number of columns: {column_count}")

# Function to preprocess text
def preprocess_text(text):
    # Remove special characters and numbers
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    cleaned_text = cleaned_text.lower()
    return cleaned_text

# Apply preprocessing to the 'comment' column
df['processed_comment'] = df['comment'].apply(preprocess_text)

# Split the dataset into features (X) and target (y)
X = df['processed_comment']
y = df['label']

# Vectorize the text data (convert text to numerical features)
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X_tfidf.shape}, {y.shape}")

# Split the data into training and testing sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train_tfidf, y_train)

# Train a K-Nearest Neighbors classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_resampled, y_resampled)

# Predict on the test set
y_pred = knn_classifier.predict(X_test_tfidf)

# Evaluate the model after random oversampling
print("Evaluation after Random Oversampling:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Feature encoding

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline

# Load the new dataset
dataset = pd.read_csv('new_processed_dataset.csv')
print(dataset.head())

# Drop rows with NaN values in the 'tweet' column
dataset = dataset.dropna(subset=['tweet'])

# Define the input features and target variable
features = dataset['tweet']
target = dataset['class']

# Split the data into training and testing sets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Function to display evaluation metrics
def display_metrics(actual, predicted):
    print("Accuracy:", accuracy_score(actual, predicted))
    print("Precision:", precision_score(actual, predicted, average='weighted'))
    print("Recall:", recall_score(actual, predicted, average='weighted'))
    print("F1 Score:", f1_score(actual, predicted, average='weighted'))


Logistic encoding

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create a pipeline for text classification with CountVectorizer and LogisticRegression
text_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(binary=True, max_features=1000)),
    ('classifier', LogisticRegression())
])

# Fit the pipeline on the training data
text_pipeline.fit(features_train, target_train)

# Predict on the test data
predictions = text_pipeline.predict(features_test)

# Output the evaluation metrics
print("Evaluation using Count Vectorization and Logistic Regression:")
print("Accuracy Score:", accuracy_score(target_test, predictions))
print("Classification Report:\n", classification_report(target_test, predictions))

# Function to display additional metrics
def show_metrics(actual, predicted):
    print("Accuracy:", accuracy_score(actual, predicted))
    print("Precision:", precision_score(actual, predicted, average='weighted'))
    print("Recall:", recall_score(actual, predicted, average='weighted'))
    print("F1 Score:", f1_score(actual, predicted, average='weighted'))

# Display additional metrics
show_metrics(target_test, predictions)

In [None]:
rom sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Encoding Pipeline
pipeline_tfidf = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=1000)),
    ('classifier', LogisticRegression())
])

# Train and evaluate the model
pipeline_tfidf.fit(X_train, y_train)
y_pred_tfidf = pipeline_tfidf.predict(X_test)
print("TF-IDF Encoding")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("Classification Report:\n", classification_report(y_test, y_pred_tfidf))
print_metrics(y_test, y_pred_tfidf)

In [None]:
from gensim.models import Word2Vec
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class Word2VecTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vector_size=100, window=5, min_count=1):
        self.vector_size = vector_size
        self.window = window
        self.min_count = min_count
        self.model = None

    def fit(self, X, y=None):
        tokenized_X = [tweet.split() for tweet in X]
        self.model = Word2Vec(sentences=tokenized_X, vector_size=self.vector_size, window=self.window, min_count=self.min_count)
        return self

    def transform(self, X):
        def get_word2vec_features(text):
            words = text.split()
            feature_vector = np.mean([self.model.wv[word] for word in words if word in self.model.wv] or [np.zeros(self.vector_size)], axis=0)
            return feature_vector
        
        return np.array([get_word2vec_features(tweet) for tweet in X])

# Word2Vec Encoding Pipeline
pipeline_w2v = Pipeline([
    ('word2vec', Word2VecTransformer(vector_size=100)),  # We don't set max_features for Word2Vec
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train and evaluate the model
pipeline_w2v.fit(X_train, y_train)
y_pred_w2v = pipeline_w2v.predict(X_test)
print("Word2Vec Encoding")
print("Accuracy:", accuracy_score(y_test, y_pred_w2v))
print("Classification Report:\n", classification_report(y_test, y_pred_w2v))
print_metrics(y_test, y_pred_w2v)

Naivebays encoding

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np

# Load and preprocess data
dataset = pd.read_csv('new_processed_dataset.csv')
print(dataset.head())

# Remove duplicates and handle missing values
dataset.drop_duplicates(inplace=True)
dataset.dropna(inplace=True)

# Define features (X) and target (y)
features = dataset['tweet']
target = dataset['class']

# Split the dataset into training and testing sets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Function to print evaluation metrics
def display_metrics(actual, predicted):
    print("Accuracy:", accuracy_score(actual, predicted))
    print("Precision:", precision_score(actual, predicted, average='weighted'))
    print("Recall:", recall_score(actual, predicted, average='weighted'))
    print("F1 Score:", f1_score(actual, predicted, average='weighted'))
    # Instantiate a Multinomial Naive Bayes classifier pipeline with CountVectorizer
nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(max_features=5000)),
    ('classifier', MultinomialNB())
])

# Train the pipeline on the training data
nb_pipeline.fit(features_train, target_train)

# Predict on the test data
predictions = nb_pipeline.predict(features_test)

# Display evaluation metrics
print("Evaluation Results:")
display_metrics(target_test, predictions)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create a pipeline for text classification with CountVectorizer and LogisticRegression
text_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(binary=True, max_features=1000)),
    ('classifier', LogisticRegression())
])

# Fit the pipeline on the training data
text_pipeline.fit(features_train, target_train)

# Predict on the test data
predictions = text_pipeline.predict(features_test)

# Output the evaluation metrics
print("Evaluation using Count Vectorization and Logistic Regression:")
print("Accuracy Score:", accuracy_score(target_test, predictions))
print("Classification Report:\n", classification_report(target_test, predictions))

# Function to display additional metrics
def show_metrics(actual, predicted):
    print("Accuracy:", accuracy_score(actual, predicted))
    print("Precision:", precision_score(actual, predicted, average='weighted'))
    print("Recall:", recall_score(actual, predicted, average='weighted'))
    print("F1 Score:", f1_score(actual, predicted, average='weighted'))

# Display additional metrics
show_metrics(target_test, predictions)


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Create a pipeline for text classification with CountVectorizer and Multinomial Naive Bayes
text_pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(binary=True, max_features=1000)),
    ('naive_bayes', MultinomialNB())
])

# Train the pipeline using the training dataset
text_pipeline.fit(features_train, target_train)

# Generate predictions on the test set
test_predictions = text_pipeline.predict(features_test)

# Print the classification report
print("Text Classification Evaluation with Naive Bayes:")
print("Classification Report:\n", classification_report(target_test, test_predictions))

# Define a function to print detailed performance metrics
def print_performance_metrics(true_labels, predicted_labels):
    print("Accuracy Score:", accuracy_score(true_labels, predicted_labels))
    print("Precision Score:", precision_score(true_labels, predicted_labels, average='weighted'))
    print("Recall Score:", recall_score(true_labels, predicted_labels, average='weighted'))
    print("F1 Score:", f1_score(true_labels, predicted_labels, average='weighted'))

# Output the detailed performance metrics
print_performance_metrics(target_test, test_predictions)


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Create a pipeline for text classification using TF-IDF and Multinomial Naive Bayes
tfidf_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(max_features=1000)),
    ('naive_bayes_classifier', MultinomialNB())
])

# Train the pipeline on the training data
tfidf_pipeline.fit(features_train, target_train)

# Predict the labels for the test set
predicted_labels_tfidf = tfidf_pipeline.predict(features_test)

# Print the classification report for TF-IDF encoding with Naive Bayes
print("Evaluation of TF-IDF Encoding with Naive Bayes Classifier:")
print("Classification Report:\n", classification_report(target_test, predicted_labels_tfidf))

# Function to display performance metrics
def show_performance_metrics(true_values, predicted_values):
    print("Accuracy:", accuracy_score(true_values, predicted_values))
    print("Precision:", precision_score(true_values, predicted_values, average='weighted'))
    print("Recall:", recall_score(true_values, predicted_values, average='weighted'))
    print("F1 Score:", f1_score(true_values, predicted_values, average='weighted'))

# Display additional metrics for the TF-IDF pipeline results
show_performance_metrics(target_test, predicted_labels_tfidf)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from gensim.models import Word2Vec
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.metrics import classification_report

# Custom Transformer for Word2Vec
class CustomWord2VecTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vector_size=100, window=5, min_count=1):
        self.vector_size = vector_size
        self.window = window
        self.min_count = min_count
        self.word2vec_model = None

    def fit(self, X, y=None):
        tokenized_texts = [text.split() for text in X]
        self.word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=self.vector_size, window=self.window, min_count=self.min_count)
        return self

    def transform(self, X):
        def compute_feature_vector(text):
            tokens = text.split()
            vectors = [self.word2vec_model.wv[token] for token in tokens if token in self.word2vec_model.wv]
            if not vectors:
                return np.zeros(self.vector_size)
            return np.mean(vectors, axis=0)
        
        return np.array([compute_feature_vector(text) for text in X])

# Define the pipeline with the Word2Vec transformer and Gaussian Naive Bayes
w2v_pipeline = Pipeline([
    ('word2vec_transformer', CustomWord2VecTransformer(vector_size=100)),  # Reduced vector size for quicker processing
    ('naive_bayes_classifier', GaussianNB())
])

# Fit the pipeline on the training data
w2v_pipeline.fit(X_train, y_train)

# Predict the labels for the test set
predicted_labels_w2v = w2v_pipeline.predict(X_test)

# Print the classification report for the Word2Vec model
print("Word2Vec Encoding with Naive Bayes Classifier:")
print("Classification Report:\n", classification_report(y_test, predicted_labels_w2v))

# Function to display performance metrics
def display_metrics(true_labels, predicted_labels):
    print("Accuracy:", accuracy_score(true_labels, predicted_labels))
    print("Precision:", precision_score(true_labels, predicted_labels, average='weighted'))
    print("Recall:", recall_score(true_labels, predicted_labels, average='weighted'))
    print("F1 Score:", f1_score(true_labels, predicted_labels, average='weighted'))

# Display detailed performance metrics
display_metrics(y_test, predicted_labels_w2v)


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Define the pipeline for Term Frequency encoding with Naive Bayes
tf_pipeline = Pipeline([
    ('tf_vectorizer', CountVectorizer(max_features=1000)),  # Convert text data into term frequency features
    ('naive_bayes', MultinomialNB())  # Classifier
])

# Fit the pipeline on the training data
tf_pipeline.fit(X_train, y_train)

# Predict the labels for the test data
predicted_labels_tf = tf_pipeline.predict(X_test)

# Print the classification report for the Term Frequency encoding model
print("Term Frequency Encoding with Naive Bayes Classifier:")
print("Classification Report:\n", classification_report(y_test, predicted_labels_tf))

# Function to display additional performance metrics
def display_performance_metrics(true_labels, predicted_labels):
    print("Accuracy:", accuracy_score(true_labels, predicted_labels))
    print("Precision:", precision_score(true_labels, predicted_labels, average='weighted'))
    print("Recall:", recall_score(true_labels, predicted_labels, average='weighted'))
    print("F1 Score:", f1_score(true_labels, predicted_labels, average='weighted'))

# Show detailed performance metrics
display_performance_metrics(y_test, predicted_labels_tf)


Embedding Techniques

In [None]:
# Word Tokenization with NLTK
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

# Ensure text column is of type string
train_data['text'] = train_data['text'].astype(str)

# Download required NLTK resources (if not already downloaded)
nltk.download('punkt')

# Apply word tokenization to each text entry in the DataFrame
train_data['tokenized_text'] = train_data['text'].apply(word_tokenize)

# Display the DataFrame with the new tokenized_text column
print(train_data)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Drop rows with missing values in the 'text' column
train_data.dropna(subset=['text'], inplace=True)

# Function to compute TF-IDF embeddings
def compute_tfidf_embeddings(corpus):
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)
    tfidf_embeddings = tfidf_vectorizer.fit_transform(corpus)
    return tfidf_embeddings

# Extract text data and compute TF-IDF embeddings
text_data = train_data['text'].values
tfidf_embeddings = compute_tfidf_embeddings(text_data)

# Print the TF-IDF embeddings
print(tfidf_embeddings)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: TF-IDF Encoding
def compute_tfidf_embeddings(corpus):
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)
    tfidf_embeddings = tfidf_vectorizer.fit_transform(corpus)
    return tfidf_embeddings

# Extract text and target variable
texts = train_data['text'].values
target = train_data['hd'].values

# Compute TF-IDF embeddings
tfidf_embeddings = compute_tfidf_embeddings(texts)

# Step 2: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_embeddings, target, test_size=0.2, random_state=42)

# Step 3: Train Random Forest model
random_forest_model = RandomForestClassifier(random_state=42)
random_forest_model.fit(X_train, y_train)

# Step 4: Predictions and Evaluation
y_pred = random_forest_model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


In [None]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Extract text data from the training DataFrame
text_data = train_data['text'].values

# Replace NaN values with empty strings
text_data = np.where(pd.isnull(text_data), '', text_data)

# Tokenize the text data
tokenized_texts = [word_tokenize(text) for text in text_data]

# Train a Word2Vec model on the tokenized texts
word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)

# Function to compute average Word2Vec embeddings for each document
def average_word2vec(tokens, model, vocab, vector_dim):
    vec_sum = np.zeros((vector_dim,), dtype="float32")
    num_tokens = 0
    for token in tokens:
        if token in vocab:
            num_tokens += 1
            vec_sum = np.add(vec_sum, model.wv[token])
    if num_tokens > 0:
        vec_sum = np.divide(vec_sum, num_tokens)
    return vec_sum

# Generate average embeddings for each text in the training set
vocabulary_set = set(word2vec_model.wv.index_to_key)
text_embeddings = np.array([average_word2vec(tokens, word2vec_model, vocabulary_set, 100) for tokens in tokenized_texts])

# Print the resulting embeddings
print(text_embeddings)


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Display the first two rows of the DataFrame and its info
print(train_data.head(2))
train_data.info()

# Ensure 'label' column is treated as string type
train_data['text'] = train_data['label'].astype(str)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')

# Fit and transform the 'label' column to one-hot encoded format
encoded_labels = encoder.fit_transform(train_data[['label']])

# Print the one-hot encoded array
print(encoded_labels)


Model Techniques

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import re

# Load your DataFrame (df)
# df = pd.read_csv('path_to_your_csv.csv')

# Count the number of columns in the DataFrame
column_count = df.shape[1]
print(f"Number of columns: {column_count}")

# Function to preprocess text
def preprocess_text(text):
    # Remove special characters and numbers
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    cleaned_text = cleaned_text.lower()
    return cleaned_text

# Apply preprocessing to the 'comment' column
df['processed_comment'] = df['comment'].apply(preprocess_text)

# Split the dataset into features (X) and target (y)
X = df['processed_comment']
y = df['label']

# Vectorize the text data (convert text to numerical features)
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X_tfidf.shape}, {y.shape}")

# Split the data into training and testing sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train_tfidf, y_train)

# Train a Decision Tree classifier
decision_tree_classifier = DecisionTreeClassifier(random_state=42)
decision_tree_classifier.fit(X_resampled, y_resampled)

# Predict on the test set
y_pred = decision_tree_classifier.predict(X_test_tfidf)

# Evaluate the model
print("Evaluation after Random Oversampling:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import re

# Load your DataFrame (data)
# data = pd.read_csv('path_to_your_csv.csv')

# Count the number of columns in the DataFrame
column_count = data.shape[1]
print(f"Number of columns: {column_count}")

# Assuming the last column is the label and the rest are TF-IDF features
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # The last column

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X.shape}, {y.shape}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Train a Decision Tree classifier
decision_tree_classifier = DecisionTreeClassifier(random_state=42)
decision_tree_classifier.fit(X_resampled, y_resampled)

# Predict on the test set
y_pred = decision_tree_classifier.predict(X_test)

# Evaluate the model
print("Evaluation after Random Oversampling:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler

# Assuming 'data' is your DataFrame and it contains TF-IDF features and the label as the last column

# Count the number of columns in the DataFrame
column_count = data.shape[1]
print(f"Number of columns: {column_count}")

# Separate features and labels
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # The last column

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X.shape}, {y.shape}")

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Train a Decision Tree classifier
decision_tree_classifier = DecisionTreeClassifier(random_state=42)
decision_tree_classifier.fit(X_resampled, y_resampled)

# Predict on the test set
y_pred = decision_tree_classifier.predict(X_test)

# Evaluate the model
print("Evaluation after Random Oversampling:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler

# Assuming 'data' is your DataFrame and it contains TF-IDF features and the label as the last column

# Count the number of columns in the DataFrame
column_count = data.shape[1]
print(f"Number of columns: {column_count}")

# Separate features and labels
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # The last column

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X.shape}, {y.shape}")

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Instantiate and train a Decision Tree classifier
decision_tree =


In [None]:
# Predict the labels for the test set
y_pred = decision_tree.predict(X_test)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler

# Assuming 'data' is your DataFrame and it contains TF-IDF features and the label as the last column

# Count the number of columns in the DataFrame
column_count = data.shape[1]
print(f"Number of columns: {column_count}")

# Separate features and labels
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # The last column

# Print the shape of the data before sampling
print(f"Shape of data before sampling: {X.shape}, {y.shape}")

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle class imbalance using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Instantiate and train a Decision Tree classifier
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_resampled, y_resampled)

# Predict on the test set
y_pred = decision_tree.predict(X_test)

# Evaluate the classifier
print("Evaluation after Random Oversampling:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Random forest classifier

In [None]:
import pandas as pd
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
data = pd.read_csv('new_processed_dataset.csv')
print(data.head())

# Handle any potential issues with data formats (e.g., strings that need to be converted)
# This assumes that some columns might need conversion from string representations of lists or dictionaries
data['features'] = data['features'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Define features and target variable
X = data['tweet']  # Assuming 'tweet' contains the text data
y = data['class']  # Assuming 'class' contains the labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline that includes TF-IDF vectorization and Random Forest classification
text_clf_pipeline = make_pipeline(
    TfidfVectorizer(max_features=1000),  # Convert text data to TF-IDF features
    RandomForestClassifier(random_state=42)  # Classifier
)

#


In [None]:
import pandas as pd

# Load the dataset from the specified CSV file
df = pd.read_csv('/content/cleaned_dataset_combined (2).csv')

# Display the first few rows of the DataFrame to verify successful loading
print("Displaying the first few rows of the DataFrame:")
print(df.head())


In [None]:
# Convert string representation of list to actual list
df['tweet_tokens'] = df['tweet_tokens'].apply(ast.literal_eval)

# Display the first few rows to verify
print("\nDataFrame after converting tweet_tokens to lists:")
print(df.head())

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/cleaned_dataset_combined (2).csv')

# Define feature and target variables
X = df[['tweet_tokens']]
y = df['class']

# Convert list of tokens into a single string for each row
# Assuming 'tweet_tokens' is a column where each entry is a list of tokens
X['tweet_tokens'] = X['tweet_tokens'].apply(lambda tokens: ' '.join(eval(tokens)) if isinstance(tokens, str) else ' '.join(tokens))

# Verify the transformation
print("Transformed DataFrame:")
print(X.head())


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

# Load the dataset
df = pd.read_csv('/content/cleaned_dataset_combined (2).csv')

# Define feature and target variables
X = df[['tweet_tokens']]
y = df['class']

# Convert list of tokens into a single string for each row
# If 'tweet_tokens' is a string representation of a list, use eval to convert it
X['tweet_tokens'] = X['tweet_tokens'].apply(lambda tokens: ' '.join(eval(tokens)) if isinstance(tokens, str) else ' '.join(tokens))

# Create a ColumnTransformer to apply TfidfVectorizer to the 'tweet_tokens' column
column_transformer = ColumnTransformer(
    transformers=[
        ('tfidf', TfidfVectorizer(), 'tweet_tokens')  # Apply TF-IDF vectorization
    ],
    remainder='passthrough'  # Keep other columns unchanged (though there are none in this case)
)

# Transform the feature data
X_transformed = column_transformer.fit_transform(X)

# Output the shape of the transformed data to verify
print("Shape of the transformed feature data:")
print(X_transformed.shape)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
df = pd.read_csv('/content/cleaned_dataset_combined (2).csv')

# Define feature and target variables
X = df[['tweet_tokens']]
y = df['class']

# Convert list of tokens into a single string for each row
X['tweet_tokens'] = X['tweet_tokens'].apply(lambda tokens: ' '.join(eval(tokens)) if isinstance(tokens, str) else ' '.join(tokens))

# Create a ColumnTransformer to apply TfidfVectorizer to the 'tweet_tokens' column
column_transformer = ColumnTransformer(
    transformers=[
        ('tfidf', TfidfVectorizer(), 'tweet_tokens')  # Apply TF-IDF vectorization
    ],
    remainder='passthrough'  # Keep other columns unchanged (though there are none in this case)
)

# Create a Random Forest classifier pipeline
pipeline = make_pipeline(column_transformer, RandomForestClassifier(random_state=42))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the first few rows of the training data and target
print("\nTraining Data:")
print(X_train.head())
print("\nTraining Target:")
print(y_train.head())

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
y_pred = pipeline.predict(X_test)

# Display classification report
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv('/content/cleaned_dataset_combined (2).csv')

# Define feature and target variables
X = df[['tweet_tokens']]
y = df['class']

# Convert list of tokens into a single string for each row
X['tweet_tokens'] = X['tweet_tokens'].apply(lambda tokens: ' '.join(eval(tokens)) if isinstance(tokens, str) else ' '.join(tokens))

# Create a ColumnTransformer to apply TfidfVectorizer to the 'tweet_tokens' column
column_transformer = ColumnTransformer(
    transformers=[
        ('tfidf', TfidfVectorizer(), 'tweet_tokens')  # Apply TF-IDF vectorization
    ],
    remainder='passthrough'  # Keep other columns unchanged (though there are none in this case)
)

# Create a Random Forest classifier pipeline
pipeline = make_pipeline(column_transformer, RandomForestClassifier(random_state=42))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print(f'\nAccuracy: {accuracy:.4f}')

# Predict the test set results
y_pred = pipeline.predict(X_test)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
