<CENTER>HATE SPEECH IDENTIFICATION-TRISHAA

PIPELINE FOR SENTIMENT ANLYSIS TASK


1. Data Collection: Gather labeled data (e.g., reviews, tweets) for sentiment analysis.
   
2. Data Preprocessing: Clean and preprocess data by removing noise, tokenizing, lowercasing, removing stopwords, and performing stemming/lemmatization.

3. Feature Extraction: Convert text data into numerical features using techniques like Bag-of-Words, TF-IDF, or word embeddings.

4. Model Selection: Choose a suitable model such as Logistic Regression, Naive Bayes, SVM, RNN, CNN, or Transformer-based models like BERT.

5. Model Training: Split data into training and testing sets, train the model on the training data, and tune hyperparameters if necessary.

6. Model Evaluation: Evaluate the trained model's performance on the testing data using metrics like accuracy, precision, recall, and F1-score.

7. Fine-tuning: Refine the model by adjusting hyperparameters or experimenting with different architectures.

8. Deployment: Deploy the trained model into production, integrating it with other systems or applications.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec, FastText
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense


In [2]:

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

# Load hate speech dataset (assuming it's in CSV format)
data = pd.read_csv('/labeled_data.csv')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# Data Cleaning and Preprocessing
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = word_tokenize(text)  # Tokenization
    text = [word for word in text if word.isalnum()]  # Remove non-alphanumeric characters
    text = [word for word in text if word not in stopwords.words('english')]  # Remove stopwords
    return text

data['clean_text'] = data['tweet'].apply(preprocess_text)



In [5]:
data['clean_text']

0        [rt, mayasolovely, woman, complain, cleaning, ...
1        [rt, mleew17, boy, dats, cold, tyga, dwn, bad,...
2        [rt, urkindofbrand, dawg, rt, 80sbaby4life, ev...
3                                 [rt, look, like, tranny]
4        [rt, shenikaroberts, shit, hear, might, true, ...
                               ...                        
24778    [muthaf, lie, 8220, lifeasking, right, tl, tra...
24779    [gone, broke, wrong, heart, baby, drove, redne...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781             [youu, got, wild, bitches, tellin, lies]
24782    [ntac, eileen, dahlia, beautiful, color, combi...
Name: clean_text, Length: 24783, dtype: object

In [6]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['clean_text'], data['class'], test_size=0.2, random_state=42)



WORD2VEC

In [8]:
# Word2Vec
word2vec_model = Word2Vec(sentences=X_train, vector_size=100, window=5, min_count=1, workers=4)
X_train_word2vec = np.array([np.mean([word2vec_model.wv[word] for word in words if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for words in X_train])
X_test_word2vec = np.array([np.mean([word2vec_model.wv[word] for word in words if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for words in X_test])

print(word2vec_model)

Word2Vec<vocab=25473, vector_size=100, alpha=0.025>


In [25]:
from sklearn.metrics import confusion_matrix, classification_report

# Predict labels for Word2Vec + Logistic Regression
y_pred_word2vec = logreg_word2vec.predict(X_test_word2vec)

# Evaluate Word2Vec + Logistic Regression Accuracy
word2vec_accuracy = accuracy_score(y_test, y_pred_word2vec)
print("Word2Vec + Logistic Regression Accuracy:", word2vec_accuracy)


Word2Vec + Logistic Regression Accuracy: 0.8355860399435142


In [17]:
  # Print Word Representations from Word2Vec
print("Word Representations from Word2Vec:")
for word, representation in zip(word2vec_model.wv.index_to_key[:10], word2vec_model.wv.vectors[:10]):
    print(word, representation)



Word Representations from Word2Vec:
bitch [-0.30385545  1.3490597   0.24555592  0.37775442 -0.24384387 -2.0314631
  1.1677946   2.4828095  -0.67582285 -0.7674058  -0.18649688 -1.5680487
 -0.3430366   0.53939945 -0.07178079 -1.1282406   0.6195518  -0.6651011
 -0.05511921 -2.1791525   0.5110197   0.7341875   1.1283576  -0.23003629
 -0.23346585  0.34712282 -0.62616545 -0.57454044 -1.2170774  -0.1798486
  0.990339    0.24647652  0.35013536 -1.3273172  -0.6314316   1.0967523
  0.48773274 -0.62207764 -1.0021445  -2.1685905   0.3221075  -0.5910369
 -0.5623844  -0.20302433  0.1809369  -0.7670203  -1.0036834  -0.10144753
  0.1408536   1.028045    0.42124674 -0.6486839  -0.1117289  -0.41965684
 -0.4530616   0.701973    1.0463613  -0.406821   -1.1287614   0.5009816
  0.07095814  0.3114     -0.04395135 -0.10771189 -1.0333594   1.2207897
  0.46354604  0.8860421  -2.0075004   1.3416872  -0.45429212  0.5074623
  1.1953722  -0.9097047   1.1476848   0.17832312  0.6698421   0.0241885
 -0.7836555   0.396

In [27]:
# Analyze misclassified instances for Word2Vec + Logistic Regression
misclassified_word2vec = X_test[y_test != y_pred_word2vec]
true_labels_word2vec = y_test[y_test != y_pred_word2vec]
predicted_labels_word2vec = y_pred_word2vec[y_test != y_pred_word2vec]
misclassified_df_word2vec = pd.DataFrame({'Text': misclassified_word2vec, 'True Label': true_labels_word2vec, 'Predicted Label': predicted_labels_word2vec})
print("Misclassified instances for Word2Vec + Logistic Regression:")
print(misclassified_df_word2vec)




Misclassified instances for Word2Vec + Logistic Regression:
                                                    Text  True Label  \
18943  [rt, 1inkkofrosess, lol, credit, ai, near, goo...           2   
4273            [search, gay, redneck, episode, 1, play]           0   
15789  [rt, jsu, coach, omar, johnson, u, ball, u, th...           2   
11311  [tryna, get, sleep, birds, start, getting, rowdy]           2   
...                                                  ...         ...   
4767   [stevestockmantx, hes, friggin, idiot, say, an...           0   
10959                        [think, eat, brownie, pass]           2   
20979  [real, unreal, lol, yankees, worldseries, 27an...           2   
7339                       [xcorey21, uh, trash, 128536]           1   
3310   [grizzboadams, wyattnuckels, haha, ight, nig, ...           0   

       Predicted Label  
18943                1  
4273                 2  
3778                 1  
15789                1  
11311                1

In [28]:
# Generate confusion matrix and classification report for Word2Vec + Logistic Regression
print("Confusion Matrix for Word2Vec + Logistic Regression:")
print(confusion_matrix(y_test, y_pred_word2vec))
print("\nClassification Report for Word2Vec + Logistic Regression:")
print(classification_report(y_test, y_pred_word2vec))

Confusion Matrix for Word2Vec + Logistic Regression:
[[   0  220   70]
 [   0 3674  158]
 [   0  367  468]]

Classification Report for Word2Vec + Logistic Regression:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       290
           1       0.86      0.96      0.91      3832
           2       0.67      0.56      0.61       835

    accuracy                           0.84      4957
   macro avg       0.51      0.51      0.51      4957
weighted avg       0.78      0.84      0.80      4957



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


FASTTEXT REPRESENTATION

In [9]:
# FastText
fasttext_model = FastText(sentences=X_train, vector_size=100, window=5, min_count=1, workers=4)
X_train_fasttext = np.array([np.mean([fasttext_model.wv[word] for word in words if word in fasttext_model.wv] or [np.zeros(100)], axis=0) for words in X_train])
X_test_fasttext = np.array([np.mean([fasttext_model.wv[word] for word in words if word in fasttext_model.wv] or [np.zeros(100)], axis=0) for words in X_test])



In [18]:

# Print Word Representations from FastText
print("Word Representations from FastText:")
for word, representation in zip(fasttext_model.wv.index_to_key[:10], fasttext_model.wv.vectors[:10]):
    print(word, representation)

Word Representations from FastText:
bitch [-1.2081189   0.48140657 -0.45500606  1.2237934   1.1582677  -0.05025363
  0.7210156   0.9125101   0.8744453  -1.0732038  -0.8917567   0.20820488
 -0.921073    1.3633486   0.27999055  0.07162543  0.4194755   0.13796261
 -0.3434904  -1.5783442  -1.1715018   0.41978264 -0.30339175  0.68986714
 -0.99670464 -0.4082996  -0.22610721 -0.2613036   1.3433273   0.05678868
 -0.5205509  -0.47839132  0.5786245  -0.20522532  0.21177533  0.7629824
  0.15956216  0.18594545 -1.0617661   1.0013833   0.47804418 -1.0387074
 -0.5291364  -0.74315053 -0.1634434  -0.7592818  -0.9658449  -0.5481073
  0.5796927   0.2616041   0.23131043  0.05281986  1.3860446   0.3047956
 -0.5449317  -0.17371115  0.32753363  0.9444184  -0.83536357 -0.223976
 -0.31982696 -0.36510965 -1.398876    1.8386248   0.09219341  0.9639278
 -0.11818993 -0.3380057  -0.6112552   0.91946375  0.2665039  -0.15656362
  0.29125014 -0.46849912 -0.14190103  0.6191236   0.71286213  0.37444732
 -0.07689146  0.

In [29]:
# Evaluate FastText + Logistic Regression
y_pred_fasttext = logreg_fasttext.predict(X_test_fasttext)
fasttext_accuracy = accuracy_score(y_test, y_pred_fasttext)
print("FastText + Logistic Regression Accuracy:", fasttext_accuracy)

# Analyze misclassified instances for FastText + Logistic Regression
misclassified_fasttext = X_test[y_test != y_pred_fasttext]
true_labels_fasttext = y_test[y_test != y_pred_fasttext]
predicted_labels_fasttext = y_pred_fasttext[y_test != y_pred_fasttext]
misclassified_df_fasttext = pd.DataFrame({'Text': misclassified_fasttext, 'True Label': true_labels_fasttext, 'Predicted Label': predicted_labels_fasttext})
print("Misclassified instances for FastText + Logistic Regression:")
print(misclassified_df_fasttext)

# Generate confusion matrix and classification report for FastText + Logistic Regression
print("Confusion Matrix for FastText + Logistic Regression:")
print(confusion_matrix(y_test, y_pred_fasttext))
print("\nClassification Report for FastText + Logistic Regression:")
print(classification_report(y_test, y_pred_fasttext))

FastText + Logistic Regression Accuracy: 0.8355860399435142
Misclassified instances for FastText + Logistic Regression:
                                                    Text  True Label  \
18943  [rt, 1inkkofrosess, lol, credit, ai, near, goo...           2   
4273            [search, gay, redneck, episode, 1, play]           0   
15789  [rt, jsu, coach, omar, johnson, u, ball, u, th...           2   
11311  [tryna, get, sleep, birds, start, getting, rowdy]           2   
...                                                  ...         ...   
10959                        [think, eat, brownie, pass]           2   
20979  [real, unreal, lol, yankees, worldseries, 27an...           2   
7339                       [xcorey21, uh, trash, 128536]           1   
20769  [unfollowed, said, cried, watching, dawn, apes...           2   
3310   [grizzboadams, wyattnuckels, haha, ight, nig, ...           0   

       Predicted Label  
18943                1  
4273                 2  
3778        

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


CNN AND RNN

In [10]:
# CNN
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_cnn = tokenizer.texts_to_sequences(X_train)
X_test_cnn = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 100
X_train_cnn = pad_sequences(X_train_cnn, padding='post', maxlen=maxlen)
X_test_cnn = pad_sequences(X_test_cnn, padding='post', maxlen=maxlen)


In [11]:

# RNN
X_train_rnn = pad_sequences(X_train_cnn, padding='post', maxlen=maxlen)
X_test_rnn = pad_sequences(X_test_cnn, padding='post', maxlen=maxlen)


In [12]:

# Define CNN model
cnn_model = Sequential()
cnn_model.add(Embedding(input_dim=vocab_size, output_dim=100, input_length=maxlen))
cnn_model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Dense(10, activation='relu'))
cnn_model.add(Dense(1, activation='sigmoid'))
cnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train CNN model
cnn_model.fit(X_train_cnn, y_train, epochs=10, batch_size=64, validation_data=(X_test_cnn, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7c326e32feb0>

In [13]:

# Define RNN model
rnn_model = Sequential()
rnn_model.add(Embedding(input_dim=vocab_size, output_dim=100, input_length=maxlen))
rnn_model.add(LSTM(100))
rnn_model.add(Dense(1, activation='sigmoid'))
rnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train RNN model
rnn_model.fit(X_train_rnn, y_train, epochs=10, batch_size=64, validation_data=(X_test_rnn, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7c326884ca00>

In [23]:



# Evaluate CNN
cnn_loss, cnn_accuracy = cnn_model.evaluate(X_test_cnn, y_test)
print("CNN Accuracy:", cnn_accuracy)

# Evaluate RNN
rnn_loss, rnn_accuracy = rnn_model.evaluate(X_test_rnn, y_test)
print("RNN Accuracy:", rnn_accuracy)


CNN Accuracy: 0.7720804214477539
RNN Accuracy: 0.7730482220649719


In [19]:
# Get Embedding Layer Output for CNN
embedding_output_cnn = cnn_model.layers[0](X_test_cnn)
print("Embedding Output Shape (CNN):", embedding_output_cnn.shape)


Embedding Output Shape (CNN): (4957, 100, 100)


In [20]:

# Get Embedding Layer Output for RNN
embedding_output_rnn = rnn_model.layers[0](X_test_rnn)
print("Embedding Output Shape (RNN):", embedding_output_rnn.shape)


Embedding Output Shape (RNN): (4957, 100, 100)
