**Dataset**
labeled dataset collected from twitter (Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Evaluation metric**
macro f1 score

**Steps**

To classify hate speech in tweets, follow these key steps:

1. **Data Preprocessing**: Clean text (remove punctuation, stopwords, etc.), lowercase, tokenize, and so on.
2. **Text Representation**: Use Bag of Words, TF-IDF, or word embeddings (e.g., GloVe, Word2Vec, or FastText).
3. **Modeling Approaches**:
   - **Traditional Models**: Logistic Regression, Naive Bayes, SVM, Random Forest.
   - **Deep Learning**: LSTM or RNN.
4. **Evaluation**
5. **Optimization**: Use hyperparameter tuning, regularization, and ensemble methods for better performance.


### Import used libraries

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### Load Dataset

###### Note: search how to load the data from tsv file

In [2]:
df = pd.read_csv("Hate Speech.tsv", sep= "\t")
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


In [3]:
len(df)

31535

### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [4]:
from sklearn.model_selection import train_test_split

train_df, rest_df = train_test_split(df, test_size=0.3)
val_df, test_df = train_test_split(rest_df, test_size=0.5)

print(len(train_df), len(val_df), len(test_df))

22074 4730 4731


### EDA on training data

- check NaNs

In [5]:
print(f"{train_df.isna().sum()}\n{val_df.isna().sum()}\n{test_df.isna().sum()}")

id       0
label    0
tweet    0
dtype: int64
id       0
label    0
tweet    0
dtype: int64
id       0
label    0
tweet    0
dtype: int64


- check duplicates

In [6]:
df.duplicated().sum()

0

- show a representative sample of data texts to find out required preprocessing steps

In [7]:
sampled_data = df.groupby('label').apply(lambda x: x.sample(min(10, len(x)))).reset_index(drop=True)

sampled_data[['id', 'label', 'tweet']]

  sampled_data = df.groupby('label').apply(lambda x: x.sample(min(10, len(x)))).reset_index(drop=True)


Unnamed: 0,id,label,tweet
0,23887,0,"@user the dog kind of way to say: âdon't worry, be !â #dogsarejoy"
1,26141,0,my little bro may have just beat me in basketball
2,4522,0,good for you || ms #grunge #nature #rad #awsome #sun #photo #iphone #nofilter #eahâ¦
3,25043,0,@user kids loved the drums. mom couldn`t close the deal with salesmen #bengaluru street market #wednesday
4,21251,0,happy feet. #throwback #holiday #solotrip #solooverseastrip #australia #ausboundâ¦
5,8541,0,the fun pa #sarcasm #moving #thetaylorway #chaos
6,898,0,"python27 and concurrency are not best friends, all the code i have 2 restructure to get concurrency with celery #developers #python"
7,23400,0,@user @user @user i had to #factcheck this cuz i was like #wtf can't b true but guess #cnn #media like #meth
8,4192,0,"ah w'd rather be happy than , cuss it all t' tarnation. varejao more &gt;&gt;"
9,31524,0,#life #love be #enjoy #appreciate #des'tee


- check dataset balancing

In [8]:
print("Normal speech: ", df.label.value_counts()[0]/len(df)*100)
print("Hate speech: ", df.label.value_counts()[1]/len(df)*100)

Normal speech:  92.98240050737276
Hate speech:  7.01759949262724


- Cleaning and Preprocessing are:
    - Lowercasing
    - Remove user mentions
    - Remove URLs
    - Remove special characters and punctuation
    - Tokenize text
    - Remove stopwords
    - Handling emojis
    - Lemmatization
    - Handle abbreviations and slang

### Cleaning and Preprocessing

In [9]:
import re
import emoji
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.base import BaseEstimator, TransformerMixin
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

slang_dict = {
    "u": "you",
    "ur": "your",
    "r": "are",
    "y": "why",
    "pls": "please",
    "plz": "please",
    "thx": "thanks",
    "ty": "thank you",
    "dm": "direct message",
    "btw": "by the way",
    "brb": "be right back",
    "idk": "I don't know",
    "ikr": "I know right",
    "smh": "shaking my head",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "omg": "oh my god",
    "lol": "laughing out loud",
    "lmao": "laughing my ass off",
    "rofl": "rolling on the floor laughing",
    "wtf": "what the fuck",
    "wth": "what the heck",
    "afaik": "as far as I know",
    "asap": "as soon as possible",
    "fyi": "for your information",
    "tbh": "to be honest",
    "np": "no problem",
    "bc": "because",
    "b/c": "because",
    "b4": "before",
    "cuz": "because",
    "gf": "girlfriend",
    "bf": "boyfriend",
    "bff": "best friends forever",
    "gr8": "great",
    "l8r": "later",
    "tho": "though",
    "thru": "through",
    "msg": "message",
    "txt": "text",
    "omw": "on my way",
    "fml": "fuck my life",
    "nvm": "never mind",
    "bday": "birthday",
    "tbt": "throwback Thursday",
    "icymi": "in case you missed it",
    "irl": "in real life",
    "ppl": "people",
    "bffl": "best friends for life",
    "jk": "just kidding",
    "xoxo": "hugs and kisses",
    "idc": "I don't care",
    "ily": "I love you",
    "ilu": "I love you",
    "omfg": "oh my fucking god",
    "srsly": "seriously",
    "ikr": "I know right",
    "fam": "family",
    "bae": "before anyone else",
    "hmu": "hit me up",
    "gg": "good game",
    "tmi": "too much information",
    "ftw": "for the win",
    "lit": "exciting or fun",
    "s/o": "shoutout",
    "irl": "in real life",
    "imo": "in my opinion",
    "hbd": "happy birthday",
    "atm": "at the moment",
    "qotd": "quote of the day",
    "rn": "right now",
    "tfw": "that feeling when",
    "yolo": "you only live once",
    "wyd": "what are you doing",
    "wya": "where you at",
    "bb": "baby",
    "luv": "love",
    "gr8": "great",
    "m8": "mate",
    "obvi": "obviously",
    "def": "definitely",
    "jk": "just kidding"
}

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yara.mahfouz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yara.mahfouz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Define a function to handle the text preprocessing
        def preprocess_tweet(text):
            # Lowercasing
            text = text.lower()
            
            # Remove Mentions
            text = re.sub(r'@\w+', '', text)
            
            # Remove URLs
            text = re.sub(r'http\S+|www.\S+', '', text)
                        
            # Remove Special Characters and Punctuation
            text = re.sub(r'[^a-zA-Z\s]', '', text)
            
            # Tokenize Text
            words = text.split()
            
            # Remove Stop Words
            words = [word for word in words if word not in self.stop_words]
            
            # Handle Emojis (convert to text)
            text = emoji.demojize(" ".join(words))
            
            # Lemmatization
            words = [self.lemmatizer.lemmatize(word) for word in words]
            
            # Replace Slang
            words = [slang_dict[word] if word in slang_dict else word for word in words]
            
            # Reconstruct the processed tweet
            return " ".join(words)
        
        # Apply the preprocessing to each tweet in the 'tweet' column
        X = X.copy()
        
        return X.apply(preprocess_tweet)

    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)


**You  are doing Great so far!**

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', TfidfVectorizer()),
    ('model', model),
])

# Split your data into inputs and labels
X_train = train_df['tweet']  # Raw tweet text as input
y_train = train_df['label']  # Labels for classification

X_val = val_df['tweet']
y_val = val_df['label']

# Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Predict on the validation set
val_predictions = pipeline.predict(X_val)

val_accuracy = pipeline.score(X_val, y_val)
print("Validation Accuracy:", val_accuracy)


Validation Accuracy: 0.9439746300211417


In [12]:
X_test = test_df['tweet']  
y_test = test_df['label']  

test_predictions = pipeline.predict(X_test)

test_accuracy = pipeline.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.9492707672796449


#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [13]:
from sklearn.metrics import f1_score

test_predictions = pipeline.predict(X_test)

# Calculate the macro F1 score
test_macro_f1 = f1_score(y_test, test_predictions, average='macro')
print("Validation Macro F1 Score:", test_macro_f1)


Validation Macro F1 Score: 0.7211947393751289


### Enhancement

- Using different text representation or modeling techniques
- Hyperparameter tuning

- The basic implementation above produced a very low F1 score despite having good accuracy becasue of class imbalance, i will try to fix it by adjusting class weights.

In [14]:
from sklearn.model_selection import GridSearchCV
model = LogisticRegression(class_weight='balanced')
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('vectorizing', TfidfVectorizer()),
    ('model', model),
])

param_grid = {
    'model__C': [0.01, 0.1, 1, 10],
    'model__penalty': ['l2'],
}

grid_search = GridSearchCV(pipeline, param_grid, scoring='f1_macro', cv=5, n_jobs=1)

grid_search.fit(X_train, y_train)

best_pipeline = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)

test_predictions = best_pipeline.predict(X_test)
test_macro_f1 = f1_score(y_test, test_predictions, average='macro')
print("Test Macro F1 Score with best parameters:", test_macro_f1)


Best parameters: {'model__C': 10, 'model__penalty': 'l2'}
Test Macro F1 Score with best parameters: 0.8498494058768788


- The results are much better, but there is still room for improvement. Let's try a different model

In [15]:
from sklearn.svm import SVC


svm_model = SVC(kernel='rbf', class_weight='balanced')

# Define the pipeline with TfidfVectorizer and SVM
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('vectorizing', TfidfVectorizer()),
    ('model', svm_model)
])

pipeline.fit(X_train, y_train)

test_predictions = pipeline.predict(X_test)

test_macro_f1 = f1_score(y_test, test_predictions, average='macro')
print("Test Macro F1 Score with SVM:", test_macro_f1)


Test Macro F1 Score with SVM: 0.8398827987241715


- Not better, lets try deep learning

In [16]:
custom_transformer = CustomTransformer()

# Transform the text data
X_train_cleaned = custom_transformer.fit_transform(X_train)
X_val_cleaned = custom_transformer.transform(X_val)
X_test_cleaned = custom_transformer.transform(X_test)

In [17]:
import tensorflow as tf
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.optimizers import Adam
from collections import Counter
import numpy as np


MAX_NUM_WORDS = 10000

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train_cleaned)

# Convert texts to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train_cleaned)
X_val_seq = tokenizer.texts_to_sequences(X_val_cleaned)
X_test_seq = tokenizer.texts_to_sequences(X_test_cleaned)

all_sequences = X_train_seq + X_val_seq + X_test_seq

sequence_lengths = [len(seq) for seq in all_sequences]

# Get the counts of each sequence length
length_counts = Counter(sequence_lengths)

# Sort the lengths in descending order and get the top k values with their counts
k = 20
top_k_lengths = sorted(length_counts.items(), key=lambda x: x[0], reverse=True)[:k]

for length, count in top_k_lengths:
    print(f"Length: {length}, Count: {count}")


Length: 1293, Count: 1
Length: 583, Count: 1
Length: 364, Count: 1
Length: 193, Count: 1
Length: 162, Count: 1
Length: 157, Count: 1
Length: 130, Count: 1
Length: 122, Count: 1
Length: 107, Count: 1
Length: 82, Count: 1
Length: 66, Count: 1
Length: 54, Count: 1
Length: 51, Count: 1
Length: 46, Count: 1
Length: 23, Count: 1
Length: 21, Count: 3
Length: 20, Count: 15
Length: 19, Count: 19
Length: 18, Count: 20
Length: 17, Count: 86


i will drop the top 15 sequences lengths to reduce computation complexity resulting from large padding, and 15 samples is not a large number to drop yet it will provide strong gains in terms of computational complexity. The length od the max sequence will then be set to 25.

In [18]:
# Get the lengths of each sequence
sequence_lengths = [len(seq) for seq in X_train_seq + X_val_seq + X_test_seq]

# Identify the top 15 longest lengths
from collections import Counter
length_counts = Counter(sequence_lengths)
top_15_lengths = sorted(length_counts.keys(), reverse=True)[:15]

# Define a function to filter sequences and their labels
def filter_sequences_and_labels(sequences, labels, lengths_to_remove):
    filtered_sequences = []
    filtered_labels = []
    for seq, label in zip(sequences, labels):
        if len(seq) not in lengths_to_remove:
            filtered_sequences.append(seq)
            filtered_labels.append(label)
    return filtered_sequences, filtered_labels

X_train_filtered, y_train_filtered = filter_sequences_and_labels(X_train_seq, y_train, top_15_lengths)
X_val_filtered, y_val_filtered = filter_sequences_and_labels(X_val_seq, y_val, top_15_lengths)
X_test_filtered, y_test_filtered = filter_sequences_and_labels(X_test_seq, y_test, top_15_lengths)

y_train = np.array(y_train_filtered)
y_val = np.array(y_val_filtered)
y_test = np.array(y_test_filtered)


In [19]:
MAX_SEQUENCE_LENGTH = 25

# Pad the sequences to ensure uniform length
X_train_padded = pad_sequences(X_train_filtered, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
X_val_padded = pad_sequences(X_val_filtered, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
X_test_padded = pad_sequences(X_test_filtered, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

vocab_size = min(len(tokenizer.word_index) + 1, MAX_NUM_WORDS)

def create_lstm_model(input_length, vocab_size, loss_function, lr=0.001):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=128, input_length=input_length),
        LSTM(64),
        Dense(64, activation='relu'),
        Dense(1, activation='sigmoid') 
    ])
    model.compile(optimizer=Adam(learning_rate=lr), loss=loss_function, metrics=['accuracy'])
    return model

lstm_model = create_lstm_model(MAX_SEQUENCE_LENGTH, vocab_size, "binary_crossentropy")


lstm_model.fit(
    X_train_padded, y_train,
    epochs=15,
    batch_size=32,
    validation_data=(X_val_padded, y_val)
)



Epoch 1/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 20ms/step - accuracy: 0.9304 - loss: 0.2736 - val_accuracy: 0.9439 - val_loss: 0.1613
Epoch 2/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 19ms/step - accuracy: 0.9634 - loss: 0.1122 - val_accuracy: 0.9579 - val_loss: 0.1327
Epoch 3/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 19ms/step - accuracy: 0.9810 - loss: 0.0606 - val_accuracy: 0.9543 - val_loss: 0.1804
Epoch 4/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 20ms/step - accuracy: 0.9878 - loss: 0.0368 - val_accuracy: 0.9573 - val_loss: 0.1742
Epoch 5/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 20ms/step - accuracy: 0.9927 - loss: 0.0239 - val_accuracy: 0.9564 - val_loss: 0.2031
Epoch 6/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 19ms/step - accuracy: 0.9950 - loss: 0.0174 - val_accuracy: 0.9509 - val_loss: 0.2310
Epoch 7/15
[1m6

<keras.src.callbacks.history.History at 0x20f819078e0>

In [20]:
y_test_pred = (lstm_model.predict(X_test_padded) > 0.5).astype("int32")
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
print("Test Macro F1 Score:", test_macro_f1)

[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step
Test Macro F1 Score: 0.8295405228018558


- The model is overfitting, i will fix it using regularization and early stopping to monitor the validation loss and stop training before it overfitts.

In [21]:
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping

def create_regularized_lstm_model(input_length, vocab_size, loss_function, lr=0.001):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=128, input_length=input_length),
        LSTM(64, return_sequences=True),
        Dropout(0.4), 
        LSTM(64),
        Dropout(0.4),
        Dense(64, activation='relu'),
        Dropout(0.4),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(learning_rate=lr), loss=loss_function, metrics=['accuracy'])
    return model

regularized_lstm_model = create_regularized_lstm_model(MAX_SEQUENCE_LENGTH, vocab_size, "binary_crossentropy")


early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)


regularized_lstm_model.fit(
    X_train_padded, y_train,
    epochs=15,
    batch_size=32,
    validation_data=(X_val_padded, y_val),
    callbacks=[early_stopping]
)


Epoch 1/15




[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 33ms/step - accuracy: 0.9254 - loss: 0.2688 - val_accuracy: 0.9520 - val_loss: 0.1564
Epoch 2/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.9705 - loss: 0.0963 - val_accuracy: 0.9566 - val_loss: 0.1713
Epoch 3/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 29ms/step - accuracy: 0.9832 - loss: 0.0595 - val_accuracy: 0.9568 - val_loss: 0.1691
Epoch 4/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 31ms/step - accuracy: 0.9877 - loss: 0.0417 - val_accuracy: 0.9452 - val_loss: 0.2253


<keras.src.callbacks.history.History at 0x20f81d189d0>

In [22]:
y_test_pred = (regularized_lstm_model.predict(X_test_padded) > 0.5).astype("int32")
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
print("Test Macro F1 Score:", test_macro_f1)

[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step
Test Macro F1 Score: 0.8307110685864973


- Regularization didn't help much but there could be more room for improvements if class imbalance was handled properly

In [23]:
import numpy as np

unique_classes, class_counts = np.unique(y_train, return_counts=True)
total_samples = len(y_train)

# Compute weights as the inverse of class frequencies
class_weight_dict = {label: total_samples / count for label, count in zip(unique_classes, class_counts)}

# Normalize the weights
total_weight = sum(class_weight_dict.values())
class_weight_dict = {label: weight / total_weight for label, weight in class_weight_dict.items()}

print("Computed class weights:", class_weight_dict)

weights = [class_weight_dict[0], class_weight_dict[1]]


def weighted_binary_crossentropy(weights):
    def loss(y_true, y_pred):
        y_true = tf.cast(y_true, tf.float32)
        y_pred = tf.clip_by_value(y_pred, tf.keras.backend.epsilon(), 1 - tf.keras.backend.epsilon())
        
        # Calculate the binary cross-entropy loss
        bce = y_true * tf.math.log(y_pred) * weights[1] + (1 - y_true) * tf.math.log(1 - y_pred) * weights[0]
        return -tf.reduce_mean(bce)
    return loss

weights = [class_weight_dict[0], class_weight_dict[1]]  


weighted_lstm_model = create_regularized_lstm_model(MAX_SEQUENCE_LENGTH, vocab_size, weighted_binary_crossentropy(weights))

weighted_lstm_model.fit(
    X_train_padded, y_train,
    epochs=15,
    batch_size=32,
    validation_data=(X_val_padded, y_val),
    callbacks=[early_stopping]

)


Computed class weights: {0: 0.06867011150394343, 1: 0.9313298884960566}
Epoch 1/15




[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 33ms/step - accuracy: 0.7444 - loss: 0.0781 - val_accuracy: 0.8894 - val_loss: 0.0541
Epoch 2/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.9142 - loss: 0.0321 - val_accuracy: 0.8856 - val_loss: 0.0436
Epoch 3/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.9520 - loss: 0.0209 - val_accuracy: 0.9190 - val_loss: 0.0605
Epoch 4/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 33ms/step - accuracy: 0.9725 - loss: 0.0139 - val_accuracy: 0.9253 - val_loss: 0.0515
Epoch 5/15
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 31ms/step - accuracy: 0.9825 - loss: 0.0089 - val_accuracy: 0.9279 - val_loss: 0.1225


<keras.src.callbacks.history.History at 0x20f87529150>

In [24]:
y_test_pred = (weighted_lstm_model.predict(X_test_padded) > 0.5).astype("int32")
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
print("Test Macro F1 Score:", test_macro_f1)

[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step
Test Macro F1 Score: 0.7279461417606394


- The results are poor. lets try a more dynamic approach for weighted loss.

In [25]:
import tensorflow as tf
from tensorflow.keras import backend as K

def focal_loss(gamma=4., alpha=0.75):
    def focal_loss_fixed(y_true, y_pred):
        y_true = tf.cast(y_true, tf.float32)
        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
        fl = -alpha_t * K.pow(1. - p_t, gamma) * K.log(p_t + K.epsilon())
        return K.mean(fl)
    return focal_loss_fixed


focal_lstm_model = create_regularized_lstm_model(MAX_SEQUENCE_LENGTH, vocab_size, focal_loss(), 0.001)
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)


focal_lstm_model.fit(
    X_train_padded, y_train,
    epochs=20,
    batch_size=32,
    validation_data=(X_val_padded, y_val),
    callbacks=[early_stopping]

)


Epoch 1/20




[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 31ms/step - accuracy: 0.9258 - loss: 0.0078 - val_accuracy: 0.9296 - val_loss: 0.0051
Epoch 2/20
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 29ms/step - accuracy: 0.9688 - loss: 0.0033 - val_accuracy: 0.9456 - val_loss: 0.0072
Epoch 3/20
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.9787 - loss: 0.0021 - val_accuracy: 0.9463 - val_loss: 0.0077
Epoch 4/20
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 29ms/step - accuracy: 0.9862 - loss: 0.0012 - val_accuracy: 0.9461 - val_loss: 0.0074
Epoch 5/20
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 29ms/step - accuracy: 0.9884 - loss: 0.0010 - val_accuracy: 0.9482 - val_loss: 0.0208
Epoch 6/20
[1m690/690[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 31ms/step - accuracy: 0.9913 - loss: 6.6747e-04 - val_accuracy: 0.9507 - val_loss: 0.0152


<keras.src.callbacks.history.History at 0x20f89a43910>

In [26]:
y_test_pred = (focal_lstm_model.predict(X_test_padded) > 0.5).astype("int32")
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
print("Test Macro F1 Score:", test_macro_f1)

[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step
Test Macro F1 Score: 0.7966024931983754


- The focal loss produced a much better outcome than the tradition class weights approach, but it is still not better than the first basic LSTM model. Let's try resampling techniques instead.

In [27]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_padded, y_train)

In [29]:
smote_lstm_model = create_regularized_lstm_model(MAX_SEQUENCE_LENGTH, vocab_size, "binary_crossentropy", 0.001)

smote_lstm_model.fit(
    X_train_resampled, y_train_resampled,
    epochs=15,
    batch_size=32,
    validation_data=(X_val_padded, y_val),
    callbacks=[early_stopping]
)

Epoch 1/15




[1m1285/1285[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 32ms/step - accuracy: 0.6755 - loss: 0.5877 - val_accuracy: 0.8310 - val_loss: 0.3883
Epoch 2/15
[1m1285/1285[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 32ms/step - accuracy: 0.8315 - loss: 0.3855 - val_accuracy: 0.8551 - val_loss: 0.3655
Epoch 3/15
[1m1285/1285[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 30ms/step - accuracy: 0.8949 - loss: 0.2576 - val_accuracy: 0.8659 - val_loss: 0.3119
Epoch 4/15
[1m1285/1285[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 31ms/step - accuracy: 0.9353 - loss: 0.1641 - val_accuracy: 0.8655 - val_loss: 0.3929
Epoch 5/15
[1m1285/1285[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 31ms/step - accuracy: 0.9615 - loss: 0.1039 - val_accuracy: 0.8487 - val_loss: 0.6023


<keras.src.callbacks.history.History at 0x21009ee8700>

In [30]:
y_test_pred = (smote_lstm_model.predict(X_test_padded) > 0.5).astype("int32")
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
print("Test Macro F1 Score:", test_macro_f1)

[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step
Test Macro F1 Score: 0.6040511542361019


- Smote produced the worst results of them all


### Conclusion and final results


### Machine Learning Models
- The logistic regression model performed better because its hyperparameters were tuned, svm parameters were not tuned due to time constraints but there is a chance that it might perform slightly better ar at least similar to logistic regression with the right tuning.

### Deep Learning Models
- The very first deep learning model, which was without regularization or weighted loss performed better than others with such configurations, which was odd given that the model was struggling with class imbalance and overfitting.
- when it comes to handling imbalance using weighted loss, focal loss, which uses dynamic weights, was much better than weighted cross entropy loss, although tuning for the gamme and alpha parameters was required (not shown in notebook bu increasing both of these values improved results).
- smote performed worst out of all approaches, which signifies that it is generaly not suitable for nlp tasks.

### Genral Insights
- the machine learning models outperformed the deep learning models, specially logistic regression This likely due to dataset size not being large enough for deep learning model's performance to scale to it. This experimet could highlight the importance of using traditional machine learning models and that deep learning models are not always the best option for any problem.

#### Done!