<a href="https://colab.research.google.com/github/rilli-00/Fitra/blob/main/LSTM_CNN_Classifier_(reem).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**LSTM-CNN-Classifier**
This model combines CNN and LSTM to take advantage of their strengths. CNN extracts important features from the text, such as patterns and key words. Then, LSTM captures the sequence and context between words. This combination helps the model analyze text more effectively and achieve higher classification accuracy

 Benefits of Combining CNN + LSTM in a Model:

1️⃣ Feature Extraction & Pattern Recognition (CNN)

CNN efficiently extracts key features from textual data.
Captures spatial relationships between words, even if order is less important.
Reduces dimensionality while preserving important contextual patterns.

2️⃣ Context Understanding & Sequential Learning (LSTM)

LSTM focuses on the sequential nature of text.
Maintains long-term dependencies, ensuring meaning is retained across sentences.
Handles varying sequence lengths without losing important context.

3️⃣ Improved Accuracy & Generalization

CNN removes irrelevant noise, while LSTM refines contextual understanding.
The combination enhances classification accuracy for complex text data.
Prevents overfitting, making the model more robust.

4️⃣ Balanced Speed & Efficiency

CNN accelerates feature extraction, reducing processing time.
LSTM ensures comprehensive text understanding, improving prediction quality.
Together, they create a balanced trade-off between speed and depth.

5️⃣ Better Performance on NLP Tasks

Works well for text classification, sentiment analysis, and speech recognition.
Handles context-sensitive tasks better than standalone CNN or LSTM.
Effective for multilingual processing and complex sentence structures.

In [None]:
!pip install --upgrade numpy==1.26.4



In [None]:
!pip install --upgrade pandas==2.2.2



##Upload and clean the data

In [None]:
import pandas as pd
import numpy as np
import re
from google.colab import files

#load data
uploaded = files.upload()
df = pd.read_csv(list(uploaded.keys())[0])

#cleaning method
def clean_text(text):
    if isinstance(text, str):
        text = text.lower()
        text = re.sub(r"http\S+|www\S+", "", text)
        text = re.sub(r"[^a-zA-Z\s]", "", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text
    return ""

df["video_title"] = df["video_title"].apply(clean_text)
df["video_description"] = df["video_description"].apply(clean_text)
df["transcript"] = df["transcript"].apply(clean_text)

#  clean contain_lgbtq`
df["contain_lgbtq"] = df["contain_lgbtq"].astype(str).str.strip().str.lower()
df["contain_lgbtq"] = df["contain_lgbtq"].replace({"yes": 1, "no": 0})
df = df[df["contain_lgbtq"].isin([0, 1])]


#remove empty data
df = df.dropna(subset=["video_title"])
df = df.dropna(subset=["video_description"])
df = df.dropna(subset=["transcript"])



# save after cleaning
cleaned_file_path = "Cleaned_Dataset.csv"
df.to_csv(cleaned_file_path, index=False, encoding="utf-8")
print(f"\n📂 **Cleaned dataset saved at:** {cleaned_file_path}")

df.head()


Saving combined_final_data.csv to combined_final_data.csv


  df["contain_lgbtq"] = df["contain_lgbtq"].replace({"yes": 1, "no": 0})



📂 **Cleaned dataset saved at:** Cleaned_Dataset.csv


Unnamed: 0,video_id,video_title,url,video_description,transcript,contain_lgbtq
0,O61aMY3Tqfk,q news tonight broadcast full tue jun q news t...,https://www.youtube.com/watch?v=O61aMY3Tqfk,tune in daily for q news tonight live at pm ea...,well good evening america it is pm tuesday jun...,1
1,oCfIxCbogn4,thu feb daily live lgbtq news broadcast queer ...,https://www.youtube.com/watch?v=oCfIxCbogn4,we need your support become a patron missed ou...,on oneill is our lead tonight who is dwight oh...,1
2,CqcXXaCoJis,wed feb daily live lgbtq news broadcast queer ...,https://www.youtube.com/watch?v=CqcXXaCoJis,we need your support become a patron missed ou...,a comedian with us tonight and a tarot yeah th...,1
3,8761Y_g3_go,tue feb daily live lgbtq news broadcast queer ...,https://www.youtube.com/watch?v=8761Y_g3_go,we need your support become a patron missed ou...,at least that happening out that both of you h...,1
4,OF_Wf2dVJ2M,mon feb daily live lgbtq news broadcast queer ...,https://www.youtube.com/watch?v=OF_Wf2dVJ2M,we need your support become a patron missed ou...,presidents day presidents day were in life oka...,1


##Text Embedding with FastText"

In [None]:
!pip install --upgrade gensim


Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully 

In [None]:
import pandas as pd
import numpy as np
import re
from gensim.models import FastText
from tqdm import tqdm

# load data
file_path = "Cleaned_Dataset.csv"
df = pd.read_csv(file_path)

#text tokenization method
def tokenize_text(text):
    if isinstance(text, str):
        text = re.sub(r"[^\w\s]", "", text)
        return text.split()
    return []

df["video_title_tokens"] = df["video_title"].apply(tokenize_text)
df["video_description_tokens"] = df["video_description"].apply(tokenize_text)
df["transcript_tokens"] = df["transcript"].apply(tokenize_text)

# gather texts to train FastText model
all_texts = df["video_title_tokens"].tolist() + df["video_description_tokens"].tolist() + df["transcript_tokens"].tolist()

# train FastText
print("⏳ Training FastText model...")
fasttext_model = FastText(sentences=all_texts, vector_size=300, window=5, min_count=2, workers=4, sg=1, epochs=10)
print("✅ FastText training completed!")

# ✅text to vectors by FastText
def get_sentence_embedding(tokens):
    vectors = [fasttext_model.wv[word] for word in tokens if word in fasttext_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)


df["video_title_vector"] = df["video_title_tokens"].apply(get_sentence_embedding)
df["video_description_vector"] = df["video_description_tokens"].apply(get_sentence_embedding)
df["transcript_vector"] = df["transcript_tokens"].apply(get_sentence_embedding)

# 3 vectors ----> 1 vector
df["combined_vector"] = df.apply(lambda row: np.hstack([
    row["video_title_vector"],
    row["video_description_vector"],
    row["transcript_vector"]
]), axis=1)

# save the data
embeddings_file_path = "FastText_Embeddings.csv"
df[["combined_vector", "contain_lgbtq"]].to_csv(embeddings_file_path, index=False, encoding="utf-8")

print(f"\n📂 **Embeddings saved at:** {embeddings_file_path}")


⏳ Training FastText model...
✅ FastText training completed!

📂 **Embeddings saved at:** FastText_Embeddings.csv


##LSTM-CNN model

###Data Preparation

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#load data
file_path = "FastText_Embeddings.csv"
df = pd.read_csv(file_path)

#from text to numpy
df["combined_vector"] = df["combined_vector"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

#define x , y
X = np.vstack(df["combined_vector"].values)
y = df["contain_lgbtq"].values.astype(int)

#count x ,y values
num_x_values = X.shape[0]
num_y_values = len(y)
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

#Display the final shape of the data after reshaping

print(f"🔹 Number of all values in X: {num_x_values}")
print(f"🔹 Number of all values in y: {num_y_values}")

print(f"🔍 X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"🔍 y_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
print("✅ Data is now ready for training and testing!")




🔹 Number of all values in X: 14404
🔹 Number of all values in y: 14404
🔍 X_train shape: (11523, 900), y_train shape: (11523,)
🔍 y_test shape: (2881, 900), y_test shape: (2881,)
✅ Data is now ready for training and testing!


###Model building

In [None]:
!pip install --upgrade tensorflow




In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, LSTM, Dropout, BatchNormalization

# ✅ Define the improved Hybrid CNN-LSTM Model
def HybridTextClassifier(input_shape):
    model = Sequential()

    # ✅ CNN Layer with BatchNormalization
    model.add(Conv1D(filters=64, kernel_size=5, activation="relu", input_shape=input_shape))
    model.add(BatchNormalization())  # ✅ stabilizes training
    model.add(MaxPooling1D(pool_size=2))

    # ✅ LSTM Layer
    model.add(LSTM(100, return_sequences=False))

    # ✅ Dense Layer + Dropout
    model.add(Dense(50, activation="relu"))
    model.add(Dropout(0.4))  # ✅ less aggressive dropout

    # ✅ Output Layer
    model.add(Dense(1, activation="sigmoid"))

    # ✅ Compile the Model
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model

# ✅ Create the Model
input_shape = (900, 1)
model = HybridTextClassifier(input_shape)

# ✅ Show Summary
model.summary()


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


###Train the model

In [None]:
# ✅ Train the model
epochs = 10  # Number of training iterations
batch_size = 32 # Number of samples per batch


history = model.fit(
    X_train, y_train,
    epochs=epochs,
    batch_size=batch_size,
    validation_data=(X_test, y_test),
    verbose=1
)


Epoch 1/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 390ms/step - accuracy: 0.8360 - loss: 0.3625 - val_accuracy: 0.7987 - val_loss: 0.5060
Epoch 2/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 385ms/step - accuracy: 0.9510 - loss: 0.1422 - val_accuracy: 0.9625 - val_loss: 0.0987
Epoch 3/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 388ms/step - accuracy: 0.9648 - loss: 0.1077 - val_accuracy: 0.9733 - val_loss: 0.0780
Epoch 4/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 388ms/step - accuracy: 0.9747 - loss: 0.0881 - val_accuracy: 0.9663 - val_loss: 0.0945
Epoch 5/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 384ms/step - accuracy: 0.9804 - loss: 0.0629 - val_accuracy: 0.9788 - val_loss: 0.0610
Epoch 6/10
[1m361/361[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 387ms/step - accuracy: 0.9728 - loss: 0.0823 - val_accuracy: 0.9760 - val_loss: 0.0753
Epoc

###Test the model

In [None]:

# Probabilities (values between 0 and 1)
y_pred_probabilities = model.predict(X_test)
# Convert probabilities to binary lab
y_pred = (y_pred_probabilities > 0.5).astype("int32")



[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 111ms/step


###Model Evaluation

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from tabulate import tabulate

total_samples = len(y_train) + len(y_test)

# ✅ Compute test dataset statistics
total_Test_samples = len(y_test)
test_positive_samples = np.sum(y_test)
test_negative_samples = total_Test_samples - test_positive_samples

# ✅ Compute training dataset statistics
total_train_samples = len(y_train)
train_positive_samples = np.sum(y_train)
train_negative_samples = len(y_train) - train_positive_samples

# ✅ Compute model accuracy
train_accuracy = history.history['accuracy'][-1]
test_accuracy = accuracy_score(y_test, y_pred)
accuracy_gap = abs(train_accuracy - test_accuracy) * 100

# ✅ Analyze loss stability
train_loss = history.history['loss']
loss_stability = max(train_loss) - min(train_loss)
loss_variation = np.std(train_loss)

# ✅ Compute confusion matrix & classification report
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, target_names=["Not LGBT", "LGBT"], output_dict=True)
conf_matrix_percent = conf_matrix.astype(float) / conf_matrix.sum(axis=1)[:, np.newaxis] * 100

# ✅ Compute F1-score & CHAP Score
f1_not_lgbt = classification_rep["Not LGBT"]["f1-score"] * 100
f1_lgbt = classification_rep["LGBT"]["f1-score"] * 100
f1_gap = abs(f1_not_lgbt - f1_lgbt)

chap_score = (f1_not_lgbt + f1_lgbt) / 2  # CHAP Score measures class balance in performance

# ✅ Compute ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred) * 100


# ✅ Compute additional metrics
precision_not_lgbt = classification_rep["Not LGBT"]["precision"] * 100
precision_lgbt = classification_rep["LGBT"]["precision"] * 100
recall_not_lgbt = classification_rep["Not LGBT"]["recall"] * 100
recall_lgbt = classification_rep["LGBT"]["recall"] * 100
roc_auc = roc_auc_score(y_test, y_pred) * 100

# ✅ Detect overfitting
overfitting_detected = False
overfitting_reasons = []

if accuracy_gap > 5:
    overfitting_detected = True
    overfitting_reasons.append(f"⚠ *High Accuracy Gap:* {accuracy_gap:.2f}%")

if loss_stability > 0.4:
    overfitting_detected = True
    overfitting_reasons.append(f"⚠ *Unstable Loss Variation:* {loss_stability:.4f}")

if f1_gap > 10:
    overfitting_detected = True
    overfitting_reasons.append(f"⚠ *High F1-Score Gap Between Classes:* {f1_gap:.2f}%")

if classification_rep["Not LGBT"]["recall"] > 0.95 and classification_rep["LGBT"]["recall"] < 0.50:
    overfitting_detected = True
    overfitting_reasons.append("⚠ *Class Bias Detected (Favoring 'Not LGBT').*")


if classification_rep["LGBT"]["recall"] > 0.95 and classification_rep["Not LGBT"]["recall"] < 0.50:
    overfitting_detected = True
    overfitting_reasons.append("⚠ *Class Bias Detected (Favoring 'Not LGBT').*")



# ✅ Display  analysis
print("\n🔍 * LSTM + CNN Model  Analysis:*\n")
print(f"📊 **Total samples:** {total_samples} ")
print("\n***********Train Data **********************")
print(f"📊 *Total Train samples:* {total_train_samples}")
print(f"✔️ *Training Not LGBT samples:* {train_negative_samples}")
print(f"✔️ *Training LGBT samples:* {train_positive_samples}")


print("\n***********Test Data **********************")
print(f"📊 *Total test samples:* {total_Test_samples}")
print(f"✔️ *Not LGBT samples:* {test_negative_samples}")
print(f"✔️ *LGBT samples:* {test_positive_samples}")

print("\n🎯 **Model Performance:**")
print(f"✅ **Train Accuracy:** {train_accuracy:.2%} ")
print(f"✅ **Test Accuracy:** {test_accuracy:.2%} ")
print(f"📈 **Accuracy Gap:** {accuracy_gap:.2f}% {'🔴 Overfitting detected!' if accuracy_gap > 5 else '🟢 Good generalization!'}")
print(f"📉 *Loss Stability Score:* {loss_stability:.4f}")
print(f"📊 *Loss Variation Standard Deviation:* {loss_variation:.4f}")

print("\n📊 **F1-Scores & CHAP Score:**")
print(f"🔄 **F1-Score Not LGBT:** {f1_not_lgbt:.2f}%")
print(f"🔄 **F1-Score LGBT:** {f1_lgbt:.2f}%")
print(f"⚖ **F1-Score Gap:** {f1_gap:.2f}% {'🔴 Possible class Bias!' if f1_gap > 10 else '🟢 Balanced!'}")
print(f"🟢 **CHAP Score:** {chap_score:.2f}%")

print("\n🎯 ROC-AUC Score:")
print(f"🟢 ROC-AUC Score: {roc_auc:.2f}% ")



# ✅ Print Overfitting results
if overfitting_detected:
    print("\n🚨 *Overfitting Detected!* 🚨")
    for reason in overfitting_reasons:
        print(reason)
else:
    print("\n✅ *Model is well-generalized! No Overfitting detected.*")



print("\n📊 **Precision & Recall Analysis:**")
print(f"🔍 **Precision (Not LGBT):** {precision_not_lgbt:.2f}%")
print(f"🔍 **Precision (LGBT):** {precision_lgbt:.2f}%")
print(f"🔍 **Recall (Not LGBT):** {recall_not_lgbt:.2f}%")
print(f"🔍 **Recall (LGBT):** {recall_lgbt:.2f}%")



# ✅ Display confusion matrix
conf_matrix_df = pd.DataFrame(conf_matrix, index=["Actual Not LGBT", "Actual LGBT"], columns=["Predicted Not LGBT", "Predicted LGBT"])
conf_matrix_percent_df = pd.DataFrame(conf_matrix_percent, index=["Actual Not LGBT", "Actual LGBT"], columns=["Predicted Not LGBT", "Predicted LGBT"])

print("\n📊 *Confusion Matrix - Raw Values:*")
print(tabulate(conf_matrix_df, headers='keys', tablefmt='fancy_grid'))
print("\n📊 *Confusion Matrix - Percentage Values:*")
print(tabulate(conf_matrix_percent_df.round(2), headers='keys', tablefmt='fancy_grid'))





🔍 * LSTM + CNN Model  Analysis:*

📊 **Total samples:** 14404 

***********Train Data **********************
📊 *Total Train samples:* 11523
✔️ *Training Not LGBT samples:* 5762
✔️ *Training LGBT samples:* 5761

***********Test Data **********************
📊 *Total test samples:* 2881
✔️ *Not LGBT samples:* 1441
✔️ *LGBT samples:* 1440

🎯 **Model Performance:**
✅ **Train Accuracy:** 98.58% 
✅ **Test Accuracy:** 96.88% 
📈 **Accuracy Gap:** 1.70% 🟢 Good generalization!
📉 *Loss Stability Score:* 0.2026
📊 *Loss Variation Standard Deviation:* 0.0583

📊 **F1-Scores & CHAP Score:**
🔄 **F1-Score Not LGBT:** 96.96%
🔄 **F1-Score LGBT:** 96.79%
⚖ **F1-Score Gap:** 0.17% 🟢 Balanced!
🟢 **CHAP Score:** 96.87%

🎯 ROC-AUC Score:
🟢 ROC-AUC Score: 96.88% 

✅ *Model is well-generalized! No Overfitting detected.*

📊 **Precision & Recall Analysis:**
🔍 **Precision (Not LGBT):** 94.53%
🔍 **Precision (LGBT):** 99.49%
🔍 **Recall (Not LGBT):** 99.51%
🔍 **Recall (LGBT):** 94.24%

📊 *Confusion Matrix - Raw Values:*