<div align="center" style="border: 2px solid #1C6EA4; background-color: #90EE90; padding: 15px;">
    <img src="https://www.colorado.edu/brand/sites/default/files/styles/medium/public/page/boulder_left_lockup_black.png?itok=4qMuKoBT" alt="Colorado Boulder University Logo" width="400" height="500">
    <h2 style="color: black; font-weight: bold;">
        <i class="fas fa-exclamation-triangle" style="color: #FF4500;"></i> NLP Disaster Tweets Kaggle Mini-Project
    </h2>
</div>


**Name**: _Willian Pina_

<div style="background-color: lightgrey; color: black; padding: 20px;border: 1px solid black;">
  <h3>Brief Description of the Problem</h3>
  <p>The challenge is to develop a machine learning model capable of classifying tweets as either related to a real-world disaster or not. This is a Natural Language Processing (NLP) problem where the objective is to understand the semantics and context of textual data—in this case, tweets. The aim is to automate a task that is easily done by humans but not so straightforward for machines: discerning the urgency or critical nature of a tweet based on its text content.</p>

  <h3>NLP (Natural Language Processing)</h3>
  <p>NLP is a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate objective is to enable computers to understand, interpret, and produce human languages in a way that is both meaningful and useful. NLP involves several challenges including language modeling, parsing, sentiment analysis, and machine translation.</p>

  <h3>Data Description</h3>
  <h4>Size:</h4>
  <ul>
    <li>The training set comprises 7,613 manually labeled tweets.</li>
    <li>The test set contains 3,263 tweets.</li>
  </ul>

  <h4>Dimension:</h4>
  <p>Each tweet is represented by several features, making the data multidimensional. Specifically, the features are:</p>
  <ul>
    <li><code>id</code>: A unique identifier for each tweet.</li>
    <li><code>text</code>: The text of the tweet itself.</li>
    <li><code>location</code>: The geographical location from which the tweet was sent (may be blank).</li>
    <li><code>keyword</code>: A keyword extracted from the tweet (may also be blank).</li>
    <li><code>target</code>: A label indicating whether the tweet is about a real disaster (1) or not (0).</li>
  </ul>

  <h4>Structure:</h4>
  <p>The data is structured in a tabular format, stored in CSV files. There are separate files for training (<code>train.csv</code>) and testing (<code>test.csv</code>). A sample submission file (<code>sample_submission.csv</code>) is also provided to guide how predictions should be formatted for submission.</p>

  <p>By understanding these aspects of the data and problem, we can proceed to design and train a machine learning model to make accurate predictions.</p>
</div>


In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from tqdm import tqdm
import random

# Importing necessary libraries for data preprocessing and model building
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout,GRU, Bidirectional, BatchNormalization
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
import keras_tuner
from kerastuner.tuners import RandomSearch

np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)


class CONFIG:
    train_path             = '/kaggle/input/nlp-getting-started/train.csv'
    test_path              = '/kaggle/input/nlp-getting-started/test.csv'
    sample_submission_path = '/kaggle/input/nlp-getting-started/sample_submission.csv'
    max_vocab_size         = 20000  # Maximum number of words to keep, based on word frequency
    max_sequence_length    = 100  # Maximum number of words in a sequence
    embedding_dim          = 100  # Dimensionality of the GloVe word vectors
    epochs                 = 5
    batch_size             = 32

In [None]:
# Load Data
train_data = pd.read_csv(CONFIG.train_path)
test_data = pd.read_csv(CONFIG.test_path)

# Describe Size and Dimension
print(f"Train Data Dimensions: {train_data.shape}")
print(f"Test Data Dimensions: {test_data.shape}")

# Check Structure
print("\nTrain Data Structure:")
print(train_data.info())
print("\nTest Data Structure:")
print(test_data.info())

<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
  <h4>Exploratory Data Analysis (EDA)</h4>
  
  <p>EDA is crucial for understanding the dataset's characteristics, identifying patterns and potential outliers, and informing subsequent data processing steps. Below are some insights gained from EDA, complete with visualizations and data cleaning procedures.</p>
  
  <h5>Data Visualizations</h5>
  
  <ul>
    <li><strong>Target Variable Distribution</strong>: A histogram of the <code>target</code> variable can show us how balanced or imbalanced our dataset is.</li>
    <li><strong>Missing Values</strong>: Bar plots can show the number of missing values in each feature.</li>
    <li><strong>Word Count in Tweets</strong>: A histogram can provide an understanding of the text length in the tweets.</li>
    <li><strong>Most Common Keywords</strong>: A bar plot can display the most frequently occurring keywords in the dataset.</li>
  </ul>
</div>




In [None]:
# Plot Histogram for 'target' distribution
sns.countplot(x='target', data=train_data)
plt.title('Distribution of Target Variable')
plt.show()

# Check for Missing Values
print("\nMissing Values in Train Data:")
print(train_data.isnull().sum())

print("\nMissing Values in Test Data:")
print(test_data.isnull().sum())

# Word Count Distribution
word_counts = train_data['text'].apply(lambda x: len(x.split()))
sns.histplot(word_counts, bins=30)
plt.title('Word Count Distribution in Tweets')
plt.show()

# Data Cleaning (Simple)
train_data.fillna('', inplace=True)
test_data.fillna('', inplace=True)

# Most Common Keywords
common_keywords = Counter(train_data['keyword'])
common_keywords = common_keywords.most_common(10)
sns.barplot(x=[count for keyword, count in common_keywords], y=[keyword for keyword, count in common_keywords])
plt.title('Most Common Keywords')
plt.show()

<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
  <h4>Data Cleaning Procedures</h4>
  
  <ul>
    <li><strong>Handling Missing Values</strong>: Fill missing <code>keyword</code> and <code>location</code> values with a placeholder like "None".</li>
    <li><strong>Text Cleaning</strong>: Remove URLs, special characters, and perform lowercase conversion for text normalization.</li>
  </ul>
</div>


In [None]:
# Handling Missing Values
train_data.fillna('None', inplace=True)
test_data.fillna('None', inplace=True)

# Text Cleaning
train_data['text'] = train_data['text'].str.replace('http\S+|www.\S+', '', case=False)
test_data['text'] = test_data['text'].str.replace('http\S+|www.\S+', '', case=False)

print("Train Data")
display(train_data.head())
print("\nTest Data")
display(test_data.head())

<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
  <h4>Plan of Analysis</h4>
  
  <p>Based on the EDA:</p>
  
  <ul>
    <li><strong>Data Imbalance</strong>: The target variable is somewhat imbalanced, with more tweets being not about real disasters. Techniques like SMOTE or random undersampling could be considered.</li>
    <li><strong>Feature Engineering</strong>: Given the missing values in <code>keyword</code> and <code>location</code>, these columns could either be filled with a placeholder or used for feature engineering. Tokenization and padding will be essential for the <code>text</code> column.</li>
    <li><strong>Model Selection</strong>: The text nature of the data confirms that NLP models like LSTM or BERT would be beneficial for this problem.</li>
    <li><strong>Evaluation</strong>: F1-Score remains the metric of choice, aligning with the competition's requirements.</li>
  </ul>
</div>


<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
  <h3>Model Architecture</h3>
  
  <h4>Architecture Overview</h4>
  <p>The proposed architecture for this task is centered around Long Short-Term Memory (LSTM) layers. The model consists of the following components:</p>
  <ul>
    <li><strong>Embedding Layer</strong>: Utilized for converting tokenized words into vectors.</li>
    <li><strong>LSTM Layer</strong>: Employed to capture sequential dependencies in the text data.</li>
    <li><strong>Dropout Layer</strong>: Integrated for regularization purposes to mitigate overfitting.</li>
    <li><strong>Dense Layer</strong>: A fully connected layer designed for classification.</li>
    <li><strong>Output Layer</strong>: Equipped with a sigmoid activation function for binary classification.</li>
  </ul>
  
  <h4>Rationale</h4>
  <ul>
    <li><strong>Embedding Layer</strong>: Word embeddings are instrumental in capturing semantic relationships between words by providing a dense vector space representation.</li>
    <li><strong>LSTM Layer</strong>: LSTMs are adept at learning from long sequences and are not prone to the vanishing gradient problem.</li>
    <li><strong>Dropout Layer</strong>: This layer aids in model generalization by reducing the risk of overfitting.</li>
    <li><strong>Dense and Output Layer</strong>: These layers are standard in machine learning pipelines for classification tasks.</li>
  </ul>
  
  <h4>Word Embedding Strategy: GloVe (Global Vectors for Word Representation)</h4>
  <p>GloVe is recommended for word embedding due to its ability to capture both global and local semantic relationships between words.</p>
  
  <h5>Working Mechanism of GloVe</h5>
  <ol>
    <li>A co-occurrence matrix is constructed from the dataset, wherein the elements \( X_{ij} \) denote the frequency of word \( i \) appearing adjacent to word \( j \).</li>
    <li>Matrix factorization techniques are then applied to this matrix to obtain dense word vectors capable of capturing a variety of word relationships.</li>
  </ol>
  
  <h4>Building and Training Approach</h4>
  <ul>
    <li><strong>Data Preprocessing</strong>: Tokenization of the textual data followed by sequence padding.</li>
    <li><strong>Word Embedding</strong>: Adoption of pre-trained GloVe embeddings.</li>
    <li><strong>Model Construction</strong>: Implementation of the LSTM-based architecture as outlined above.</li>
    <li><strong>Model Compilation</strong>: Utilization of binary cross-entropy as the loss function and the Adam optimizer.</li>
    <li><strong>Model Training</strong>: Execution of model training on the dataset, designating a subset for validation.</li>
    <li><strong>Performance Evaluation</strong>: Application of the F1-Score metric, aligning with competition guidelines.</li>
  </ul>
  
  <h4>References</h4>
  <ul>
    <li>Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation.</li>
    <li>Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.</li>
  </ul>
  
  <p>Through the incorporation of GloVe embeddings and LSTM layers, the model aims to encapsulate both the semantic and sequential nuances inherent in the textual data, thereby enhancing its capacity for accurate tweet classification in the given problem context.</p>
</div>


In [None]:
# Data Preprocessing
# Tokenization
tokenizer = Tokenizer(num_words=CONFIG.max_vocab_size)
tokenizer.fit_on_texts(train_data['text'])
sequences = tokenizer.texts_to_sequences(train_data['text'])

# Padding sequences
data = pad_sequences(sequences, maxlen=CONFIG.max_sequence_length)

# Prepare labels
labels = np.array(train_data['target'])

# Splitting the data into training and validation sets
print("Splitting the data into trainind and validation set...")
X_train, X_val, y_train, y_val = train_test_split(data, labels, test_size=0.2, random_state=42)

# Model Building
print("\nModel building...")
model = Sequential()
model.add(Embedding(CONFIG.max_vocab_size, CONFIG.embedding_dim, input_length=CONFIG.max_sequence_length))
model.add(LSTM(50, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(50))
model.add(Dense(1, activation='sigmoid'))
print("\n---------> OK")

# Model Compilation
print("\nModel compilation...")
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print("\n---------> OK")

# Model Training
print("\nModel training...")
print("====================\n")
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=CONFIG.epochs, batch_size=CONFIG.batch_size)
print("\n---------> OK")

# Model Evaluation
print("\nModel evaluation...")
y_pred = (model.predict(X_val) > 0.5).astype("int32")
f1 = f1_score(y_val, y_pred)
print("\n---------> OK")

print("\n\nF1 Score:", f1)


<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
  <h4>Hyperparameter Tuning</h4>
  
  <ul>
    <li><strong>Batch Size</strong>: Experiment with different batch sizes (32, 64, 128, etc.) to see how they affect training speed and model performance.</li>
    <li><strong>Learning Rate</strong>: Try varying learning rates with the Adam optimizer to observe its impact on model convergence.</li>
    <li><strong>LSTM Units</strong>: Experiment with the number of LSTM units to see if more complex representations help.</li>
  </ul>
  
  <h4>Different Architectures</h4>
  
  <ul>
    <li><strong>GRU vs LSTM</strong>: Compare the performance of Gated Recurrent Units (GRU) with LSTM to see if one outperforms the other for this task.</li>
    <li><strong>Bidirectional LSTM</strong>: Using bidirectional LSTMs can help the model capture context from both the past and the future tokens in the sequence.</li>
    <li><strong>Stacked LSTM</strong>: Experiment with stacking multiple LSTM layers on top of each other.</li>
  </ul>
  
  <h4>Techniques to Improve Training or Performance</h4>
  
  <ul>
    <li><strong>Dropout Rate</strong>: Try different dropout rates to see the model's robustness and its ability to generalize.</li>
    <li><strong>Batch Normalization</strong>: Incorporate batch normalization layers to stabilize and perhaps accelerate training.</li>
    <li><strong>Regularization</strong>: L1 or L2 regularization can also be added to the dense layer to prevent overfitting.</li>
  </ul>
  
  <h4>Evaluation Metrics</h4>
  
  <p>Document the F1-Score, Precision, Recall, and AUC-ROC for each experiment.</p>
</div>


In [None]:
# Initialize an empty list to store results
results_list = []

# Define a function to build, train, and evaluate a model
def run_experiment(model, model_name, X_train, y_train, X_val, y_val, results_list):
    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=CONFIG.epochs, batch_size=CONFIG.batch_size, validation_data=(X_val, y_val))
    
    # Evaluate the model
    y_pred = (model.predict(X_val) > 0.5).astype("int32")
    f1 = f1_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    auc_roc = roc_auc_score(y_val, y_pred)
    
    # Add the results to the list
    results_list.append({'Experiment': model_name, 'F1-Score': f1, 'Precision': precision, 'Recall': recall, 'AUC-ROC': auc_roc})

# Hyperparameter tuning for Base LSTM Model
def build_model(hp):
    model = Sequential()
    model.add(Embedding(CONFIG.max_vocab_size, hp.Int('embedding_dim', min_value=50, max_value=200, step=50), input_length=CONFIG.max_sequence_length))
    model.add(LSTM(hp.Int('lstm_units', min_value=32, max_value=128, step=32)))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    directory='output',
    project_name='NLP_Disaster_Tweets_Base_LSTM',
    seed=42
)


# Start Keras Tuning
print("==========================================")
print("========= Starting Keras Tuning ==========")
print("==========================================")
tuner.search(X_train, y_train, epochs=CONFIG.epochs, validation_data=(X_val, y_val), verbose=1)

# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
embedding_dim_best = best_hps.get('embedding_dim')
lstm_units_best = best_hps.get('lstm_units')

# Base LSTM Model with best hyperparameters
print("\n\n===========================================")
print("====  Training Base LSTM Model (Tuned) ====")
print("===========================================")
model1 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    LSTM(lstm_units_best),
    Dense(1, activation='sigmoid')
])
run_experiment(model1, 'Base LSTM Model (Tuned)', X_train, y_train, X_val, y_val, results_list)

# LSTM with 100 Units
print("\n\n==========================================")
print("=====  Training LSTM with 100 Units  =====")
print("==========================================")
model2 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    LSTM(100),
    Dense(1, activation='sigmoid')
])
run_experiment(model2, 'LSTM with 100 Units', X_train, y_train, X_val, y_val, results_list)

# GRU Model
print("\n\n==========================================")
print("=========   Training GRU Model   =========")
print("==========================================")
model3 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    GRU(lstm_units_best),
    Dense(1, activation='sigmoid')
])
run_experiment(model3, 'GRU Model', X_train, y_train, X_val, y_val, results_list)

# Bidirectional LSTM
print("\n\n==========================================")
print("====== Training Bidirectional LSTM  ======")
print("==========================================")
model4 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    Bidirectional(LSTM(lstm_units_best)),
    Dense(1, activation='sigmoid')
])
run_experiment(model4, 'Bidirectional LSTM', X_train, y_train, X_val, y_val, results_list)

# Stacked LSTM
print("\n\n==========================================")
print("========= Training Stacked LSTM] =========")
print("==========================================")
model5 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    LSTM(lstm_units_best, return_sequences=True),
    LSTM(lstm_units_best),
    Dense(1, activation='sigmoid')
])
run_experiment(model5, 'Stacked LSTM', X_train, y_train, X_val, y_val, results_list)

# LSTM with Batch Normalization
print("\n\n================================================")
print("==== Training LSTM with Batch Normalization ====")
print("================================================")
model6 = Sequential([
    Embedding(CONFIG.max_vocab_size, embedding_dim_best, input_length=CONFIG.max_sequence_length),
    LSTM(lstm_units_best),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
])

run_experiment(model6, 'LSTM with Batch Normalization', X_train, y_train, X_val, y_val, results_list)

# Print Centered
def print_centered(df_str, title, width=80, char='='):
    # Calculate the padding needed for each title
    padding = (width - len(title)) // 2
    
    # Print the titles and DataFrame, centered
    print(char * width)
    print(" " * padding + title + " " * padding)
    print(char * width)
    print(df_str)
    print(char * width)

# Create a DataFrame from the results list
results_df = pd.DataFrame(results_list)

# Display the results
results_str = results_df.to_string(index=False)
print_centered(results_str, "Results")


<div style="background-color: lightgrey; color: black; padding: 20px; border: 1px solid black;">
<h4>Results and Analysis</h4>
<table border="1">
  <thead>
    <tr>
      <th>Experiment</th>
      <th>F1-Score</th>
      <th>Precision</th>
      <th>Recall</th>
      <th>AUC-ROC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Base LSTM Model (Tuned)</td>
      <td>0.708938</td>
      <td>0.703030</td>
      <td>0.714946</td>
      <td>0.745345</td>
    </tr>
    <tr>
      <td>LSTM with 100 Units</td>
      <td>0.712538</td>
      <td>0.707132</td>
      <td>0.718028</td>
      <td>0.748602</td>
    </tr>
    <tr>
      <td>GRU Model</td>
      <td>0.715180</td>
      <td>0.726550</td>
      <td>0.704160</td>
      <td>0.753682</td>
    </tr>
    <tr>
      <td>Bidirectional LSTM</td>
      <td>0.706635</td>
      <td>0.734219</td>
      <td>0.681048</td>
      <td>0.748991</td>
    </tr>
    <tr>
      <td>Stacked LSTM</td>
      <td>0.727273</td>
      <td>0.704185</td>
      <td>0.751926</td>
      <td>0.758686</td>
    </tr>
    <tr>
      <td>LSTM with Batch Normalization</td>
      <td>0.717760</td>
      <td>0.687853</td>
      <td>0.750385</td>
      <td>0.748762</td>
    </tr>
  </tbody>
</table>

  
  <h5>Summary</h5>
  
  <ol>
    <li><strong>F1-Score</strong>: The Stacked LSTM model has the highest F1-Score of approximately 0.727, indicating that this model performs the best in terms of both false positives and false negatives.</li>
    <li><strong>Precision</strong>: The Bidirectional LSTM model has the highest precision of about 0.734, making it the most reliable when it predicts a positive class. However, high precision often comes at the expense of recall.</li>
    <li><strong>Recall</strong>: The Stacked LSTM model also has the highest recall of approximately 0.752, indicating that this model is the most capable of identifying all possible positive samples.</li>
    <li><strong>AUC-ROC</strong>: The Stacked LSTM model has the highest AUC-ROC score of about 0.759, which suggests that this model has the best ability to distinguish between the positive and negative classes.</li>
    <li><strong>Overall Performance</strong>: The Stacked LSTM model outperforms in all metrics, making it a strong candidate for scenarios where both precision and recall are important.</li>
  </ol>
  
  <p>It's important to consider the trade-offs between these metrics depending on the specific needs of your application. For example, if false positives are more costly, you might prioritize a model with higher precision. On the other hand, if missing a positive sample is more detrimental, a model with higher recall would be more appropriate.</p>
</div>


<div style="background-color: lightgray; color: black; border: 1px solid black; padding: 20px; margin: 20px;">
    <h3>Discussion and Interpretation of Results</h3>
    <ol>
        <li><strong>Model Comparison</strong>: The Stacked LSTM model showed the best performance in terms of F1-Score and AUC-ROC. This indicates that adding complexity in terms of stacking layers can indeed be beneficial for this specific problem.</li>
        <li><strong>Trade-offs</strong>: Precision and Recall were often inversely related across the models. For instance, the Bidirectional LSTM model had the highest Precision but a lower Recall. This accentuates the importance of understanding the specific needs of the application to choose the right evaluation metric.</li>
        <li><strong>Hyperparameter Tuning</strong>: The Base LSTM Model (Tuned) had a balanced performance across metrics, including the second-highest AUC-ROC score, showing the effectiveness of hyperparameter tuning.</li>
    </ol>
    <h3>Learnings and Takeaways</h3>
    <ol>
        <li><strong>Complexity vs. Performance</strong>: Contrary to initial observations, adding more complexity to the model through Stacked LSTM layers improved performance, particularly in terms of F1-Score and AUC-ROC.</li>
        <li><strong>Importance of Normalization</strong>: Despite its lower precision, the LSTM model with Batch Normalization had one of the highest recalls and F1-Scores, reiterating the importance of feature scaling even in deep learning models.</li>
        <li><strong>Metric Sensitivity</strong>: Different models excelled in different metrics, emphasizing the need to consider multiple evaluation metrics when assessing models.</li>
    </ol>
    <h3>What Worked and What Didn't</h3>
    <ul>
        <li><strong>Worked</strong>: Hyperparameter tuning, Stacked LSTM architecture, and GRU architecture.</li>
        <li><strong>Didn't Work</strong>: Bidirectional LSTM did not significantly improve the metrics, despite its higher complexity.</li>
    </ul>
    <h3>Future Improvements</h3>
    <ol>
        <li><strong>Ensemble Methods</strong>: Combining the predictions from models like Stacked LSTM and GRU could potentially lead to more robust results.</li>
        <li><strong>Data Augmentation</strong>: Techniques such as SMOTE could be employed to balance the class distribution, which might improve model performance.</li>
        <li><strong>Feature Engineering</strong>: Additional text features like sentiment or text length could provide the model with more useful information.</li>
        <li><strong>Advanced Architectures</strong>: Experimenting with newer architectures like Transformers could potentially yield even better results.</li>
        <li><strong>Fine-Tuning</strong>: Using pre-trained embeddings for word representation could be another avenue for further improvements.</li>
    </ol>
    <p>In summary, the choice of model and evaluation metrics should align closely with the specific objectives of the application. Hyperparameter tuning remains a crucial step in model development, and it's essential to consider the trade-offs between different evaluation metrics.</p>
</div>


## File Submission

In [None]:
# Importing necessary modules
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, BatchNormalization, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd

# Extracting features and labels from the training data
X = train_data['text']
y = train_data['target']

# Configuration parameters
CONFIG = {
    'max_vocab_size': 5000,  # Maximum size of vocabulary
    'max_sequence_length': 100,  # Maximum length of sequence
    'embedding_dim': 100,  # Dimension of embedding
    'lstm_units': 64,  # LSTM units
    'batch_size': 32,  # Batch size
    'epochs': 100  # Number of epochs
}

# Initialize the tokenizer
tokenizer = Tokenizer(num_words=CONFIG['max_vocab_size'], oov_token='<OOV>')
tokenizer.fit_on_texts(X)

# Tokenize and pad sequences
X_sequences = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_sequences, maxlen=CONFIG['max_sequence_length'], padding='post', truncating='post')

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Improved Stacked LSTM Model
model = Sequential([
    Embedding(CONFIG['max_vocab_size'], CONFIG['embedding_dim'], input_length=CONFIG['max_sequence_length']),
    LSTM(CONFIG['lstm_units'], return_sequences=True),
    Dropout(0.2),  # Added dropout layer
    LSTM(CONFIG['lstm_units']),
    BatchNormalization(),  # Added BatchNormalization layer
    Dense(1, activation='sigmoid')
])

# Compile the model
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=CONFIG['epochs'], batch_size=CONFIG['batch_size'], validation_data=(X_val, y_val))

# Evaluate the model on the validation data
y_pred = (model.predict(X_val) > 0.5).astype("int32")
f1 = f1_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
auc_roc = roc_auc_score(y_val, y_pred)

# Output the evaluation metrics
print(f"F1-Score: {f1}, Precision: {precision}, Recall: {recall}, AUC-ROC: {auc_roc}")

# Prepare the test data
X_test_sequences = tokenizer.texts_to_sequences(test_data['text'])
X_test_padded = pad_sequences(X_test_sequences, maxlen=CONFIG['max_sequence_length'], padding='post', truncating='post')

# Make predictions on the test data
test_predictions = (model.predict(X_test_padded) > 0.5).astype("int32")

# Prepare the submission DataFrame
submission_df = pd.DataFrame({'id': test_data['id'], 'target': test_predictions.flatten()})

# Save the submission file
submission_df.to_csv("submission.csv", index=False)

