# Long Short-Term Memory (LSTM)

## Problem Type
**Long Short-Term Memory (LSTM)** networks are primarily used for:
- **Sequential Data Processing** (e.g., time series, text, speech)
- **Supervised** learning
- **Applications**: Language Modeling, Machine Translation, Speech Recognition, Time Series Forecasting, and more.

### How LSTMs Work
- **Memory cell:**
  - LSTMs introduce a memory cell that maintains information over long time periods, enabling the network to learn long-term dependencies.
- **Gates:**
  - LSTMs use three gates to regulate the flow of information:
    - **Forget Gate:** Decides what information to discard from the cell state.
    - **Input Gate:** Decides what new information to add to the cell state.
    - **Output Gate:** Decides what part of the cell state to output as the hidden state.
- **Cell state:**
  - The cell state is the key component of LSTMs, allowing information to flow through the network with minimal modifications, mitigating the vanishing gradient problem.
- **Sequential processing:**
  - Like RNNs, LSTMs process sequences one element at a time, maintaining a hidden state that is updated at each time step.
- **Backpropagation Through Time (BPTT):**
  - LSTMs are trained using BPTT, which computes gradients through the entire sequence and updates the network weights accordingly.
- **Long-term dependencies:**
  - LSTMs are designed to remember information for long periods, making them suitable for tasks where long-term context is important.

### Key Tuning Metrics
- **`hidden_size`:**
  - **Description:** Number of units in the LSTM’s hidden layer.
  - **Impact:** Larger hidden sizes allow the model to capture more complex patterns but increase computational cost and risk of overfitting.
  - **Default:** Varies; typically ranges from `128` to `512`.
- **`num_layers`:**
  - **Description:** Number of stacked LSTM layers.
  - **Impact:** More layers can capture deeper temporal dependencies but may require more regularization to prevent overfitting.
  - **Default:** `1` (can be increased for deeper models).
- **`learning_rate`:**
  - **Description:** Step size for updating weights during training.
  - **Impact:** Higher values speed up training but may cause instability; lower values provide more stable convergence but slow down training.
  - **Default:** `0.001` (varies with optimizer).
- **`dropout_rate`:**
  - **Description:** Fraction of units to drop during training to prevent overfitting.
  - **Impact:** Helps in regularization; typical values range from `0.2` to `0.5`.
  - **Default:** `0.0` (no dropout) but often set to `0.2-0.5` in practice.
- **`sequence_length`:**
  - **Description:** Length of input sequences processed by the LSTM.
  - **Impact:** Longer sequences can capture more context but increase the risk of vanishing gradients and computational cost.
  - **Default:** Varies depending on the problem.
- **`batch_size`:**
  - **Description:** Number of sequences processed in parallel during training.
  - **Impact:** Larger batch sizes improve training stability but require more memory.
  - **Default:** Typically `32` or `64`.

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| Capable of learning long-term dependencies in sequences | Computationally expensive, especially for long sequences or deep networks |
| Effective for tasks involving time series, text, speech, etc. | Training can be slow due to the complexity of the network |
| Mitigates vanishing gradient problem better than traditional RNNs | Prone to overfitting without careful regularization    |
| Can handle inputs of varying lengths and sequences    | Requires significant hyperparameter tuning for optimal performance |
| Well-suited for tasks with complex temporal dynamics  | Difficult to interpret learned representations         |

### Evaluation Metrics
- **Accuracy (Classification):**
  - **Description:** Ratio of correct predictions to total predictions.
  - **Good Value:** Higher is better; values above 0.85 indicate strong model performance.
  - **Bad Value:** Below 0.5 suggests poor model performance.
- **Precision (Classification):**
  - **Description:** Proportion of true positives among all positive predictions.
  - **Good Value:** Higher values indicate fewer false positives, especially important in imbalanced datasets.
  - **Bad Value:** Low values suggest many false positives.
- **Recall (Classification):**
  - **Description:** Proportion of actual positives correctly identified.
  - **Good Value:** Higher values indicate fewer false negatives, important in recall-sensitive applications.
  - **Bad Value:** Low values suggest many false negatives.
- **F1 Score (Classification):**
  - **Description:** Harmonic mean of Precision and Recall.
  - **Good Value:** Higher values indicate a good balance between Precision and Recall.
  - **Bad Value:** Low values suggest a poor balance between Precision and Recall.
- **Perplexity (Language Modeling):**
  - **Description:** Measures the uncertainty in predicting the next word in a sequence; lower perplexity indicates better performance.
  - **Good Value:** Lower is better; values vary depending on the dataset, but the model should consistently reduce perplexity over time.
  - **Bad Value:** High perplexity indicates poor predictive capability.
- **Mean Squared Error (MSE) (Regression):**
  - **Description:** Measures the average squared difference between predicted and actual values.
  - **Good Value:** Lower is better; values close to `0` indicate high accuracy.
  - **Bad Value:** Higher values suggest the model’s predictions deviate significantly from the actual values.
- **AUC-ROC (Classification):**
  - **Description:** Measures the model's ability to distinguish between classes across all thresholds.
  - **Good Value:** Values closer to 1 indicate strong separability between classes.
  - **Bad Value:** Values near 0.5 suggest random guessing.



In [None]:
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  # Suppresses INFO and WARNING messages
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
from sklearn.metrics import classification_report, confusion_matrix
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import LSTM, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Load and preprocess the IMDb dataset
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 64

# Load the data
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure consistent input length
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

In [None]:
# Define the LSTM model
hidden_size = 128
num_layers = 2
dropout_rate = 0.3
learning_rate = 0.001

model = Sequential()
model.add(Embedding(max_features, hidden_size, input_length=maxlen))

# Add LSTM layers
for _ in range(num_layers - 1):
    model.add(LSTM(hidden_size, return_sequences=True))
    model.add(Dropout(dropout_rate))

# Last LSTM layer without return_sequences
model.add(LSTM(hidden_size))
model.add(Dropout(dropout_rate))

# Output layer
model.add(Dense(1, activation="sigmoid"))

# Compile the model
optimizer = Adam(learning_rate=learning_rate)
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])

# Train the model
history = model.fit(
    X_train,
    y_train,
    epochs=5, 
    batch_size=batch_size,
    validation_split=0.2,
)

In [None]:
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.2f}")

In [None]:
predictions = model.predict(X_test)
predictions = np.round(predictions).flatten()  # Convert predictions to labels

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, predictions))

In [None]:
cm = confusion_matrix(y_test, predictions)

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()