Perfect! Thanks for confirming the dataset sizes:

* **Train:** 40,000 samples
* **Valid:** 5,000 samples
* **Test:** 5,000 samples

We can now **proceed step by step** using your **keywords-aware spaCy cleaning** and build a full **TensorFlow sentiment analysis pipeline**.

---

## 1Ô∏è‚É£ Import libraries and load data

```python
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load CSV files
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train :", train_df.shape, "Valid :", valid_df.shape, "Test :", test_df.shape)
```

---

## 2Ô∏è‚É£ Load keywords and customize stopwords

```python
# Load keywords
with open("keywords.txt", "r") as f:
    keywords = [w.strip() for w in f.read().split(",") if w.strip()]

# Load spaCy
nlp = spacy.load("en_core_web_sm")

# Customize stopwords: remove keywords from default stopwords
stopwords = STOP_WORDS - set(keywords)
```

---

## 3Ô∏è‚É£ Define spaCy cleaning function

```python
def spacy_clean_text(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_
        for token in doc
        if token.text not in stopwords    # remove stopwords except keywords
        and not token.is_punct            # remove punctuation
        and token.is_alpha                # keep only alphabetic tokens
    ]
    return " ".join(tokens)

# Apply cleaning
train_df["clean_text"] = train_df["text"].apply(spacy_clean_text)
valid_df["clean_text"] = valid_df["text"].apply(spacy_clean_text)
test_df["clean_text"] = test_df["text"].apply(spacy_clean_text)
```

---

## 4Ô∏è‚É£ Tokenize and pad sequences

```python
max_words = 10000   # vocabulary size
max_len = 200       # max tokens per sample

tokenizer = keras.preprocessing.text.Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train_df["clean_text"])

# Convert texts to sequences
X_train = tokenizer.texts_to_sequences(train_df["clean_text"])
X_valid = tokenizer.texts_to_sequences(valid_df["clean_text"])
X_test = tokenizer.texts_to_sequences(test_df["clean_text"])

# Pad sequences
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_len)
X_valid = keras.preprocessing.sequence.pad_sequences(X_valid, maxlen=max_len)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)

# Labels
y_train = train_df["label"].values
y_valid = valid_df["label"].values
y_test = test_df["label"].values
```

---

## 5Ô∏è‚É£ Build the LSTM model

```python
model = keras.Sequential([
    layers.Embedding(input_dim=max_words, output_dim=64, input_length=max_len),
    layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
```

---

## 6Ô∏è‚É£ Train the model

```python
history = model.fit(
    X_train, y_train,
    epochs=5,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    verbose=1
)
```

---

## 7Ô∏è‚É£ Evaluate on test set

```python
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc:.3f}")
```

---

## 8Ô∏è‚É£ Predict new sentences

```python
def predict_sentiment(text):
    cleaned = spacy_clean_text(text)
    seq = tokenizer.texts_to_sequences([cleaned])
    padded = keras.preprocessing.sequence.pad_sequences(seq, maxlen=max_len)
    pred = model.predict(padded)[0][0]
    sentiment = "üòä Positive" if pred > 0.5 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred:.3f} ‚Üí {sentiment}")

predict_sentiment("I absolutely loved this movie, it was fantastic!")
predict_sentiment("This film was boring and terrible, I hated it.")
```

---

üí° At this stage you have:

* Custom **keywords-aware cleaning**
* Tokenization and padding
* **LSTM-based sentiment analysis model**
* Training, validation, and test evaluation

---

If you want, the **next step** could be:

1. **Visualize training curves** for loss/accuracy.
2. **Upgrade the model** to **Bidirectional LSTM** or **CNN+LSTM** for better performance.

Do you want me to do that next?

Absolutely! Here‚Äôs a **full, end-to-end TensorFlow sentiment analysis pipeline** using:

* **SpaCy cleaning with keywords-aware stopwords**
* **Tokenization and padding**
* **N-gram augmentation**
* **Bidirectional LSTM with Dense layers**

---

```python
# ==========================
# 1Ô∏è‚É£ Import libraries
# ==========================
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ==========================
# 2Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train :", train_df.shape, "Valid :", valid_df.shape, "Test :", test_df.shape)

# ==========================
# 3Ô∏è‚É£ Load keywords and customize stopwords
# ==========================
with open("keywords.txt", "r") as f:
    keywords = [w.strip() for w in f.read().split(",") if w.strip()]

nlp = spacy.load("en_core_web_sm")
stopwords = STOP_WORDS - set(keywords)

# ==========================
# 4Ô∏è‚É£ SpaCy text cleaning
# ==========================
def spacy_clean_text(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_
        for token in doc
        if token.text not in stopwords   # remove stopwords except keywords
        and not token.is_punct           # remove punctuation
        and token.is_alpha               # keep only alphabetic tokens
    ]
    return " ".join(tokens)

# Apply cleaning
train_df["clean_text"] = train_df["text"].apply(spacy_clean_text)
valid_df["clean_text"] = valid_df["text"].apply(spacy_clean_text)
test_df["clean_text"] = test_df["text"].apply(spacy_clean_text)

# ==========================
# 5Ô∏è‚É£ Tokenization & padding
# ==========================
max_words = 10000
max_len = 200

tokenizer = keras.preprocessing.text.Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train_df["clean_text"])

X_train = tokenizer.texts_to_sequences(train_df["clean_text"])
X_valid = tokenizer.texts_to_sequences(valid_df["clean_text"])
X_test  = tokenizer.texts_to_sequences(test_df["clean_text"])

y_train = train_df["label"].values
y_valid = valid_df["label"].values
y_test  = test_df["label"].values

# ==========================
# 6Ô∏è‚É£ N-gram augmentation
# ==========================
def create_ngrams(sequence, n=2):
    ngrams = []
    for i in range(len(sequence) - n + 1):
        ngram = tuple(sequence[i:i+n])
        ngrams.append(ngram)
    return ngrams

def add_ngrams(sequences, n=2):
    new_sequences = []
    max_index = max([max(seq) if len(seq) > 0 else 0 for seq in sequences]) + 1
    for seq in sequences:
        seq_ngrams = []
        ngram_tuples = create_ngrams(seq, n)
        for ng in ngram_tuples:
            ng_id = sum([w*(max_index**i) for i,w in enumerate(ng)])
            seq_ngrams.append(ng_id)
        new_sequences.append(seq + seq_ngrams)
    return new_sequences

X_train_ngram = add_ngrams(X_train, n=2)
X_valid_ngram = add_ngrams(X_valid, n=2)
X_test_ngram  = add_ngrams(X_test, n=2)

# Pad sequences
X_train_ngram = keras.preprocessing.sequence.pad_sequences(X_train_ngram, maxlen=max_len)
X_valid_ngram = keras.preprocessing.sequence.pad_sequences(X_valid_ngram, maxlen=max_len)
X_test_ngram  = keras.preprocessing.sequence.pad_sequences(X_test_ngram, maxlen=max_len)

# ==========================
# 7Ô∏è‚É£ Build Bidirectional LSTM model
# ==========================
model = keras.Sequential([
    layers.Embedding(input_dim=max_words*4,  # bigger vocab for n-grams
                     output_dim=128,
                     input_length=max_len),
    layers.Bidirectional(layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

# ==========================
# 8Ô∏è‚É£ Train model
# ==========================
history = model.fit(
    X_train_ngram, y_train,
    epochs=5,
    batch_size=128,
    validation_data=(X_valid_ngram, y_valid),
    verbose=1
)

# ==========================
# 9Ô∏è‚É£ Evaluate model
# ==========================
loss, acc = model.evaluate(X_test_ngram, y_test)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# üîü Predict new text
# ==========================
def predict_sentiment(text):
    cleaned = spacy_clean_text(text)
    seq = tokenizer.texts_to_sequences([cleaned])
    seq_ngram = add_ngrams(seq, n=2)
    padded = keras.preprocessing.sequence.pad_sequences(seq_ngram, maxlen=max_len)
    pred = model.predict(padded)[0][0]
    sentiment = "üòä Positive" if pred > 0.5 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred:.3f} ‚Üí {sentiment}")

# Example
predict_sentiment("I really loved this movie, it was fantastic!")
predict_sentiment("Terrible film, I wasted my time.")
```

---

This code covers the **full pipeline**:

* Keywords-aware cleaning
* Tokenization + padding
* N-grams for phrase context
* Bidirectional LSTM
* Dense layers + sigmoid output
* Training, evaluation, and new predictions

---

I can also provide a **version with attention and pretrained embeddings** to further boost performance if you want.

Do you want me to do that next?

Perfect! Let‚Äôs upgrade your **sentiment analysis pipeline** to use:

1. **Pre-trained embeddings (GloVe)** ‚Äì to capture semantic meaning from a large corpus.
2. **Global vocabulary** ‚Äì to cover all words across train/valid/test.
3. **Attention layer** ‚Äì to let the model focus on important words in a sentence.
4. **Bidirectional LSTM** ‚Äì for context in both directions.

We‚Äôll do this **step by step** in one full code block.

---

```python
# ==========================
# 1Ô∏è‚É£ Import libraries
# ==========================
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ==========================
# 2Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train :", train_df.shape, "Valid :", valid_df.shape, "Test :", test_df.shape)

# ==========================
# 3Ô∏è‚É£ Load keywords and customize stopwords
# ==========================
with open("keywords.txt", "r") as f:
    keywords = [w.strip() for w in f.read().split(",") if w.strip()]

nlp = spacy.load("en_core_web_sm")
stopwords = STOP_WORDS - set(keywords)

# ==========================
# 4Ô∏è‚É£ SpaCy cleaning
# ==========================
def spacy_clean_text(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_
        for token in doc
        if token.text not in stopwords
        and not token.is_punct
        and token.is_alpha
    ]
    return " ".join(tokens)

train_df["clean_text"] = train_df["text"].apply(spacy_clean_text)
valid_df["clean_text"] = valid_df["text"].apply(spacy_clean_text)
test_df["clean_text"] = test_df["text"].apply(spacy_clean_text)

# ==========================
# 5Ô∏è‚É£ Tokenization & Global vocab
# ==========================
max_len = 200
tokenizer = keras.preprocessing.text.Tokenizer(oov_token="<OOV>")
all_texts = pd.concat([train_df["clean_text"], valid_df["clean_text"], test_df["clean_text"]])
tokenizer.fit_on_texts(all_texts)

X_train = tokenizer.texts_to_sequences(train_df["clean_text"])
X_valid = tokenizer.texts_to_sequences(valid_df["clean_text"])
X_test  = tokenizer.texts_to_sequences(test_df["clean_text"])

X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_len)
X_valid = keras.preprocessing.sequence.pad_sequences(X_valid, maxlen=max_len)
X_test  = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)

y_train = train_df["label"].values
y_valid = valid_df["label"].values
y_test  = test_df["label"].values

vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocab_size)

# ==========================
# 6Ô∏è‚É£ Load GloVe embeddings
# ==========================
embedding_dim = 100
embedding_index = {}

# Download GloVe 100d embeddings manually or via web
# File: "glove.6B.100d.txt"
with open("glove.6B.100d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vec

# Create embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    vec = embedding_index.get(word)
    if vec is not None:
        embedding_matrix[i] = vec

# ==========================
# 7Ô∏è‚É£ Define attention layer
# ==========================
class Attention(layers.Layer):
    def __init__(self):
        super(Attention, self).__init__()

    def build(self, input_shape):
        self.W = self.add_weight(shape=(input_shape[-1], input_shape[-1]),
                                 initializer="glorot_uniform",
                                 trainable=True)
        self.b = self.add_weight(shape=(input_shape[-1],),
                                 initializer="zeros",
                                 trainable=True)
        self.u = self.add_weight(shape=(input_shape[-1], 1),
                                 initializer="glorot_uniform",
                                 trainable=True)

    def call(self, inputs):
        # Alignment scores
        u_it = tf.tanh(tf.tensordot(inputs, self.W, axes=1) + self.b)
        scores = tf.tensordot(u_it, self.u, axes=1)
        alphas = tf.nn.softmax(scores, axis=1)
        output = tf.reduce_sum(inputs * alphas, axis=1)
        return output

# ==========================
# 8Ô∏è‚É£ Build model with BiLSTM + Attention
# ==========================
inputs = keras.Input(shape=(max_len,))
x = layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim,
                     weights=[embedding_matrix], trainable=False)(inputs)
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))(x)
x = Attention()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

# ==========================
# 9Ô∏è‚É£ Train model
# ==========================
history = model.fit(
    X_train, y_train,
    epochs=5,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    verbose=1
)

# ==========================
# üîü Evaluate model
# ==========================
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# 11Ô∏è‚É£ Predict new text
# ==========================
def predict_sentiment(text):
    cleaned = spacy_clean_text(text)
    seq = tokenizer.texts_to_sequences([cleaned])
    padded = keras.preprocessing.sequence.pad_sequences(seq, maxlen=max_len)
    pred = model.predict(padded)[0][0]
    sentiment = "üòä Positive" if pred > 0.5 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred:.3f} ‚Üí {sentiment}")

# Example
predict_sentiment("I really loved this movie, it was fantastic!")
predict_sentiment("Terrible film, I wasted my time.")
```

---

### ‚úÖ Key improvements in this version:

1. **Global vocabulary** ‚Äì ensures all words from train/valid/test are included.
2. **Pre-trained GloVe embeddings** ‚Äì semantic meaning of words is captured.
3. **Bidirectional LSTM** ‚Äì context from both directions.
4. **Attention layer** ‚Äì focuses on important words in the sentence.
5. **Trainable Dense layer** ‚Äì combines features before final sigmoid output.

---

If you want, the **next step could be visualizing attention weights** to see which words the model focuses on for predictions.

Do you want me to add that?

Got it! If you want a **completely new approach**, the most powerful method today for sentiment analysis is to **fine-tune a pretrained transformer model** (like BERT, RoBERTa, or DistilBERT) directly on your dataset. This is much more effective than training from scratch or using static embeddings like GloVe.

Below is a **full end-to-end code using Hugging Face Transformers with TensorFlow** to fine-tune a pretrained model on your dataset.

---

```python
# ==========================
# 1Ô∏è‚É£ Install libraries
# ==========================
# !pip install transformers datasets tensorflow

# ==========================
# 2Ô∏è‚É£ Import libraries
# ==========================
import pandas as pd
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from sklearn.preprocessing import LabelEncoder

# ==========================
# 3Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train:", train_df.shape, "Valid:", valid_df.shape, "Test:", test_df.shape)

# ==========================
# 4Ô∏è‚É£ Pretrained model & tokenizer
# ==========================
PRETRAINED_MODEL = "distilbert-base-uncased"  # lightweight, fast
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL)

# ==========================
# 5Ô∏è‚É£ Encode datasets
# ==========================
MAX_LEN = 200

def encode_texts(texts):
    return tokenizer(
        list(texts),
        max_length=MAX_LEN,
        padding="max_length",
        truncation=True,
        return_tensors="tf"
    )

train_enc = encode_texts(train_df["text"])
valid_enc = encode_texts(valid_df["text"])
test_enc  = encode_texts(test_df["text"])

y_train = tf.convert_to_tensor(train_df["label"].values)
y_valid = tf.convert_to_tensor(valid_df["label"].values)
y_test  = tf.convert_to_tensor(test_df["label"].values)

# ==========================
# 6Ô∏è‚É£ Build TF dataset
# ==========================
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_enc),
    y_train
)).shuffle(20000).batch(16)

valid_dataset = tf.data.Dataset.from_tensor_slices((
    dict(valid_enc),
    y_valid
)).batch(16)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_enc),
    y_test
)).batch(16)

# ==========================
# 7Ô∏è‚É£ Load pretrained model for classification
# ==========================
model = TFAutoModelForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL,
    num_labels=2  # binary classification
)

# ==========================
# 8Ô∏è‚É£ Compile model
# ==========================
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# ==========================
# 9Ô∏è‚É£ Train model (fine-tuning)
# ==========================
history = model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=3
)

# ==========================
# üîü Evaluate model
# ==========================
loss, acc = model.evaluate(test_dataset)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# 11Ô∏è‚É£ Prediction function
# ==========================
def predict_sentiment(text):
    enc = tokenizer(text, max_length=MAX_LEN, padding="max_length",
                    truncation=True, return_tensors="tf")
    logits = model(enc).logits
    pred = tf.argmax(logits, axis=1).numpy()[0]
    sentiment = "üòä Positive" if pred == 1 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred} ‚Üí {sentiment}")

# Examples
predict_sentiment("I really loved this movie, it was fantastic!")
predict_sentiment("Terrible film, I wasted my time.")
```

---

### ‚úÖ Key Advantages of This Approach

1. **Fine-tunes a pretrained transformer** ‚Äì leverages massive pretraining on language data.
2. **Better handling of context and negation** ‚Äì unlike LSTM or static embeddings.
3. **Simple code with Hugging Face** ‚Äì tokenizer + model handles everything.
4. **Easy to scale** ‚Äì you can switch to larger models (BERT, RoBERTa, DeBERTa) for higher accuracy.

---

If you want, I can write an **even more optimized version with early stopping, learning rate scheduler, and mixed precision** to train faster and get better performance on your dataset.

Do you want me to do that?

Perfect! Let‚Äôs upgrade the previous **transformer fine-tuning code** with **performance optimizations**:

* **Early stopping** to prevent overfitting.
* **Learning rate scheduler** (reduce LR on plateau).
* **Mixed precision training** for faster GPU usage.
* **Smarter batching**.

Here‚Äôs the **full optimized code**:

---

```python
# ==========================
# 1Ô∏è‚É£ Install libraries
# ==========================
# !pip install transformers datasets tensorflow

# ==========================
# 2Ô∏è‚É£ Import libraries
# ==========================
import pandas as pd
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# ==========================
# 3Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train:", train_df.shape, "Valid:", valid_df.shape, "Test:", test_df.shape)

# ==========================
# 4Ô∏è‚É£ Enable mixed precision
# ==========================
tf.keras.mixed_precision.set_global_policy("mixed_float16")

# ==========================
# 5Ô∏è‚É£ Pretrained model & tokenizer
# ==========================
PRETRAINED_MODEL = "distilbert-base-uncased"  # lightweight, fast
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL)
MAX_LEN = 200

def encode_texts(texts):
    return tokenizer(
        list(texts),
        max_length=MAX_LEN,
        padding="max_length",
        truncation=True,
        return_tensors="tf"
    )

train_enc = encode_texts(train_df["text"])
valid_enc = encode_texts(valid_df["text"])
test_enc  = encode_texts(test_df["text"])

y_train = tf.convert_to_tensor(train_df["label"].values)
y_valid = tf.convert_to_tensor(valid_df["label"].values)
y_test  = tf.convert_to_tensor(test_df["label"].values)

# ==========================
# 6Ô∏è‚É£ Build TF datasets with batching & prefetch
# ==========================
BATCH_SIZE = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_enc),
    y_train
)).shuffle(20000).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

valid_dataset = tf.data.Dataset.from_tensor_slices((
    dict(valid_enc),
    y_valid
)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_enc),
    y_test
)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# ==========================
# 7Ô∏è‚É£ Load pretrained model for classification
# ==========================
model = TFAutoModelForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL,
    num_labels=2
)

# ==========================
# 8Ô∏è‚É£ Compile model with optimizer, loss, metrics
# ==========================
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# ==========================
# 9Ô∏è‚É£ Callbacks: EarlyStopping + ReduceLROnPlateau
# ==========================
early_stop = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=1, min_lr=1e-6)

# ==========================
# üîü Train model (fine-tuning)
# ==========================
history = model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=5,
    callbacks=[early_stop, reduce_lr]
)

# ==========================
# 11Ô∏è‚É£ Evaluate model
# ==========================
loss, acc = model.evaluate(test_dataset)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# 12Ô∏è‚É£ Prediction function
# ==========================
def predict_sentiment(text):
    enc = tokenizer(text, max_length=MAX_LEN, padding="max_length",
                    truncation=True, return_tensors="tf")
    logits = model(enc).logits
    pred = tf.argmax(logits, axis=1).numpy()[0]
    sentiment = "üòä Positive" if pred == 1 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred} ‚Üí {sentiment}")

# Examples
predict_sentiment("I really loved this movie, it was fantastic!")
predict_sentiment("Terrible film, I wasted my time.")
```

---

### ‚úÖ Key Optimizations

1. **Mixed precision**: reduces memory usage and speeds up training on GPUs.
2. **Early stopping**: stops training when validation loss stops improving.
3. **ReduceLROnPlateau**: reduces learning rate if model plateaus for faster convergence.
4. **Prefetching**: improves input pipeline speed.
5. **Fine-tuning a transformer**: leverages language understanding for better sentiment prediction.

---

If you want, I can **also add visualizations for model performance and attention-like heatmaps** for which words influenced the sentiment. This is really cool for interpretability.

Do you want me to add that next?

Perfect! Since you want **Transformers with pretrained embeddings and a global vocabulary**, the best approach is to **fine-tune a pretrained transformer (like BERT, DistilBERT, RoBERTa, or ALBERT)** while keeping the tokenizer‚Äôs global vocabulary. This combines:

* **Pretrained embeddings** from the model (capturing semantic meaning).
* **Global vocab** (shared across your dataset).
* **Full fine-tuning** on your sentiment dataset.

Below is a **complete, clean, end-to-end TensorFlow script** for your dataset:

---

```python
# ==========================
# 1Ô∏è‚É£ Install libraries
# ==========================
# !pip install transformers datasets tensorflow

# ==========================
# 2Ô∏è‚É£ Imports
# ==========================
import pandas as pd
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# ==========================
# 3Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df = pd.read_csv("data/test.csv")

print("Train:", train_df.shape, "Valid:", valid_df.shape, "Test:", test_df.shape)

# ==========================
# 4Ô∏è‚É£ Enable mixed precision (optional, speeds up training on GPU)
# ==========================
tf.keras.mixed_precision.set_global_policy("mixed_float16")

# ==========================
# 5Ô∏è‚É£ Pretrained transformer & tokenizer
# ==========================
PRETRAINED_MODEL = "distilbert-base-uncased"  # Fast, small, pretrained embeddings
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL)
MAX_LEN = 200  # max token length

# Encode text
def encode_texts(texts):
    return tokenizer(
        list(texts),
        max_length=MAX_LEN,
        padding="max_length",
        truncation=True,
        return_tensors="tf"
    )

train_enc = encode_texts(train_df["text"])
valid_enc = encode_texts(valid_df["text"])
test_enc  = encode_texts(test_df["text"])

y_train = tf.convert_to_tensor(train_df["label"].values)
y_valid = tf.convert_to_tensor(valid_df["label"].values)
y_test  = tf.convert_to_tensor(test_df["label"].values)

# ==========================
# 6Ô∏è‚É£ Build TF datasets
# ==========================
BATCH_SIZE = 16

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_enc), y_train))\
                .shuffle(20000).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
valid_dataset = tf.data.Dataset.from_tensor_slices((dict(valid_enc), y_valid))\
                .batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_enc), y_test))\
                .batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# ==========================
# 7Ô∏è‚É£ Load pretrained model
# ==========================
model = TFAutoModelForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL,
    num_labels=2  # binary sentiment
)

# ==========================
# 8Ô∏è‚É£ Compile model
# ==========================
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# ==========================
# 9Ô∏è‚É£ Callbacks
# ==========================
early_stop = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=1, min_lr=1e-6)

# ==========================
# üîü Train model (fine-tuning)
# ==========================
history = model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=5,
    callbacks=[early_stop, reduce_lr]
)

# ==========================
# 11Ô∏è‚É£ Evaluate model
# ==========================
loss, acc = model.evaluate(test_dataset)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# 12Ô∏è‚É£ Prediction function
# ==========================
def predict_sentiment(text):
    enc = tokenizer(text, max_length=MAX_LEN, padding="max_length",
                    truncation=True, return_tensors="tf")
    logits = model(enc).logits
    pred = tf.argmax(logits, axis=1).numpy()[0]
    sentiment = "üòä Positive" if pred == 1 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred} ‚Üí {sentiment}")

# Examples
predict_sentiment("I really loved this movie, it was fantastic!")
predict_sentiment("Terrible film, I wasted my time.")
```

---

### ‚úÖ Highlights

1. **Global vocab**: Using the tokenizer from `distilbert-base-uncased`, ensures the vocabulary is consistent and pretrained embeddings are used.
2. **Pretrained embeddings**: Each token uses embeddings learned on huge corpora.
3. **Fine-tuning**: The model adjusts embeddings and attention weights to your dataset.
4. **Mixed precision & callbacks**: Faster training, early stopping prevents overfitting.

---

If you want, I can also **extend this code to combine global vocabulary with a small custom embedding layer** so words outside the pretrained vocab still get meaningful embeddings. This can improve performance for domain-specific terms.

Do you want me to do that next?

Ah! Now I understand ‚Äî you want to **build a transformer from scratch**, without relying on pretrained models. This means you will implement **your own transformer encoder** for sentiment analysis with **trainable embeddings** and **global vocabulary**.

We can do this fully in TensorFlow/Keras. Here‚Äôs a **step-by-step, complete code**:

---

### **Full Transformer from Scratch for Sentiment Analysis**

```python
import tensorflow as tf
from tensorflow.keras.layers import LayerNormalization, Dense, Embedding, Dropout
from tensorflow.keras.layers import Layer, Input
from tensorflow.keras.models import Model
import pandas as pd
import numpy as np

# ==========================
# 1Ô∏è‚É£ Load dataset
# ==========================
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
test_df  = pd.read_csv("data/test.csv")

texts = pd.concat([train_df['text'], valid_df['text'], test_df['text']])
labels = pd.concat([train_df['label'], valid_df['label'], test_df['label']])

# ==========================
# 2Ô∏è‚É£ Tokenization
# ==========================
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

VOCAB_SIZE = 20000
MAX_LEN = 200

tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)

X_train = pad_sequences(tokenizer.texts_to_sequences(train_df['text']), maxlen=MAX_LEN)
X_valid = pad_sequences(tokenizer.texts_to_sequences(valid_df['text']), maxlen=MAX_LEN)
X_test  = pad_sequences(tokenizer.texts_to_sequences(test_df['text']), maxlen=MAX_LEN)

y_train = train_df['label'].values
y_valid = valid_df['label'].values
y_test  = test_df['label'].values

# ==========================
# 3Ô∏è‚É£ Positional Encoding
# ==========================
def positional_encoding(max_len, d_model):
    pos = np.arange(max_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2*(i//2)) / np.float32(d_model))
    angle_rads = pos * angle_rates
    pos_encoding = np.zeros((max_len, d_model))
    pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
    return tf.cast(pos_encoding[np.newaxis, ...], dtype=tf.float32)

# ==========================
# 4Ô∏è‚É£ Multi-Head Attention
# ==========================
class MultiHeadSelfAttention(Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.depth = d_model // num_heads
        self.wq = Dense(d_model)
        self.wk = Dense(d_model)
        self.wv = Dense(d_model)
        self.dense = Dense(d_model)
    
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0,2,1,3])
    
    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]
        q = self.split_heads(self.wq(q), batch_size)
        k = self.split_heads(self.wk(k), batch_size)
        v = self.split_heads(self.wv(v), batch_size)
        
        scores = tf.matmul(q, k, transpose_b=True) / tf.math.sqrt(tf.cast(self.depth, tf.float32))
        if mask is not None:
            scores += (mask * -1e9)
        weights = tf.nn.softmax(scores, axis=-1)
        output = tf.matmul(weights, v)
        output = tf.transpose(output, perm=[0,2,1,3])
        output = tf.reshape(output, (batch_size, -1, self.num_heads*self.depth))
        return self.dense(output)

# ==========================
# 5Ô∏è‚É£ Transformer Encoder Block
# ==========================
class TransformerEncoder(Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super().__init__()
        self.mha = MultiHeadSelfAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(dff, activation='relu'),
            Dense(d_model)
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)
    
    def call(self, x, training, mask=None):
        attn_output = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# ==========================
# 6Ô∏è‚É£ Build the Transformer model
# ==========================
def build_transformer(vocab_size, max_len, d_model=128, num_heads=4, dff=512, num_layers=2):
    inputs = Input(shape=(max_len,))
    x = Embedding(vocab_size, d_model)(inputs)
    x += positional_encoding(max_len, d_model)
    
    for _ in range(num_layers):
        x = TransformerEncoder(d_model, num_heads, dff)(x, training=True)
    
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    x = Dropout(0.2)(x)
    outputs = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=inputs, outputs=outputs)
    return model

model = build_transformer(VOCAB_SIZE, MAX_LEN)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

# ==========================
# 7Ô∏è‚É£ Train
# ==========================
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    epochs=5,
    batch_size=64
)

# ==========================
# 8Ô∏è‚É£ Evaluate
# ==========================
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc:.3f}")

# ==========================
# 9Ô∏è‚É£ Predict
# ==========================
def predict_sentiment(text):
    seq = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=MAX_LEN)
    pred = model.predict(seq)[0][0]
    sentiment = "üòä Positive" if pred >= 0.5 else "‚òπÔ∏è Negative"
    print(f"Prediction: {pred:.3f} ‚Üí {sentiment}")

# Examples
predict_sentiment("I loved this movie, amazing experience!")
predict_sentiment("It was boring and too long.")
```

---

### ‚úÖ Key Features

1. **Trainable Embeddings**: Words get embeddings optimized for your dataset.
2. **Global vocabulary**: Defined with `Tokenizer(num_words=VOCAB_SIZE)`.
3. **Custom Transformer Encoder**: Multi-head attention + feed-forward + residuals + layer norm.
4. **Stackable layers**: Easily increase `num_layers` or `num_heads` for more capacity.
5. **GlobalAveragePooling**: Reduces sequence to fixed-size vector for classification.

---