We'll simulate prompt tuning by:

#****Freezing a base model.***

#****Adding trainable prompt embeddings before the input sequence.***

#****Training only the prompt using data loaded from a .csv.***

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# --- Step 1: Load Dataset ---
df = pd.read_csv("sentiment.csv")  # Replace with your path
df.head()


Unnamed: 0,text,label
0,Awful plot,0
1,Highly recommend,1
2,This is fantastic,1
3,Great acting,1
4,Best ever,1


In [2]:
texts = df['text'].astype(str).tolist()
labels = df['label'].astype(int).tolist()

texts

['Awful plot',
 'Highly recommend',
 'This is fantastic',
 'Great acting',
 'Best ever',
 'Terrible experience',
 'Superb!',
 'So boring',
 'Amazing experience',
 'Awful plot',
 'Brilliant plot',
 'So boring',
 'Would not recommend',
 'Thumbs up',
 'I hate this',
 'Best ever',
 'Horrible acting',
 'Awful plot',
 'Amazing experience',
 'Disappointing',
 'Brilliant plot',
 'Best ever',
 'Waste of time',
 'Horrible acting',
 'Highly recommend',
 'Waste of time',
 'Disappointing',
 'Thumbs down',
 'Great acting',
 'Would not recommend',
 'Really bad',
 'Thumbs up',
 'Highly recommend',
 'Thumbs up',
 'Absolutely loved it',
 'So boring',
 'Highly recommend',
 'Awful plot',
 'Absolutely loved it',
 'Thumbs down',
 'So boring',
 'Would not recommend',
 'Absolutely loved it',
 'Disappointing',
 'Amazing experience',
 'I love this movie',
 'Thumbs up',
 'Highly recommend',
 'I love this movie',
 'I hate this',
 'So boring',
 'Superb!',
 'Superb!',
 'Waste of time',
 'Terrible experience',
 'Abs

In [None]:
🧠 What is vocab_size?
In natural language processing (NLP), vocab_size (short for vocabulary size) refers to:

🔤 The total number of unique tokens (usually words or subwords) that your model is allowed to recognize or work with.

🎯 In Practice (e.g., Keras Tokenizer):
When you define:

tokenizer = keras.preprocessing.text.Tokenizer(num_words=10000)
You're telling Keras:

“Keep only the 10,000 most frequent words from the training texts. Ignore the rest.”

This means:

The tokenizer builds a word-to-index mapping using the top 10,000 words. <--------------------

Words outside this set get mapped to an “OOV” (out-of-vocabulary) token.

💡 Why Limit the Vocabulary?
Limiting vocab_size is important because:

✅ Memory-efficient (smaller embedding matrices).

🚀 Faster training and inference.

📉 Reduces overfitting (especially on rare, noisy words).

🔢 How It Affects the Embedding Layer
If vocab_size = 10000 and embed_dim = 64, then:

Embedding(input_dim=10000, output_dim=64)
The embedding layer will create a matrix of shape (10000, 64):

Each word index from 0 to 9999 will map to a 64-dimensional vector.

⚠️ Note
The actual number of words in your data might be more or less than vocab_size. It's just a cap on how many to use.



In [None]:
Imagine this sentence:

sentence = "I love samosas"
We want to classify the sentiment of this sentence (positive or negative) using a neural network.

🧾 Step 1: Tokenization (Word to Index)
Let’s say we define this vocabulary:


word_to_index = {
    "i": 1,
    "love": 2,
    "samosas": 3,
    "<OOV>": 0  # Out-of-vocabulary
}
Now the sentence "I love samosas" becomes:


[1, 2, 3]

🧱 Step 2: Embedding Layer
We create an embedding layer like:


Embedding(input_dim=4, output_dim=3)
input_dim = 4 → we have 4 unique tokens (including OOV).

output_dim = 3 → each word is represented as a 3D vector.

Suppose the embedding matrix looks like:

Word	Index	Embedding Vector
<OOV>	0	[ 0.01, -0.02, 0.03]
"i"	1	[-0.10, 0.20, 0.30]
"love"	2	[ 0.50, 0.60, -0.40]
"samosas"	3	[ 0.25, -0.30, 0.75]

So, input [1, 2, 3] is converted to:

[
 [-0.10,  0.20,  0.30],   # "i"
 [ 0.50,  0.60, -0.40],   # "love"
 [ 0.25, -0.30,  0.75]    # "samosas"
]
✅ You now have a 3×3 matrix (3 words × 3 features each).

🧠 Why This Matters
Words that mean similar things will be close together in vector space.

Example: "awesome" and "great" may both map to [0.6, 0.8, -0.1]

The model learns these embeddings from data during training!



In [None]:
max_len = 20
vocab_size = 10000

tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')

🔍 Explanation:
🔹 max_len = 20
This defines the maximum number of tokens per input sequence.

All sequences will be padded or truncated to exactly 20 tokens.

🔹 vocab_size = 10000
You're restricting the tokenizer to use the top 10,000 most frequent words.

Any words not in this set get replaced with the <OOV> (Out Of Vocabulary) token.

🔹 tokenizer = keras.preprocessing.text.Tokenizer(...)
Creates a tokenizer that will:

Index words (e.g., "love" → 53)

Replace rare words with <OOV>

🔹 tokenizer.fit_on_texts(texts)
Learns the vocabulary from your list of strings (texts).

Creates a word_index dictionary:


{'<OOV>': 1, 'i': 2, 'love': 3, 'samosas': 4, ...}
🔹 sequences = tokenizer.texts_to_sequences(texts)
Converts each string into a list of integers (word indices).


"I love samosas" → [2, 3, 4]
🔹 padded = keras.preprocessing.sequence.pad_sequences(...)
Pads each list to a uniform length of max_len (20).

Padding is done with 0s at the end (padding='post').


[2, 3, 4] → [2, 3, 4, 0, 0, 0, ..., 0]  # length = 20
📊 Resulting padded shape:
If texts contains 1000 sentences, you’ll get a matrix of shape:


(1000, 20)
Each row is a fixed-length representation of a sentence — ready to feed into an Embedding layer!

🧠 Why is this crucial?
Neural networks require fixed-length input.

Word order is preserved in sequences (important for RNNs/LSTMs).

Embedding layers work with integer input, not raw text.



In [3]:
# --- Step 2: Tokenize Text ---
max_len = 20
vocab_size = 10000

tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')



In [6]:
padded.shape  # Matrix shape is 2000 * 20  (2000 line of sentences and 20 max len.)

(2000, 20)

In [8]:
from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(padded, labels, test_size=0.2, random_state=42)

X_train

array([[ 9,  0,  0, ...,  0,  0,  0],
       [32, 33,  0, ...,  0,  0,  0],
       [36, 37,  6, ...,  0,  0,  0],
       ...,
       [ 4, 16,  0, ...,  0,  0,  0],
       [12,  7,  0, ...,  0,  0,  0],
       [26, 27,  0, ...,  0,  0,  0]], dtype=int32)

In [12]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import numpy as np

# --- Step 3: Build Base Model (Frozen) ---
embed_dim = 64
prompt_len = 5
max_len = 20
vocab_size = 10000 # Ensure vocab_size is defined

def build_base_model(input_shape):
    inputs = keras.Input(shape=input_shape)
    x = layers.Bidirectional(layers.LSTM(64))(inputs)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.trainable = False  # Freeze all layers
    return model

# --- Step 4: Prompt Tuning Model ---
class PromptTuningModel(keras.Model):
    def __init__(self, base_model, prompt_len, embed_dim, vocab_size):
        super().__init__()
        self.base_model = base_model
        self.prompt_embeddings = tf.Variable(
            tf.random.normal([prompt_len, embed_dim]), trainable=True, name="prompt_embeddings" # trainable:True means prompt going to train.
        )
        # Remove the embedding layer from here as input is already embedded
        self.embed = layers.Embedding(vocab_size, embed_dim, trainable=False)


    def call(self, input_ids):
        embedded = self.embed(input_ids)
        batch_size = tf.shape(embedded)[0]
        prompt = tf.tile(tf.expand_dims(self.prompt_embeddings, 0), [batch_size, 1, 1])
        concat_input = tf.concat([prompt, embedded], axis=1)
        return self.base_model(concat_input)

# --- Step 5: Compile & Train ---
# Define input shape for the base model after concatenation (max_len + prompt_len, embed_dim)
base_model_input_shape = (max_len + prompt_len, embed_dim)
base_model = build_base_model(base_model_input_shape)
prompt_model = PromptTuningModel(base_model, prompt_len, embed_dim, vocab_size)

prompt_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Ensure X_train and y_train are numpy arrays
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)


prompt_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=16)

# Optional: Save prompt embeddings
tf.saved_model.save(prompt_model, "saved_prompt_model")

Epoch 1/100




[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 17ms/step - accuracy: 0.4710 - loss: 0.6975 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 2/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.4710 - loss: 0.6975 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 3/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.4627 - loss: 0.6983 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 4/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.4654 - loss: 0.6981 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 5/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.4729 - loss: 0.6973 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 6/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.4757 - loss: 0.6970 - val_accuracy: 0.4825 - val_loss: 0.6964
Epoch 7/100
[1m100/100[0m [

In [13]:
def predict_sentences(prompt_model, tokenizer, sentences):
    # Step 1: Tokenize and pad the input sentences
    sequences = tokenizer.texts_to_sequences(sentences)
    padded = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')

    # Step 2: Predict using the prompt model
    predictions = prompt_model.predict(padded)

    # Step 3: Display results
    for sentence, pred in zip(sentences, predictions):
        label = "Positive" if pred[0] > 0.5 else "Negative"
        confidence = pred[0]
        print(f"Input: {sentence}")
        print(f"Prediction: {label} (Confidence: {confidence:.2f})\n")


In [14]:
# Example sentences
test_sentences = [
    "I love this product",
    "This is the worst experience ever",
    "Absolutely fantastic!",
    "I will never buy this again"
]

# Run inference
predict_sentences(prompt_model, tokenizer, test_sentences)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 518ms/step
Input: I love this product
Prediction: Negative (Confidence: 0.47)

Input: This is the worst experience ever
Prediction: Negative (Confidence: 0.48)

Input: Absolutely fantastic!
Prediction: Negative (Confidence: 0.48)

Input: I will never buy this again
Prediction: Negative (Confidence: 0.47)



In [17]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# --- Step 1: Load Dataset ---
df = pd.read_csv("sentiment_10000.csv")
texts = df['text'].astype(str).tolist()
labels = df['label'].astype(int).tolist()

# --- Step 2: Tokenize Text ---
max_len = 20
vocab_size = 10000

tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')

# --- Step 3: Train-Test Split ---
X_train, X_test, y_train, y_test = train_test_split(padded, labels, test_size=0.2, random_state=42)
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

# --- Step 4: Base Model ---
embed_dim = 64
prompt_len = 5

def build_base_model(input_shape):
    inputs = keras.Input(shape=input_shape)
    x = layers.Bidirectional(layers.LSTM(64, return_sequences=False, dropout=0.2))(inputs)
    x = layers.Dropout(0.3)(x)
    x = layers.Dense(64, activation="relu")(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.trainable = True  # Set to True if you want to unfreeze LSTM for better performance
    return model

# --- Step 5: Prompt Tuning Model ---
class PromptTuningModel(keras.Model):
    def __init__(self, base_model, prompt_len, embed_dim, vocab_size):
        super().__init__()
        self.base_model = base_model
        self.prompt_embeddings = tf.Variable(
            tf.random.normal([prompt_len, embed_dim]), trainable=True, name="prompt_embeddings"
        )
        self.embed = layers.Embedding(vocab_size, embed_dim, trainable=True)  # Set trainable=True for better results

    def call(self, input_ids):
        embedded = self.embed(input_ids)
        batch_size = tf.shape(embedded)[0]
        prompt = tf.tile(tf.expand_dims(self.prompt_embeddings, 0), [batch_size, 1, 1])
        concat_input = tf.concat([prompt, embedded], axis=1)
        return self.base_model(concat_input)

# --- Step 6: Compile and Train ---
base_model_input_shape = (max_len + prompt_len, embed_dim)
base_model = build_base_model(base_model_input_shape)
prompt_model = PromptTuningModel(base_model, prompt_len, embed_dim, vocab_size)

prompt_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
prompt_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30, batch_size=32)

# --- Step 7: Inference Function ---
def predict_sentences(prompt_model, tokenizer, sentences):
    sequences = tokenizer.texts_to_sequences(sentences)
    padded = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='post')
    predictions = prompt_model.predict(padded)
    for sentence, pred in zip(sentences, predictions):
        label = "Positive" if pred[0] > 0.5 else "Negative"
        print(f"Input: {sentence}")
        print(f"Prediction: {label} (Confidence: {pred[0]:.2f})\n")

# --- Example Inference ---
test_sentences = [
    "I love this product",
    "Worst experience ever",
    "Perfect in every way",
    "Would not recommend"
]
predict_sentences(prompt_model, tokenizer, test_sentences)


Epoch 1/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 41ms/step - accuracy: 0.7689 - loss: 0.3933 - val_accuracy: 1.0000 - val_loss: 6.0606e-05
Epoch 2/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 35ms/step - accuracy: 1.0000 - loss: 4.4268e-05 - val_accuracy: 1.0000 - val_loss: 1.7822e-05
Epoch 3/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 39ms/step - accuracy: 1.0000 - loss: 1.4709e-05 - val_accuracy: 1.0000 - val_loss: 8.2456e-06
Epoch 4/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 38ms/step - accuracy: 1.0000 - loss: 7.1523e-06 - val_accuracy: 1.0000 - val_loss: 4.7272e-06
Epoch 5/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 33ms/step - accuracy: 1.0000 - loss: 4.2320e-06 - val_accuracy: 1.0000 - val_loss: 3.0470e-06
Epoch 6/30
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 32ms/step - accuracy: 1.0000 - loss: 2.7672e-06 - val_accuracy: 1.0

In [18]:
# --- Example Inference ---
test_sentences = [
    "I hacked the website but it help company to know volunabilities",
    "good thing to do but bad thing may come in life",
    "love and fight are life",
    "Would not recommended but need of it"
]
predict_sentences(prompt_model, tokenizer, test_sentences)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 112ms/step
Input: I hacked the website but it help company to know volunabilities
Prediction: Negative (Confidence: 0.00)

Input: good thing to do but bad thing may come in life
Prediction: Negative (Confidence: 0.00)

Input: love and fight are life
Prediction: Positive (Confidence: 1.00)

Input: Would not recommended but need of it
Prediction: Negative (Confidence: 0.00)

