# TensorFlow and Keras for LLMs: From Prototyping to Deployment

---

## Outline

1. **Introduction to TensorFlow and Keras**
2. **TensorFlow and Keras Essentials: A Refresher**

   * Tensors and Data Types
   * Data Loading and Preprocessing
   * The Significance of TensorFlow and Keras in ML
3. **Building Neural Networks in Keras**

   * Sequential API
   * Functional API
4. **LLMs in the TensorFlow Ecosystem**

   * Overview of LLMs
   * Hugging Face Transformers with TensorFlow
   * Other Compatible LLM Libraries
5. **Model Importing and Inference**

   * Importing Pretrained LLMs with TensorFlow
   * Running Inference Locally
   * Hardware Considerations
6. **Prototyping to Deployment**

   * TF SavedModel Format
   * TensorFlow Serving / TF Lite / TF.js
   * Integration into Applications (e.g., Flask, FastAPI)
7. **References and Resources**

---



## 1. Introduction to TensorFlow and Keras

* TensorFlow (TF): end-to-end open-source platform for machine learning developed by Google.
* TF provides tools for model building, training, serving, and deployment.
* Keras is its high-level API for building and training deep learning models with a user-friendly interface.

##### TensorFlow and Keras together provide:

* Scalable execution on CPUs, GPUs, TPUs
* Flexibility (eager and graph modes)
* Interoperability with the broader Python ML ecosystem
* Integration with powerful tools like TensorBoard, TFHub, and TFX

##### Note:
- PyTorch dominates open source large LLM space
- TensorFlow still remains a robust option for enterprise-grade ML/LLM solutions.

---



## 2. TensorFlow and Keras Essentials: A Refresher

### Tensors and Data Types

- Tensors are the core data structure in TensorFlow
- Tensors are multi-dimensional arrays with a uniform data type.


In [None]:
!python --version

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
print(a)
print('---')
print(a.shape)      # (2, 2)
print('---')
print(a.dtype)      # tf.float32

Common dtypes:

* `tf.float32`, `tf.float64`
* `tf.int32`, `tf.int64`
* `tf.string`, `tf.bool`

### Data Loading and Preprocessing

Data ingestion in TensorFlow is handled via the `tf.data` API:

In [None]:
dataset = tf.data.TextLineDataset(".\data\hawking.txt")
#The dataset now contains a sequence of tf.string tensors, where each tensor is one line from your text file.
dataset = dataset.map(lambda x: tf.strings.split(x))
# The dataset now contains a sequence of tf.Tensor objects, where each tensor is a list of words (or "tokens") from the original line.


In [None]:
type(dataset)

##### Preview Dataset

In [None]:
# Preview first few lines
for i, line in enumerate(dataset.take(5)):
    tokens = tf.reshape(line, [-1]).numpy()  # ensure flat array
    print(f"Line {i+1} tokens:", [token.decode("utf-8") for token in tokens])



It supports streaming, batching, caching, shuffling, and prefetching for performance.

For tokenized datasets (common in LLM workflows), integration with Hugging Face Datasets is common using interoperability bridges.

### Why TensorFlow/Keras?

* **Production-readiness** with TFX and TensorFlow Serving
* **Tooling**: TensorBoard, Model Optimization Toolkit, TF Lite
* **Scalability**: GPU/TPU acceleration with XLA
* **Compatibility**: Hugging Face, ONNX, JAX interop

---



## 3. Building Neural Networks in Keras

### What is a Neural Network?

### Sequential API

Ideal for linear stacks of layers:


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = tf.keras.Sequential([
    tf.keras.Input(shape=(784,)), # Explicit Input layer
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()


### Functional API

Supports complex architectures (multiple inputs/outputs, residuals):


In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, concatenate

input1 = Input(shape=(64,))
input2 = Input(shape=(32,))
x1 = Dense(128, activation='relu')(input1)
x2 = Dense(64, activation='relu')(input2)
combined = concatenate([x1, x2])
output = Dense(1, activation='sigmoid')(combined)

model = Model(inputs=[input1, input2], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy')

model.summary()


---

## 4. LLMs in the TensorFlow Ecosystem

### Overview of LLMs

LLMs (e.g., GPT, BERT, T5) are transformer-based models trained on massive corpora. Tasks include:

* Text classification
* Named entity recognition (NER)
* Text summarization
* Question answering
* Code generation

### Hugging Face Transformers with TensorFlow

Most Hugging Face models support PyTorch and TensorFlow backends:


In [None]:
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


### Other Compatible Libraries

* **TFHub**: Pretrained models for text and vision
* **KerasNLP**: Native Keras tools for tokenization, embedding, transformer layers
* **T5X / Flax models**: Often require conversion but increasingly interoperable

---

## 5. Model Importing and Inference

### Importing Pretrained Models

>- DistilBERT for question answering works by encoding both the question and the context text together into token embeddings.
>- The model then predicts two positions in the combined input: the start and end tokens of the answer span within the context, allowing it to extract the most relevant answer directly from the passage.

In [None]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

### Running Locally for Inference

In [None]:
question = "What is TensorFlow?"
context = "TensorFlow is an end-to-end open-source platform for machine learning developed by Google."

inputs = tokenizer(question, context, return_tensors="tf")
inputs.pop("token_type_ids", None)  # if model doesn't use them

outputs = model(**inputs)

In [None]:
print(inputs)

In [None]:
print(outputs)

In [None]:
# 1. Get start and end logits
start_logits = outputs.start_logits  # (1, seq_len)
end_logits = outputs.end_logits      # (1, seq_len)

# 2. Find the most likely start and end token positions
start_index = tf.argmax(start_logits, axis=1).numpy()[0]
end_index = tf.argmax(end_logits, axis=1).numpy()[0]

# 3. Get input tokens IDs
input_ids = inputs["input_ids"].numpy()[0]

# 4. Extract answer tokens IDs from input
answer_ids = input_ids[start_index : end_index + 1]

# 5. Decode tokens to text
answer = tokenizer.decode(answer_ids)

print("Answer:", answer)



Inference can be optimized using:

* **XLA compilation**
* **Mixed-precision** (FP16) for GPU
* **SavedModel export** for efficient serving

### Hardware Considerations

* CPU inference possible but slow
* GPU (CUDA) or Apple Silicon acceleration preferred
* TPU use supported in GCP or Colab





---

## 6. Prototyping to Deployment

#### Saving the Pretrained Model


In [None]:
model.save_pretrained("./saved_model")

### Or convert to TensorFlow format:

model.save("./tf_model")


### Serving Options

* **TensorFlow Serving** for REST/gRPC interfaces
* **TFLite** for mobile devices
* **TF.js** for browser-based inference
* **ONNX** export (via `transformers.onnx`) for broad compatibility

### Integration into Applications

Using Flask or FastAPI for REST APIs:

```python
from flask import Flask, request
import tensorflow as tf

model = tf.keras.models.load_model("tf_model")

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json
    tokens = tokenizer(data["text"], return_tensors="tf")
    output = model(**tokens)
    return output
```

---

## 7. References and Resources

* [TensorFlow Official Site](https://www.tensorflow.org/)
* [Keras Documentation](https://keras.io/)
* [Hugging Face Transformers](https://huggingface.co/transformers/)
* [TensorFlow Model Garden](https://github.com/tensorflow/models)
* [KerasNLP](https://keras.io/keras_nlp/)
* [TF Lite](https://www.tensorflow.org/lite)
* [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving)
* [Hugging Face + TensorFlow Guide](https://huggingface.co/docs/transformers/index)

---

## Lab: Introduction to TensorFlow and Keras for LLMs

**Objectives:**

  * Understand basic TensorFlow data types and operations.
  * Load and preprocess text data for LLMs.
  * Utilize Hugging Face Transformers with TensorFlow/Keras for LLM tasks.
  * Generate BERT embeddings for text.
  * Perform text classification using a pre-trained LLM.
  * Experiment with basic generative LLMs.
  * (Optional Advanced) Understand the concepts and perform a lightweight fine-tuning of an LLM.

-----


**TensorFlow Basics:**

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
# Create a constant tensor
my_tensor = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
print("My Tensor:\n", my_tensor)

In [None]:
# Basic operation
added_tensor = my_tensor + 5
print("Added Tensor:\n", added_tensor)

In [None]:
# Convert to NumPy
numpy_array = my_tensor.numpy()
print("NumPy Array:\n", numpy_array)

**Tokenization:**

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

sentences = ["Hello, how are you today?", "I love deploying LLMs!"]
tokenized_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="tf")

print("Input IDs:\n", tokenized_inputs["input_ids"])
print("Attention Mask:\n", tokenized_inputs["attention_mask"])

In [None]:
# token vocabulary
print("Tokenizer vocab (sample):")
for token, idx in list(tokenizer.vocab.items())[:20]:
    print(f"'{token}': {idx}")

**BERT Embeddings:**

> - Load a DistilBERT tokenizer and model to convert a batch of input sentences into token embeddings. It tokenizes the sentences with padding and truncation, then passes them through the model to obtain the last hidden states
> - Last hidden state = encoded meaning of each token (word) given all other tokens in the sentence
> - To get a simple summary for each whole sentence, it picks the numbers from the `CLS` special token. These numbers can then be used to compare sentences or feed into other programs that understand this kind of data.

In [None]:
from transformers import TFAutoModel, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = TFAutoModel.from_pretrained("distilbert-base-uncased")

sentences = ["Hello, how are you today?", "I love deploying LLMs!"]
tokenized_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="tf")

# Get embeddings
outputs = model(tokenized_inputs)
last_hidden_state = outputs.last_hidden_state  # shape: (batch_size, seq_len, hidden_size)

# Use first token's embedding as pooled output (DistilBERT has no pooler_output)
pooled_output = last_hidden_state[:, 0, :]  # shape: (batch_size, hidden_size)

print("Last Hidden State shape:", last_hidden_state.shape)
print("Pooled Output shape:", pooled_output.shape)
print("First sentence embedding (pooled, first 5 dims):", pooled_output[0, :5].numpy())


#### compute similarity between sentences

In [None]:
from transformers import TFAutoModel, AutoTokenizer
import tensorflow as tf
import numpy as np

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = TFAutoModel.from_pretrained("distilbert-base-uncased")

# Example sentences
sentences = [
    "I love machine learning.",
    "Artificial intelligence is fascinating.",
    "I enjoy deploying LLMs!",
    "the quick brown fox jumps over the lazy dog",
    "Green tea has a soothing effect.",
    "I hate machine learning."
]

# Tokenize with padding and truncation
tokenized_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="tf")

# Get model outputs
outputs = model(tokenized_inputs)
last_hidden_state = outputs.last_hidden_state  # (batch_size, seq_len, hidden_size)

# Mean pooling: average over the token embeddings for each sentence
pooled_embeddings = tf.reduce_mean(last_hidden_state, axis=1)  # (batch_size, hidden_size)

# Normalize embeddings to unit vectors
normalized_embeddings = tf.math.l2_normalize(pooled_embeddings, axis=1)

# Compute cosine similarity matrix between all pairs
cosine_sim_matrix = tf.matmul(normalized_embeddings, normalized_embeddings, transpose_b=True)

# Convert to numpy for easy printing
cosine_sim_matrix_np = cosine_sim_matrix.numpy()

print("Cosine similarity matrix:")
print(np.round(cosine_sim_matrix_np, 3))


**LLM Classification (Tiny Example):**

In [None]:
# import tf_keras as keras
from tensorflow import keras
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

In [None]:
import pandas as pd
imdb = pd.read_csv(".\data\imdb_reviews.csv")
imdb.head()

In [None]:
# Fake small dataset
#texts = ["This movie was great!", "Terrible film.", "Neutral review.", "Absolutely amazing!"]
#labels = [1, 0, 1, 1] # 1 for positive, 0 for negative/neutral

texts = list(imdb['review'].values)[:20]
labels = list(imdb['sentiment'].values)[:20]

# Tokenize
tokenized_data = tokenizer(texts, padding=True, truncation=True, return_tensors="tf")
tf_labels = tf.constant(labels)



In [None]:
labels

In [None]:
# Load model
classifier_model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)


# Compile model
classifier_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)


In [None]:
# Train for a very small number of epochs with a tiny batch size
history = classifier_model.fit(
    x=dict(tokenized_data),
    y=tf_labels,
    epochs=5, # Very few epochs
    batch_size=2, # Tiny batch size
    verbose=1
)


In [None]:
# Prediction
#new_text = ["This is a fantastic example!"]
new_text = imdb['review'][20]
new_tokenized_input = tokenizer(new_text, padding=True, truncation=True, return_tensors="tf")
logits = classifier_model.predict(new_tokenized_input).logits
predicted_class_id = tf.argmax(logits, axis=1).numpy()[0]
print(f"Predicted class for '{new_text[0]}': {predicted_class_id}")

In [None]:
new_text

**Generative LLM:**

In [None]:
from transformers import TFAutoModelForCausalLM, AutoTokenizer
import tensorflow as tf

# Use matching model and tokenizer
model_name = "gpt2"  # You can change this to "distilgpt2" for a smaller model

tokenizer = AutoTokenizer.from_pretrained(model_name)
generator_model = TFAutoModelForCausalLM.from_pretrained(model_name)

**Generate text**


In [None]:
# Input prompt
input_text = "The quick brown fox jumps over the"
input_ids = tokenizer.encode(input_text, return_tensors="tf")

# Generate output
generated_output = generator_model.generate(
    input_ids,
    max_length=50,
    do_sample=True,
    temperature=0.7,
    num_return_sequences=1,
    top_k=50,
    top_p=0.95
)

# Decode and print
decoded_output = tokenizer.decode(generated_output[0], skip_special_tokens=True)
print("Generated Text:\n", decoded_output)



#### Experiment with Temperature and top_p
- try high temperature (> 1.5) and gradually decrease top_p from 1 to 0.95
- try a few different prompts
#### Try a smaller generative LLM
- replace gpt2 (117M parameters) with distilgpt2 (82M parameters). Does this affect model latency (make it generate faster?)