# Harry Potter

A step-by-step guide to fine tune a pre-trained language model from Hugging Face using TensorFlow and Keras on the Harry Potter books.

This model could be able to run tasks like text analysis, sentiment analysis, questions answering, topic modeling, named entity recognition, and even generating Harry Potter fan fiction using techniques like natural language processing (NLP).

## Step 1: Install Required Libraries

First, you need to install the necessary libraries. You can do this using pip.

```
!pip install transformers tensorflow datasets
```

These dependencies should be already installed. To prevent the following error:

```
RuntimeError: Failed to import transformers.models.gpt2.modeling_tf_gpt2 because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.
```

Install the `tf-keras` package

In [1]:
!pip install tf-keras



## Step 2: Import Libraries

Import the required libraries for the task.

In [2]:
import tensorflow as tf
from transformers import TFAutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

2025-03-22 15:35:52.490537: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Step 3: Load Pre-trained Model and Tokenizer

Load a pre-trained model and tokenizer from Hugging Face. We'll use the GPT-2 model for this example.

In [3]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForCausalLM.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## Step 4: Prepare the Dataset

Load the Harry Potter books dataset.

In [13]:
# dataset = load_dataset('text', data_files={'train': 'harry_potter_books.txt'})

with open('harrypotter.txt', 'r') as file:
    data = file.read()

## Step 5: Tokenize the Dataset

Tokenize the dataset using the tokenizer.

In [19]:
tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = tokenizer(data, return_tensors='tf', truncation=True, padding='max_length', max_length=512)

## Step 6: Prepare Data for Training

Convert the tokenized dataset into a format suitable for training.

In [20]:
tokenized_datasets.keys()

dict_keys(['input_ids', 'attention_mask'])

In [31]:
# tokenized_datasets

In [3]:
train_dataset = tokenized_datasets['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    shuffle=True,
    batch_size=8,
    collate_fn=lambda x: {'input_ids': tf.stack([f['input_ids'] for f in x]),
                          'attention_mask': tf.stack([f['attention_mask'] for f in x]),
                          'labels': tf.stack([f['input_ids'] for f in x])}
)

NameError: name 'tokenized_datasets' is not defined

## Step 7: Compile the Model

Compile the model with an appropriate optimizer and loss function.

In [4]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss)

NameError: name 'tf' is not defined

## Step 8: Train the Model

Train the model on the dataset.

In [5]:
model.fit(train_dataset, epochs=3)

NameError: name 'model' is not defined

## Step 9: Save the Model

Save the fine-tuned model for later use.

In [6]:
model.save_pretrained('fine_tuned_gpt2_harry_potter')
tokenizer.save_pretrained('fine_tuned_gpt2_harry_potter')

NameError: name 'model' is not defined