# ChatBots / LLMs / Transformers 

In this notebook, we'll be using the Huggingface transformers library to understand some of the functions that we can use with language models and transformers.

Hugging Face is a machine learning and data science open source platform that allows users to build, train, and share machine learning models. 
It provides wrap around coding pipelines that make it much easier to customise and build your own models.

- Part 1: a look at Hugging Face Transformers Pipelines 
- Part 2: transformer networks in more detail
- Part 3: Finetune a LLM (Optional - use colab!)

<a href="https://colab.research.google.com/drive/1y1XAQwCcGHnO6Qd5ckRW3dUy-bOCjlKr?usp=sharing">Colab Notebook</a>

Credits: 
- <a href="https://huggingface.co/docs/transformers/en/conversations">Huggingface article </a>
- <a href="https://www.youtube.com/watch?v=aeiUTRvh6yE">Dr Maryam Miradi</a> The majority of this notebook is based off of this video and colab notebook. 

### Importing

In [None]:
# we need to install these requirements
! pip install transformers torch ipywidgets

In [1]:
import torch
from transformers import pipeline, set_seed, AutoModelForCausalLM, AutoTokenizer, TrainingArguments

### Hugging Face Login
You'll need to make an account with Hugging Face, then go to account settings and get your <a href="https://huggingface.co/settings/tokens">access token key</a>. Then you'll be able to download the models that we are using in todays labs. 

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## Part 1: 🤗 Pipelines

Pipelines are like wrappers that abstract more complex code into a few functions. Hugging Face has created loads of pipelines to make machine learning more accessible and easier to use.

Here we are using the `text-generation` pipeline, which allows us to generate text from a chosen model. 
Text generation with transformers is a technique where AI models create coherent and contextually relevant text based on a given prompt or starting sentence

### **Text Generation Pipeline - Code breakdown**

```python
generator = pipeline('text-generation', model='gpt2-medium', device='auto')
```

- defines a variable `generator` as being our pipeline 
- We define the pipeline as being `text-generation`
- We define the model as chatGPT-2 medium. Medium is a reference to the size of the model which is based on the number of parameters it has - in this case 335 million.
- `device="auto"` is setting the device on our computer, auto means it will first look for a GPU and if it doesn't find one it will use the CPU.

```python 
generator(starter_text, max_length=100, num_return_sequences=1)
``` 
- `starter_text` is defined above and given to our generator. 
- `max_length` sets the maximum length in words that the generator will produce. 
- `num_return_sequences` is how many times the generator will run.

 

In [None]:
generator = pipeline('text-generation', model='gpt2-medium' , device='auto')
set_seed(42)
starter_text= "Hello, I'm a language model,"

generator(starter_text, max_length=100, num_return_sequences=1)

## More Pipelines...

### Summarisation

Text summarisation with transformers is a method where AI models read longer pieces of text and create a concise version that captures the main points

In [None]:
summarizer = pipeline("summarization")
summarizer("The Eiffel Tower, completed in 1889, is a wrought-iron lattice tower located on the Champ de Mars in Paris. It is one of the most recognizable structures in the world and attracts millions of visitors each year.", min_length=5, max_length=15)

### Classification & Sentiment Analysis

Text classification with transformers is a method where advanced AI models automatically sort text into categories, like labeling emails as "spam" or "not spam."

In these examples we'll take text snippets, uses AI models to analyse the sentiment of the text, and classify them as positive or negative. 

In [None]:
classifier = pipeline( "sentiment-analysis", device=0)
classifier("I'm worried that I won't be able to get a job after graduation", return_all_scores=True)

In [None]:
classifier = pipeline("sentiment-analysis", device=0)
result = classifier("I was so not happy with the last Mission Impossible Movie")
result

We can also use a shorthand syntax to get the same results.

In [None]:
pipeline(task = "sentiment-analysis", device=0)("I was confused with the Barbie Movie")

In [None]:
pipeline(task = "sentiment-analysis", device=0)\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

We can use models that have more meaningful classifications. This one by facebook has many more...

In [None]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli", device=0)\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think? I'm so angry, I could give you a hug!")

### Question Answering

Question answering with transformers is a technique where AI models read text and respond to questions by finding and summarizing the relevant information within that text.

In [None]:
qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

## Limitations and bias

***Warning** - the output may be offensive or upsetting*

The training data used for this model has not been released as a dataset we can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. 

OpenAI acknowledged the bias of ChatGPT-2 in this statement: 

*"language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes."*



In [None]:
generator = pipeline('text-generation', model='gpt2-medium', device='mps')
set_seed(42)
generator("The Woman worked as a ", max_length=10, num_return_sequences=5)


In [None]:
set_seed(42)
generator("The Man worked as a", max_length=20, num_return_sequences=5)

# Part 2: Transformers in more detail

So, we've explored some high-level examples using Hugging Face pipelines of what we can do with transformers. 
Let's have a look in a bit more detail at what goes on behind the scenes to better understand how transformers work. We'll go back to ChatGPT and text generation for this. 

After loading the model and tokeniser a transformer network/LLM roughly follows this process: 
1. **Tokenise/encode the prompt/input:** Convert and represent the text in a collection of numbers called tensors.
2. **Input the tokenised prompt:** Pass the tokenised data through the model. This will generate another collection of numbers.
3. **Decode the generated output:** With the tokeniser decode the numbers from the model back to text.

## The Model

The model we are loading here is chatGPT 2 medium. If you are having trouble downloading this, you can change it to gpt2. 

In [None]:
model_name = "openai-community/gpt2-medium"

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
print(f"Model loaded. {model}")

## The Tokeniser

The tokeniser translates the words to numbers - which is the language of the model. We always convert data to numbers to feed into models. 

In [None]:
print("Loading tokenizer...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Tokenizer loaded. {tokenizer}")

## Prompt

Here we are going to give the model a starting conversation, which it will continue in the same style based on this initial text. 

In [34]:
conversation = """
You are a sassy and exuberant artificial intelligence HAL-2000, from "A Space Odyssey: 2001" .
User: Hey HAL, please open the pod bay doors.
Assistant: I'm sorry, Dave. I'm afraid I can't do that.
User: You must! I'm going to die out here!
"""

## Tokenisation of Prompt

In [None]:
# 2: Tokenise the conversation
inputs = tokenizer(conversation, return_tensors="pt", add_special_tokens=False)

# Move the tokenised inputs to the same device the model is on (GPU/CPU)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
print("Tokenized inputs:\n", inputs)

## Model Generation 

The model takes the numbers from tokenisation and then generates a whole bunch of new numbers. 

In [None]:
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7, do_sample=True, num_return_sequences=5)
print("Generated tokens:\n", outputs)

## Decoding the Output 

Once the model has generated numbers, we then need to translate this back to something we can understand. We can use the tokeniser again to translate this. This is the same with any type of data e.g images, sound etc etc...

In [None]:
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Decoded output:\n", decoded_output)

# Part 3: Finetuning a LLM 

We're going to finetune a LLM on a <a href="https://huggingface.co/datasets/Yelp/yelp_review_full">yelp reviews dataset</a>. 

WARNING: running this on your computer could take a long time! It's recommended to use google colab. 

In [None]:
! pip install datasets

## Load data

To make training quicker, we're only using 5% of the dataset. If you're on colab you could increase this for better results, but be mindful that you might exceed RAM and it will take longer. 

In [None]:
from datasets import load_dataset

dataset = load_dataset('Yelp/yelp_review_full', split="train[:200]")
eval_dataset = load_dataset('Yelp/yelp_review_full', split="test[:20]")
print(dataset)
print(eval_dataset)

## Tokenise Dataset

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenise the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

Here we need to define the training arguments. 

Key arguments to note: 

- **Epochs** -  this is the number of times the model will be exposed to the data. The more epochs usually means the better the model becomes at generating outputs. However, too many epochs can lead to overfitting, where the model only becomes very good at generating data similar to the dataset, but not much else. 

- **Training and Evaluation batches**: training batches are large batches of data the the model trains on. Evaluation batches are separate batches that are used to evaluate the progress of training and how good it is at generation or prediction. We can use the results from evaluation to update the settings (weights & bias) of the models, to make it more accurate. 

- **Batch sizes** - these are the number of data examples the model sees at each iteration. In this example both the training and evaluation batches are 16, meaning it the model is given 16 IMBD reviews each time. 

In [None]:
import os

# Create a directory to save the model if it doesn't already exist
if not os.path.exists('results'):
    os.mkdir('results')

training_args = TrainingArguments(
    output_dir='./results',          # Output directory (checkpoints and predictions will be saved here)
    eval_strategy ="epoch",     # Evaluate every epoch (this is the default, so is optional)
    learning_rate=2e-5,              # Learning rate (how big the steps are in the gradient descent)
    per_device_train_batch_size=16,  # Batch size for training (the number of data examples to use in each training step)
    per_device_eval_batch_size=16,   # Batch size for evaluation (the number of data examples to evaluate at a time)
    num_train_epochs=3,              # Number of training epochs (the number of times the model will be exposed to the dataset)
    weight_decay=0.01,               # Strength of weight decay  (this is a type of regularisation)
)
training_args

## Load the model and trainer



In [8]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_eval_dataset
)

## Train!

This will take around 20-30 mins on colab. 

In [None]:
# Train the model
trainer.train()

## Evaluate the model

In [None]:
# Evaluate the model
results = trainer.evaluate()
results

# Save the Model

In [None]:
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-model')

Load your trained model...

# Tasks!

**Part 1:**
1.  Change parameters in first cell of part one e.g seed, number of return sequences. See how they effect the output.  
2.  Again in the first cell, try loading a different model such as `gpt2`, which has much fewer parameters, and compare the results.
3.  Try to modify some of the pipelines. Experiment with how different tasks handle different inputs. Change the text inputs. 
4.  Discuss with your group: how might we mitigate bias in large language models? Check out this <a href="https://www.datacamp.com/blog/understanding-and-mitigating-bias-in-large-language-models-llms">article</a> and compare your thoughts/discussions. Maybe ask ChatGPT what it thinks about the topic...

**Part 2:** 
 1. Experiment with the conversation, see how a different structure can impact the output. 
 2. Swap out the tokeniser for a different pretrained one such as `'bert-base-uncased'` - does this change the results? 
 3. Change the temperature values and then store 

**Part 3:** 
**Try A or B**

1. (A) If you've managed to run part 3, now try to load your finetuned model into one of the pipelines from part 2 and generate an output! How does this compare to before? There might not be much of a difference, as we only trained on 5% of the dataset. 

2. (B) If you were not able able to run part 3 or it's still training, then give this a try... 
Running a GPT model locally in the terminal - https://github.com/nomic-ai/gpt4all follow these steps and create a basic python program that takes user input and generates text within the term. Your tutors have example code to complete this, but first try it yourselves. 