# State of GPT

## How to train GPT-Assistance

### Emerging recipes for training GPT

- keep in mind that this is all very new and rapidly evolving

![training gpt pipeline](assets/training_gpt_pipeline.png)

- we have 4 major stages, that and they follow each other serially 
    - pretraining (1 stage)
    - fine-tuning (3 stages)
        - supervised fine-tuning (SFT)
        - reward modeling
        - reinforcement learning
- in each stage we have a dataset, an algorithm, resulting model

### Pretraining Stage

- this stage is where all the computational work basically happens
    - this is 99% of the training and compute time
- this is where we deal with internet scale data, thousands of GPUs, and months of training
- the resulting model of this stage is called the `"base model"`

#### Dataset Collection

- we will gather a large amount of data
- example of what is called a data mixture that comes from the paper released by Meta (rleased the LLama model)

![image.png](assets/dataset_collection.png)

- we have common crawl (web scraping)
- C4 (also web scraping)
- then some high quality datasets as well
    - GitHub
    - Wikipedia
    - Books
    - and so on

- these are all mixed up together and sampled according to some proportions (that forms the data mixture that used for pretraining)


#### Tokenization

![image-2.png](assets/tokenization.png)


#### Hyperparameter examples for pretraining stage 
![image-3.png](assets/pre_training_examples.png)

- these are roughly the orders of magnitude that we are dealing with when we are pretraining these models
    - the vocabulary size is a couple of 10K tokens
    - the context length is around 2048 tokens (nowadays even 100,000) this governs the maximum length of the input that the model will look at when it is trying to predict the next token
    - notice that LLama has 65B parameters, much smaller than GPT-3 which has 175B parameters, but is significantly more powerful because it is trained for significantly longer and on more data (1.4T tokens!! instead of 300B tokens for GPT-3)
        - so, we should not judge the power of the model by the number of parameters alone
    

#### Pretraining

![image-4.png](assets/pre_training_batches.png)

- we take all the documents and append the `EOS` token at the end of each document, now we have concatenated all documents into one long stream of tokens
- we then chunk this stream into chunks of a context length and create batches of them (just like what we did in the previous notebook)
- so we basically took these documents and packed them into rows and we delimited them by the `EOS` token, telling the transformer where a new document begins

- the rest is as we studied in the previous notebook

![image-5.png](assets/training_example.png)


#### Training curve examples

- these are the kind of plots we look at when we are doing model pre-training (the loss curve)

![image-6.png](assets/training_curve_examples.png)

#### Model result (base model)

- the first thing that the field noticed is that the base model learns very powerful and general representations, and that it is possible to fine-tune them for any arbitrary task we want

- so if we are interested in sentiment classification
    - the old approach is to collect a bunch of +ve and -ve reviews and train an nlp model for that
    - the new approach is to go off and do large language model pre-training, train a base model, and then fine-tune it on the sentiment classification task using few examples
        - it works very well in practice because the base model was forced to multi-task a huge amount of tasks in the language modeling task (it is forced to understand a lot about the structure of language and the different concepts just for the sake of predicting the next token)

- all that was GPT-1

- around the time of GPT-2, people noticed that actually even better than fine-tuning, we can actually prompt these base models effectively
    - these are language models, and they want to complete documents,so we can trick them into performing tasks just by arranging the prompt in a specific way

    - for example to get the base model to make question-answering, we can just give it a document of a passage then multiple Q-A pairs (called few-shot prompt), then a question and it will actually complete the document by answering the question

- an example of prompt engineering a base model
![image-7.png](assets/prompt_engineering_base_model.png)

- this has kicked off the era of prompting over fine-tuning and seeing that this actually can work extremely well on a lot of problems even without any fine-tuning

- since then, we have seen a lot of base models being released, not all of them are available, and currently the best base models are the LLama series from meta (not commercially licensed)
    - that is because GPT-3 and GPT-4 are not base models, they are GPT-Assistants (fine-tuned models)

- so, notice that the base models are not assistants, they don't want to answer our questions or do tasks for us, they just want to complete documents
    - so, if we tell them write a poem about something, or tell us a story about something, they will probably complete it with more questions or undefined behavior
    - but if we prompt them with a part of a poem or a story, they will complete it

![image-8.png](assets/base_models_are_not_assistants.png)

- we can even trick base models into being assistants
    - we can do this by creating a specific few-shot promot that makes it look like there is a conversation going on between a human and assistant, and they are exchanging information, then put our query at the end and the base model will condition itself on the document (condition itself into being a helpful assistant and answer)
        - although this is not very reliable, and doesn't work super well in practice

![image-9.png](assets/few_shot_prompt_base_models.png)
- so, we have a different path to make actual GPT assistants, not just base models (document completers)
    - so that takes us into the 3 fine-tuning stages (mine: so do we do the fine-tuning if we want them to be assistants?)

### Supervised Fine-Tuning (SFT)

- in the supervised fine-tuning stage, we are going to collect small but high quality datasets
- we asked human contractors to gather data of the form prompt -> ideal response
    - collected 10K-100K of these
- so, we still do language modeling, nothing is changed algorithmically, we are just swapping out a training set
    - so, it used to be internet documents, which is high quantity and low quality
    - now it is high quality and low quantity (Q-A prompt-response data)

- after training on this stage we get an SFT model
    - we can deploy them, and they are actual assistants and they work to some extent


#### SFT Dataset example

- here is something that a human contractor might come up with

![image.png](assets/sft_dataset_example.png)

in the image we see a prompt and the ideal response by a human following some instructions

### RLHF models: reinforcement learning from human feedback

- that consists of both stage 3 and 4
    - reward modeling
    - reinforcement learning


#### Reward Modeling
- in the reward modeling step, we will shift the data collection to be of the form of comparisons
    - we have the same prompt, and then we take the SFT model (already trained on human data) and create multiple completions
    - then we ask people to rank these completions
        - this can take people even hours for single prompt -> cmpletions pair

    ![image.png](assets/RM_dataset.png)
    
    - then, we follow that with something that looks very much like binary classification on all the possible pairs between these completions
    ![image-2.png](assets/reward_modeling.png)
        - we lay out the prompt in rows, the prompt is identical (same prompt in blue), with the different completions (in yellow), then append a special reward token (in green) to each completion (mine: a reward token followed by the actual score)
            - and train the transformer at the single green token (mine: that is why we said it is something like binary classification)
            - so the transformer will predict some reward for how good that completion is for that prompt 
                - and we use the ranks given by people to make it predict higher rewards for the better completions (we formulate it into a loss function) and train the model to make reward prediction that is consistent with the ground truth coming from the comparisons of the contractors
        - now it will be able to score how good a completion is for a prompt
    - once trained, this is not a model that we deploy, it is a model that we use to generate rewards for the next stage


#### Reinforcement Learning


- we can now score the quality of any completion for any prompt
- so, we get a large collection of prompts, and now we do reinforcement learning with respect to the reward model

![image-3.png](assets/RL_learning.png)

- now we take the model initialized at the SFT, we take a single prompt, and we generate completions, and then append the reward token
- then read-off the reward according to the trained reward model now
    - it tells us the quality of every single completion for every prompt
- then we apply the same language modeling on the yellow tokens (completions) but weighted by rewards given by the reward model
    - for example in the above image, the first completion is given a reward of 1.0 (high reward), so all the tokens we sampled in the first row will be reinforced and get higher probabilities for the future
    - the second completion was given a reward of -1.2 (bad reward), so for every single token we sampled in the second row will get lower probabilities for the future

- then all the completions will score high according to the reward model trained on the previous stage

#### Why we want to do RLHF?

- it just works better 
    - humans just prefer tokens that come from RLHF models, compared to SFT models, and also compared to base models that are prompted to be assistants

![image-4.png](assets/performance_vs_finetuning.png)

- it is easy to compare vs generate 
    - a contractor might find it hard to prepare a good poem for the SFT, but will find it easy to compare between poems generated (the dataset for the reward model)
    - so it is a better way to laverage human judgment to create better models


- note that there are some cases where base models are better than RLHF models
    - people noticed that RLHF models (and SFT) lose some diversity in the completions
        - in other words, they output samples with lower variations (since we alligned them towards some tasks)
    - so, if we have a case where we have n things and want to generate more things like it (makemore), the base models are much better 

![image-5.png](assets/mode_collapse.png)

- so we have base models
- we have SFT models
    - like vicuna-13B and so on
- we have RLHF models
    - like ChatGPT, and claude

so the best assistants are RLHF models

## how to use GPT-Assistance effectively for our applications

- we are done with the training, and now we want to see how to apply GPT assistants to our problems 

### Human Text Generation vs LLM Text Generation

- how would a human generate this sentence `California's population is 53 times that of Alaska`? 
    - first let's get their populations
        - I know that I don't know their populations in my head, so I will look it up
            - [uses wikipedia]: ok california has 39.5M
            - [uses wikipedia]: ok alaska has 0.7M
    - now I should divide them by each other
        - I know that I can't divide that using my head, (as we have decimals) so i will rely on a calculator
            - [uses calculator]: 39.2/0.74 = 53
    - maybe I do some reflection on the result, does 53 make sense? it is quite a large fraction
        - but then California is the most populous state, and alaska is the least populous state, so it makes sense
    - now we write the sentence, we might delete and modify the phrase (as we are wiritng we have this separate process of inspecting what we are writing, and judging whether it is good or not, and if not we may delete or reframe it)


- how would a GPT genearte the same sentence?

    ![image.png](assets/sentence_generation_llms.png)

    - it just goes "Token Token Token Token"
    - each token is roughly the same amount of computational work (unlike us we spend different times generating different parts of the sentence, like the calculation of 53)
        - they don't very shallow when generating these tokens though, they have like 80 layers of reasoning, but still not too much
    - they are just like token simulators (even after RLHF, they are just simulating more responses that are more pleasing to humans), and they don't know when they don't have information -unlike humans-
        - they don't know what they are good at and what they are not good at, they just do their best to imitate the next token
    - they also don't reflect on their generations, or apply sanity check to them, or correct their mistakes by deleting some tokens 
    - howver, they do have cognetive advantage over humans (they do have a fact-based knowledge across a vast number of areas, as they have like billions of parameters -so that is a lot of storage for alot of facts)
        - so they have relatively large and perfect working memory 
        - so whatever fits into the context window is immediately available for the transformer -through the self-attention mechanism-
            - so it is like a perfect memory where all information is already accessible but this memory is finite

- why did we make this comparisons?
    - that is because prompting is just making up for the differences between the LLMs and humans
    - if our tasks require reasoning, we can't expect the transformer to do too much reasoning per token, so we have to spread out the reasoning across more and more tokens
        - so we can't give the model a very complicated question and expect it to answer it in a single token (there is not enough time to think)
        - in other words, models need tokens to think
            - so we may have few shot prompts that show the transformer how the reasoning should be done (give it examples of the reasoning and it will immitate it)
            
            ![image-2.png](assets/few_shot_cot.png)
            
            -  we can tell it things like "let's think this step by step" because this conditions the transformer into showing its work and generate multiple tokens, eventually allowing it to do more reasoning
            
            ![image-3.png](assets/encourage_the_model_to_think.png)



- another approach is something called `self-consistency`
    - just like what we do when we write, we can self-reflect and correct our mistakes
    - so, we may do a similar behaviour by getting multiple answers from the model, then have some process of finding the ones that are good and keeping them or do some voting 
        - mine: so it doesn't really correct its mistakes but we sample from large number of answers so that if it made a mistake it will be corrected by the other answers (majority of the answers)

    ![image-4.png](assets/ensemble_multiple_attempts.png)


- another approach is to simply ask them for reflection :D
    - they actually know -specially later models like GPT-4 - when they screw up
    - we had to prompt it because without us promting them for reflection, they don't know to revisit their answers and correct them

![image-5.png](assets/self_reflection_in_later_llms.png)

- prople are still playing with prompting and trying to figure out how to get the best out of these models (and bring back the abilities that we have over the LLMs in our thinking)
    - a paper suggested `tree of thought`
    - we maintain multiple completions for a given prompt, and score them along the way and keep the ones that make sense
    - generally, people are exploring more techniques 


- notice that LLMs don't want to succeed :D they want to imitate, we want to succeed and we should ask for it
    - when transformers are trained, they have training set of different quality in their data (for example we could train it on physics answers, and there could be wrong answers and expert answers that are right)
        - so the transformer can't tell the difference between the low-quality solutions and the high-quality solutions, they will just imitate all of them (it is a language model)
        - and so at test time, we have to ask it a good performance
            - for example we tell it let's look at it step by step -for reasoning- to make sure we get the right answer
            - or telling it that you are an expert of something, so it get the right answer

- `tool use / plugins` another thing, when we are trying to solve problems, we lean on tools to get some information, so people thought about doing the same thing to LLMs
    - like give them calculators, code interpreters, and so on
    - the ability to search 
    - and since they don't know when they don't have an infomation, we can tell them specifically in the prompt
        - like you are not very good at arithmetic, so whenever you are dealing with large numbers, use the calculator

- `retreival augmented LLMs`

    ![image-6.png](assets/retreival_augmented_llms.png)
    - we went from a world of retrieval-only (searching ourselves for infrmation on the internet, no LLms), to a world of memory-only (LLMs)
    - but there is an entire space in between of retreival-augmented models, and it works very well in practice
        - as we mentioned, the context window of the transformer is its working memory, and if we load it with any information that is relevant to the task, it will work very well
        - so, a lot of are interested in retreival-augmented generation 

    - example for it in the image is llama index (which is a data connector)
        - and we can index all the relevant data and make it accessible to LLMs
            - we take relevant documents
            - split them to chunks
            - embed all of them
            - store that in a vector store
            - and at test time, we make a query to the vector store and fetch relevant chunks and stuff them into the prompt and generate 

        - it is similar to us (we don't do everything from memory, we reference some documents or documentation to help us)
            - like we have some memory about some of it but it is better to look it up in details, and that is the inspiration for wanting LLMs to do that too


- `constraint prompting`
    - that is a technique for forcing a certain templates in the output of LLMs
    - like enfocring that the output from LLMs is Json like the example below
    
    ![image-7.png](assets/constraint_prompting.png)
    ![image-8.png](assets/constraint_prompting2.png)

    - they go in and mess with the probabilities of all the different tokens that come out of the transformer and they clamp those tokens and the transformer will only fill in the blanks, and they can enforce additional restrictions on what can go into these blanks

## Fine Tuning and default recommendations

- it is the case that we can get really far with prompt engineering (one-shot or few-shot prompt), but it is also possible to fine-tune the models (actually changing the weights, like SFT -supervised fine-tuning- and RLHF -reinforcement learning from human feedback-)

    ![image.png](assets/prompt_engineering_vs_fine_tuning.png)

    - it is becoming a lot more accessible to do fine-tuning in practice, thanks to a number of techniques that have been developed and have libraries for 
        - for example, parameter efficient fine-tuning (PEFT) like LoRA
        - make sure that you are only fine-tuning small sparse pieces of the model (most of the model is kept clamped or fixed at the base model, and some pieces of it are allowed to change)
        - this works pretty well imperically and makes it much cheaper as we tune small pieces of the model
            - beacuse most of the model is fixed, we can use very low precision inference for computing those parts, as they will not be updated by gradient descent
        - we also have a number of high-quality base models available (like LLama), although it is not commercially licensed right now
    - fine-tuning is a lot more technically involved. It requires a lot more expertise to do right, and requires human contractors for datasets and/or synthetic data pipelines
        - However, SFT is achievable, as it is relatively straight forward (same thing but with different task-related data)
        - but RLHF is complicated and research territory, and it is harder to get to work
            - so, it is not advisable for someone to try and roll his own RLHF implementation, as these things are unstable and not beginner-friendly
            - they also can potentially change rapidly

- so, these are the default recommendations (mine: for using them in applications)

![image-2.png](assets/default_recommendations.png)

- look for prompt engineering online 
    - after squeezing out prompt engineering (which we should stick with for a while), look at potentially fine-tuning a model on your application, but expect it to be a lot more slower 

![image-3.png](assets/use_cases.png)