---
title: "Building an LLM"
author: "Vahram Poghosyan"
date: "2023-01-13"
categories: ["Machine Learning", "Large Language Models", "Deep Learning"]
format:
  html:
    toc: true
    toc-depth: 5
    code-fold: true
jupyter: python3
include-after-body:
  text: |
    <script type="application/javascript" src="../../javascript/light-dark.js"></script>
---

# Steps

## Step 1: Data Curation

GPT-3 was trained on 0.5T tokens, today's leading models are often trained on 5T tokens and above. 

### Web-Scraping

Write a script to download the whole internet. The internet is composed of a lot of Wikipedia articles (which are of particular value, as sources that have been referenced from Wikipedia are often given more weight in the final output of the model), forums, books, scientific articles, news articles, code bases, etc. 

For a taste of web scraping, read about my project for [scraping prices of goods and sending notifications of price drops](../web_scraping/web_scraping.ipynb). 

### Use Public Datasets

* [Common Crawl](https://commoncrawl.org/)
* [The Pile](https://pile.eleuther.ai/)
* [Hugging Face Datasets](https://huggingface.co/docs/datasets/en/index)

Private datasets are also available, like FinPile (used by BloombergGPT). 

### Auxillary LLMs to Generate Synthetic Data 

Alpaca. 

## Step 2: Training

### Pre-Training

### Post-Training

## Step 3: Architecture Choices

### Encoder

Output is an embedding, this is most useful for classification purposes.

### Decoder

Does not see the future. No attention is paid by tokens to future tokens in the sentence. The output is a probability distribution over the entire corpus or, effectively, the next predictable token.

### Encoder-Decoder

Transformers are encoder-decoder models.


### Tweaking Architecture

An LLM outputs a probability distributiom over the entire corpus. So, how does such a device work for the explicit task of multiple choice answering, for example? Well, we can tokenize and pass the entire question, along with the multiple choice answers, as raw tokens and hope the model outputs "A," "B," "C," or "D" as the most probable next token. We can, of course, do much better by constraining the choices to those tokens and comparing which of those has the highest probability. We can also template our prompts. We can instruct the LLM to understand that every prompt follows a template of "question" and "answer choices." There are many options for tweaking our models in these ways as part of pre-training or post-training. 

## Step 4: Evaluation

### Choice of Loss

### Benchmarking Datasets

Multiple-choice tasks: [ARC](https://github.com/fchollet/ARC-AGI), [Hellaswag](https://rowanzellers.com/hellaswag/), [MMLU](https://paperswithcode.com/dataset/mmlu),...

### Human Evaluation

### NLP Metrics

Quantifies the quality of the output via metrics such as Perplexity, [BLEU](https://en.wikipedia.org/wiki/BLEU), or [ROGUE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores.

Perplexity is a measure of how many tokens our LLM was hesitating on choosing as the most probable next token (out of the entire corpus). Fornmally... 

Another option is using auxillary fine-tuned LLMs which was used in the [TruthfulQA paper](https://arxiv.org/pdf/2109.07958) to compare output to some ground truth. 

