# Introduction

:::{admonition} Objective
:class: important
The goals of this workshop are to look at best practices with regard to:

- <font color='purple'>**Selecting and Executing**</font> open source LLMs on Quest and on the Kellogg Linux Cluster (KLC)
- <font color='purple'>**Adapting**</font> models by using fine-tuning to improve performance and accuracy
- <font color='purple'>**Integrating**</font> with external resources at run-time to improve LLM knowledge and reduce hallucinations
:::

:::{admonition} Project Lifecycle

Every LLM project goes through at least some version of this lifecycle:

```{figure} ./images/project-lifecycle-1.png
---
width: 900px
name: project-lifecycle-1
---
```
(Diagram taken from [DeepLearning.AI](https://www.deeplearning.ai/), provided under the Creative Commons License)
:::

## Define the Use Case

:::{card}

One key to success is coming up with a well-defined use case that your LLM application will implement:

```{figure} ./images/project-lifecycle-2.png
---
width: 900px
name: project-lifecycle-2
---
```

Your plan should specify:
* What data will I be using to achieve my research goal?
* How much data do I need?
* How will I evaluate LLM output? 
* What counts as good enough?
:::

:::{admonition} Types of Use Cases

LLMs support different [types](https://txt.cohere.com/llm-use-cases) of use cases, often with somewhat different underlying model architectures: 

```{figure} ./images/LLM-use-cases.png
---
width: 900px
name: LLM-use-cases
---
```
:::

## Select a Model

:::{card} 

There are many open source model choices available. Why choose open source over closed source models like GPT-4?

- __Reproducibility__
- __Data privacy__
- __Flexibility to adapt a model__
- __Flexibility to incorporate external resources__
- __Cost at inference time__

:::

:::{admonition} Leaderboards

We can use leaderboards to choose the best one for our use case, and model hubs to download and run them locally.

```{figure} ./images/project-lifecycle-3-annotated.png
---
width: 900px
name: project-lifecycle-3
---
```
:::

:::{card} Models vs. Code


```{figure} ./images/model-v-code.drawio.png
---
width: 900px
name: model-v-code
---
```
:::

:::{card} Model Hubs

One widely used model hub is from [Hugging Face](https://huggingface.co/docs/hub/en/models-the-hub):

```{figure} ./images/model-hub.png
---
width: 900px
name: model-hub
---
```
:::

:::{card} Benchmarks and Leaderboards: Chatbot Arena

This is the [chatbot arena leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) as of 2024-03-04:

```{figure} ./images/chatbot-leaderboard.png
---
width: 900px
name: chatbot-leaderboard
---
```
:::

:::{card} Benchmarks and Leaderboards: Others

There are many [other benchmarks](https://huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection-64faca6335a7fc7d4ffe974a):

```{figure} ./images/big-benchmarks-collection.png
---
width: 900px
name: big-benchmarks-collection
---
```
:::

:::{card} Benchmarks and Leaderboards: HELM

The growing capabilities of very large LLMs have inspired new and challenging benchmarks, like [HELM](https://crfm.stanford.edu/helm/lite/latest/):

```{figure} ./images/helm-benchmark.png
---
width: 900px
name: helm-benchmark
---
```
:::

:::{card} Executing an Open LLM

Executing LLMs on a [GPU](https://blogs.nvidia.com/blog/whats-the-difference-between-a-cpu-and-a-gpu/) is __much__ faster than using CPU. We will show you how to access GPUs for training and inference on [Quest/KLC](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1112)

```{figure} ./images/gpu-v-cpu.jpg
---
width: 900px
name: gpu-v-cpu
---
```
:::

## Adapt the Model: Fine-tuning

:::{card}

While we should always start with crafting good prompts in order to achieve the best performance we can, it may sometimes be advantageous to adapt a model to improve its performance. Fine-tuning is one way to achieve this goal.

```{figure} ./images/project-lifecycle-4-annotated.png
---
width: 900px
name: project-lifecycle-4
---
```
:::

:::{admonition} Fine-tuning
:::

:::{admonition} Evaluation Metrics

Evaluation metrics depend on the type of task. For information extraction tasks, metrics such as [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) are appropriate

```{figure} ./images/precision-recall.png
---
width: 400px
float: left
name: precision-recall
---
```
:::

## Application Integration

:::{card} 
LLMs are usually deployed as a component of a larger application. This larger application can make use of external resources, such as collections of documents, or knowledge bases. Deployment must also take into account the computational resources that are available, such as the availability of GPUs and sufficient memory.

```{figure} ./images/project-lifecycle-5-annotated.png
---
width: 900px
name: project-lifecycle-5
---
```
:::

:::{admonition} Model Optimization
Models can consume very large amounts of memory. The largest model you can currently run on Quest has to fit into a 4 Nvidia A100s with 80GB of RAM each. This is a lot, but you have to contend for these nodes with the rest of Northwestern. One way to tackle this challeng is to [quantize](https://huggingface.co/blog/4bit-transformers-bitsandbytes) your model weights, lowering FP precision in order to consume less memory:

```{figure} ./images/FP8-scheme.png
---
width: 900px
name: FP8-schema
---
```
:::

:::{admonition} Retrieval Augmented Generation (RAG)

No model can "know" anything about events that have occurred after its training cutoff date. One way to overcome this obstacle is to integrate external resources, such as Retrieval Augmented Generation (RAG). RAG can result in better prompt completions and fewer "hallucinations".

:::