<a href="https://colab.research.google.com/github/ryoungj/ObsScaling/blob/main/model_subset_selection_eval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Model Subset Selection (Guideline)

This notebook provides a guideline with minimal examples to select a subset of available models with optimal experimental design principle that minimize the evaluation cost while maintaining the prediction performance.

## Preparation

Colab specific setup: uncomment the following lines in colab

In [1]:
# ! git clone https://github.com/ryoungj/ObsScaling
# %cd ObsScaling
# ! pip install -r requirements.txt

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import json
import re
import copy

from utils import *

In [3]:
%load_ext autoreload 
%autoreload 2
%config InlineBackend.figure_format = 'retina'

## Guidelines

#### Step 1: Load benchmark eval results for LLMs to select from

In [4]:
base_llm_benchmark_eval = load_base_llm_benchmark_eval()

####  Step 2: Specify model selection arguments

In [5]:
## This is an illustrative example 
## Specify your own arguments based on your own data and needs
DEFAULT_SELECTION_KWARGS = {
    "num_model_budgets": [4, 8, 12, 16, 20, 24, 28, 32, 36],  # number of model budgets
    "max_family_to_search": 10,  # maximum number of model families to brute search
    "include_family": "Llama-2",  # always include Llama-2 in the selection as it is the most widely used model family
    "num_pc_for_select": 3,  # number of PCs to compute the variance for model selection
    "all_families_for_select": EVAL_BASE_MODEL_FAMILIES,  # all model families to consider for model selection
}

### Step 3: Selecting model set under additional budgets

Selecting models from all available models

In [6]:
run_results, select_results = search_subset(base_llm_benchmark_eval, **DEFAULT_SELECTION_KWARGS)

Total:  431909


431909it [00:17, 23995.38it/s] 


>>> Num model budegt: 4

### Best configs:
 	 Object value: -1365.30
 	 Model family (2): Mistral, Llama-2
 	 Models (4): meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, mistralai/Mistral-7B-v0.1




>>> Num model budegt: 8

### Best configs:
 	 Object value: -37.43
 	 Model family (4): Mixtral, Phi, MPT, Llama-2
 	 Models (8): meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, mistralai/Mixtral-8x7B-v0.1, microsoft/phi-2, microsoft/phi-1_5, mosaicml/mpt-30b, mosaicml/mpt-7b




>>> Num model budegt: 12

### Best configs:
 	 Object value: -16.93
 	 Model family (4): Llama-3, Falcon, DeepSeek-Coder, Llama-2
 	 Models (12): meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-8B, tiiuae/falcon-180B, tiiuae/falcon-40b, tiiuae/falcon-7b, tiiuae/falcon-rw-1b, deepseek-ai/deepseek-coder-1.3b-base, deepseek-ai/deepseek-coder-6.7b-base, deepseek-ai/deepse

Selecting models under additional budget constraints (e.g., sub 7B models)

In [7]:
## Specify cutoff kwargs to keep models under 7B parameters
## You can also do that by pre-filtering the `base_llm_benchmark_eval` based on your own needs
CUTOFF_KWARGS = {
    "split_method": "cutoff_by_Model Size (B)",
    "cutoff_threshold": 7,
}

SELECTION_KWARGS = {
    **DEFAULT_SELECTION_KWARGS,

    "cutoff_kwargs": CUTOFF_KWARGS,
}
run_results, select_results = search_subset(base_llm_benchmark_eval, **SELECTION_KWARGS)

Total:  89845


89845it [00:19, 4609.46it/s] 


>>> Num model budegt: 4

### Best configs:
 	 Object value: -25.61
 	 Model family (3): Phi, CodeLlama, Llama-2
 	 Models (4): meta-llama/Llama-2-7b-hf, microsoft/phi-2, microsoft/phi-1_5, codellama/CodeLlama-7b-hf




>>> Num model budegt: 8

### Best configs:
 	 Object value: -10.51
 	 Model family (6): Llama, Qwen, Phi, MPT, DeepSeek-Coder, Llama-2
 	 Models (8): meta-llama/Llama-2-7b-hf, huggyllama/llama-7b, Qwen/Qwen-7B, microsoft/phi-2, microsoft/phi-1_5, mosaicml/mpt-7b, deepseek-ai/deepseek-coder-1.3b-base, deepseek-ai/deepseek-coder-6.7b-base




>>> Num model budegt: 12

### Best configs:
 	 Object value: -7.04
 	 Model family (8): Llama, Qwen, Gemma, Falcon, Phi, MPT, DeepSeek-Coder, Llama-2
 	 Models (12): meta-llama/Llama-2-7b-hf, huggyllama/llama-7b, Qwen/Qwen-7B, google/gemma-7b, google/gemma-2b, tiiuae/falcon-7b, tiiuae/falcon-rw-1b, microsoft/phi-2, microsoft/phi-1_5, mosaicml/mpt-7b, deepseek-ai/deepseek-coder-1.3b-base, deepseek-ai/deepseek-coder-6.7b-base




>>> Nu