## LLM-Blender Usage examples

Please first download our DeBERTa-v3-large PairRanker checkpoint to your local folder: [checkpoint link](https://drive.google.com/drive/folders/1E3qsZqja5IBaYEDRtVARU88mDl_nBqQ3?usp=sharing)
And put that to <your checkpoint path>

In [None]:
import llm_blender
ranker_config = llm_blender.RankerConfig
ranker_config.ranker_type = "pairranker"
ranker_config.model_type = "deberta"
ranker_config.model_name = "microsoft/deberta-v3-large"
ranker_config.load_checkpoint = "<your checkpoint path>"
ranker_config.cache_dir = "./hf_models"
ranker_config.source_max_length = 128
ranker_config.candidate_max_length = 128
ranker_config.n_tasks = 1
fuser_config = llm_blender.GenFuserConfig
fuser_config.model_name = "llm-blender/gen_fuser_3b"
fuser_config.cache_dir = "./hf_models"
fuser_config.max_length = 512
fuser_config.candidate_max_length = 128
blender_config = llm_blender.BlenderConfig
blender_config.device = "cuda"
blender = llm_blender.Blender(blender_config, ranker_config, fuser_config)

## Using LLM-Blender for ranking
By the rank function, LLM-Blender could ranks the candidates through pairwise comparisons and return the ranks. We show our ranker's ranks are highly correlated with ChatGPT ranks.

In [2]:
import datasets
from llm_blender.gpt_eval.cor_eval import COR_MAPS
from llm_blender.gpt_eval.utils import get_ranks_from_chatgpt_cmps
mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)
few_examples = list(mixinstruct_test.take(8))
# remove cmp_results with none cmp results
for ex in few_examples:
    keys = list(ex['cmp_results'].keys())
    for key in keys:
        if not ex['cmp_results'][key]:
            del ex['cmp_results'][key]
few_examples = [x for x in few_examples if x['cmp_results']]
inputs = [x['input'] for x in few_examples]
candidates_texts = [[cand['text'] for cand in x['candidates']] for x in few_examples]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)


Ranking candidates: 100%|██████████| 4/4 [00:43<00:00, 11.00s/it]


In [3]:
import numpy as np
llm_ranks_map, gpt_cmp_results = get_ranks_from_chatgpt_cmps(few_examples)
gpt_ranks = np.array(list(llm_ranks_map.values())).T
print("Correlation with ChatGPT")
print("------------------------")
for cor_name, cor_func in COR_MAPS.items():
    print(cor_name, cor_func(ranks, gpt_ranks))

Correlation with ChatGPT
------------------------
corrcoef 0.502434644007648
spearman 0.35554809046205055
spearman_footrule 25.5
set_based 0.6422190656565656


## Using LLM-Blender for fuse generation
We show that the the fused generation using the top-ranked candidate from the rankers could get outputs of higher quality.

In [4]:
from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks
topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)
fuse_generations = blender.fuse(inputs, topk_candidates, batch_size=2)


Fusing candidates: 100%|██████████| 4/4 [00:12<00:00,  3.11s/it]


In [5]:
# # Or do rank and fuser together
# fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, return_scores=False, batch_size=2, top_k=3)


Ranking candidates: 100%|██████████| 4/4 [00:44<00:00, 11.06s/it]
Fusing candidates: 100%|██████████| 4/4 [00:12<00:00,  3.12s/it]


In [15]:
from llm_blender.common.evaluation import overall_eval
metrics = ['bartscore']
targets = [x['output'] for x in few_examples]
scores = overall_eval(fuse_generations, targets, metrics)

print("Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in few_examples[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}
for ex in few_examples:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", np.mean(llm_scores))


Evaluating bartscore: 100%|██████████| 8/8 [00:00<00:00, 41.08it/s]

Fusion Scores
   bartscore: -3.8043667674064636
LLM Scores
0 oasst-sft-4-pythia-12b-epoch-3.5
   bartscore: -3.807092547416687
1 koala-7B-HF
   bartscore: -4.550534904003143
2 alpaca-native
   bartscore: -4.206288725137711
3 llama-7b-hf-baize-lora-bf16
   bartscore: -3.9363586008548737
4 flan-t5-xxl
   bartscore: -4.934148460626602
5 stablelm-tuned-alpha-7b
   bartscore: -4.432858616113663
6 vicuna-13b-1.1
   bartscore: -4.20223930478096
7 dolly-v2-12b
   bartscore: -4.440025061368942
8 moss-moon-003-sft
   bartscore: -3.587637573480606
9 chatglm-6b
   bartscore: -3.7075400948524475
10 mpt-7b
   bartscore: -4.1352817714214325
11 mpt-7b-instruct
   bartscore: -4.282741814851761



