## LLM-Blender Usage examples

Please first download our DeBERTa-v3-large PairRanker checkpoint to your local folder: [checkpoint link](https://drive.google.com/drive/folders/1E3qsZqja5IBaYEDRtVARU88mDl_nBqQ3?usp=sharing).
And put that to `<your checkpoint path>`

In [1]:
import llm_blender
ranker_config = llm_blender.RankerConfig
ranker_config.ranker_type = "pairranker"
ranker_config.model_type = "deberta"
ranker_config.model_name = "microsoft/deberta-v3-large" # ranker backbone
ranker_config.load_checkpoint = "checkpoint-best" # ranker checkpoint <your checkpoint path>
ranker_config.cache_dir = "./hf_models" # hugging face model cache dir
ranker_config.source_max_length = 128
ranker_config.candidate_max_length = 128
ranker_config.n_tasks = 1 # number of singal that has been used to train the ranker. This checkpoint is trained using BARTScore only, thus being 1.
fuser_config = llm_blender.GenFuserConfig
fuser_config.model_name = "llm-blender/gen_fuser_3b" # our pre-trained fuser
fuser_config.cache_dir = "./hf_models"
fuser_config.max_length = 512
fuser_config.candidate_max_length = 128
blender_config = llm_blender.BlenderConfig
blender_config.device = "cuda" # blender ranker and fuser device
blender = llm_blender.Blender(blender_config, ranker_config, fuser_config)

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Using DeBERTa model


Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2Model: ['mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Using LLM-Blender for ranking
By the rank function, LLM-Blender could ranks the candidates through pairwise comparisons and return the ranks. We show our ranker's ranks are highly correlated with ChatGPT ranks.

In [3]:
import datasets
from llm_blender.gpt_eval.cor_eval import COR_MAPS
from llm_blender.gpt_eval.utils import get_ranks_from_chatgpt_cmps
mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)
few_examples = list(mixinstruct_test.take(8))
# remove cmp_results with none cmp results
for ex in few_examples:
    keys = list(ex['cmp_results'].keys())
    for key in keys:
        if not ex['cmp_results'][key]:
            del ex['cmp_results'][key]
few_examples = [x for x in few_examples if x['cmp_results']]
inputs = [x['input'] for x in few_examples]
candidates_texts = [[cand['text'] for cand in x['candidates']] for x in few_examples]
print("Example:")
print("Input 1:", inputs[0])
print("Candidate 1 for input 1:")
print(candidates_texts[0][0])


Example:
Input 1: I've always wondered what the difference is between a skeptic and a denier.
Candidate 1 for input 1:
A skeptic is someone who doubts or expresses doubt about a claim or idea without being dismissive of it. They are open-minded and approach evidence with an open mind, searching for reasonable explanations and evidence to support their beliefs.

A denier, on the other hand, is someone who actively works to deny or ignore evidence that contradicts their beliefs. They are often characterized by a closed mind and an unwillingness to consider alternative perspectives. They may also use rhetoric or false claims to try to discredit the evidence.


In [6]:
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
print("Ranks for input 1:", ranks[0]) # ranks of candidates for input 1
# Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]

Ranking candidates: 100%|██████████| 4/4 [00:43<00:00, 10.95s/it]

Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]





In [4]:
import numpy as np
llm_ranks_map, gpt_cmp_results = get_ranks_from_chatgpt_cmps(few_examples)
gpt_ranks = np.array(list(llm_ranks_map.values())).T
print("Correlation with ChatGPT")
print("------------------------")
for cor_name, cor_func in COR_MAPS.items():
    print(cor_name, cor_func(ranks, gpt_ranks))

Correlation with ChatGPT
------------------------
pearson 0.502434644007648
spearman 0.35554809046205055
spearman_footrule 25.5
set_based 0.6422190656565656


## Using LLM-blender to directly compare two candidates

In [5]:
candidates_A = [x['candidates'][0]['text'] for x in few_examples]
candidates_B = [x['candidates'][1]['text'] for x in few_examples]
comparison_results = blender.compare(inputs, candidates_A, candidates_B, batch_size=2)
print("comparison_results:", comparison_results)
# whether candidate A is better than candidate B for each input

Ranking candidates: 100%|██████████| 4/4 [00:00<00:00,  5.82it/s]

comparison_results: [ True  True False  True False  True  True  True]





## Using LLM-Blender for fuse generation
We show that the the fused generation using the top-ranked candidate from the rankers could get outputs of higher quality.

In [6]:
from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks
topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)
fuse_generations = blender.fuse(inputs, topk_candidates, batch_size=2)
print("fuse_generations for input 1:", fuse_generations[0])

Fusing candidates: 100%|██████████| 4/4 [00:12<00:00,  3.19s/it]

fuse_generations for input 1: A skeptic is someone who questions the validity of a claim or idea, while a denier is someone who dismisses or ignores evidence that contradicts their beliefs. Skeptics approach claims with an open mind and seek evidence to support or refute them, while denier's often have a closed mind and refuse to consider evidence that contradicts their beliefs.





In [7]:
# # Or do rank and fuser together
fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, return_scores=False, batch_size=2, top_k=3)

Ranking candidates: 100%|██████████| 4/4 [00:44<00:00, 11.06s/it]
Fusing candidates: 100%|██████████| 4/4 [00:13<00:00,  3.28s/it]


In [8]:
from llm_blender.common.evaluation import overall_eval
metrics = ['bartscore']
targets = [x['output'] for x in few_examples]
scores = overall_eval(fuse_generations, targets, metrics)

print("Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in few_examples[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}
for ex in few_examples:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", "{:.4f}".format(np.mean(llm_scores)))


Evaluating bartscore: 100%|██████████| 8/8 [00:00<00:00, 41.73it/s]

Fusion Scores
   bartscore: -3.8043667674064636
LLM Scores
0 oasst-sft-4-pythia-12b-epoch-3.5
   bartscore: -3.8071
1 koala-7B-HF
   bartscore: -4.5505
2 alpaca-native
   bartscore: -4.2063
3 llama-7b-hf-baize-lora-bf16
   bartscore: -3.9364
4 flan-t5-xxl
   bartscore: -4.9341
5 stablelm-tuned-alpha-7b
   bartscore: -4.4329
6 vicuna-13b-1.1
   bartscore: -4.2022
7 dolly-v2-12b
   bartscore: -4.4400
8 moss-moon-003-sft
   bartscore: -3.5876
9 chatglm-6b
   bartscore: -3.7075
10 mpt-7b
   bartscore: -4.1353
11 mpt-7b-instruct
   bartscore: -4.2827



