# Reproduce Llama2-13b on a single A100 40GB

This file including four section:
- [(Optional) Train the quantization parameters of Llama2-13B by yourself.](#reproduce-llama2-13b-on-a-single-a100-40gb)
- [Download our prebuilt quantized model.](#download-the-prebuilt-quantized-model)
- [Reproduce Perplexity!](#reproduce-result)
- [Reproduce ZeroShot](#reproduce-zeroshotusing-lm-evaluation-harness)

## (Optional) Train the quantization parameters of Llama-2-13b-hf by yourself.

This section provids how to train the quantization parameters of Llama-2-13b by yourself. You can skip this section because we have provided the pre-built quantized models in [Download the pre-quantized models](#download-the-pre-quantized-models).

In [None]:
CUDA_VISIBLE_DEVICES=5 python main.py --model /PATH/TO/MODEL --epochs 20 --output_dir ./log/llama2-13b --eval_ppl --wbits 4 --abits 16 --quant_type mix --lwc \
--ckpt_path /PATH/TO/CKPT \
--percent 0.2

## Download the prebuilt quantized model:

We have provide the prebuilt quantized model on Huggingface. In order to download the large weights, we'll have to use `git lfs`.

In [None]:
!conda install git git-lfs
!git lfs install

In [None]:
!mkdir -p pre_quantized_models/

!git clone https://huggingface.co/ptq161/llama2-13b ./pre_quantized_models/llama2-13b

## Reproduce Result

Constraint in one GPU.

In [1]:
import torch
torch.cuda.set_device(0)


### Import necessary packages

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from datautils import get_loaders
from tqdm import tqdm
from torch import nn
import logging
import gc   
import time
torch.backends.cuda.matmul.allow_tf32 = True
logger = logging.getLogger(__name__)
def get_model(model_path):
    model = AutoModelForCausalLM.from_pretrained(
        model_path, device_map="cpu", torch_dtype=torch.float16
    )
    for n,p in model.named_parameters():
        p.requires_grad = False
    return model
def evaluate(model, tokenizer_path, logger):
    results = {}
    device = model.device
    seqlen = 2048
    seed = 42
    # for dataset in ["wikitext2", "ptb", "c4","ptb-new",'c4-new']:
    for dataset in ["wikitext2", "c4"]:
        dataloader, testloader = get_loaders(
            dataset,
            seed=seed,
            model=tokenizer_path,
            seqlen=seqlen,
        )
        if "c4" in dataset:
            testenc = testloader
        else:
            testenc = testloader.input_ids

        nsamples = testenc.numel() // seqlen
        use_cache = model.config.use_cache
        model.config.use_cache = False
        model.eval()
        nlls = []
        for i in tqdm(range(nsamples)):
            batch = testenc[:, (i * seqlen) : ((i + 1) * seqlen)].to(device)
            outputs = model.model(batch)
            logits = outputs[0]
            logits = model.lm_head(logits)
            shift_logits = logits[:, :-1, :]
            shift_labels = testenc[:, (i * seqlen) : ((i + 1) * seqlen)][
                :, 1:
            ].to(model.lm_head.weight.device)
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1),
            )
            neg_log_likelihood = loss.float() * seqlen
            nlls.append(neg_log_likelihood)

        ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * seqlen))
        logger.info(f'{dataset} : {ppl.item()}')
        model.config.use_cache = use_cache
        results[dataset] = ppl.item()
        print("dataset:", ppl.item())
    return results


  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


### Reproduce perplexity.

In [3]:
model_path = "/share/tmp/llama2-13b-mix-ptq2"
model = get_model(model_path)
model.eval()
device=torch.device("cuda:0")
model.to(device)
results = evaluate(model, model_path, logger)
print('perplexity result:')
for k,v in results.items():
    print(k, v)

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████| 6/6 [00:00<00:00,  9.00it/s]


get_wikitext2


100%|██████████| 166/166 [01:11<00:00,  2.34it/s]


dataset: 9.665534019470215
get_c4


  table = cls._concat_blocks(blocks, axis=0)
100%|██████████| 256/256 [01:49<00:00,  2.33it/s]

dataset: 13.457582473754883
perplexity result:
wikitext2 9.665534019470215
c4 13.457582473754883





## Reproduce Zeroshot(Using lm-evaluation-harness)

In [None]:
TASK="hellaswag,winogrande,race,piqa,mmlu,hellaswag,arc_easy,arc_challenge,lambada,ceval-valid"
MODEL_PATH="/PATH/TO/MODEL"
CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \
    --model_args pretrained=$MODEL_PATH \
    --tasks $TASK \
    --device cuda:0 \
    --batch_size 4 \
    --output ./results/ptq161/$MODEL_PATH

# Follow lm-eval-harness to install the environment.