Skip to content

Files

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Apr 16, 2024
Dec 6, 2023
Nov 7, 2024
Nov 7, 2024
Nov 7, 2024
Feb 23, 2024
Feb 23, 2024
Feb 28, 2024

Step-by-Step

This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.

The script run_clm_no_trainer.py supports GPTJ, OPT, LLaMA2, BLOOM and Falcon quantization and validates last word prediction accuracy with lm_eval now, and we are adding more models.

Prerequisite

1. Create Environment

# Installation
pip install -r requirements.txt

Run

Here is how to run the scripts:

Causal Language Modeling (CLM)

run_clm_no_trainer.py quantizes the large language models using the dataset NeelNanda/pile-10k calibration and validates lambada_openai, piqa, winogrande, hellaswag and other datasets accuracy provided by lm_eval, an example command is as follows.

GPT-J-6b

Quantization

# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
    --model EleutherAI/gpt-j-6B \
    --quantize \
    --sq \
    --alpha 1.0 \
    --output_dir "saved_results" \
    --ipex 

Notes: Smooth quantization here is based on torch.jit. Without past key value in example_inputs, the quantized model cannot be used for text-generation. For text-generation task, please go to link

# "--approach weight_only" is used to enable weight only quantization.
python run_clm_no_trainer.py \
    --model EleutherAI/gpt-j-6B \
    --quantize \
    --approach weight_only \
    --woq_bits 4 \
    --woq_group_size 128 \
    --woq_scheme asym  \
    --woq_algo RTN \
    --woq_enable_mse_search \
    --output_dir "saved_results"

# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
    --model EleutherAI/gpt-j-6B \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length

Notes: Weight-only quantization based on fake quantization is supported in preview, including RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to link. Our GPTQ API support various CLMs including GPTJ, OPTs, Blooms, Llamas, Falcons, MPTs, ChatGLMs, etc. Simply replace the "--model" argument with other models to quantize different CLMs with GPTQ.

Accuracy with lm_eval

# FP32 Accuracy
python run_clm_no_trainer.py \
    --model EleutherAI/gpt-j-6B \
    --accuracy \
    --approach weight_only \
    --batch_size 112 \
    --tasks "lambada_openai"\
    --int8 \
    --output_dir "saved_results"  # load int8 model

# to validate FP32 model, please remove "--int8" and "--output_dir".
# to validate int8 model generated with `--sq`, please remove "--approach weight_only".
# to validate the int8 model quantized with ipex, please include "--ipex".

OPT-1.3b/2.7b/6.7b

Quantization

# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
    --model facebook/opt-2.7b \
    --quantize \
    --sq \
    --alpha 0.5 \
    --ipex \
    --output_dir "saved_results" \
    --int8_bf16_mixed 

# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
    --model facebook/opt-1.3b \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length

Accuracy with lm_eval

python run_clm_no_trainer.py \
    --model facebook/opt-2.7b \
    --accuracy \
    --approach weight_only \
    --batch_size 112 \
    --tasks "lambada_openai" \
    --int8 \
    --output_dir "saved_results"  # load int8 model

# to validate FP32 model, please remove "--int8" and "--output_dir".
# to validate int8 model generated with `--sq`, please remove "--approach weight_only".
# to validate the int8 model quantized with ipex, please include "--ipex".

LLAMA2-7b/13b/70b

Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy.

Quantization

# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
    --model meta-llama/Llama-2-7b-hf \
    --quantize \
    --sq \
    --alpha 0.8 \
    --ipex \
    --output_dir "saved_results" \
    --int8_bf16_mixed 

# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
    --model meta-llama/Llama-2-7b-hf \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length

Accuracy with lm_eval

python run_clm_no_trainer.py \
    --model meta-llama/Llama-2-7b-hf \
    --accuracy \
    --approach weight_only \
    --batch_size 112 \
    --tasks  "lambada_openai" \
    --int8 \
    --output_dir "saved_results"  # load int8 model

# to validate FP32 model, please remove "--int8" and "--output_dir".
# to validate int8 model generated with `--sq`, please remove "--approach weight_only".
# to validate the int8 model quantized with ipex, please include "--ipex".

BLOOM

Quantization

# "--sq" is used to enable smooth quant
python run_clm_no_trainer.py \
    --model bigscience/bloom-560m \
    --quantize \
    --ipex \
    --sq \
    --alpha 0.5 \
    --output_dir "saved_results"

# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
    --model bigscience/bloom-560m \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length

Accuracy with lm_eval

python run_clm_no_trainer.py \
    --model bigscience/bloom-560m \
    --accuracy \
    --approach weight_only \
    --batch_size 112 \
    --tasks  "lambada_openai" \
    --int8 \
    --output_dir "saved_results"  # load int8 model

# to validate FP32 model, please remove "--int8" and "--output_dir".
# to validate int8 model generated with `--sq`, please remove "--approach weight_only".
# to validate the int8 model quantized with ipex, please include "--ipex".

Falcon-7b

Quantization

# "--sq" is used to enable smooth quant
python run_clm_no_trainer.py \
    --model tiiuae/falcon-7b-instruct \
    --quantize \
    --sq \
    --alpha 0.5 \
    --output_dir "saved_results"

# "--woq_algo GPTQ" is used to enable GPTQ algorithms
python run_clm_no_trainer.py \
    --model tiiuae/falcon-7b-instruct \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length

Accuracy with lm_eval

python run_clm_no_trainer.py \
    --model bigscience/bloom-560m \
    --accuracy \
    --approach weight_only \
    --batch_size 112 \
    --tasks  "lambada_openai" \
    --int8 \
    --output_dir "saved_results"  # load int8 model

# to validate FP32 model, please remove "--int8" and "--output_dir".

[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023). [2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).