First, install the dependencies below to get started. As these features are available on the `main` branches only, we need to install the libraries below from source.

## Basic usage

Similarly as 8bit models, you can load and convert a model in 8bit by just adding the argument `load_in_4bit`! As simple as that!
Let's first try to load small models, by starting with `facebook/opt-350m`.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-350m"

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

The model conversion technique is totally similar as the one presented in the [8 bit integration blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) - it is based on module replacement. If you print the model, you will see that most of the `nn.Linear` layers are replaced by `bnb.nn.Linear4bit` layers!

In [3]:
print(model)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=1024, out_features=4096, bias=True)
       

Once loaded, run a prediction as you would do it with a classic model

In [4]:
text = "Hello my name is"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


## Advaced usages

Let's review in this section advanced usage of the 4bit integration. First, you need to understand the different arguments that can be tweaked and used.

All these parameters can be changed by using the `BitsandBytesConfig` from `transformers` and pass it to `quantization_config` argument when calling `from_pretrained`.

Make sure to pass `load_in_4bit=True` when using the `BitsAndBytesConfig`!

### Changing the compute dtype

The compute dtype is used to change the dtype that will be used during computation. For example, hidden states could be in `float32` but computation can be set to `bf16` for speedups.
By default, the compute dtype is set to `float32`.


### Changing the quantization type

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the [QLoRA paper](https://arxiv.org/abs/2305.14314)

YOu can switch between these two dtype using `bnb_4bit_quant_type` from `BitsAndBytesConfig`. By default, the FP4 quantization is used.

In [5]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

In [6]:
outputs = model_nf4.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John and I am a very happy man. I am a very happy man. I am a very


### Use nested quantization for more memory efficient inference and training

We also advise users to use the nested quantization technique. This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

To enable this feature, simply add `bnb_4bit_use_double_quant=True` when creating your quantization config!

In [7]:
from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)

In [8]:
outputs = model_double_quant.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


### Combining all the features together

Of course, the features are not mutually exclusive. You can combine these features together inside a single quantization config. Let us assume you want to run a model with `nf4` as the quantization type, with nested quantization and using `bfloat16` as the compute dtype:

In [9]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

In [10]:
outputs = model_4bit.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John and I am a very happy man. I am a very happy man. I am a very


## Pushing the limits of Google Colab

How far can we go using 4bit quantization? We'll see below that it is possible to load a 20B-scale model (40GB in half precision) entirely on the GPU using this quantization method! 🤯

Let's load the model with NF4 quantization type for better results, `bfloat16` compute dtype as well as nested quantization for a more memory efficient model loading.

In [11]:
del model, model_4bit

## My Experiments

In [4]:
import time
import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load model and tokenizer
model_name = "gpt2"  # Replace with your LLM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [6]:

# Load dataset
dataset = load_dataset("xsum", split="validation[:20]",trust_remote_code=True)  # Limit for quick testing


Downloading data: 100%|██████████| 2.72M/2.72M [00:00<00:00, 8.75MB/s]
Generating train split: 100%|██████████| 204045/204045 [00:23<00:00, 8832.67 examples/s]
Generating validation split: 100%|██████████| 11332/11332 [00:14<00:00, 801.44 examples/s]
Generating test split: 100%|██████████| 11334/11334 [00:14<00:00, 801.35 examples/s]


In [15]:
def calculate_perplexity(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        start_time = time.time()
        outputs = model(**inputs, labels=inputs["input_ids"])
        end_time = time.time()
    loss = outputs.loss.item()
    inference_time = end_time - start_time
    return np.exp(loss), inference_time

def generate_and_evaluate_bleu(input_text, reference_text, max_tokens=64):
    inputs = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512).to(device)

    start_time = time.time()
    outputs = model.generate(inputs, max_new_tokens=max_tokens)
    end_time = time.time()

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    smoothie = SmoothingFunction().method4
    reference = [reference_text.split()]
    candidate = generated_text.split()
    bleu = sentence_bleu(reference, candidate, smoothing_function=smoothie)

    inference_time = end_time - start_time
    return generated_text, bleu, inference_time


In [16]:


# Run evaluation
for sample in dataset:
    input_text = sample["document"]
    reference_text = sample["summary"]

    perplexity, ppl_time = calculate_perplexity(input_text)
    generated, bleu, gen_time = generate_and_evaluate_bleu(input_text, reference_text)

    print("="*60)
    print(f"Input: {input_text[:200]}...")
    print(f"Reference: {reference_text}")
    print(f"Generated: {generated}")
    print(f"Perplexity: {perplexity:.2f} (Time: {ppl_time:.2f}s)")
    print(f"BLEU Score: {bleu:.4f} (Time: {gen_time:.2f}s)")


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation - a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Ef...
Reference: Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of charity fraud.
Generated: The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation - a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand trial in July.
They were all released on bail.
The charges relate to the alleged trading of a £1,000,000 prize for a football match between the two sides in the summer of 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Voges was forced to retire hurt on 86 after suffering the injury while batting during the County Championship draw with Somerset on 4 June.
Middlesex hope to have the Australian back for their T20 Bla...
Reference: Middlesex batsman Adam Voges will be out until August after suffering a torn calf muscle in his right leg.
Generated: Voges was forced to retire hurt on 86 after suffering the injury while batting during the County Championship draw with Somerset on 4 June.
Middlesex hope to have the Australian back for their T20 Blast game against Hampshire at Lord's on 3 August.
The 37-year-old has scored 230 runs in four first-class games this season at an average of 57.50.
"Losing Adam is naturally a blow as he contributes significantly to everything we do," director of cricket Angus Fraser said.
"His absence, however, does give opportunities to other players who are desperate to play in the first XI.
"In the past we have coped well without an overseas player and I expect us to do

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Seven photographs taken in the Norfolk countryside by photographer Josh Olins will appear in the June edition.
In her first sitting for a magazine, the duchess is seen looking relaxed and wearing casu...
Reference: The Duchess of Cambridge will feature on the cover of British Vogue to mark the magazine's centenary.
Generated: Seven photographs taken in the Norfolk countryside by photographer Josh Olins will appear in the June edition.
In her first sitting for a magazine, the duchess is seen looking relaxed and wearing casual clothes.
The shoot was in collaboration with the National Portrait Gallery, where two images are being displayed in the Vogue 100: A Century of Style exhibition.
The duchess, who has a keen interest in photography, has been patron of the National Portrait Gallery since 2012.
Nicholas Cullinan, director of the National Portrait Gallery, said: "Josh has captured the duchess exactly as she is - full of life, with a great sense of humour, thoughtful and intellig

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Four police officers were injured in the incident on Friday night.
A man, aged 19, and a boy, aged 16, have been charged with six counts of aggravated vehicle taking.
They are due to appear before Bel...
Reference: Two teenagers have been charged in connection with an incident in west Belfast in which a car collided with two police vehicles.
Generated: Four police officers were injured in the incident on Friday night.
A man, aged 19, and a boy, aged 16, have been charged with six counts of aggravated vehicle taking.
They are due to appear before Belfast Magistrates' Court on Monday.
The 19-year-old man has also been charged with driving while disqualified and using a motor vehicle without insurance.
The boy, aged 16, and the 16-year-old boy are both charged with driving while disqualified.
The 16-year-old boy has been charged with driving while disqualified.
A police spokesman said: "We are investigating the incident and are looking into the circumstances surrounding the inciden

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The injured pedestrian - a young man - is thought to have been walking with a group of people from a graduation ceremony at the Caird Hall.
The incident took place on High Street at about 18:00.
The m...
Reference: A pedestrian has been struck by a taxi in Dundee after it mounted the pavement.
Generated: The injured pedestrian - a young man - is thought to have been walking with a group of people from a graduation ceremony at the Caird Hall.
The incident took place on High Street at about 18:00.
The man's injuries are believed not to be life-threatening. The driver of the taxi is thought to be uninjured.
A police spokesman said: "We are investigating the incident and are treating the incident as a serious incident.
"We are appealing for anyone with information to contact us."
The incident is being treated as a serious incident.
A spokesman for the Caird Hall said: "We are appealing for anyone with information
Perplexity: 16.91 (Time: 0.01s)
BLEU Score: 0.0050 (Time: 0.64s)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Barca will be investigated for alleged misappropriation of funds in the £48.6m (57m euros) deal with Santos.
The signing of Neymar has been correct and his signing has caused despair and envy in some ...
Reference: Barcelona football club chief Sandro Rosell has resigned following a Spanish court's decision to look into last year's signing of Brazil star Neymar.
Generated: Barca will be investigated for alleged misappropriation of funds in the £48.6m (57m euros) deal with Santos.
The signing of Neymar has been correct and his signing has caused despair and envy in some of our adversaries
Rosell, speaking at a news conference after a Barca board meeting, insisted he had "acted correctly".
Vice-president Josep Maria Bartomeu now takes over from the 49-year-old Rosell, who came to power in 2010.
Rosell's future has been a real source of concern ever since a Spanish national court judge accepted a lawsuit this week from Barcelona club member Jordi Cases, who alleged that the amount 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The think tank said the city's 1,536 schools needed to save £360m in the first year if the government's National Funding Formula (NFF) plan goes ahead.
The amount is the equivalent of 12,857 qualified...
Reference: About 70% of London schools could face budget cuts under government plans to change how they are funded, according to London Councils.
Generated: The think tank said the city's 1,536 schools needed to save £360m in the first year if the government's National Funding Formula (NFF) plan goes ahead.
The amount is the equivalent of 12,857 qualified teachers, on an average salary of £28,000.
The government said London was the highest funded part of the country.
It added that under the plans, which are under consultation, inner-city schools would be allocated 30% more money per pupil than the national average.
But London Councils, which represents the city's 32 boroughs and the City, said no school would gain enough funding from the NFF to compensate for increased cost pres

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: His 110 means he has scored 323 runs in a week after an unbeaten 93 against Glamorgan in the One-Day Cup and 120 not out against Kent in the T20 Blast.
Tim Murtagh (2-85) reduced Surrey to 23-2 inside...
Reference: Jason Roy continued his fine form with a second century in six days as Surrey made a strong start with the bat against Middlesex at Lord's.
Generated: His 110 means he has scored 323 runs in a week after an unbeaten 93 against Glamorgan in the One-Day Cup and 120 not out against Kent in the T20 Blast.
Tim Murtagh (2-85) reduced Surrey to 23-2 inside the first six overs, before Rory Burns (88) aided the recovery.
Burns and Roy put on a 118-run fourth wicket stand as Surrey closed on 384-8.
Roy's century was a fine retort against Division One leaders Middlesex, who dismissed the England limited-overs opener for a first-ball duck in the One-Day Cup on Tuesday.
After paceman Murtagh removed both Zafar Ansari and Dominic Sibley early on, Surrey's slump continued as James F

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Ms Kendall told the BBC Labour risked sending a "resignation letter to the British people as a serious party of government" by electing Mr Corbyn.
Separately, Ms Cooper warned there was a "serious ris...
Reference: Labour leadership hopefuls Liz Kendall and Yvette Cooper have said their supporters should back anyone other than Jeremy Corbyn in the contest.
Generated: Ms Kendall told the BBC Labour risked sending a "resignation letter to the British people as a serious party of government" by electing Mr Corbyn.
Separately, Ms Cooper warned there was a "serious risk the party will split" if the left-winger becomes its leader.
It comes as Labour begins sending out the first ballot papers to voters.
The result of the contest will be announced at a special conference on 12 September.
More than 600,000 people have signed up to vote in the four-way contest but Labour has said applications are still being verified.
610,753
total electorate, though this may fall as party removes those n

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The Association of School and College Leaders says England's schools have had to make more than £1bn savings this year, rising to £3bn by 2020.
The government says school funding is at a record £40bn,...
Reference: Head teachers say they are axing GCSE and A-level subjects, increasing class sizes and cutting support services as they struggle with school funding.
Generated: The Association of School and College Leaders says England's schools have had to make more than £1bn savings this year, rising to £3bn by 2020.
The government says school funding is at a record £40bn, with rises ahead.
Education Secretary Justine Greening will hear heads' cash grievances at Friday's ASCL conference in Birmingham.
She is due to address the union, which has published a survey of its members on the issue.
It suggests schools are finding it difficult to make savings without cutting provision and that things are predicted to get worse over the next two years.
Cost pressures are rising as greater pa

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Media playback is unsupported on your device
2 November 2014 Last updated at 10:11 GMT
The BBC's Ireland correspondent, Chris Buckler, reports ....
Reference: As the UK considers greater devolution in the aftermath of Scotland's independence referendum, should a troubled Northern Ireland Assembly push for more powers over its own affairs?
Generated: Media playback is unsupported on your device
2 November 2014 Last updated at 10:11 GMT
The BBC's Ireland correspondent, Chris Buckler, reports .

The BBC's Irish correspondent, Chris Buckler, reports that the IRA has been "shocked" by the news that the IRA has been "shocked" by the news that the IRA has been "shocked" by the news that the IRA has been "shocked" by the news that the IRA
Perplexity: 60.46 (Time: 0.01s)
BLEU Score: 0.0061 (Time: 0.65s)
Input: Ann Barnes dealt with an "ill-advised" TV documentary, a probe into her car insurance, and youth commissioners who both had to step away from their role.
But she said she judged he

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The man grabbed hold of a child's bag outside Heronsgate School in Lichfield Down, Walnut Tree, at about 08:20 GMT on Wednesday.
The man said, "you're coming with me" before the pupil broke free.
"The...
Reference: A man tried to abduct a boy outside a primary school in Milton Keynes, the school said.
Generated: The man grabbed hold of a child's bag outside Heronsgate School in Lichfield Down, Walnut Tree, at about 08:20 GMT on Wednesday.
The man said, "you're coming with me" before the pupil broke free.
"The incident has been reported to police who are now investigating," the school said. The offender is said to be white and in his 30s.
He had blonde hair and a scratch on his left cheek. He was wearing blue jeans, a blue-green t-shirt and Converse trainers.
He was wearing a black T-shirt and a black T-shirt with a white collar.
He was wearing a black T-shirt with a white collar and a black T-shirt with a white collar.
He was wearing a black T-shirt with a white collar and a bla

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: Taylor, 25, joined County in May from Macclesfield, but has yet to start in the league.
The move has left Newport with only one goalkeeper in Joe Day, but manager John Sheridan is confident he will qu...
Reference: Newport County goalkeeper Rhys Taylor has joined Wrexham on loan until January.
Generated: Taylor, 25, joined County in May from Macclesfield, but has yet to start in the league.
The move has left Newport with only one goalkeeper in Joe Day, but manager John Sheridan is confident he will quickly fill the vacancy.
"Rhys is too good a goalkeeper to be kept on the bench and not playing football," said Sheridan.
"Financially it might enable me to bring someone else in, to try and fill in a different area."
On Saturday Newport host fourth-placed Northampton Town hoping to win their third game in a row for the first time since last December, 2014.
The Exiles are also seeking their first home win since March.
Taylor's move means County are currently without a second goalkeep

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The Derbyshire club, who play in the eighth-tier Northern Premier League Division One North, have lost all 19 league and cup games this season.
New Mills have conceded 68 goals while three managers ha...
Reference: If Chelsea boss Jose Mourinho thought he was having a bad time, he should spare a thought for New Mills.
Generated: The Derbyshire club, who play in the eighth-tier Northern Premier League Division One North, have lost all 19 league and cup games this season.
New Mills have conceded 68 goals while three managers have left since June.
"It's tough but we've got a new squad and the players are starting to gel," Millers boss Garry Brown told BBC Radio 5 live's Non League Football Show.
Former Norwich City midfielder Keith Briggs took over from Roy Soule, who stepped down in June, but resigned after just 23 days for a job with Sheffield United's academy.
Andy Fearn was put in charge in July and appointed former Manchester City striker Shaun Goater as his assistant.
But Fea

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The referendum will take place on 10 March, but Bath Conservative MP Ben Howlett said he was concerned about a "lack of awareness" about the issue.
Mr Howlett also said he is worried about the public'...
Reference: An MP has criticised "the level of misinformation" about a referendum on an elected mayor for Bath and North East Somerset.
Generated: The referendum will take place on 10 March, but Bath Conservative MP Ben Howlett said he was concerned about a "lack of awareness" about the issue.
Mr Howlett also said he is worried about the public's level of engagement.
Bath and North East Somerset Council said the referendum had been publicised in press releases and tweets.
It also said it was the subject of a two-page article in the winter edition of the council magazine which was distributed to all households in the region.
A further news release and polling cards will also be sent out to all households this week, the authority added.
Supporters of the referendum say Bath needs a

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: More than a dozen men stormed into the Kiwi cafe in the Georgian capital on Sunday evening, the cafe said, shouting and throwing meat at patrons.
A brawl erupted but the attackers fled before police a...
Reference: A vegan cafe in Tbilisi has appealed for public solidarity after being invaded by ultra-nationalists wielding grilled meat and sausages.
Generated: More than a dozen men stormed into the Kiwi cafe in the Georgian capital on Sunday evening, the cafe said, shouting and throwing meat at patrons.
A brawl erupted but the attackers fled before police arrived.
Police are now investigating, and say they have questioned the attackers and cafe staff. Nobody has been arrested.
The cafe has appealed for public support, saying it was no prank but a case of intimidation by neo-Nazis.
The attackers wore strings of sausages round their necks and threw chunks of meat onto customers' plates, the BBC's Rayhan Demytrie reports from Tbilisi.
They are known as the Bergmann group, and a soc

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input: The New Zealander made only one unenforced switch, bringing in Rhys Webb for Gareth Davies at scrum-half.
Sam Warburton, Alun Wyn Jones and Alex Cuthbert miss out with injuries.
"We wanted to give the...
Reference: Wales coach Warren Gatland resisted making more changes to his team against Italy to give his men a chance to make up for their poor start at Twickenham.
Generated: The New Zealander made only one unenforced switch, bringing in Rhys Webb for Gareth Davies at scrum-half.
Sam Warburton, Alun Wyn Jones and Alex Cuthbert miss out with injuries.
"We wanted to give the players a chance to sort of put behind us a disappointing first half from last week," said Gatland.
Flanker Justin Tipuric, second row Luke Charteris and Hallam Amos are drafted in to the team in place of the injured players.
Dan Lydiate will captain the side in the absence of Warburton and regular stand-in Jones.
Gatland said he had shown faith in players who had performed well in earlier matches, a fact he 

In [17]:
import torch
import time
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from datasets import load_dataset


In [18]:

class QuantizedModelPipeline:
    def __init__(self, model_name, quantization_config=None, max_tokens=64):
        self.model_name = model_name
        self.quantization_config = quantization_config
        self.max_tokens = max_tokens
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = None
        self.tokenizer = None

    def load_quantized_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, use_fast=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=self.quantization_config,
            device_map="auto",
            torch_dtype=torch.bfloat16
        )
        self.model.eval()
        print(f"Loaded model: {self.model_name} with 4-bit quantization")

    def calculate_perplexity(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        with torch.no_grad():
            start_time = time.time()
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            end_time = time.time()
        loss = outputs.loss.item()
        return np.exp(loss), end_time - start_time

    def generate_and_evaluate_bleu(self, input_text, reference_text):
        inputs = self.tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512).to(self.device)

        start_time = time.time()
        outputs = self.model.generate(inputs, max_new_tokens=self.max_tokens)
        end_time = time.time()

        generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        smoothie = SmoothingFunction().method4
        reference = [reference_text.split()]
        candidate = generated_text.split()
        bleu_score = sentence_bleu(reference, candidate, smoothing_function=smoothie)

        return generated_text, bleu_score, end_time - start_time

    def evaluate_sample(self, input_text, reference_text):
        ppl, ppl_time = self.calculate_perplexity(input_text)
        generated_text, bleu, gen_time = self.generate_and_evaluate_bleu(input_text, reference_text)

        return {
            "generated": generated_text,
            "perplexity": ppl,
            "perplexity_time": ppl_time,
            "bleu_score": bleu,
            "generation_time": gen_time
        }


In [19]:

# Define quant config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Create pipeline
pipeline = QuantizedModelPipeline("gpt2", quantization_config=bnb_config)
pipeline.load_quantized_model()

# Load a dataset sample
dataset = load_dataset("xsum", split="validation[:5]")
sample = dataset[0]
input_text = sample["document"]
reference_text = sample["summary"]

# Evaluate
result = pipeline.evaluate_sample(input_text, reference_text)
print(result)


Loaded model: gpt2 with 4-bit quantization


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated': 'The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation - a charity to raise money for Nigerian sport.\nMr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.\nAppearing at the Old Bailey earlier, all four denied the offence.\nThe charge relates to offences which allegedly took place between 2008 and 2014.\nSam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand trial in July.\nThey were all released on bail.\nThe court heard that the two men were involved in a scheme to buy up the sport of football.\nThe court heard that the two men were involved in a scheme to buy up the sport of football.\nThe court heard that the two men were involved in a scheme to buy up the sport of football.\n', 'perplexity': np.float64(36.4367614031039), 'perplexity_time': 0.058586835861206055, 'bleu_score': 0.0038426860309137565, 'generation_time': 1.1299660205841064}


In [None]:
print(result)

In [None]:
result

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn as nn
import copy

# Load model and tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()

# Dummy input for loss evaluation
input_text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(input_text, return_tensors="pt")
labels = inputs["input_ids"].clone()  # Causal LM expects labels
loss_fn = nn.CrossEntropyLoss()

def evaluate_loss(model):
    with torch.no_grad():
        outputs = model(**inputs, labels=labels)
        return outputs.loss.item()

# Baseline loss
original_loss = evaluate_loss(model)
print(f"Original loss: {original_loss:.4f}")

# Track loss after quantizing each layer
loss_after_quant = []

# Access the transformer blocks
transformer_blocks = model.transformer.h

# Loop through each layer
for i in range(len(transformer_blocks)):
    print(f"\nQuantizing layer {i}...")

    # Deep copy the model to preserve previous layers
    model_quant = copy.deepcopy(model)

    # Get the specific layer and apply dynamic quantization
    target_layer = model_quant.transformer.h[i]
    model_quant.transformer.h[i] = torch.quantization.quantize_dynamic(
        target_layer, {nn.Linear}, dtype=torch.qint8
    )

    # Evaluate loss
    loss = evaluate_loss(model_quant)
    loss_after_quant.append((i, loss))
    print(f"Loss after quantizing layer {i}: {loss:.4f}")

print("\nSummary of loss after each layer's quantization:")
for i, l in loss_after_quant:
    print(f"Layer {i}: Loss = {l:.4f}")


Original loss: 6.8673

Quantizing layer 0...
Loss after quantizing layer 0: 6.8673

Quantizing layer 1...
Loss after quantizing layer 1: 6.8673

Quantizing layer 2...
Loss after quantizing layer 2: 6.8673

Quantizing layer 3...
Loss after quantizing layer 3: 6.8673

Quantizing layer 4...
Loss after quantizing layer 4: 6.8673

Quantizing layer 5...
Loss after quantizing layer 5: 6.8673

Summary of loss after each layer's quantization:
Layer 0: Loss = 6.8673
Layer 1: Loss = 6.8673
Layer 2: Loss = 6.8673
Layer 3: Loss = 6.8673
Layer 4: Loss = 6.8673
Layer 5: Loss = 6.8673


In [34]:

class QuantizedModelPipeline:
    def __init__(self, model_name, quantization_config=None, max_tokens=64):
        self.model_name = model_name
        self.quantization_config = quantization_config
        self.max_tokens = max_tokens
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = None
        self.tokenizer = None

    def load_quantized_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, use_fast=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=self.quantization_config,
            device_map="auto",
            torch_dtype=torch.bfloat16
        )
        self.model.eval()
        print(f"Loaded model: {self.model_name} with 4-bit quantization")

    def calculate_perplexity(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        with torch.no_grad():
            start_time = time.time()
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            end_time = time.time()
        loss = outputs.loss.item()
        return np.exp(loss), end_time - start_time

    def generate_and_evaluate_bleu(self, input_text, reference_text):
        inputs = self.tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512).to(self.device)

        start_time = time.time()
        outputs = self.model.generate(inputs, max_new_tokens=self.max_tokens)
        end_time = time.time()

        generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        smoothie = SmoothingFunction().method4
        reference = [reference_text.split()]
        candidate = generated_text.split()
        bleu_score = sentence_bleu(reference, candidate, smoothing_function=smoothie)

        return generated_text, bleu_score, end_time - start_time

    def evaluate_sample(self, input_text, reference_text):
        ppl, ppl_time = self.calculate_perplexity(input_text)
        generated_text, bleu, gen_time = self.generate_and_evaluate_bleu(input_text, reference_text)

        return {
            "generated": generated_text,
            "perplexity": ppl,
            "perplexity_time": ppl_time,
            "bleu_score": bleu,
            "generation_time": gen_time
        }

    def compute_layerwise_importance(self, text):
        print("Computing layerwise importance...")
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        labels = inputs["input_ids"]

        with torch.no_grad():
            full_output = self.model(**inputs, labels=labels)
            base_loss = full_output.loss.item()

        importance_scores = {}

        # Adjust depending on model architecture
        try:
            layers = self.model.model.layers  # LLaMA, OPT
        except AttributeError:
            layers = self.model.transformer.h  # GPT-2, distilGPT2

        for i, layer in enumerate(layers):
            def forward_hook(module, input, output):
              if isinstance(output, tuple):
                  return tuple(torch.zeros_like(t) if isinstance(t, torch.Tensor) else t for t in output)
              return torch.zeros_like(output) if isinstance(output, torch.Tensor) else output

            handle = layer.register_forward_hook(forward_hook)

            with torch.no_grad():
                ablated_output = self.model(**inputs, labels=labels)
                ablated_loss = ablated_output.loss.item()

            delta = ablated_loss - base_loss
            importance_scores[i] = delta

            handle.remove()
            print(f"Layer {i}: ΔLoss = {delta:.4f}")

        return importance_scores


    def compute_importance_by_gradient_diff(self, text):
        print("Computing gradient-difference-based importance...")
        self.model.train()  # Ensure gradients flow

        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        labels = inputs["input_ids"]

        # 1. Get original gradients
        self.model.zero_grad()
        outputs = self.model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()

        # Collect baseline gradients
        baseline_grads = {}
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                baseline_grads[name] = param.grad.detach().clone()

        importance_scores = {}

        try:
            layers = self.model.model.layers
        except AttributeError:
            layers = self.model.transformer.h

        for i, layer in enumerate(layers):
            self.model.zero_grad()

            # Define masking hook
            def hook_fn(module, input, output):
              if isinstance(output, tuple):
                  return (output[0] * 0.5,)  # Return as a tuple
              return output * 0.5


            handle = layer.register_forward_hook(hook_fn)

            perturbed_outputs = self.model(**inputs, labels=labels)
            perturbed_loss = perturbed_outputs.loss
            perturbed_loss.backward()

            grad_diff = 0.0
            for name, param in self.model.named_parameters():
                if param.grad is not None and name in baseline_grads:
                    diff = param.grad - baseline_grads[name]
                    grad_diff += torch.norm(diff, p=2).item()

            importance_scores[i] = grad_diff
            handle.remove()
            print(f"Layer {i}: ∥Δgrad∥₂ = {grad_diff:.4f}")

        self.model.eval()
        return importance_scores

    def print_model_layers(self):
        print(f"\n📚 Model Layers in: {self.model_name}")
        print("=" * 50)
        for name, module in self.model.named_modules():
            print(name)



In [35]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)



In [36]:
pipeline = QuantizedModelPipeline("gpt2", quantization_config=quant_config)
pipeline.load_quantized_model()


Loaded model: gpt2 with 4-bit quantization


In [37]:

sample_text = "The theory of relativity revolutionized our understanding of space and time."
importance = pipeline.compute_layerwise_importance(sample_text)
print("\nLayerwise Importance:")
for layer_idx, score in importance.items():
    print(f"Layer {layer_idx}: Score = {score:.4f}")

Computing layerwise importance...
Layer 0: ΔLoss = 4.5209
Layer 1: ΔLoss = 5.0440
Layer 2: ΔLoss = 12.8126
Layer 3: ΔLoss = 13.0395
Layer 4: ΔLoss = 14.2315
Layer 5: ΔLoss = 8.0808
Layer 6: ΔLoss = 10.3122
Layer 7: ΔLoss = 6.7539
Layer 8: ΔLoss = 16.0767
Layer 9: ΔLoss = 8.7173
Layer 10: ΔLoss = 7.2698
Layer 11: ΔLoss = 4.9546

Layerwise Importance:
Layer 0: Score = 4.5209
Layer 1: Score = 5.0440
Layer 2: Score = 12.8126
Layer 3: Score = 13.0395
Layer 4: Score = 14.2315
Layer 5: Score = 8.0808
Layer 6: Score = 10.3122
Layer 7: Score = 6.7539
Layer 8: Score = 16.0767
Layer 9: Score = 8.7173
Layer 10: Score = 7.2698
Layer 11: Score = 4.9546


As you can see, we were able to load and run the 4bit gpt-neo-x model entirely on the GPU