# Fine-tune Llama 2 in Google Colab
> 🗣️ Large Language Model Course

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne), based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). Special thanks to Tolga HOŞGÖR for his solution to empty the VRAM.

This notebook runs on a T4 GPU. (Last update: 24 Aug 2023)


In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m109.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [3]:
import huggingface_hub
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# **Float16**    -  
 is a half-precision floating-point format that uses 16 bits to represent a number. This means that it can only represent a limited range of numbers with a limited precision. Float16 is typically used in applications where memory bandwidth is a major concern, such as machine learning and graphics processing.

# **(NF4)**
 data type for 4-bit quantization

# 4-bit quantization
is a technique used to reduce the number of bits used to represent the weights and activations of a neural network.

### **example of how 4-bit quantization**
can be used to reduce the memory footprint of a neural network.
 Let's say we have a neural network with 1 million weights. Each weight is currently represented using 32 bits, which means that the total memory footprint of the weights is 32 MB.
 If we use 4-bit quantization to represent the weights, we can reduce the memory footprint of the weights to 1.25 MB. This is a 75% reduction in memory usage.



# **epoch**
 is a single pass of the entire training dataset through the learning algorithm.

# **batch size**
of 32, then the model will process 32 training examples before it updates its parameters.
larger batch size:-

1.   leading to faster training speed
2.   more accurate estimate of the gradient, which guides the model's parameter updates
3.require more memory

**gradient_accumulation_steps**
By accumulating gradients over multiple steps, it effectively simulates a larger batch size without actually increasing the memory requirement. This can improve generalization performance, especially when dealing with small batch sizes.with smaller batch sizes, it reduces the memory consumption per iteration


*   Decreasing gradient_accumulation_steps:
*   2.Faster training
* Reduced memory overhead

In [4]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
dataset_name = "cnn_dailymail"
new_model = "llama2_final_docu_summary"

# QLoRA parameters
#higher value -generate more coherent and contextually appropriate text. but expensive
lora_r = 64
# more value -->less computational , Improved memory , may  performance degradation
lora_alpha = 16
# for  reduce overfitting. more--->less overfitting ,If too many activations are dropped performance degrad
lora_dropout = 0.1

# bitsandbytes parameters
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

# TrainingArguments parameters
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps =1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = 200 #maximum number of steps or iterations allowed for training a model.
warmup_ratio = 0.03

# Group sequences into batches with same length
group_by_length = True
save_steps = 0
logging_steps = 25

# SFT parameters
max_seq_length = None
packing = False
device_map = {"": 0}

# **epoch-**
Number of times the entire training dataset is passed through the model during training
Expose the model to all the training data and allow it to learn from patterns and relationships
# **maxsteps-**
* Maximum number of steps or iterations allowed for training the model

* Control the duration of the training process and prevent overtraining

# **warmup ratio**
might be 0.1 or 0.2, meaning that 10% or 20% of the training steps are used for the warmup phase. During this phase, the **learning rate is gradually increased from a small value to the target learning rat**e. This allows the model to learn from the **data more slowly and gradually**, which can help to prevent overfitting and improve generalization performance.

10,000-->training steps and warmup ratio of 0.2.

* first 2,000 training steps for the warmup phase.
* learning rate will be gradually increased from a small value (e.g., 0.0001) to the target learning rate (e.g., 0.01).
* The remaining 8,000 training steps will then be used with the target learning rate.

In [5]:
# Load dataset (you can process it here)
dataset = load_dataset("cnn_dailymail",'3.0.0',split="train")

dataset

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [6]:
# dataset['train'][1]

In [7]:
# dataset['test'][10]

In [8]:
# dataset['test'][5]

# Gradient checkpointing
 is a memory optimization technique --> reduce the amount of memory that is required to train and run large language models.
Gradient checkpointing is particularly effective for Llama2 models because of their large size and complex architecture.


it is possible to reduce the memory  by up to 50%. This can make it possible to train and run Llama2 models on hardware with less memory, or to train larger models on the same hardware.

# The max_grad_norm
to prevent the gradients from becoming too large. It is typically set to a value between 1 and 2. When the gradients are larger than this value, they are clipped to the maximum norm. This means that the gradients are scaled down so that their norm does not exceed the maximum norm.

# learning rate-->
controls the speed at which a model learns, determines the size of the steps

A higher learning rate means that the model will learn faster, while a lower learning rate means that the model will learn slower.


# weight decay
to prevent overfitting in neural networks. It works by penalizing large weights, which encourages the model to learn more compact and generalizable representations of the data.

# optim="paged_adamw_32bit"
 It specifies the optimizer to be used for the training process, which in this case is a variant of the AdamW optimizer called PagedAdamW.

 # lr_scheduler_type="cosine"
A cosine annealing learning rate scheduler is a technique that gradually decreases the learning rate over the course of training, following a cosine function. This approach starts with a high learning rate to quickly explore the loss landscape and then gradually decreases it to refine the model's parameters and avoid overfitting.

# device_map = {"": 0}
is a configuration setting that specifies the mapping of model components to available devices, typically GPUs or CPUs.

# max_seq_length = None
 indicates that there is no upper limit on the length of input sequences during training. This means that the model can be trained on sequences of arbitrary length, which can be beneficial for tasks that require processing long documents or transcripts.

In [9]:
# dataset=dataset['train']


In [10]:
# dataset


configuration object for a machine learning model defines the model's architecture, hyperparameters, and training parameters.

It provides a structured way to store and manage the model's settings, allowing for easier model initialization, saving, and loading.

In [11]:
0# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
# loading the model with quantization config
 #AutoModelForCausalLM.from_pretrained method is used to load a pre-trained causal language model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [12]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [13]:
report_column = dataset['article'][0]
# summary_column = dataset['highlights']
print(report_column)

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

In [14]:
 dataset['highlights'][0]

"Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."

Vanilla model's response

In [15]:
prompt = """
<User>:LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I'll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say 'kid star goes off the rails,'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's "Equus." Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: "I just think I'm going to be more sort of fair game," he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.\n
\n<Summarizer>:
""".strip()

In [16]:
# Generate a response
generated_text = model.generate(
    input_ids=tokenizer(prompt, return_tensors="pt").input_ids,
    max_new_tokens=300,
    temperature=0.7,
    top_p=0.7,
    num_return_sequences=1,
)

# Decode the generated text
print(tokenizer.decode(generated_text[0], skip_special_tokens=True))



<User>:LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details

In [17]:
def preprocess(data_point):
  return {'document': f"""
<User>: {data_point['article']}
\n<Summarizer>:
""".strip(),
          'summary': f"{data_point['highlights']}"}

In [18]:
# gener

In [19]:
data = dataset.map(preprocess)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

In [20]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    # report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="highlights",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)



Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,2.6292
50,2.6606
75,2.2962
100,2.5428
125,2.1645
150,2.4586
175,2.2236
200,2.476


In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

Model's response after training

In [21]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = 'WASHINGTON (CNN) -- The Air Force is returning F-15E Strike Eagle jets to service over Iraq and Afghanistan after grounding other F-15s, the Air Force said Wednesday. The Air Force grounded models of its F-15 fleet after the crash of an older model F-15C this month. The F-15s were grounded after a crash earlier this month in Missouri of an older model that disintegrated in flight. Each F-15E must pass an inspection of critical parts on the airframe before returning to flying missions, Air Force officials said. All U.S. Air Force 224 E-model aircraft will undergo a one-time inspection of hydraulic system lines, the Air Force statement said. The longerons -- molded, metal strips of the aircraft fuselage that run from front to rear -- will also be inspected, according to the Air Force. The straps and skin panels in and around the environmental control system bay will also be examined, officials said. The Air Force would not say whether the parts being inspected were part of the problem on the aircraft that crashed. The investigation into why that plane fell apart in flight is still ongoing and Air Force officials will not say what happened until the investigation is complete, an Air Force spokesperson said. Air Force officials said the rest of the almost 500 F-15s -- older airframes than the F-15Es -- will remain grounded until the investigation offers a solution to what happened. The E-model aircraft, the youngest and most sophisticated in the F-15 inventory, is heavily used by Central Command for ground support in the U.S.-led wars in Iraq and Afghanistan. It is also used for the homeland security mission over the United States known as Operation Noble Eagle. On November 3, the Air Force grounded all of its F-15s in response to a November 1 crash of a Missouri Air National Guard F-15C in Boss, Missouri. The grounding forced Central Command to use other Air Force, Navy and French fighters to fill the gaps, though Strike Eagles did fly to support troops in battle in Afghanistan as an emergency measure while they were still under grounding orders, according to Central Command reports. The plane that crashed, built in 1980, was one of the older F-15s in the fleet. The F-15E Strike Eagle is an air-to-ground and air-to-air fighter, making it more versatile than other F-15 models, which are used for only air-to-air missions. The Strike Eagle is used in Afghanistan and Iraq in its air-to-ground role, using its advanced sensors to drop bombs on targets. E-mail to a friend'
pipe = pipeline(task="summarization", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0])



{'summary_text': '[INST] WASHINGTON (CNN) -- The Air Force is returning F-15E Strike Eagle jets to service over Iraq and Afghanistan after grounding other F-15s, the Air Force said Wednesday. The Air Force grounded models of its F-15 fleet after the crash of an older model F-15C this month. The F-15s were grounded after a crash earlier this month in Missouri of an older model that disintegrated in flight. Each F-15E must pass an inspection of critical parts on the airframe before returning to flying missions, Air Force officials said. All U.S. Air Force 224 E-model aircraft will undergo a one-time inspection of hydraulic system lines, the Air Force statement said. The longerons -- molded, metal strips of the aircraft fuselage that run from front to rear -- will also be inspected, according to the Air Force. The straps and skin panels in and around the environmental control system bay will also be examined, officials said. The Air Force would not say whether the parts being inspected we

F-15s grounded after a November 1 crash in Missouri .\nF-15 is used for ground support in the wars in Iraq and Afghanistan .\nAll U.S. Air Force 224 E-model aircraft will undergo a one-time inspection .

 [/INST]  The Air Force has announced that it will return its F-15E Strike Eagle jets to service after grounding them following a crash of an older model F-15C in Missouri. The Air Force will inspect critical parts of the airframe before returning the jets to service. The F-15E is the youngest and most sophisticated model in the F-15 inventory and is heavily used by Central Command for ground support in the U.S.-led wars in Iraq and Afghanistan. The Air Force will inspect the aircraft's hydraulic system lines, longerons, skin panels, and environmental control system bay before returning them to service. The investigation into the crash of the older model F-15C is still ongoing, but the Air Force will not say what happened until the investigation is complete. The grounding of the F-15s has forced Central Command to use other Air Force,"}


In [None]:
# dataset['article'][78]

"ROME, Italy -- Mauro Camoranesi scored with 13 minutes left to earn Juventus a 1-1 home draw with Serie A leaders Inter Milan on Sunday. Julio Cruz is mobbed by team-mates after giving Inter the lead in their 1-1 draw at Juventus. Camoranesi picked up a headed knock-down from substitute Vincenzo Iaquinta before seeing his shot deflect off defender  Walter Samuel to leave goalkeeper Julio Cesar helpless. Inter took a first-half lead when Argentine striker Julio Cruz broke Juve's offside trap and latched onto Brazilian midfielder Cesar's through ball before firing past Gianluigi Buffon. The result means Inter retain their unbeaten record this season, despite injury problems that saw the likes of Patrick Vieira, Francesco Toldo, Marco Materazzi and Dejan Stankovic ruled out. The defending champions are now two points clear of Fiorentina at the top of the table, with Roma a point further behind and Juventus in fourth place. Earlier in the day, Roma missed out on the chance to close the ga

In [None]:
# dataset['highlights'][78]

'F-15s grounded after a November 1 crash in Missouri .\nF-15 is used for ground support in the wars in Iraq and Afghanistan .\nAll U.S. Air Force 224 E-model aircraft will undergo a one-time inspection .'

In [None]:
prompt = """
<text>: ROME, Italy -- Mauro Camoranesi scored with 13 minutes left to earn Juventus a 1-1 home draw with Serie A leaders Inter Milan on Sunday. Julio Cruz is mobbed by team-mates after giving Inter the lead in their 1-1 draw at Juventus. Camoranesi picked up a headed knock-down from substitute Vincenzo Iaquinta before seeing his shot deflect off defender  Walter Samuel to leave goalkeeper Julio Cesar helpless. Inter took a first-half lead when Argentine striker Julio Cruz broke Juve's offside trap and latched onto Brazilian midfielder Cesar's through ball before firing past Gianluigi Buffon. The result means Inter retain their unbeaten record this season, despite injury problems that saw the likes of Patrick Vieira, Francesco Toldo, Marco Materazzi and Dejan Stankovic ruled out. The defending champions are now two points clear of Fiorentina at the top of the table, with Roma a point further behind and Juventus in fourth place. Earlier in the day, Roma missed out on the chance to close the gap on Inter when a late collapse saw them throw away a two-goal lead to draw 2-2 at Empoli. First half goals from French winger Ludovic Giuly and Matteo Brighi had put the visiting Romans in charge and for more than an hour they looked set to cruise to victory. But with 23 minutes remaining Ighli Vannucchi reduced the deficit and Sebastian Giovinco snatched an injury time equaliser to deny Luciano Spaletti's injury-depleted team. Siena snatched a share of the spoils from Parma in a 2-2 draw as Daniele Galloppa scored in the last minute while Napoli needed an injury time goal from striker Ezequiel Lavezzi to deny rock-bottom Reggina their first win of the season, forcing them to settle for a 1-1 draw in the south. E-mail to a friend
<summarizer>:
""".strip()

In [None]:
encoding = tokenizer(prompt, return_tensors="pt").to(device_map)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [22]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

20931

In [23]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
# !huggingface-cli login

model.push_to_hub("llama2_final_docu_summary", use_temp_dir=False)
# tokenizer.push_to_hub("llama2_final_doc_summary", use_temp_dir=False)

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/soundarya2873/llama2_final_docu_summary/commit/b82f8552e02f2d69c7b5748c49001c0666f957df', commit_message='Upload LlamaForCausalLM', commit_description='', oid='b82f8552e02f2d69c7b5748c49001c0666f957df', pr_url=None, pr_revision=None, pr_num=None)

In [25]:
tokenizer.push_to_hub("llama2_final_docu_summary", use_temp_dir=False)

CommitInfo(commit_url='https://huggingface.co/soundarya2873/llama2_final_docu_summary/commit/02eae5579b611ba7b8e61dbe201205dca8d3a0b4', commit_message='Upload tokenizer', commit_description='', oid='02eae5579b611ba7b8e61dbe201205dca8d3a0b4', pr_url=None, pr_revision=None, pr_num=None)