# Introduction

In this project, we will use **Llama-3.1 8B**, a **Large Language Model (LLM)** developed by Meta, to automatically summarize customer reviews. The goal is to extract key insights for better marketing decisions and enhance understanding of customer needs.

Customer reviews provide valuable feedback but are often lengthy and unstructured. Manually analyzing them is time-consuming, which is why automating the process is essential for actionable insights.

**Llama-3.1 8B** is a powerful pre-trained model capable of understanding, summarizing, and generating text. It helps identify sentiments, strengths, and weaknesses in customer feedback.

### Steps:
1. **Data Preparation**: Clean and structure the reviews.
2. **Model Training**: Fine-tune **Llama-3.1 8B** on customer reviews.
3. **Evaluation**: Compare the model-generated summaries with human ones.

This project explores how **LLMs** can transform customer review analysis, providing businesses with faster, more insightful data and improved customer strategies.


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
!pip install bitsandbytes



In [3]:
!pip install unsloth_zoo accelerate torch



## Model Setup and Configuration for Llama-3.1 8B


In [4]:
from unsloth import FastLanguageModel
import torch
# Longueur maximale de la séquence d'entrée
max_seq_length = 2048
# Type de données pour la détection automatique
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# Utiliser la quantification 4bit pour réduire l'utilisation de la mémoire
load_in_4bit = True

# Modèles pré-quantifiés en 4bit que nous supportons pour des téléchargements 4x plus rapides et éviter les erreurs OOM
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

# Chargement du modèle et du tokenizer avec les paramètres spécifiés
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## Fine-tuning with PEFT (Parameter Efficient Fine-Tuning) for Llama Model

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# II. Data Preparation

In [6]:
import json

with open('dataLlama3.json') as f1, open('recommendation_dataset.json') as f2:
    summary_data = json.load(f1)
    recommendation_data = json.load(f2)

merged_data = summary_data + recommendation_data

with open('merged_dataset.json', 'w') as out:
    json.dump(merged_data, out, indent=2)


In [7]:
import json

with open('merged_dataset.json') as f:
    dataset = json.load(f)

## Data Preprocessing and Prompt Formatting for Alpaca Model


In [8]:
from datasets import Dataset

# Assuming 'dataset' is a list of dictionaries
dataset = Dataset.from_dict({
    "instruction": [entry.get('instruction', '') for entry in dataset],
    "input": [entry.get('input', '') for entry in dataset],
    "output": [entry.get('output', '') for entry in dataset]
})

# Define the Alpaca prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

# Function to format the prompts
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction=instruction, input=input, output=output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the map function
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/1120 [00:00<?, ? examples/s]

In [9]:
dataset[0]

{'instruction': 'Summarize customer reviews.',
 'input': 'Product: Wireless Earbuds - SoundPro | Positive: Great sound quality, long battery life, comfortable fit | Negative: Expensive, connectivity issues | Neutral: Decent but overpriced.',
 'output': 'SoundPro earbuds offer excellent sound and battery life, but connectivity issues and high price are drawbacks.',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize customer reviews.\n\n### Input:\nProduct: Wireless Earbuds - SoundPro | Positive: Great sound quality, long battery life, comfortable fit | Negative: Expensive, connectivity issues | Neutral: Decent but overpriced.\n\n### Response:\nSoundPro earbuds offer excellent sound and battery life, but connectivity issues and high price are drawbacks.<|eot_id|>'}

In [22]:
dataset[1119]

{'instruction': 'Generate recommendations based on the feedback summary.',
 'input': 'Book pages are too thin.',
 'output': '1. Use thicker paper quality\n2. Offer premium edition\n3. Highlight lightweight portability',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate recommendations based on the feedback summary.\n\n### Input:\nBook pages are too thin.\n\n### Response:\n1. Use thicker paper quality\n2. Offer premium edition\n3. Highlight lightweight portability<|eot_id|>'}


# III. Train the model


In [23]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1120 [00:00<?, ? examples/s]

In [24]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.516 GB of memory reserved.


In [25]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,120 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.6019
2,2.6887
3,2.5428
4,2.5607
5,2.1957
6,1.8573
7,1.5827
8,1.3496
9,1.1096
10,0.9702


In [26]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

359.9937 seconds used for training.
6.0 minutes used for training.
Peak reserved memory = 6.514 GB.
Peak reserved memory for training = 0.998 GB.
Peak reserved memory % of max memory = 44.19 %.
Peak reserved memory for training % of max memory = 6.77 %.



### Inference


In [27]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

# Enable fast inference
FastLanguageModel.for_inference(model)

# Define the ALPACA-style prompt only ONCE
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""

# Reusable function to generate output from instruction + input
def run_model(instruction, input_text, model, tokenizer, max_new_tokens=300):
    prompt = alpaca_prompt.format(instruction=instruction, input=input_text)
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        use_cache=True,
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ✅ Example: Summarization
summary_input = (
    "Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', "
    "'Fast delivery and helpful customer support during setup.', 'Great value for money.' | "
    "Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'"
)

summary_result = run_model("Summarize customer reviews.", summary_input, model, tokenizer)
print("🔎 Summary Result:\n", summary_result)

# ✅ Example: Recommendation
recommend_input = "Great sound quality but uncomfortable for long use."
recommend_result = run_model("Generate recommendations based on the feedback summary.", recommend_input, model, tokenizer)
print("\n💡 Recommendation Result:\n", recommend_result)


🔎 Summary Result:
 Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize customer reviews.

### Input:
Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'

### Response:
SoundBeats headphones are praised for their sound quality and price, with responsive customer support. Some feel the build quality and delivery timing could be improved.

💡 Recommendation Result:
 Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate recommendations based on the feedback summary.

### Inpu

In [30]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

# Enable fast inference
FastLanguageModel.for_inference(model)

# Define the unified ALPACA-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""

def generate_response(instruction, input_text, model, tokenizer, max_new_tokens=300):
    """
    Unified function to handle both summarization and recommendation generation
    """
    prompt = alpaca_prompt.format(instruction=instruction, input=input_text)
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Text streamer for real-time output (optional)
    streamer = TextStreamer(tokenizer)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        use_cache=True,
        eos_token_id=tokenizer.eos_token_id,
        streamer=streamer,  # Remove if not needed
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example 1: Summarization Task
summary_input = (
    "Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', "
    "'Fast delivery and helpful customer support during setup.', 'Great value for money.' | "
    "Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'"
)

summary_result = generate_response(
    instruction="Summarize these customer reviews highlighting key positives and negatives.",
    input_text=summary_input,
    model=model,
    tokenizer=tokenizer
)
print("\n📝 Summary Result:")
print(summary_result)

# Example 2: Recommendation Task
recommend_input = "Great sound quality but uncomfortable for long use."
recommend_result = generate_response(
    instruction="Generate 3 actionable product improvement recommendations based on this feedback:",
    input_text=recommend_input,
    model=model,
    tokenizer=tokenizer
)
print("\n💡 Recommendation Result:")
print(recommend_result)

# Example 3: Combined Task (Summary + Recommendations)
combined_input = (
    "Product: Wireless Earbuds | Reviews: 'Battery life is amazing but they fall out during workouts.', "
    "'Sound is crisp but the touch controls are too sensitive.', 'Comfortable for all-day wear but case is bulky.'"
)

combined_result = generate_response(
    instruction="First summarize the key feedback, then generate 3 product improvement recommendations:",
    input_text=combined_input,
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=400
)
print("\n🔍 Combined Summary & Recommendations:")
print(combined_result)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.

### Instruction:
Summarize these customer reviews highlighting key positives and negatives.

### Input:
Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'

### Response:
SoundBeats headphones offer great value for money with excellent sound quality. However, some customers feel the build quality could be better, and delivery can sometimes be delayed.<|eot_id|>

📝 Summary Result:
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.

### Instruction:
Sum

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# ALPACA prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

# Function to generate responses
def generate_response(instruction, input_text, max_new_tokens=128):
    formatted_prompt = alpaca_prompt.format(
        instruction=instruction,
        input=input_text,
        output=""  # Leave blank for generation
    )

    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)

    outputs = model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    # Extract just the generated text (after "### Response:")
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = full_output.split("### Response:")[-1].strip()
    return response

# Example 1: Summarization
summary_input = "Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'"

print("\n📝 Summarization Example:")
summary_result = generate_response(
    instruction="Summarize these customer reviews highlighting key positives and negatives.",
    input_text=summary_input,
    max_new_tokens=150
)

# Example 2: Recommendations
recommendation_input = "Great sound quality but uncomfortable for long use."

print("\n💡 Recommendation Example:")
recommendation_result = generate_response(
    instruction="Generate 3 actionable product improvement recommendations based on this feedback:",
    input_text=recommendation_input,
    max_new_tokens=200
)

# Example 3: Combined Task
combined_input = "Product: Wireless Earbuds | Reviews: 'Battery life is amazing but they fall out during workouts.', 'Sound is crisp but the touch controls are too sensitive.', 'Comfortable for all-day wear but case is bulky.'"

print("\n🔍 Combined Summary + Recommendations Example:")
combined_result = generate_response(
    instruction="First summarize the key feedback, then generate 3 product improvement recommendations:",
    input_text=combined_input,
    max_new_tokens=300
)


📝 Summarization Example:
<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize these customer reviews highlighting key positives and negatives.

### Input:
Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'

### Response:
SoundBeats headphones offer great sound quality and value for money, with fast delivery and helpful support. However, some customers feel the build quality could be better and delivery may be delayed.<|eot_id|>

💡 Recommendation Example:
<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a respo

<a name="Save"></a>
# VI. Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [32]:
model.save_pretrained("MergedModels") # Local saving
tokenizer.save_pretrained("MergedModels")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('MergedModels/tokenizer_config.json',
 'MergedModels/special_tokens_map.json',
 'MergedModels/tokenizer.json')

In [33]:
import os
from huggingface_hub import login
HF_TOKEN = os.environ.get('HF_TOKEN')

# Log in to Hugging Face Hub
login(token=HF_TOKEN)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
model.push_to_hub("tessou/MergedModels", token = HF_TOKEN) # Online saving
tokenizer.push_to_hub("tessou/MergedModels", token = HF_TOKEN) # Online saving

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/tessou/MergedModels


README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [37]:
from unsloth import FastLanguageModel
import torch

# Configuration
model_name = "tessou/MergedModels"  # Your fine-tuned model
max_seq_length = 2048  # Adjust based on your training
dtype = None  # Automatic detection
load_in_4bit = True  # Quantization

try:
    # Try loading with GPU first
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        device_map = "auto",  # Let HuggingFace handle device placement
    )
    FastLanguageModel.for_inference(model)

except ValueError as e:
    print(f"GPU loading failed: {e}")
    print("Attempting with CPU offloading...")

    # Fallback with CPU offloading
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        device_map = "auto",
        llm_int8_enable_fp32_cpu_offload = True,  # Enable CPU offloading
    )
    FastLanguageModel.for_inference(model)

# ALPACA prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

# Check memory usage
print(f"\nGPU Memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

def generate_response(instruction, input_text, max_new_tokens=128):
    try:
        formatted_prompt = alpaca_prompt.format(
            instruction=instruction,
            input=input_text,
            output=""
        )

        inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
        text_streamer = TextStreamer(tokenizer)

        outputs = model.generate(
            **inputs,
            streamer=text_streamer,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

        return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[-1].strip()

    except RuntimeError as e:
        return f"Generation failed: {str(e)}"

# Example usage
print("\n📝 Testing model...")
test_output = generate_response(
    instruction="Summarize this review",
    input_text="Product: Wireless Earbuds - Great sound but battery life could be better",
    max_new_tokens=50
)
print(test_output)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
GPU loading failed: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Attempting with CPU offloading...
==((====))==  Unsloth 2025.3.19: Fast Llam

TypeError: LlamaForCausalLM.__init__() got an unexpected keyword argument 'llm_int8_enable_fp32_cpu_offload'

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "SummarizeReviews_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("SummarizeReviews_model")