# Introduction

In this project, we will use **Llama-3.1 8B**, a **Large Language Model (LLM)** developed by Meta, to automatically summarize customer reviews. The goal is to extract key insights for better marketing decisions and enhance understanding of customer needs.

Customer reviews provide valuable feedback but are often lengthy and unstructured. Manually analyzing them is time-consuming, which is why automating the process is essential for actionable insights.

**Llama-3.1 8B** is a powerful pre-trained model capable of understanding, summarizing, and generating text. It helps identify sentiments, strengths, and weaknesses in customer feedback.

### Steps:
1. **Data Preparation**: Clean and structure the reviews.
2. **Model Training**: Fine-tune **Llama-3.1 8B** on customer reviews.
3. **Evaluation**: Compare the model-generated summaries with human ones.

This project explores how **LLMs** can transform customer review analysis, providing businesses with faster, more insightful data and improved customer strategies.


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
!pip install bitsandbytes



In [3]:
!pip install unsloth_zoo accelerate torch



## Model Setup and Configuration for Llama-3.1 8B


In [4]:
from unsloth import FastLanguageModel
import torch
# Longueur maximale de la séquence d'entrée
max_seq_length = 2048
# Type de données pour la détection automatique
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# Utiliser la quantification 4bit pour réduire l'utilisation de la mémoire
load_in_4bit = True

# Modèles pré-quantifiés en 4bit que nous supportons pour des téléchargements 4x plus rapides et éviter les erreurs OOM
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

# Chargement du modèle et du tokenizer avec les paramètres spécifiés
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## Fine-tuning with PEFT (Parameter Efficient Fine-Tuning) for Llama Model

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# II. Data Preparation

In [6]:
import json

with open('dataLlama3.json') as f:
    dataset = json.load(f)

## Data Preprocessing and Prompt Formatting for Alpaca Model


In [8]:
from datasets import Dataset

# Assuming 'dataset' is a list of dictionaries
dataset = Dataset.from_dict({
    "instruction": [entry.get('instruction', '') for entry in dataset],
    "input": [entry.get('input', '') for entry in dataset],
    "output": [entry.get('output', '') for entry in dataset]
})

# Define the Alpaca prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

# Function to format the prompts
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction=instruction, input=input, output=output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the map function
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/1033 [00:00<?, ? examples/s]

In [9]:
dataset[0]

{'instruction': 'Summarize customer reviews.',
 'input': 'Product: Wireless Earbuds - SoundPro | Positive: Great sound quality, long battery life, comfortable fit | Negative: Expensive, connectivity issues | Neutral: Decent but overpriced.',
 'output': 'SoundPro earbuds offer excellent sound and battery life, but connectivity issues and high price are drawbacks.',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize customer reviews.\n\n### Input:\nProduct: Wireless Earbuds - SoundPro | Positive: Great sound quality, long battery life, comfortable fit | Negative: Expensive, connectivity issues | Neutral: Decent but overpriced.\n\n### Response:\nSoundPro earbuds offer excellent sound and battery life, but connectivity issues and high price are drawbacks.<|eot_id|>'}

In [10]:
dataset[40]

{'instruction': 'Summarize customer reviews.',
 'input': 'Product: Lipstick - GlamMatte | Positive: Vibrant color, long-lasting, smooth application | Negative: Dries lips, limited shades | Neutral: Good quality but could be more moisturizing.',
 'output': 'GlamMatte offers bold, long-lasting color, but may dry out lips.',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize customer reviews.\n\n### Input:\nProduct: Lipstick - GlamMatte | Positive: Vibrant color, long-lasting, smooth application | Negative: Dries lips, limited shades | Neutral: Good quality but could be more moisturizing.\n\n### Response:\nGlamMatte offers bold, long-lasting color, but may dry out lips.<|eot_id|>'}


# III. Train the model


In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1033 [00:00<?, ? examples/s]

In [13]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.516 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,033 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.6126
2,2.6509
3,2.6074
4,2.3311
5,2.1799
6,1.8963
7,1.503
8,1.3169
9,1.1332
10,1.0486


In [15]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

370.5822 seconds used for training.
6.18 minutes used for training.
Peak reserved memory = 7.182 GB.
Peak reserved memory for training = 1.666 GB.
Peak reserved memory % of max memory = 48.721 %.
Peak reserved memory for training % of max memory = 11.302 %.



### Inference


In [16]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction="Summarize customer reviews.",  # instruction
        input="Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'",  # input
        output="" # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 300, eos_token_id=tokenizer.eos_token_id, use_cache = True)
tokenizer.batch_decode(outputs)

["<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize customer reviews.\n\n### Input:\nProduct: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'\n\n### Response:\nSoundBeats Bluetooth Headphones offer great sound quality at an affordable price, with quick delivery and helpful support. However, some customers feel the build quality could be better for the price, and delivery was delayed for a few.<|eot_id|>"]

In [18]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
"""

prompt_text = alpaca_prompt.format(
    "Summarize customer reviews.",
    "Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'"
)

inputs = tokenizer([prompt_text], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=255,
    use_cache=True,
    temperature=0.7,  # Plus de diversité dans les réponses
    top_p=0.9  # Permet d'explorer différentes options
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize customer reviews.

### Input:
Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'

### Response:
SoundBeats headphones offer excellent sound quality at a reasonable price, with fast delivery and good customer support. However, some customers feel the build could be better for the price, and delivery was delayed for a few.


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [19]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize customer reviews.", # instruction
        "Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize customer reviews.

### Input:
Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'

### Response:
SoundBeats headphones are praised for their sound quality and value, but some feel the build could be better and delivery was delayed in some cases.<|eot_id|>


<a name="Save"></a>
# VI. Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [20]:
model.save_pretrained("SummarizeReviews_model") # Local saving
tokenizer.save_pretrained("SummarizeReviews_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('SummarizeReviews_model/tokenizer_config.json',
 'SummarizeReviews_model/special_tokens_map.json',
 'SummarizeReviews_model/tokenizer.json')

In [22]:
import os
from huggingface_hub import login
HF_TOKEN = os.environ.get('HF_TOKEN')

# Log in to Hugging Face Hub
login(token=HF_TOKEN)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [23]:
model.push_to_hub("Nourhen2001/SummarizeReviews_model_v2", token = HF_TOKEN) # Online saving
tokenizer.push_to_hub("Nourhen2001/SummarizeReviews_model_v2", token = HF_TOKEN) # Online saving

README.md:   0%|          | 0.00/610 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Nourhen2001/SummarizeReviews_model_v2


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [24]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "Nourhen2001/SummarizeReviews_model_v2", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction="Summarize customer reviews.",  # instruction
        input="Product: Bluetooth Headphones - SoundBeats | Positive: 'Excellent sound quality at a great price.', 'Fast delivery and helpful customer support during setup.', 'Great value for money.' | Negative: 'Could be cheaper for the build quality.', 'Delivery was delayed a few days, but customer service was responsive.'",  # input
        output="" # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

IndexError: Replacement index 0 out of range for positional args tuple

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [25]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "SummarizeReviews_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("SummarizeReviews_model")