## Install relevant packages

In [8]:
%%capture

!pip install unsloth # install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git # Also get the latest version Unsloth!

## Import all relevant packages throughout this walkthrough

In [10]:
# Modules for fine-tuning
from unsloth import FastLanguageModel
import torch # Import PyTorch
from trl import SFTTrainer # Trainer for supervised fine-tuning (SFT)
from unsloth import is_bfloat16_supported # Checks if the hardware supports bfloat16 precision
# Hugging Face modules
from huggingface_hub import login # Lets you login to API
from transformers import TrainingArguments # Defines training hyperparameters
from datasets import load_dataset # Lets you load fine-tuning datasets
# Import weights and biases
import wandb
# Import kaggle secrets
from kaggle_secrets import UserSecretsClient


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


ImportError: cannot import name 'TensorifyScalarRestartAnalysis' from 'torch._dynamo.exc' (/usr/local/lib/python3.11/dist-packages/torch/_dynamo/exc.py)

## Create API keys and login to Hugging Face and Weights and Biases

In [6]:
# Initialize Hugging Face & WnB tokens
user_secrets = UserSecretsClient() # from kaggle_secrets import UserSecretsClient
hugging_face_token = user_secrets.get_secret("HF_TOKEN_DEEPSEEK")
wnb_token = user_secrets.get_secret("wnb_token")

# Login to Hugging Face
login(hugging_face_token) # from huggingface_hub import login

# Login to WnB
wandb.login(key=wnb_token) # import wandb
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on LawBot', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlucky-sntso[0m ([33mlucky-santoso[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


## Loading DeepSeek R1 and the Tokenizer

**What are we doing in this step?**

In this step, we **load the DeepSeek R1 model and its tokenizer** using `FastLanguageModel.from_pretrained()`. We also **configure key parameters** for efficient inference and fine-tuning. We will be using a distilled 8B version of R1 for faster computation.  

**Key parameters explained**
```py
max_seq_length = 2048  # Define the maximum sequence length a model can handle (i.e., number of tokens per input)
dtype = None  # Default data type (usually auto-detected)
load_in_4bit = True  # Enables 4-bit quantization – a memory-saving optimization
```

**Intuition behind 4-bit quantization**

Imagine compressing a **high-resolution image** to a smaller size—**it takes up less space but still looks good enough**. Similarly, **4-bit quantization reduces the precision of model weights**, making the model **smaller and faster while keeping most of its accuracy**. Instead of storing precise **32-bit or 16-bit numbers**, we compress them into **4-bit values**. This allows **large language models to run efficiently on consumer GPUs** without needing massive amounts of memory. 

In [6]:
# Set parameters
max_seq_length = 2048 # Define the maximum sequence length a model can handle (i.e. how many tokens can be processed at once)
dtype = None # Set to default 
load_in_4bit = True # Enables 4 bit quantization — a memory saving optimization 

# Load the DeepSeek R1 model and tokenizer using unsloth — imported using: from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Load the pre-trained DeepSeek R1 model (8B parameter version)
    max_seq_length=max_seq_length, # Ensure the model can process up to 2048 tokens at once
    dtype=dtype, # Use the default data type (e.g., FP16 or BF16 depending on hardware support)
    load_in_4bit=load_in_4bit, # Load the model in 4-bit quantization to save memory
    token=hugging_face_token, # Use hugging face token
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.1.
   \\   /|    Tesla P100-PCIE-16GB. Num GPUs = 1. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 6.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Testing DeepSeek R1 on a medical use-case before fine-tuning


### Defining a system prompt 
To create a prompt style for the model, we will define a system prompt and include placeholders for the question and response generation. The prompt will guide the model to think step-by-step and provide a logical, accurate response.

In [7]:
prompt_style = """
Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.
Tuliskan respons yang menyelesaikan permintaan dengan tepat.
Sebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.

### Instruksi:
Anda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. 
Jawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.

### Pertanyaan:
{}

### Jawaban:
<think>
{}
"""

### Running inference on the model

In this step, we **test the DeepSeek R1 model** by providing a **medical question** and generating a response.  
The process involves the following steps:

1. **Define a test question** related to a medical case.
2. **Format the question using the structured prompt (`prompt_style`)** to ensure the model follows a logical reasoning process.
3. **Tokenize the input and move it to the GPU (`cuda`)** for faster inference.
4. **Generate a response using the model**, specifying key parameters like `max_new_tokens=1200` (limits response length).
5. **Decode the output tokens back into text** to obtain the final readable answer.

In [10]:
# # Creating a test medical question for inference
# question = """Apa arti dari “berada di bawah Presiden” dalam konteks TNI?"""

# Enable optimized inference mode for Unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!

# Format the question using the structured prompt (`prompt_style`) and tokenize it
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")  # Convert input to PyTorch tensor & move to GPU

# Generate a response using the model
outputs = model.generate(
    input_ids=inputs.input_ids, # Tokenized input question
    attention_mask=inputs.attention_mask, # Attention mask to handle padding
    max_new_tokens=1200, # Limit response length to 1200 tokens (to prevent excessive output)
    use_cache=True, # Enable caching for faster inference
)

# Decode the generated output tokens into human-readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the relevant response part (after "### Response:")
print(response[0].split("### Response:")[0])  

<｜begin▁of▁sentence｜>
Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.
Tuliskan respons yang menyelesaikan permintaan dengan tepat.
Sebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.

### Instruksi:
Anda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. 
Jawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.

### Pertanyaan:
Apa arti dari “berada di bawah Presiden” dalam konteks TNI?

### Jawaban:
<think>

"Berada di bawah Presiden" dalam konteks TNI berarti seorang anggota TNI yang diberikan tugas khusus oleh Presiden untuk melaksanakan tindakan tertentu, biasanya dalam situasi darurat atau ketika keamanan negara terancang. Ini adalah gelaran khusus

>**Before starting fine-tuning — why are we fine-tuning in the first place?**
>
> Even without fine-tuning, our model successfully generated a chain of thought and provided reasoning before delivering the final answer. The reasoning process is encapsulated within the `<think>` `</think>` tags. So, why do we still need fine-tuning? The reasoning process, while detailed, was long-winded and not concise. Additionally, we want the final answer to be consistent in a certain style. 



## Fine-tuning step by step

### Step 1 — Update the system prompt 
We will slightly change the prompt style for processing the dataset by adding the third placeholder for the complex chain of thought column. `</think>`

In [11]:
train_prompt_style = """
Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.
Tuliskan respons yang menyelesaikan permintaan dengan tepat.
Sebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.

### Instruksi:
Anda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. 
Jawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.


### Pertanyaan:
{}

### Jawaban:
<think>
{}
</think>
{}
"""


### Step 2 — Download the fine-tuning dataset and format it for fine-tuning

We will use the Medical O1 Reasoninng SFT found here on [Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT). From the authors: This dataset is used to fine-tune HuatuoGPT-o1, a medical LLM designed for advanced medical reasoning. This dataset is constructed using GPT-4o, which searches for solutions to verifiable medical problems and validates them through a medical verifier.

In [12]:
import pandas as pd

# Load dataset from local path
df = pd.read_csv("/kaggle/input/dataset-law-chatbot/sample_dataset_law_chatbot.csv")

# Optional: lihat 5 baris pertama untuk memastikan kolomnya sesuai
df.head()


Unnamed: 0,Question,Complex_cot,Output
0,Apa pengertian dari Tentara Nasional Indonesia...,"Berdasarkan UU No. 34 Tahun 2004, Tentara Nasi...",TNI adalah alat negara di bidang pertahanan ya...
1,Siapa yang dimaksud dengan prajurit dalam UU TNI?,"Berdasarkan UU No. 34 Tahun 2004, prajurit ada...",Prajurit adalah warga negara Indonesia yang di...
2,Apa itu pertahanan negara menurut UU TNI?,"Berdasarkan UU No. 34 Tahun 2004, pertahanan n...",Pertahanan negara adalah usaha untuk menegakka...
3,Bagaimana definisi sistem pertahanan negara me...,"Berdasarkan UU No. 34 Tahun 2004, sistem perta...",Sistem pertahanan negara adalah sistem semesta...
4,Apa arti dari “berada di bawah Presiden” dalam...,Berdasarkan perubahan UU No. 34 Tahun 2004 pad...,“Berkedudukan di bawah Presiden” berarti TNI b...


>**Next step is to structure the fine-tuning dataset according to train prompt style—why?**
>
> - Each question is paired with chain-of-thought reasoning and the final response.
> - Ensures every training example follows a consistent pattern.
> - Prevents the model from continuing beyond the expected response lengt by adding the EOS token.

In [13]:
# We need to format the dataset to fit our prompt training style 
EOS_TOKEN = tokenizer.eos_token  # Define EOS_TOKEN which the model when to stop generating text during training
EOS_TOKEN

'<｜end▁of▁sentence｜>'

In [16]:
# Define formatting prompt function
def formatting_prompts_func(examples):  # Takes a batch of dataset examples as input
    inputs = examples["Question"]       # Extracts the medical question from the dataset
    cots = examples["Complex_cot"]      # Extracts the chain-of-thought reasoning (logical step-by-step explanation)
    outputs = examples["Output"]      # Extracts the final model-generated response (answer)
    
    texts = []  # Initializes an empty list to store the formatted prompts
    
    # Iterate over the dataset, formatting each question, reasoning step, and response
    for input, cot, output in zip(inputs, cots, outputs):  
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN  # Insert values into prompt template & append EOS token
        texts.append(text)  # Add the formatted text to the list

    return {
        "text": texts,  # Return the newly formatted dataset with a "text" column containing structured prompts
    }

In [17]:
# Update dataset formatting
from datasets import Dataset

# Convert Pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

dataset_finetune = dataset.map(formatting_prompts_func, batched = True)
dataset_finetune["text"][0]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

'\nDi bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.\nTuliskan respons yang menyelesaikan permintaan dengan tepat.\nSebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.\n\n### Instruksi:\nAnda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. \nJawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.\n\n\n### Pertanyaan:\nApa pengertian dari Tentara Nasional Indonesia menurut undang-undang?\n\n### Jawaban:\n<think>\nBerdasarkan UU No. 34 Tahun 2004, Tentara Nasional Indonesia (TNI) adalah alat negara di bidang pertahanan yang terdiri dari Angkatan Darat, Angkatan Laut, dan Angkatan Udara. TNI berperan menegakkan kedaulatan negara, mempertahankan keutuhan 

### Step 3 — Setting up the model using LoRA

**An intuitive explanation of LoRA** 

Large language models (LLMs) have **millions or even billions of weights** that determine how they process and generate text. When fine-tuning a model, we usually update all these weights, which **requires massive computational resources and memory**.

LoRA (**Low-Rank Adaptation**) allows to fine-tune efficiently by:

- Instead of modifying all weights, **LoRA adds small, trainable adapters** to specific layers.  
- These adapters **capture task-specific knowledge** while leaving the original model unchanged.  
- This reduces the number of trainable parameters **by more than 90%**, making fine-tuning **faster and more memory-efficient**.  

Think of an LLM as a **complex factory**. Instead of rebuilding the entire factory to produce a new product, LoRA **adds small, specialized tools** to existing machines. This allows the factory to adapt quickly **without disrupting its core structure**.

For a more technical explanation, check out this tutorial by [Sebastian Raschka](https://www.youtube.com/watch?v=rgmJep4Sb4&t).

Below, we will use the `get_peft_model()` function which stands for Parameter-Efficient Fine-Tuning — this function wraps the base model (`model`) with LoRA modifications, ensuring that only specific parameters are trained.

In [19]:
# Apply LoRA (Low-Rank Adaptation) fine-tuning to the model 
model_lora_lawbot = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank: Determines the size of the trainable adapters (higher = more parameters, lower = more efficiency)
    target_modules=[  # List of transformer layers where LoRA adapters will be applied
        "q_proj",   # Query projection in the self-attention mechanism
        "k_proj",   # Key projection in the self-attention mechanism
        "v_proj",   # Value projection in the self-attention mechanism
        "o_proj",   # Output projection from the attention layer
        "gate_proj",  # Used in feed-forward layers (MLP)
        "up_proj",    # Part of the transformer’s feed-forward network (FFN)
        "down_proj",  # Another part of the transformer’s FFN
    ],
    lora_alpha=16,  # Scaling factor for LoRA updates (higher values allow more influence from LoRA layers)
    lora_dropout=0,  # Dropout rate for LoRA layers (0 means no dropout, full retention of information)
    bias="none",  # Specifies whether LoRA layers should learn bias terms (setting to "none" saves memory)
    use_gradient_checkpointing="unsloth",  # Saves memory by recomputing activations instead of storing them (recommended for long-context fine-tuning)
    random_state=3407,  # Sets a seed for reproducibility, ensuring the same fine-tuning behavior across runs
    use_rslora=False,  # Whether to use Rank-Stabilized LoRA (disabled here, meaning fixed-rank LoRA is used)
    loftq_config=None,  # Low-bit Fine-Tuning Quantization (LoFTQ) is disabled in this configuration
)

Now, we initialize `SFTTrainer`, a supervised fine-tuning trainer from `trl` (Transformer Reinforcement Learning), to fine-tune our model efficiently on a dataset.

In [20]:
# Initialize the fine-tuning trainer — Imported using from trl import SFTTrainer
trainer_lawbow = SFTTrainer(
    model=model_lora_lawbot,  # The model to be fine-tuned
    tokenizer=tokenizer,  # Tokenizer to process text inputs
    train_dataset=dataset_finetune,  # Dataset used for training
    dataset_text_field="text",  # Specifies which field in the dataset contains training text
    max_seq_length=max_seq_length,  # Defines the maximum sequence length for inputs
    dataset_num_proc=2,  # Uses 2 CPU threads to speed up data preprocessing

    # Define training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Number of examples processed per device (GPU) at a time
        gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps before updating weights
        num_train_epochs=1, # Full fine-tuning run
        warmup_steps=5,  # Gradually increases learning rate for the first 5 steps
        max_steps=20,  # Limits training to 60 steps (useful for debugging; increase for full fine-tuning)
        learning_rate=2e-4,  # Learning rate for weight updates (tuned for LoRA fine-tuning)
        fp16=not is_bfloat16_supported(),  # Use FP16 (if BF16 is not supported) to speed up training
        bf16=is_bfloat16_supported(),  # Use BF16 if supported (better numerical stability on newer GPUs)
        logging_steps=10,  # Logs training progress every 10 steps
        optim="adamw_8bit",  # Uses memory-efficient AdamW optimizer in 8-bit mode
        weight_decay=0.01,  # Regularization to prevent overfitting
        lr_scheduler_type="linear",  # Uses a linear learning rate schedule
        seed=3407,  # Sets a fixed seed for reproducibility
        output_dir="outputs",  # Directory where fine-tuned model checkpoints will be saved
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

## Step 4 — Model training! 

This should take around 30 to 40 minutes — we can then check out our training results on Weights and Biases

In [22]:
# Start the fine-tuning process
trainer_stats = trainer_lawbow.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20 | Num Epochs = 10 | Total steps = 20
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.5203
20,0.3146


In [23]:
# Save the fine-tuned model
wandb.finish()

0,1
train/epoch,▁██
train/global_step,▁██
train/grad_norm,█▁
train/learning_rate,█▁
train/loss,█▁

0,1
total_flos,2077592306171904.0
train/epoch,6.8
train/global_step,20.0
train/grad_norm,0.40057
train/learning_rate,1e-05
train/loss,0.3146
train_loss,0.91747
train_runtime,353.4682
train_samples_per_second,0.453
train_steps_per_second,0.057


## Step 5 — Run model inference after fine-tuning

In [25]:
question = """Apa arti dari “berada di bawah Presiden” dalam konteks TNI?"""

# Load the inference model using FastLanguageModel (Unsloth optimizes for speed)
FastLanguageModel.for_inference(model_lora_lawbot)  # Unsloth has 2x faster inference!

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model_lora_capd.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[0])

<｜begin▁of▁sentence｜>
Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.
Tuliskan respons yang menyelesaikan permintaan dengan tepat.
Sebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.

### Instruksi:
Anda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. 
Jawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.

### Pertanyaan:
Apa arti dari “berada di bawah Presiden” dalam konteks TNI?

### Jawaban:
<think>

Berdasarkan UU No. 34 Tahun 2004, "berada di bawah Presiden" berarti anggota TNI yang menjalani pemeriksaan kasus keprajurit, dikasuskan oleh Presiden karena melakukan kesalahan dalam menjalani tugas, dengan maksud untuk memastikan pemeliharaan dis

In [None]:
model_lora_lawbot.save

# Evaluasi ROUGE atau BLEU 

# Save Model

In [26]:
# Setelah training selesai
trainer_lawbow.save_model("model_lora_lawbot")
tokenizer.save_pretrained("model_lora_lawbot") 

('model_lora_lawbot/tokenizer_config.json',
 'model_lora_lawbot/special_tokens_map.json',
 'model_lora_lawbot/tokenizer.json')

In [30]:
!zip -r model_lora_lawbot.zip model_lora_lawbot

  adding: model_lora_lawbot/ (stored 0%)
  adding: model_lora_lawbot/tokenizer_config.json (deflated 95%)
  adding: model_lora_lawbot/README.md (deflated 66%)
  adding: model_lora_lawbot/special_tokens_map.json (deflated 69%)
  adding: model_lora_lawbot/tokenizer.json (deflated 85%)
  adding: model_lora_lawbot/adapter_config.json (deflated 56%)
  adding: model_lora_lawbot/training_args.bin (deflated 51%)
  adding: model_lora_lawbot/adapter_model.safetensors (deflated 9%)


In [31]:
from IPython.display import FileLink

# Membuat link untuk download
FileLink("model_lora_lawbot.zip")

# Testing

In [None]:
from peft import PeftModel, PeftConfig
from unsloth import FastLanguageModel

# Set parameters
max_seq_length = 2048 # Define the maximum sequence length a model can handle (i.e. how many tokens can be processed at once)
dtype = None # Set to default 
load_in_4bit = True # Enables 4 bit quantization — a memory saving optimization 

# Load the DeepSeek R1 model and tokenizer using unsloth — imported using: from unsloth import FastLanguageModel
model_base, tokenizer_base = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Load the pre-trained DeepSeek R1 model (8B parameter version)
    max_seq_length=max_seq_length, # Ensure the model can process up to 2048 tokens at once
    dtype=dtype, # Use the default data type (e.g., FP16 or BF16 depending on hardware support)
    load_in_4bit=load_in_4bit, # Load the model in 4-bit quantization to save memory
    token=hugging_face_token, # Use hugging face token
)

# 2. Load adapter LoRA
model_lora_lawbot_load = PeftModel.from_pretrained(
    model_base,
    "",  # path ke adapter
    is_trainable = False    # True jika mau lanjut fine-tuning, False untuk inference
)

# 3. Aktifkan optimisasi inference Unsloth
FastLanguageModel.for_inference(model_lora_lawbot_load)

In [None]:
prompt_style = """
Di bawah ini adalah instruksi yang menjelaskan tugas, dipasangkan dengan input yang memberikan konteks lebih lanjut.
Tuliskan respons yang menyelesaikan permintaan dengan tepat.
Sebelum menjawab, pikirkan dengan cermat pertanyaan tersebut dan buatlah rangkaian pemikiran langkah demi langkah untuk memastikan respons yang logis dan akurat.

### Instruksi:
Anda adalah seorang ahli hukum dengan pengetahuan tingkat lanjut dalam penalaran hukum, analisis kasus, dan penyusunan dokumen hukum. 
Jawablah pertanyaan hukum berikut ini dengan tepat, berdasarkan peraturan perundang-undangan yang berlaku dan preseden hukum yang relevan.


### Pertanyaan:
{}

### Jawaban:
<think>
{}
</think>
{}
"""


In [None]:
question = """"""

# Load the inference model using FastLanguageModel (Unsloth optimizes for speed)
FastLanguageModel.for_inference(model_lora_capdv2_load)  # Unsloth has 2x faster inference!

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer_base([prompt_style.format(question, "", "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model_lora_capdv2_load.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer_base.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[0])