## Installation

In [1]:
%%capture
!pip install unsloth==2025.1.6

## Global Variables

In [4]:
dataset_name = "pauliusztin/second_brain_course_summarization_task" # The generated dataset during Lesson 3. Leave it like this if you haven't generated any dataset.
model_name = "pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization"  # The LLM you used for fine-tuning in Lesson 4. Leave it like this if you haven't fine-tuned any LLM.

## Load Fine-tuned LLM

In [5]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=8192,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

==((====))==  Unsloth 2025.1.6: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

## Prepare Input Samples

In [6]:
from datasets import load_dataset


alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""

In [7]:
dataset = load_dataset(dataset_name, split="test")

Generating train split:   0%|          | 0/575 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/72 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/72 [00:00<?, ? examples/s]

In [8]:
dataset[0]["instruction"][:1000]

'# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools'

In [9]:
dataset[0]["answer"][:1000]

'```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```'

## Inference

In [10]:
from transformers import TextStreamer


text_streamer = TextStreamer(tokenizer)


def generate_text(instruction, streaming: bool = True, trim_input_message: bool = False):
  message = alpaca_prompt.format(
      instruction,
      "",  # output - leave this blank for generation!
  )
  inputs = tokenizer([message], return_tensors="pt").to("cuda")

  if streaming:
    return model.generate(**inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True)
  else:
    output_tokens = model.generate(**inputs, max_new_tokens=256, use_cache=True)
    output = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

    if trim_input_message:
      return output[len(message) :]
    else:
      return output

In [12]:
_ = generate_text(dataset[0]["instruction"], streaming=True)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
# Notes



<child_page>
# Design Patterns

# Training code

The most natural way of splitting the training code:
- Dataset
- DatasetLoader
- Model
- ModelFactory
- Trainer (takes in the dataset and model)
- Evaluator

# Serving code

[Infrastructure]Model (takes in the trained model)
	- register
	- deploy
</child_page>


---

# Resources [Community]

# Resources [Science]

# Tools

### Response:
TL;DR Summary:
Design patterns for training and serving code:
- Training: Dataset, DatasetLoader, Model, ModelFactory, Trainer, Evaluator
- Serving: In