In [None]:
# Install Pytorch
%pip install "torch==2.3.1" tensorboard

# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.45.1" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.23.4" "trl==0.8.6" "peft==0.13.0"

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [None]:
dbutils.library.restartPython()

### 1. What are the Training types of each encoder and decoder models?

Pre-training Methods

- Full - Seq-to-Seq (teacher forcing with paris of input and output sequences, and the correct output is fed into the decoder at each step during training )
- Encoder - MLM (some of the input tokens are masked, and the model learns to predict the masked tokens.)
- Decoder - Autoregressive Training (they predict the next token in the sequence given all previous tokens. which does not require pair sentences.)

## 2. Please pick one popular model of each above and indicate the number of dimensions of each layer.

### * Self attention
- $$ Q=XW_Q, K = XW_K, V = XW_V $$ 
- and $$ A = \frac{QK_T}{\sqrt{d_k}} $$
$$ output = softmax(A) * V $$
- EX) if X has dim (3,4) and W has 4,2 then Q, K, and V will have dims (3,2)
- Q*K_T will have dim (3,3) - square matrix
- Denominator d_k is 2 since its number of features in Q and K
- So A will be also (3,3) and V is (3,2) so output vector will be (3,2) since it's linear transformation

### 2.2 Encoder layer - distilbert
#### Embedding Layer
- (word_embeddings): Embedding(30522, 768, padding_idx=0)
- (position_embeddings): Embedding(512, 768)
- (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
- (dropout): Dropout(p=0.1, inplace=False)

#### Multi Attention Layer x 5
- (dropout): Dropout(p=0.1, inplace=False)
- (q_lin): Linear(in_features=768, out_features=768, bias=True)
- (k_lin): Linear(in_features=768, out_features=768, bias=True)
- (v_lin): Linear(in_features=768, out_features=768, bias=True)
- (out_lin): Linear(in_features=768, out_features=768, bias=True)

#### MLP(FFN) + Activation & Output
- LayerNorm((768,), eps=1e-12, elementwise_affine=True)
- Dropout(p=0.1, inplace=False)
- Linear(in_features=768, out_features=3072, bias=True)
- Linear(in_features=3072, out_features=768, bias=True)
- GELUActivation()
- LayerNorm((768,), eps=1e-12, elementwise_affine=True)

### 2.3 Decoder layer - Llama
#### Embedding Layer
- Embedding(128256, 2048)

#### Multi Attention Layer x 16
- (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
- (k_proj): Linear(in_features=2048, out_features=512, bias=False)
- (v_proj): Linear(in_features=2048, out_features=512, bias=False)
- (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
- (rotary_emb): LlamaRotaryEmbedding()

#### MLP(FFN) + Activation & Output
- (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
- (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
- (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
- (act_fn): SiLU()
- (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
- (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
- (norm): LlamaRMSNorm((2048,), eps=1e-05)
- (rotary_emb): LlamaRotaryEmbedding()


## 3. Understand activation functions like GELU, dropout, softmax and optimizer like adam. Explain forward and backward propagation.

- GELU adds non-linearity to help the model learn complex functions.
- Dropout reduces overfitting by randomly deactivating some neurons during training.
- Softmax converts raw outputs into probabilities, used in the final output layer.
- Adam updates the model's parameters during training, improving convergence speed and stability.

- Forward Propagation: In this step, the input tokens are converted into embeddings, passed through self-attention layers, and processed by feed-forward layers with activation functions like GELU. Dropout and layer normalization stabilize the model, while softmax generates probabilities for the final predictions.

- Backward Propagation: During this step, the gradients of the loss function with respect to each parameter are calculated, allowing the optimizer (Adam) to update the model’s weights. This process repeats during training, helping the model learn patterns from the data.



### 4. Gradient Descending in Transformer

- Gradient descent is a fundamental optimization technique used in machine learning to minimize the loss function and improve model performance. Understanding the mechanics of gradient descent, including the different variations and how to tune the learning rate, is essential for effectively training machine learning models.

- Cross-Entropy Loss: Typically used for classification tasks, such as predicting the next word in a sequence.



###8. QLoRA
- Quantization is just downsizing the precision in data types to process in optimized memory and speed
- Commonly used precision is float16,8,int8, or even boolean 1/0

## Task
3) 매우 간단한 데이터셋를 이용하여 파이썬으로 간단하게 구현하여 어떻게 작동하는지 설명해주기
4) 음식 관련 데이터 수집 (wiki/reddit/quora나 kaggle등 많으면 많을 수록 좋음 - 메뉴와 식당 소개글 식당 설명글 등등 아주 많이)  100만개 이상 메뉴 1000개 이상 식당. 옐프나 음식 리뷰 글 만들기
5) 수집한 Dataset으로 Q&A식 데이터 만들어 Fine tuning하기 (간단하게) ( E
6) prompt engineering 이해하여 role별로 질문지 작성하여 챗봇 기능 구현하기 ( B
7) onnx/optimum[onnxruntime] 적용하여 cpu/local에서도 빠르게 작동하게 만들기 ( [E]/B



### List of special tokens

1. [PAD]:
- Purpose: Padding token used to make sequences the same length for batch processing.
- Usage: It fills in sequences that are shorter than the maximum length, ensuring uniform input size.

2. [CLS]:
- Purpose: Classification token.
- Usage: Added at the beginning of sequences in models like BERT. The final hidden state corresponding to this token is typically used for classification tasks.

3. [SEP]:
- Purpose: Separator token.
- Usage: Used to separate different segments within the same input, such as two sentences in a question-answering task or different contexts in a conversation.

4. [UNK]:
- Purpose: Unknown token.
- Usage: Represents any word that is not in the model's vocabulary. If an input token cannot be mapped to a known word, it is replaced with this token.

5. [MASK]:
- Purpose: Mask token.
- Usage: Used in tasks like masked language modeling, where certain tokens in the input are replaced with this token, and the model learns to predict them.

6. [INST] and [/INST]:
- Purpose: Instruction tokens.
- Usage: Used in instruction-following models to delineate commands or prompts, helping the model understand when a new instruction starts and ends.

7. [BOS]:
- Purpose: Beginning-of-sequence token.
- Usage: Indicates the start of a sequence, often used in generative models.

8. [EOS]:
- Purpose: End-of-sequence token.
- Usage: Indicates the end of a sequence. This is particularly important in tasks like text generation where the model needs to know when to stop generating output.

In [None]:
import torch
import transformers
from transformers import pipeline, LlamaTokenizer, LlamaForSequenceClassification,AutoTokenizer
from transformers import LlamaForSequenceClassification, LlamaTokenizer, TrainingArguments
model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

2024-10-20 05:33:37.665311: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-20 05:33:37.758530: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
from huggingface_hub import notebook_login,login

### 1. Load Pretrained Model 

In [None]:
login(token='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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
import torch
import transformers
from transformers import pipeline, LlamaTokenizer, LlamaForSequenceClassification,AutoTokenizer
from transformers import LlamaForSequenceClassification, LlamaTokenizer, TrainingArguments
model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

### 2. Data Loading

In [None]:
from datasets import load_dataset
# New instruction dataset
# guanaco_dataset = "mlabonne/guanaco-llama2-1k"
# dataset = load_dataset(guanaco_dataset, split="train")


from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset("ag_news")

# Sample only 1000 records from the training set
sampled_dataset = dataset['train'].shuffle(seed=42).select(range(1000))

# Check the first example in the sampled dataset
print(sampled_dataset[0])



{'text': 'Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.', 'label': 0}


In [None]:
def limit_text_length(example):
    # Split the text into words and take the first 10 words
    example['text'] = ' '.join(example['text'].split()[:10])
    return example

# Apply the function to the dataset
limited_dataset = dataset.map(limit_text_length)

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
sampled_dataset['text']

['Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.',
 'Desiring Stability Redskins coach Joe Gibbs expects few major personnel changes in the offseason and wants to instill a culture of stability in Washington.',
 'Will Putin #39;s Power Play Make Russia Safer? Outwardly, Russia has not changed since the barrage of terrorist attacks that culminated in the school massacre in Beslan on Sept.',
 'U2 pitches for Apple New iTunes ads airing during baseball games Tuesday will feature the advertising-shy Irish rockers.',
 'S African TV in beheading blunder Public broadcaster SABC apologises after news bulletin shows footage of American beheaded in Iraq.',
 'A Cosmic Storm: When Galaxy Clusters Collide Astronomers have found what they are calling the perfect cosmic storm, a galaxy cluster pile-up so powerful its energy output is second only to the Big Bang.',
 'West 

### 3. Low Rank Adaptation (LoRA) Fine tuning for Sequence Classification

#### 3.1 Prepare sample dataset and split into train/test


#### 3.2 PEFT (Parameter Efficient Fine-Tuning) : LoRA and QLoRA.
- Faster Process and Less Demanding
- It only modifies subset of LLM Params to enhance the speed and reduce memory demands. 
- $$ W = W_0+BA $$ where BA is the low rank modification and W_0 is the original weight matrix. It produces new W by trainable B and A.

- Fine tuning procedure reference : https://github.com/adidror005/youtube-videos/blob/main/LLAMA_3_Fine_Tuning_for_Sequence_Classification_Actual_Video.ipynb

In [None]:
from datasets import Dataset, DatasetDict

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, # enable 4-bit quantization
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
    bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)


In [None]:
from transformers import AutoModelForCausalLM,AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

1. LoraConfig
* Hyperparameters
- r is the number of low rank matrices
- target_modules : which module you are targeting to inject new layers ? q,k,v are the components of attention layer and o_proj is attention output.
- task_type : SEQ_CLS, TOK_CLS, QA, Text Generation, Seq2Seq, Regression, Multiple Choice

Reference : https://huggingface.co/docs/peft/en/package_reference/lora

In [None]:
lora_config = LoraConfig(
    r = 16, # the dimension of the low-rank matrices
    lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout = 0.05, # dropout probability of the LoRA layers
    bias = 'none', # wether to train bias weights, set to 'none' for attention layers
    task_type = 'CAUSAL_LM' #SEQ_CLS
)

In [None]:
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.pretraining_tp = 1

### 4. Classification Prediction For Testing

In [None]:
training_params = TrainingArguments(
    output_dir="./results_exp",
    num_train_epochs=200,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

In [None]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=sampled_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
trainer.train()

Step,Training Loss
25,3.5253
50,3.6221
75,3.2039
100,3.4796
125,3.0776
150,3.4072
175,3.0635
200,3.3017
225,2.9941
250,3.3201


TrainOutput(global_step=50000, training_loss=0.3155220663881302, metrics={'train_runtime': 21254.1714, 'train_samples_per_second': 9.41, 'train_steps_per_second': 2.352, 'total_flos': 6.226696481852621e+16, 'train_loss': 0.3155220663881302, 'epoch': 200.0})