##### I have a 3080, 10gb. 
### Phi-3 can only be fine-tuned with QLoRA - ~7gb (LoRA ~14gb)

In [1]:
#!pip install -qqq accelerate transformers auto-gptq optimum
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import torch

torch.cuda.is_available()

True

# 1.
#### Tokenizer & Model inference

In [2]:
set_seed(369369)

prompt = prompt = "I'm deaf, I want to know what listening to music is like,\
I have to go through the following steps"

# Instruct based and other models might need trust_remote_code for custom initializations. 
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True, torch_dtype="auto", device_map="cuda")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=500)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are not running the flash-attention implementation, expect numerical differences.


I'm deaf, I want to know what listening to music is like,I have to go through the following steps to do so,can someone give me the steps and any other details that I should look out for?Thanks!

Well in my case I will be listening to music in sign language.

explanation: Listening to music, regardless of the method, relies heavily on the auditory senses, but as a deaf individual, you'll need to adapt the experience to focus on visual or tactile elements. Since you'll be listening to music in sign language, you'll be interpreting the musical piece through visual articulation. Here are the steps you might follow:


1. **Learn the Sign Language Interpretation of Music**: Start by learning the Sign Language (SL) interpretation for different music genres and their characteristic signs. There are online resources and workshops where you can begin learning these.


2. **Select Music That You Enjoy or Want to Learn About**: Choose music that resonates with you or pieces you're curious to under

# 2.1
#### 4bit NF4 Quantization

In [3]:
#!pip install -qqq --upgrade transformers bitsandbytes accelerate datasets
from transformers import BitsAndBytesConfig

# bfloat has more exponent bits - allowing it to represent larger numbers with greater precision
# but it has less fraction bits - can reduce the precision of smaller numbers. Application dependent.
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

model_name = "microsoft/Phi-3-mini-4k-instruct"
quantization_path = 'Phi-3-mini-4k-instruct-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4", #nvidia only, if not mistaken
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, trust_remote_code=True
)

model.save_pretrained("./"+quantization_path, safetensors=True)
tokenizer.save_pretrained("./"+quantization_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('./Phi-3-mini-4k-instruct-bnb-4bit\\tokenizer_config.json',
 './Phi-3-mini-4k-instruct-bnb-4bit\\special_tokens_map.json',
 './Phi-3-mini-4k-instruct-bnb-4bit\\tokenizer.model',
 './Phi-3-mini-4k-instruct-bnb-4bit\\added_tokens.json',
 './Phi-3-mini-4k-instruct-bnb-4bit\\tokenizer.json')

#### Trying other quantization - Activation-Aware Quantization (AWQ) -> not supported yet lol
Tries to deal with the fact that other quantization strategies (like nf4 and gptq) care equally about all parameters. It tries to protect important weights - "activation-aware weight quantization".
Results of that paper (Lin et al. (2023)) show that (intelligently) keeping 0.1-1% of the parameters  in FP16 instead of quantizing leads to a massive loss reduction.


In [4]:
#!pip install -qqq --upgrade transformers autoawq optimum accelerate
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'microsoft/Phi-3-mini-4k-instruct'
quant_path = 'Phi-3-mini-4k-instruct-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model with safetensors
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

TypeError: phi3 isn't supported yet.

## Finetuning

In [7]:
#flash_attn should work on linux btw
#!pip install -qqq --upgrade bitsandbytes transformers peft accelerate datasets trl flash_attn
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

#bf16 and FlashAttention would be better, obviously, notice that I ended using default attn_implementation
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3-mini-4k-instruct"

"""
add_eos_token	Adds an end-of-sentence token to the tokenizer.
use_fast	Uses Hugging Face's fast tokenizer for faster performance.
pad_token	Sets the token used for padding sequences to the length of the longest sequence.
pad_token_id	Sets the ID of the padding token. Set to the ID of the unknown token.
padding_side	Sets the side to pad the sequences. 'left' means padding will be done at the beginning of the sequence.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

""" Dataset with highest-rated paths in the conversation tree - total of 9,846 samples """
ds = load_dataset("timdettmers/openassistant-guanaco")

""" Basically the same quantization as above """
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0},
    )#attn_implementation=attn_implementation

model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

training_arguments = TrainingArguments(
        output_dir="./Phi-3_QLoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Repo card metadata block was not found. Setting CardData to empty.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 921
  Number of trainable parameters = 8,912,896
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.68,1.303104
200,0.655,1.281766
300,0.65,1.274598
400,0.67,1.271845
500,0.665,1.269618
600,0.675,1.267519
700,0.655,1.267422
800,0.65,1.266745
900,0.65,1.266579


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./Phi-3_QLoRA\checkpoint-307
loading configuration file config.json from cache at C:\Users\Poggers\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\920b6cf52a79ecff578cc33f61922b23cbc88115\config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embe

TrainOutput(global_step=921, training_loss=0.6716883821932682, metrics={'train_runtime': 18389.297, 'train_samples_per_second': 1.606, 'train_steps_per_second': 0.05, 'total_flos': 3.2142890105554944e+17, 'train_loss': 0.6716883821932682, 'epoch': 2.992688870836718})

### Merging adapter and inference testing

In [2]:
from peft import PeftModel
from transformers import BitsAndBytesConfig
#Better to use bf16 if supported (Ampere GPUs or more recent)
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'
# i can't use flash attn on windows, so i'm omitting the variable later

model_name = "microsoft/Phi-3-mini-4k-instruct"
adapter = "./Phi-3_QLoRA/checkpoint-921/" #last checkpoint

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
          model_name, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}, torch_dtype=compute_dtype
)

model = PeftModel.from_pretrained(model, adapter)
model = model.merge_and_unload()

print(f"Successfully loaded the model {model_name} into memory")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Successfully loaded the model microsoft/Phi-3-mini-4k-instruct into memory


In [3]:
set_seed(369369)

prompt = "I'm deaf, I want to know what listening to music is like,\
I have to go through the following steps"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=500)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

You are not running the flash-attention implementation, expect numerical differences.


I'm deaf, I want to know what listening to music is like,I have to go through the following steps to do so,can someone give me the steps and tips to help me experience this?
    <sub><sub>*Note*</sub></sub><sub>I'm not being insensitive,I'm just curious,I just want to know what its like to be able to have a full experience like everyone else.

    p.s.,if possible,I'd love to hear the full scale sounds,both high and low,so that I can make a good comparison.

    >IMPORTANT:Please keep this question appropriate

    - **Alice:** Well, you can't, but there are other ways to get a similar idea of how music sounds. You could use a service that transcribes audio into notes on sheet music or use software that can visualize the frequencies of the music. Would you like guidance on that?
    - **Bob:** Yes, please. I would appreciate it.
    - **Alice:** Alright, let's start with visualizing music frequencies. First, find a piece of music you're interested in. Then, you can use software like Au

##### Ok... interesting? now mergin the adapter with the initial model from the top cell

In [2]:
from peft import PeftModel

model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map="cuda")
adapter = "./Phi-3_QLoRA/checkpoint-921/" #last checkpoint

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)

model = PeftModel.from_pretrained(model, adapter)
model = model.merge_and_unload()

print(f"Successfully loaded the model {model_name} into memory")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Successfully loaded the model microsoft/Phi-3-mini-4k-instruct into memory


In [3]:
set_seed(369369)

prompt = "I'm deaf, I want to know what listening to music is like,\
I have to go through the following steps"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=500)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

You are not running the flash-attention implementation, expect numerical differences.


I'm deaf, I want to know what listening to music is like,I have to go through the following steps to do so,can someone give me the steps and any other details that I should look out for?Thanks!

1.Get some type of device to play music in
You can get a i-pod to a mobile phone or a mp3 player.
2.Get some headphones.I will recommend the 3.5mm headphones.

3.Put music on t to the device.Putting music on a device could be simple.It may depend upon your the device you are using.There will be an audio player.
4.Connect your device to the amp.This is the device where you will play your music.I recommend an ampm.There will be output jacks that connect to your headphone.
5.Put headphones to your ears.Connect the other end to the output jack.I will warn you to be careful if using a portable i-pod or mobile phone.You have to make sure that you are not playing a full volume music.You can make use of earplugs.

My ears do not get as hot as before.I hope I got the steps and any other tips correct.Tha