In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Using a QLoRA Finetuned Model For Inference

In the previous notebook, we learned how to finetune a model using QLoRA, which output a directory with the LoRA weights.  This is often times much smaller than the full model with the idea that the adapter can be saved locally while the base model can live somewhere else.  On demand, the base model can be downloaded, and the adapter applied to the base model to get the fine-tuned model.

This composable based system is very handy for keeping storage low and quickly switching to new adapters on demand.

### Loading the model using `peft`

A QLoRA adapter can be loaded along with the base model by using the `AutoPeftModelForCausalLM` class and passing in the path to your fine-tuned LoRA weights.

In [None]:
lora_path = "path-to-lora-adapter"
model = AutoPeftModelForCausalLM.from_pretrained(
    lora_path,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(lora_path)

### Text Generation using the Finetuned Model

In the previous example, we fine-tuned the model to respond to a specific prompt format.  

```
### Instruction:
Use the following Input and come up with a structured response.

### Input:
{instruction}

### Response:
```

The idea of finetuning was that, when we pass in an instruction following this format, the fine-tuned model will know how to auto-complete it with an appropriate response.  The following function will automatically format the prompt, given an instruction, to feed into the model, expecting the completion results (or generation) after.

In [None]:
def format_prompt(instruction):
    # Convenience function to format our prompt correctly for the bot
    return  f"""### Instruction:
    Use the following Input and come up with a structured response.
    
    ### Input:
    {instruction}

    ### Response:
    """

To generate new responses, we will

1. Format the instruction into the fine-tuned format
2. Tokenize the input using the models tokenizer
3. Generate Text


In [None]:
instruction = "Tell me the phases of the moon."
prompted_instruction = format_prompt(instruction)
input_ids = tokenizer(
    prompted_instruction,
    return_tensors="pt", 
    truncation=True).input_ids.cuda()

with torch.inference_mode():
    outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)


print(f"Prompt:\n{instruction}\n")
print(f"Generated Response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompted_instruction):]}")