# Introduction 

This notebook replicates a simple chat template with continuous chat. The model understands and remembers the context according to its capacity. We use the *chat template* functionality of the tokenizer.

This is a Phi 1.5 model fine-tuned on the ORP dataset (https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k). Find the fine-tuning notebook in the `assistant_sft` directory.

**NOTE: The notebook uses a customized streamer for text streaming.**

In [1]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    logging
)

from streaming_utils import TextStreamer
from peft import PeftModel

In [2]:
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-1_5',
    # load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained('../assistant_sft/outputs/phi_1_5_chat_alpaca_orpo/best_model/')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50297, 2048)

In [4]:
model = PeftModel.from_pretrained(
    model, 
    '../assistant_sft/outputs/phi_1_5_chat_alpaca_orpo/best_model/'
).to('cuda')

In [5]:
streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True,
    # truncate_before_pattern=['\[\/', 'Goodbye'],
    truncate=True
)

In [6]:
eos_string = tokenizer.eos_token
history = None

In [7]:
while True:
    question = input("Question: ")

    template = """<|im_start|> user\n{question} <|im_end|>\n<|im_start|> assistant\n"""

    prompt = history + '\n' + template.format(question=question) if history is not None else template.format(question=question)

    # print(prompt)
    
    prompt_tokenized = tokenizer(
        prompt, 
        return_tensors='pt', 
        padding=True, 
        truncation=True,
        return_attention_mask=True
    ).to('cuda')
    
    output_tokenized = model.generate(
        **prompt_tokenized,
        max_length=len(prompt_tokenized[0])+400,
        temperature=0.8,
        top_k=40,
        top_p=0.1,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.1,
        streamer=streamer
    )
    answer = tokenizer.decode(token_ids=output_tokenized[0][len(prompt_tokenized[0]):]).strip()

    # Puny guardrails.
    if eos_string in answer:
        answer = answer.split(eos_string)[0].strip()
    if '[/' in answer:
        answer = answer.split('[/')[0].strip()

    history = ' '.join([prompt, answer, eos_string])
    # print(f"ANSWER: {answer}\n")
    # print(f"HISTORY: {history}\n")
    print('#' * 50)

Question:  You have any good monitor recommendations for coding?


Setting `pad_token_id` to `eos_token_id`:50296 for open-end generation.


For coding, I would recommend the following monitors:
1. NVIDIA GeForce RTX 3080: This is a high-end graphics card that can handle complex 3D rendering tasks and has excellent color accuracy. It's also known for its fast refresh rate and low latency.
2. NVIDIA GeForce GTX 1080: This is a mid-range graphics card that offers decent performance at an affordable price point. It's great for general use cases like web development or casual gaming.
3. NVIDIA GeForce RTX 2080: This is another mid-range graphics card that provides excellent performance and is compatible with most popular programming languages. It's also known for its affordability compared to other options.
4. NVIDIA GeForce GTX 1650: This is a budget-friendly option that still performs well in terms of graphics capabilities and compatibility with various software. It's suitable for beginners or those on a tight budget.
5. NVIDIA GeForce RTX 3060: This is a premium graphics card that offers exceptional performance and is design

KeyboardInterrupt: Interrupted by user

## Inference

In [1]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    AutoTokenizer,
    TextStreamer
)

from peft import PeftModel

In [2]:
model = AutoModelForCausalLM.from_pretrained('microsoft/phi-1_5')
tokenizer = AutoTokenizer.from_pretrained('../assistant_sft/outputs/phi_1_5_chat_alpaca_orpo/best_model/')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50297, 2048)

In [4]:
model = PeftModel.from_pretrained(
    model, 
    '../assistant_sft/outputs/phi_1_5_chat_alpaca_orpo/best_model/',
    load_in_4bit=True
).to('cuda')

In [5]:
streamer = TextStreamer(tokenizer)

In [6]:
print(tokenizer.eos_token)

<|im_end|>


In [7]:
prompt = """<|im_start|> user
How are you? <|im_end|> 
<|im_start|> assistant
"""

In [8]:
prompt_tokenized = tokenizer(
        prompt, 
        return_tensors='pt', 
        padding=True, 
        truncation=True,
        return_attention_mask=True
    ).to('cuda')
    
output_tokenized = model.generate(
    **prompt_tokenized,
    max_length=len(prompt_tokenized[0])+400,
    temperature=0.7,
    top_k=40,
    top_p=0.95,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    streamer=streamer
)

answer = tokenizer.decode(token_ids=output_tokenized[0][len(prompt_tokenized[0]):]).strip()

Setting `pad_token_id` to `eos_token_id`:50296 for open-end generation.


<|im_start|> user
How are you? <|im_end|> 
<|im_start|> assistant
I'm doing well, thank you! How about yourself?                                                                                                                                                                                                                                                                                                                                                                                                    


In [19]:
print(answer)

I'm doing well, thank you. How about yourself?                                                                                                                                                                                                                                                                                                                                                                                                   
