# Introduction 

This notebook replicates a simple chat template with continuous chat. The model understands and remembers the context according to its capacity. We use the *chat template* functionality of the tokenizer.

This is a Phi 1.5 model fine-tuned to Chat Alpaca dataset (https://huggingface.co/datasets/flpelerin/ChatAlpaca-10k). Find the fine-tuning notebook in the `assistant_sft` directory.

**NOTE: The notebook uses a customized streamer for text streaming.**

In [1]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    logging,
)

from streaming_utils import TextStreamer

In [2]:
model = AutoModelForCausalLM.from_pretrained(
    '../assistant_sft/outputs/phi_1_5_chat_alpaca/logs/checkpoint-1187/',
    load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained(
    '../assistant_sft/outputs/phi_1_5_chat_alpaca/logs/checkpoint-1187/'
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True,
    truncate_before_pattern=['\[\/', 'Goodbye'],
    truncate=True
)

In [4]:
eos_string = tokenizer.eos_token
history = None

In [5]:
chat = [
       {"from": "human", "value": 'Hello'}
    ]

tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

'<|endoftext|>[INST] Hello [/INST]'

In [None]:
while True:
    question = input("Question: ")

    chat = [
       {"from": "human", "value": question}
    ]

    template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

    prompt = history + ' ' + template if history is not None else template

    # print(prompt)
    
    prompt_tokenized = tokenizer(
        prompt, 
        return_tensors='pt', 
    ).to('cuda')['input_ids']
    
    output_tokenized = model.generate(
        input_ids=prompt_tokenized,
        max_length=len(prompt_tokenized[0])+400,
        temperature=0.7,
        top_k=40,
        top_p=0.1,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        streamer=streamer
    )
    answer = tokenizer.decode(token_ids=output_tokenized[0][len(prompt_tokenized[0]):]).strip()

    # Puny guardrails.
    if eos_string in answer:
        answer = answer.split(eos_string)[0].strip()
    if '[/' in answer:
        answer = answer.split('[/')[0].strip()

    history = ' '.join([prompt, answer, eos_string])
    # print(f"ANSWER: {answer}\n")
    # print(f"HISTORY: {history}\n")
    print('#' * 50)

Question:  Hello.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello! How can I help you? 
##################################################


Question:  Let's talk about deep learning.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sure, what specifically would you like to know about deep learning? 
##################################################


Question:  CNNs.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What are some examples of applications of CNNs? 
##################################################


Question:  You tell me.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sure, here are some examples of applications of CNNs:

1. Image recognition: CNNs are widely used in image recognition tasks, such as identifying objects in images, detecting faces, and recognizing handwritten digits.

2. Natural language processing: CNNs are used in natural language processing tasks, such as language translation, sentiment analysis, and machine translation.

3. Object detection: CNNs are used in object detection tasks, such as identifying cars, people, and other objects in images and videos.

4. Speech recognition: CNNs are used in speech recognition tasks, such as transcribing audio recordings and speech synthesis.

5. Autonomous driving: CNNs are used in autonomous driving tasks, such as object detection, image recognition, and language translation.

6. Medical imaging: CNNs are used in medical imaging tasks, such as segmentation of medical images, such as CT scans and MRIs.

7. Fraud detection: CNNs are used in fraud detection tasks, such as identifying fraudulent 