Skip to content

LLama3 streaming repeats the previous request's first token. #287

@mikutsky

Description

@mikutsky

Hi! I'm running into a problem of repeating the first token in subsequent requests using a stream. The prompt structure follows the Meta LLama3 documentation. Could you explain why is this going on?

Simple chat example output looks in this way:

The model name is meta/meta-llama-3-70b-instruct

You: Hi!
Assistant: Hi! How can I help you today?

You: Recommend me a Hemingway novel, please.
Assistant: Hi
I'd recommend "The Old Man and the Sea". It's a classic, concise, and powerful novel that showcases Hemingway's unique writing style.

You: I read it, please recommend something else.
Assistant: Hi
I
How about "A Farewell to Arms"? It's a romantic and tragic novel set during WWI, and it's considered one of Hemingway's best works.

You: It's great! Thank you! Bye!
Assistant: Hi
I
How
You're welcome! I'm glad you enjoyed the recommendation. Have a great day and happy reading! Bye!

Example code:

import os
from replicate.client import Client

replicate_api_key = os.getenv("REPLICATE_API_TOKEN", 'EMPTY')
replicate_model = os.getenv('REPLICATE_MODEL', 'meta/meta-llama-3-70b-instruct')
replicate_client = Client(api_token=replicate_api_key)

SYSTEM_PROMPT = 'You are a helpful assistant. Answer briefly!'
MESSAGES = []


def gen_llama3_prompt(sys_prompt=None, messages=None):
    sys_prompt = '' if sys_prompt is None else sys_prompt
    messages = [] if messages is None else messages
    _result = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{sys_prompt}<|eot_id|>"
    for m in messages:
        if m['role'] == 'user':
            _result += f'<|start_header_id|>user<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
        elif m['role'] == 'assistant':
            _result += f'<|start_header_id|>assistant<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
    _result += '<|start_header_id|>assistant<|end_header_id|>\n\n'
    return _result


def print_answer(query=''):
    message = {'role': 'user', 'content': query}
    answer = ''
    MESSAGES.append(message)
    for event in replicate_client.stream(
            "meta/meta-llama-3-70b-instruct",
            input={
                "top_p": 1e-5,
                "prompt": gen_llama3_prompt(SYSTEM_PROMPT, MESSAGES),
                "max_tokens": 512,
                "min_tokens": 0,
                "temperature": 1e-6
            }):
        token = str(event)
        answer += token
        print(token, end='')
    message = {'role': 'assistant', 'content': answer}
    MESSAGES.append(message)


if __name__ == '__main__':
    print(f'Model name is {replicate_model}')
    while True:
        q = input('\nYou: ')
        print('Assistant: ', end='')
        print_answer(q)
        if 'bye' in q.lower():
            break

Thanks for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions