LLama3 streaming repeats the previous request's first token.

Hi! I'm running into a problem of repeating the first token in subsequent requests using a stream. The prompt structure follows the Meta LLama3 documentation. Could you explain why is this going on?

Simple chat example output looks in this way:
```
The model name is meta/meta-llama-3-70b-instruct

You: Hi!
Assistant: Hi! How can I help you today?

You: Recommend me a Hemingway novel, please.
Assistant: Hi
I'd recommend "The Old Man and the Sea". It's a classic, concise, and powerful novel that showcases Hemingway's unique writing style.

You: I read it, please recommend something else.
Assistant: Hi
I
How about "A Farewell to Arms"? It's a romantic and tragic novel set during WWI, and it's considered one of Hemingway's best works.

You: It's great! Thank you! Bye!
Assistant: Hi
I
How
You're welcome! I'm glad you enjoyed the recommendation. Have a great day and happy reading! Bye!
```

Example code:
```
import os
from replicate.client import Client

replicate_api_key = os.getenv("REPLICATE_API_TOKEN", 'EMPTY')
replicate_model = os.getenv('REPLICATE_MODEL', 'meta/meta-llama-3-70b-instruct')
replicate_client = Client(api_token=replicate_api_key)

SYSTEM_PROMPT = 'You are a helpful assistant. Answer briefly!'
MESSAGES = []


def gen_llama3_prompt(sys_prompt=None, messages=None):
    sys_prompt = '' if sys_prompt is None else sys_prompt
    messages = [] if messages is None else messages
    _result = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{sys_prompt}<|eot_id|>"
    for m in messages:
        if m['role'] == 'user':
            _result += f'<|start_header_id|>user<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
        elif m['role'] == 'assistant':
            _result += f'<|start_header_id|>assistant<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
    _result += '<|start_header_id|>assistant<|end_header_id|>\n\n'
    return _result


def print_answer(query=''):
    message = {'role': 'user', 'content': query}
    answer = ''
    MESSAGES.append(message)
    for event in replicate_client.stream(
            "meta/meta-llama-3-70b-instruct",
            input={
                "top_p": 1e-5,
                "prompt": gen_llama3_prompt(SYSTEM_PROMPT, MESSAGES),
                "max_tokens": 512,
                "min_tokens": 0,
                "temperature": 1e-6
            }):
        token = str(event)
        answer += token
        print(token, end='')
    message = {'role': 'assistant', 'content': answer}
    MESSAGES.append(message)


if __name__ == '__main__':
    print(f'Model name is {replicate_model}')
    while True:
        q = input('\nYou: ')
        print('Assistant: ', end='')
        print_answer(q)
        if 'bye' in q.lower():
            break

```

Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLama3 streaming repeats the previous request's first token. #287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLama3 streaming repeats the previous request's first token. #287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions