-
Notifications
You must be signed in to change notification settings - Fork 77

Description
I have a finetuned llama 2 7B chat model which I am deploying to an endpoint using DJL container. After deploying when I tested the model, the model output quality has degraded (The output seems to be echoing same answer for some questions asked).
Before using DJL container, I was using TGI container and the model was working absolutely fine.
I understand there could be difference in the way of inferencing for both these containers but is there a way of overriding the inference code.
Following is the sample prompt that I am using to prompt the model:
"[INST] <>
Respond only with the answer and do not provide any explanation or additional text. If you don't know the answer to a question, please answer with 'I dont know'.Answer should be as short as possible.
<>
Below context is text extracted from a medical document. Answer the question asked based on the context given.
Context: {text}
Question: {question} [/INST]"
The model is finetuned on the above mentioned prompt so we need to inference in such a way that it comprehends this format of the prompt and gives the answer.
Any resources/suggestions would be really helpful.