Skip to content

Conversation

@coolmian
Copy link
Contributor

@coolmian coolmian commented Nov 8, 2024

Using ensure_ascii=False provides better support for Chinese characters directly

before:

[[ ## json_output ## ]]
[{"type": "narration", "content": "\u5c0f\u660e\u8d70\u51fa\u5bb6\u95e8\uff0c\u8ddf\u90bb\u5c45\u6253\u62db\u547c"}, {"type": "dialogue", "name": "\u5c0f\u660e", "reaction": "\u9ad8\u5174", "content": "\u4f60\u597d\u5440"}, {"type": "narration", "content": "\u90bb\u5c45\u5fae\u7b11\u671d\u4ed6\u70b9\u5934"}, {"type": "voiceover", "name": "\u90bb\u5c45", "reaction": "\u5185\u5fc3\u5947\u602a", "content": "\u8fd9\u5c0f\u5b50\u4eca\u5929\u600e\u4e48\u5bf9\u6211\u8fd9\u4e48\u6709\u793c\u8c8c"}]

[[ ## completed ## ]]

after:

[[ ## json_output ## ]]
[{"type": "narration", "content": "小明走出家门,跟邻居打招呼"}, {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"}, {"type": "narration", "content": "邻居微笑朝他点头"}, {"type": "voiceover", "name": "邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}]

[[ ## completed ## ]]

Using secure_ascii=False provides better support for Chinese characters directly
@coolmian
Copy link
Contributor Author

coolmian commented Nov 8, 2024

my case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field()
    content: str = Field(default=None)
    name: str | None = Field(default=None)
    reaction: str | None = Field(default=None)

class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """

    story_text = dspy.InputField()
    json_output: list[Narrative] = dspy.OutputField(desc="list of narratives")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)
example = dspy.Example(
    story_text = "小明走出家门,跟邻居打招呼:“你好呀”。邻居微笑朝他点头,内心奇怪这小子今天怎么对他这么有礼貌?",
    json_output = [
        {"type": "narration", "content": "小明走出家门,跟邻居打招呼"},
        {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"},
        {"type": "narration", "content": "邻居微笑朝他点头"},
        {"type": "voiceover", "name":"邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}
    ]
)

predictor.demos = [example]
with open("dataset/1.txt", "r") as f:
    story_text = f.read()

# Call the predictor on a particular input.
pred = predictor(story_text=story_text)
print(f"Question: {story_text}")
for item in pred.json_output:
    print(item.model_dump())

If examples containing Chinese strings are converted to Unicode encoding, the LLM tends to reply with Unicode encoded strings, resulting in a decrease in reply quality and additional decoding work

@okhat okhat merged commit 4822d47 into stanfordnlp:main Nov 8, 2024
4 checks passed
@okhat
Copy link
Collaborator

okhat commented Nov 8, 2024

Thanks a lot @coolmian !

@coolmian coolmian deleted the patch-1 branch November 9, 2024 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants