fix: prevent Chinese examples from being converted to Unicode encoding #1774

coolmian · 2024-11-08T09:00:15Z

Using ensure_ascii=False provides better support for Chinese characters directly

before:

[[ ## json_output ## ]]
[{"type": "narration", "content": "\u5c0f\u660e\u8d70\u51fa\u5bb6\u95e8\uff0c\u8ddf\u90bb\u5c45\u6253\u62db\u547c"}, {"type": "dialogue", "name": "\u5c0f\u660e", "reaction": "\u9ad8\u5174", "content": "\u4f60\u597d\u5440"}, {"type": "narration", "content": "\u90bb\u5c45\u5fae\u7b11\u671d\u4ed6\u70b9\u5934"}, {"type": "voiceover", "name": "\u90bb\u5c45", "reaction": "\u5185\u5fc3\u5947\u602a", "content": "\u8fd9\u5c0f\u5b50\u4eca\u5929\u600e\u4e48\u5bf9\u6211\u8fd9\u4e48\u6709\u793c\u8c8c"}]

[[ ## completed ## ]]

after:

[[ ## json_output ## ]]
[{"type": "narration", "content": "小明走出家门，跟邻居打招呼"}, {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"}, {"type": "narration", "content": "邻居微笑朝他点头"}, {"type": "voiceover", "name": "邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}]

[[ ## completed ## ]]

Using secure_ascii=False provides better support for Chinese characters directly

coolmian · 2024-11-08T09:02:27Z

my case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field()
    content: str = Field(default=None)
    name: str | None = Field(default=None)
    reaction: str | None = Field(default=None)

class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """

    story_text = dspy.InputField()
    json_output: list[Narrative] = dspy.OutputField(desc="list of narratives")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)
example = dspy.Example(
    story_text = "小明走出家门，跟邻居打招呼：“你好呀”。邻居微笑朝他点头，内心奇怪这小子今天怎么对他这么有礼貌？",
    json_output = [
        {"type": "narration", "content": "小明走出家门，跟邻居打招呼"},
        {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"},
        {"type": "narration", "content": "邻居微笑朝他点头"},
        {"type": "voiceover", "name":"邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}
    ]
)

predictor.demos = [example]
with open("dataset/1.txt", "r") as f:
    story_text = f.read()

# Call the predictor on a particular input.
pred = predictor(story_text=story_text)
print(f"Question: {story_text}")
for item in pred.json_output:
    print(item.model_dump())

If examples containing Chinese strings are converted to Unicode encoding, the LLM tends to reply with Unicode encoded strings, resulting in a decrease in reply quality and additional decoding work

okhat · 2024-11-08T14:26:18Z

Thanks a lot @coolmian !

fix: prevent Chinese examples from being converted to Unicode encoding

aa55af2

Using secure_ascii=False provides better support for Chinese characters directly

okhat merged commit 4822d47 into stanfordnlp:main Nov 8, 2024
4 checks passed

coolmian deleted the patch-1 branch November 9, 2024 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent Chinese examples from being converted to Unicode encoding #1774

fix: prevent Chinese examples from being converted to Unicode encoding #1774

Uh oh!

coolmian commented Nov 8, 2024 •

edited

Loading

Uh oh!

coolmian commented Nov 8, 2024 •

edited

Loading

Uh oh!

Uh oh!

okhat commented Nov 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: prevent Chinese examples from being converted to Unicode encoding #1774

fix: prevent Chinese examples from being converted to Unicode encoding #1774

Uh oh!

Conversation

coolmian commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coolmian commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

okhat commented Nov 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coolmian commented Nov 8, 2024 •

edited

Loading

coolmian commented Nov 8, 2024 •

edited

Loading