-
Notifications
You must be signed in to change notification settings - Fork 550
fix dataset formatting and mapping for Magistral reasoning #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @rolandtannous, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical data formatting incompatibility within the Magistral reasoning notebook. The previous implementation incorrectly passed a list of conversations to Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes the data format issue when calling tokenizer.apply_chat_template. The previous implementation passed a list of conversations, causing a TemplateError, and the new implementation correctly maps over each conversation individually. The changes in both the Jupyter notebook and the corresponding Python script are correct. I've added a couple of suggestions to improve code readability, mainly around formatting a list comprehension.
| "reasoning_conversations = [tokenizer.apply_chat_template(\n", | ||
| " conv[\"conversations\"],tokenize = False,)\n", | ||
| " for conv in reasoning_dataset.map(generate_conversation)\n", | ||
| "]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension is a bit difficult to read due to its formatting. The arguments to apply_chat_template are not clearly separated, and the for loop is on a separate line which is unusual for a list comprehension of this style. For better readability and adherence to common Python style (PEP 8), I suggest reformatting it.1
A more readable version would be:
reasoning_conversations = [
tokenizer.apply_chat_template(conv["conversations"], tokenize=False)
for conv in reasoning_dataset.map(generate_conversation)
]Style Guide References
Footnotes
-
PEP 8 provides guidelines on code layout, including how to format long lines and comprehensions, to improve readability. ↩
| reasoning_conversations = [ | ||
| tokenizer.apply_chat_template(conv["conversations"],tokenize = False,) for conv in reasoning_dataset.map(generate_conversation) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list comprehension is written across multiple lines, but line 131 is very long and contains multiple parts of the expression, which harms readability. According to PEP 8, lines should be limited to 79 or 99 characters for readability.1 It's better to format long list comprehensions over multiple lines. This will make the code easier to read and maintain.
| reasoning_conversations = [ | |
| tokenizer.apply_chat_template(conv["conversations"],tokenize = False,) for conv in reasoning_dataset.map(generate_conversation) | |
| ] | |
| reasoning_conversations = [ | |
| tokenizer.apply_chat_template(conv["conversations"], tokenize=False) | |
| for conv in reasoning_dataset.map(generate_conversation) | |
| ] |
Style Guide References
Footnotes
-
PEP 8 suggests limiting all lines to a maximum of 79 characters (or 99 for some projects) to improve readability. ↩
Problem
reasoning_dataset.map(...)["conversations"]returns a list of conversationsBut
apply_chat_template()expects a single conversation (a list of message dicts), not a list of conversations.the Jinja template tries to iterate over what it thinks are "messages", but they're actually entire conversations (lists). When it tries to access
message['role'], it fails because a list doesn't have a'role'key, causing it to fall through to theelseclause that raises the exception:Solution
aligned formats of formatting , mapping functions with what
apply_chat_templateexpects.