Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a scientific paper? #23

Open
NatanFreeman opened this issue Mar 17, 2024 · 10 comments
Open

Is there a scientific paper? #23

NatanFreeman opened this issue Mar 17, 2024 · 10 comments

Comments

@NatanFreeman
Copy link

Is there a scientific paper accompanying this release? I've searched but couldn't find one. I find it odd that the weights would be released but not the research.

@Explosion-Scratch
Copy link

Explosion-Scratch commented Mar 17, 2024

It would be nice to have:

  • Paper explaining the methodology
  • Benchmarks
  • Data this was trained on

@JudeDavis1
Copy link

A model card would make sense. If there weren't any new per-say techniques then I wouldn't see the need for yet another paper. If there is, then sure!

@JudeDavis1
Copy link

I'm happy as long as the code is up to date and the science is released even if not in an academic setting.

@AlexanderPuckhaber
Copy link

A model card would make sense. If there weren't any new per-say techniques then I wouldn't see the need for yet another paper. If there is, then sure!

Agree that model card (what data Grok was trained on) is crucial for this being truly "open source"

It looks like Grok had a model card back from last November: https://x.ai/model-card/

Training data The training data used for the release version of Grok-1 comes from both the Internet up to Q3 2023 and the data provided by our AI Tutors.

I doubt we'll get an answer any more detailed than "the Internet" and "whatever synthetic data our employees made"

@Qu3tzal
Copy link

Qu3tzal commented Mar 18, 2024

Is there a scientific paper accompanying this release? I've searched but couldn't find one. I find it odd that the weights would be released but not the research.

Because there's no research underlying it? Nothing new or surprising in the model so far, it's just the same architecture as other MoE LLMs with different data and training compute.
Not every software needs a paper that's going to be rejected at conferences and stay at pre-print stage on ArXiV. :)

@NatanFreeman
Copy link
Author

Is there a scientific paper accompanying this release? I've searched but couldn't find one. I find it odd that the weights would be released but not the research.

Because there's no research underlying it? Nothing new or surprising in the model so far, it's just the same architecture as other MoE LLMs with different data and training compute. Not every software needs a paper that's going to be rejected at conferences and stay at pre-print stage on ArXiV. :)

Disagree. I think @Explosion-Scratch did a good job pointing out why a paper would be useful in this case.

@Qu3tzal
Copy link

Qu3tzal commented Mar 18, 2024

That's a technical report at best though.

@NatanFreeman
Copy link
Author

That's a technical report at best though.

Call it what you want, the issue is that it doesn't exist.

@yzlnew
Copy link

yzlnew commented Mar 26, 2024

Absolutely needs some experiment details on μTransfer of a MOE model that large, if someone noticed several 'weird' multiplier here.

grok-1/run.py

Lines 31 to 47 in 7050ed2

output_multiplier_scale=0.5773502691896257,
embedding_multiplier_scale=78.38367176906169,
model=TransformerConfig(
emb_size=48 * 128,
widening_factor=8,
key_size=128,
num_q_heads=48,
num_kv_heads=8,
num_layers=64,
attn_output_multiplier=0.08838834764831845,
shard_activations=True,
# MoE.
num_experts=8,
num_selected_experts=2,
# Activation sharding.
data_axis="data",
model_axis="model",

@AsureDay
Copy link

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants