-
Notifications
You must be signed in to change notification settings - Fork 349
Description
Hi, thanks for releasing the code and paper!
I am a newbie.I carefully read your paper and experiments, and tried to follow the reported hyperparameters. I am currently reproducing the S4 results on the LRA ListOps task. I implemented the S4 network myself, where each S4-block consists of:
S4 layer → activation → linear layer → activation
I used 6 such blocks, followed by average pooling and then a classification head. However, after training, the training set accuracy reaches about 59%, while the test set accuracy is only 51%, showing obvious overfitting. Additionally, when the dropout rate is adjusted to 0.2-0.3, the loss function tends to explode (become NaN).It also seems that using tanh as the activation function performs slightly better.
Could you please share more details about the experimental setup that are important for reproducing the reported results?
Thanks a lot!