Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fix the order of data in iterator during training step? #828

Open
seewoo5 opened this issue Jun 17, 2020 · 5 comments
Open

How to fix the order of data in iterator during training step? #828

seewoo5 opened this issue Jun 17, 2020 · 5 comments

Comments

@seewoo5
Copy link

seewoo5 commented Jun 17, 2020

❓ Questions and Help

Description

Currently, I'm running experiments with several datasets in torchtext, and I just found that I can't reproduce my experiments although I excluded all the possible randomness as following:

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
random.seed(seed)
np.random.seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

I found that, when Iterator class is initialized, RandomShuffler() defined in torchtext.data.utils is set as a self.random_shuffler, and this is used to shuffle data in training dataset. However, although one can set random state of RandomShuffler by feeding it as an argument of it, the line self.random_shuffler = RandomShuffler() doesn't let us to manually set the random state of it. Am I right? Is there a way to fix the order of data for training step?

@zhangguanheng66
Copy link
Contributor

@bencwallace mentioned this temporary fix. #522 (comment)

In the end, we will switch to torch.utils.data.DataLoader, which support deterministic sampling. Could you tell me which datasets are you using now? I may help you with our experimental datasets.

@seewoo5
Copy link
Author

seewoo5 commented Jun 18, 2020

Thanks! I'm using Iterator instead of BucketIterator with SST dataset (fine-grained classification). I just tried the way you mentioned, but it still doesn't work. Before I call Iterator.splits, I added the following 3 lines of code:

random.seed(ARGS.random_seed)
rand_st = random.getstate()
random.setstate(rand_st)

where ARGS.random_seed is a fixed integer (1). I thought that fixing a random seed also fixes random state, but this doesn't seem to be the right way. Could you help me further?

@bencwallace
Copy link

Strange, I think the work-around should work for Iterator as well. You're calling Iterator.splits right?

By the way, I'm pretty sure you're right that seeding and then calling getstate and setstate should have the same effect as just seeding. You should be able to fix the random state of the iterator just by calling random.seed. Maybe try a small, self-contained test to make sure there isn't some other source of randomness.

@seewoo5
Copy link
Author

seewoo5 commented Jun 18, 2020

Yes. Here's a plot of train accuracy and train loss for SST-5 data (obtained with wandb): two experiments are running with exactly same script. They are "almost" same but different.
스크린샷 2020-06-18 오후 5 32 39

@bencwallace
Copy link

That's somewhat disconcerting. Please let me know if you find out the problem! I'm trying to maintain reproducibility in a project of my own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants