Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EarlyStopping's occurence #33

Closed
alexzhang0825 opened this issue Feb 23, 2021 · 10 comments
Closed

EarlyStopping's occurence #33

alexzhang0825 opened this issue Feb 23, 2021 · 10 comments
Assignees

Comments

@alexzhang0825
Copy link

Hello,

I am currently trying to run your code to see how it works, but every time the code terminates too soon based on EarlyStopping. The result MSE and MAE were also quite off compared with the results shown here. I have had no involvement with almost any programming-related things for a long time so my knowledge is too limited to solve the problem myself. With that being said, I did try to set the EarlyStopping patience to 100 instead, but the code still ended on its own despite saying that EarlyStopping counter is at 3 out of 100. Also, at the start, the code would show that Use GPU: cuda: 0, which made me concerned that if the training was done on CPU at first, but when I checked with the Task Manager the GPU use was at almost 100%, so I believed it was fine, but the fact that the code terminates itself too early every time still makes me wonder if it is using the GPU properly. It would be great if you could provide me some help on this.

In case if any information on specs are needed:
OS: Windows Server 2019 64-bit
Processor: Intel Xeon CPU @ 2.20GHz 2.20GHz
Memory: 30GB
GPU: Nvidia Tesla V100

Thank you in advance. Let me know if there is any additional information you need.

@cookieminions
Copy link
Collaborator

cookieminions commented Feb 24, 2021

Hi, please try to set a larger train_epochs(default is 6) such as 20, and then set a larger EarlyStopping patience.
We add args.use_gpu = True if torch.cuda.is_available() else False in code main_informer.py. If the program show 'Use GPU:cuda:0', that means the program is using GPU.

@zhouhaoyi
Copy link
Owner

Hi.
You can upload some error logs such that we can do some basic analysis.

@alexzhang0825
Copy link
Author

Hi.
You can upload some error logs such that we can do some basic analysis.

Hello. Thank you for your swift response. I did what cookieminions told me and the code made it to Epoch 20 with EarlyStopping counter at 18 out of 100 before terminating. The result MSE and MAE were 0.45 and 0.50 respectively. However, soon after the code terminated, it ran the test again on its own but only made it to Epoch 20 again and with a EarlyStopping counter at 19 out of 100. The MSE and MAE were similar to those from the previous run. I have tried to find the error log files in the code but could not, and the code itself did not produce any error prompt anywhere during its run, so I was wondering if you could point me a place where I can look for it.

Thank you

@cookieminions
Copy link
Collaborator

Hi,
The program run again because the default number of repeated experiments itr is set to 2. The earlystopping patience means if the number of times that validation loss did not drop continuously reaches patience, the experiment will stop. But if train_epochs is reached first, the experiment will also stop.

@alexzhang0825
Copy link
Author

alexzhang0825 commented Feb 25, 2021 via email

@cookieminions
Copy link
Collaborator

If you use the default configuration, you can get a multivariate prediction result with a prediction length of 24, which is in the upper left corner of Figure 5.
The validation loss does not drop, indicating that the model is overfitting, we need to use the parameters of the model saved before the validation loss increases (we have this operation in the code)

@alexzhang0825
Copy link
Author

If you use the default configuration, you can get a multivariate prediction result with a prediction length of 24, which is in the upper left corner of Figure 5.
The validation loss does not drop, indicating that the model is overfitting, we need to use the parameters of the model saved before the validation loss increases (we have this operation in the code)

Oh now I understand it. I was looking at figure 4 and confused why it would be so off. Now that I am looking at the correct figure everything seems to be working just fine.
As for the overfitting, I am guessing that's something caused by the code itself as it looks for the most optimal result. Does that mean that the only thing I can do about it is to increase the patience level to prevent early termination?

@cookieminions
Copy link
Collaborator

Maybe you can reduce the number of d_model and d_ff by using --d_model xxx --d_ff xxx to prevent the model to reach overfitting too quickly. Increasing the patience will make the model continue to train, but eventually the model will load the parameters saved in the minimum validation error. If you do not need earlystopping, you can comment out lines 198-200 if early_stopping.early_stop: print("Early stopping") break and line 205 self.model.load_state_dict(torch.load(best_model_path)) of the code exp/exp_informer.py, and the model will be trained until the train_epochs you set is reached.

@zhouhaoyi
Copy link
Owner

Suppose there is no more discussion. I will close this issue in 12h.

@alexzhang0825
Copy link
Author

Suppose there is no more discussion. I will close this issue in 12h.

Sorry that I forgot to check back to this thread. The issue is solved for now and I will close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants