Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - how long does training the TSP model take? #40

Open
LuciaTajuelo opened this issue May 13, 2021 · 7 comments
Open

Question - how long does training the TSP model take? #40

LuciaTajuelo opened this issue May 13, 2021 · 7 comments

Comments

@LuciaTajuelo
Copy link

Hi!

I'm executing the following command to train a TSP model:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'
i've run the code for 1 epocs and it took 10 hours. Is this expected?

I'm running the code under windows on a msi laptop using GPU and cuda enable.

Thanks in advance!

@wouterkool
Copy link
Owner

Hi and thanks for trying the code! From the top of my head, training a single epoch with default settings should take around 5 minutes on a 1080Ti GPU. This does not sound as if you are actually using the GPU. Please verify GPU usage through the `nvidia-smi' command (not sure whether this exists under windows) and check that tensors are actually moved to the GPU (e.g. add some printing statements printing {tensor}.device).

@LuciaTajuelo
Copy link
Author

Hi!

Thanks for your quick replay.
I've checked the logs on the nvidia-smi' but i dont really see usage from python when running the code. I've added some prints to the code. In the run.py line 58 i've added a print(opts.device)`
and i've add some prints in nets/graph_endoder.py line 44
```
print(self.W_query)
print(self.W_key)
print(self.W_val)


I see that opts.device is cuda:0 but the tensors are cpu. However, i'm not sure if this is the proper way to test it.

To sum up, i'd say that i'm not actually using the GPU when running the code. How can i use it?

Thanks a lot!!

@XxwlW
Copy link

XxwlW commented Mar 18, 2022

Hi!I have the same problem as you. Have you solved it?

@Hessen525
Copy link

I have the same probnelm, too....

@Sinyo-Liu
Copy link

Perhaps you can decrease the magnitude of '--epoch_size', adjust from 1280000 to 128000/12800/1280, the time consuming of each epoch will be greatly improved.

@zbh888
Copy link

zbh888 commented Nov 11, 2022

Same here.

@wouterkool
Copy link
Owner

Sorry this caused trouble, for people running into this please have a look at #11, I'll copy it here for reference:

Hi, I had the same issue, and here is how I solved it:
Turns out this is related to the num_workers=1. If you change it to num_workers=0 the code will run properly. This is in line 78 of 'train.py':
training_dataloader = DataLoader(training_dataset, batch_size=opts.batch_size, num_workers=1)
At first, I thought this is because of 'enumerate', but it is not.
I hope this solves it for you.
Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants