Question - how long does training the TSP model take? #40

LuciaTajuelo · 2021-05-13T08:17:08Z

Hi!

I'm executing the following command to train a TSP model:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout'
i've run the code for 1 epocs and it took 10 hours. Is this expected?

I'm running the code under windows on a msi laptop using GPU and cuda enable.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

wouterkool · 2021-05-13T08:22:46Z

Hi and thanks for trying the code! From the top of my head, training a single epoch with default settings should take around 5 minutes on a 1080Ti GPU. This does not sound as if you are actually using the GPU. Please verify GPU usage through the `nvidia-smi' command (not sure whether this exists under windows) and check that tensors are actually moved to the GPU (e.g. add some printing statements printing {tensor}.device).

LuciaTajuelo · 2021-05-13T16:49:46Z

Hi!

Thanks for your quick replay.
I've checked the logs on the nvidia-smi' but i dont really see usage from python when running the code. I've added some prints to the code. In the run.py line 58 i've added a print(opts.device)`
and i've add some prints in nets/graph_endoder.py line 44
```
print(self.W_query)
print(self.W_key)
print(self.W_val)


I see that opts.device is cuda:0 but the tensors are cpu. However, i'm not sure if this is the proper way to test it.

To sum up, i'd say that i'm not actually using the GPU when running the code. How can i use it?

Thanks a lot!!

XxwlW · 2022-03-18T12:18:47Z

Hi！I have the same problem as you. Have you solved it?

Hessen525 · 2022-04-17T14:59:51Z

I have the same probnelm, too....

Sinyo-Liu · 2022-04-23T06:55:26Z

Perhaps you can decrease the magnitude of '--epoch_size', adjust from 1280000 to 128000/12800/1280, the time consuming of each epoch will be greatly improved.

zbh888 · 2022-11-11T17:09:47Z

Same here.

wouterkool · 2023-01-09T09:59:00Z

Sorry this caused trouble, for people running into this please have a look at #11, I'll copy it here for reference:

Hi, I had the same issue, and here is how I solved it:
Turns out this is related to the num_workers=1. If you change it to num_workers=0 the code will run properly. This is in line 78 of 'train.py':
training_dataloader = DataLoader(training_dataset, batch_size=opts.batch_size, num_workers=1)
At first, I thought this is because of 'enumerate', but it is not.
I hope this solves it for you.
Cheers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - how long does training the TSP model take? #40

Question - how long does training the TSP model take? #40

LuciaTajuelo commented May 13, 2021

wouterkool commented May 13, 2021

LuciaTajuelo commented May 13, 2021

XxwlW commented Mar 18, 2022

Hessen525 commented Apr 17, 2022

Sinyo-Liu commented Apr 23, 2022

zbh888 commented Nov 11, 2022

wouterkool commented Jan 9, 2023

Question - how long does training the TSP model take? #40

Question - how long does training the TSP model take? #40

Comments

LuciaTajuelo commented May 13, 2021

wouterkool commented May 13, 2021

LuciaTajuelo commented May 13, 2021

XxwlW commented Mar 18, 2022

Hessen525 commented Apr 17, 2022

Sinyo-Liu commented Apr 23, 2022

zbh888 commented Nov 11, 2022

wouterkool commented Jan 9, 2023