Issue recuiting GPU to AE training #27

maimechevee · 2021-11-24T21:59:31Z

Hello,
I would like to run the AE on my own videos but I cannot get it to work with my GPU.

The first problem:
_(behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python behavenet/fitting/ae_grid_search.py --data_config C:\Users\cheveemf.behavenet/Maxime_3120-210303-125248_params.json --model_config C:\Users\cheveemf/.behavenet/ae_model.json --training_config C:\Users\cheveemf/.behavenet/ae_training.json --compute_config C:\Users\cheveemf/.behavenet/ae_compute.json

Traceback (most recent call last):
File "behavenet/fitting/ae_grid_search.py", line 181, in
Traceback (most recent call last):
File "", line 1, in
hyperparams.optimize_parallel_gpu(main, gpu_ids=parallel_gpu_ids)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\site-packages\test_tube\argparse_hopt.py", line 348, in optimize_parallel_gpu
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\spawn.py", line 105, in spawn_main
self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\context.py", line 119, in Pool
exitcode = _main(fd)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\spawn.py", line 115, in _main
context=self.get_context())
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\pool.py", line 176, in init
self = reduction.pickle.load(from_parent)
EOFError : self._repopulate_pool()Ran out of input

File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
w.start()
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\context.py", line 322, in Popen
return Popen(process_obj)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init'

I "fixed" this issue following advice from this post: #8. I moved the
def init(local_gpu_q):
global g_gpu_id_q
g_gpu_id_q = local_gpu_q
out of the class HyperOptArgumentParser in argparse_hopt (test tube), and that seems remove the error, but it gets stuck somewhere and I can't find it... :

_(behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python behavenet/fitting/ae_grid_search.py --data_config C:\Users\cheveemf.behavenet/Maxime_3120-210303-125248_params.json --model_config C:\Users\cheveemf/.behavenet/ae_model.json --training_config C:\Users\cheveemf/.behavenet/ae_training.json --compute_config C:\Users\cheveemf/.behavenet/ae_compute.json

DATA CONFIG:
lab: Maxime
expt: 3120-210303-125248
animal: 3120
session: 210303
n_input_channels: 1
y_pixels: 330
x_pixels: 370
use_output_mask: False
frame_rate: 20.0
neural_type: None
neural_bin_size: 0.05
approx_batch_size: 200

COMPUTE CONFIG:
device: cuda
n_parallel_gpus: 1
gpus_viz: 0
tt_n_gpu_trials: 128
tt_n_cpu_trials: 1000
tt_n_cpu_workers: 5
mem_limit_gb: 8.0

TRAINING CONFIG:
export_train_plots: True
export_latents: True
pretrained_weights_path: None
val_check_interval: 1
learning_rate: 0.0001
max_n_epochs: 1000
min_n_epochs: 500
enable_early_stop: False
early_stop_history: 10
rng_seed_train: None
as_numpy: False
batch_load: True
rng_seed_data: 0
train_frac: 1.0
trial_splits: 8;1;1;0

MODEL CONFIG:
experiment_name: ae-example
model_type: conv
n_ae_latents: 12
l2_reg: 0.0
rng_seed_model: 0
fit_sess_io_layers: False
ae_arch_json: None
model_class: ae

using data from following sessions:
F:\ISX videos to run through DLC\3120\3120-210303-125248\TDT video data\Behavenet\Maxime\3120-210303-125248\3120\210303
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
Maxime_3120-210303-125248_3120_210303
signals: ['images']
transforms: OrderedDict([('images', None)])
paths: OrderedDict([('images', 'F:\\ISX videos to run through DLC\\3120\\3120-210303-125248\\TDT video data\\Behavenet\Maxime\3120-210303-125248\3120\210303\data.hdf5')])

constructing model...Initializing with random weights
done

Autoencoder architecture

Encoder architecture:
00: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
01: Conv2d(1, 32, kernel_size=(5, 5), stride=(2, 2))
02: LeakyReLU(negative_slope=0.05)
03: Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
04: LeakyReLU(negative_slope=0.05)
05: Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
06: LeakyReLU(negative_slope=0.05)
07: ZeroPad2d(padding=(2, 2, 1, 2), value=0.0)
08: Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2))
09: LeakyReLU(negative_slope=0.05)
10: ZeroPad2d(padding=(0, 1, 2, 2), value=0.0)
11: Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5))
12: LeakyReLU(negative_slope=0.05)
13: Linear(in_features=12800, out_features=12, bias=True)
Decoder architecture:
00: Linear(in_features=12, out_features=12800, bias=True)
01: ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5))
02: LeakyReLU(negative_slope=0.05)
03: ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2))
04: LeakyReLU(negative_slope=0.05)
05: ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
06: LeakyReLU(negative_slope=0.05)
07: ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
08: LeakyReLU(negative_slope=0.05)
09: ConvTranspose2d(32, 1, kernel_size=(5, 5), stride=(2, 2))
10: Sigmoid()

epoch 0000/1000
0%| | 0/80 [00:00<?, ?it/s]_

It never progresses.
A few notes:

I have check that my GPU is available and it seems to work fine.
I have confirmed it is not getting recruited (nvidia-smi)
I can run the AE fine with the "cpu" option

Any ideas would be very welcome :)

themattinthehatt · 2021-11-30T15:54:08Z

Very strange, sorry for your troubles :(

One thing you could try is running the integration test, which makes fake data and then runs a bunch of models on that data. It only runs each model for a couple epochs, so should be pretty fast. All you need to do is run the following command from the top level of the behavenet repo:

(behavenet) $: python tests/integration.py

Give that a try and let me know if you have the same issue.

maimechevee · 2021-12-04T20:33:10Z

Hey, thanks for getting back to me!

It gets stuck in the same way when I run integration.py. see output below.

I will keep trying to find the problem.. My best guess is that it results from fiddling with this:

"
I "fixed" this issue following advice from this post: #8. I moved the
def init(local_gpu_q):
global g_gpu_id_q
g_gpu_id_q = local_gpu_q
out of the class HyperOptArgumentParser in argparse_hopt (test tube)
"
Let me know if you think of something else, thanks!

Maxime

(behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python tests/integration.py
creating temp data...done

model: ae
session: sess-0

DATA CONFIG:
lab: lab
expt: expt
animal: animal
session: sess-0
sessions_csv:
all_source: data
n_input_channels: 1
y_pixels: 64
x_pixels: 48
use_output_mask: False
use_label_mask: False
neural_bin_size: 25
neural_type: ca
approx_batch_size: 200
n_labels: 2

COMPUTE CONFIG:
device: cuda
n_parallel_gpus: 1
gpus_viz: 0
tt_n_gpu_trials: 1000
tt_n_cpu_trials: 1000
tt_n_cpu_workers: 2
mem_limit_gb: 8.0

TRAINING CONFIG:
export_train_plots: False
export_latents: True
pretrained_weights_path: None
val_check_interval: 1
learning_rate: 0.0001
max_n_epochs: 1
min_n_epochs: 1
enable_early_stop: False
early_stop_history: 10
rng_seed_train: None
as_numpy: False
batch_load: True
rng_seed_data: 0
train_frac: 0.5
trial_splits: 8;1;1;1
export_predictions: True

MODEL CONFIG:
experiment_name: ae-expt
model_type: conv
n_ae_latents: 6
l2_reg: 0.0
rng_seed_model: 0
fit_sess_io_layers: False
ae_arch_json: None
model_class: ae
conditional_encoder: False
msp.alpha: None
vae.beta: 1
vae.beta_anneal_epochs: 100
beta_tcvae.beta: 1
beta_tcvae.beta_anneal_epochs: 100
ps_vae.alpha: 1
ps_vae.beta: 1
ps_vae.gamma: 1
ps_vae.delta: 1
ps_vae.anneal_epochs: 100
n_background: 3
n_sessions_per_batch: 1

using data from following sessions:
C:\Users\cheveemf.behavenet_tmp_save_AaA\lab\expt\animal\sess-0
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
lab_expt_animal_sess-0
signals: ['images']
transforms: OrderedDict([('images', None)])
paths: OrderedDict([('images', 'C:\Users\cheveemf\.behavenet_tmp_data_AaA\lab\expt\animal\sess-0\data.hdf5')])

constructing model...Initializing with random weights
done

Autoencoder architecture

Encoder architecture:
00: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
01: Conv2d(1, 32, kernel_size=(5, 5), stride=(2, 2))
02: LeakyReLU(negative_slope=0.05)
03: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
04: Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2))
05: LeakyReLU(negative_slope=0.05)
06: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
07: Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2))
08: LeakyReLU(negative_slope=0.05)
09: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
10: Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2))
11: LeakyReLU(negative_slope=0.05)
12: ZeroPad2d(padding=(1, 1, 0, 1), value=0.0)
13: Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5))
14: LeakyReLU(negative_slope=0.05)
15: Linear(in_features=512, out_features=6, bias=True)
Decoder architecture:
00: Linear(in_features=6, out_features=512, bias=True)
01: ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5))
02: LeakyReLU(negative_slope=0.05)
03: ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2))
04: LeakyReLU(negative_slope=0.05)
05: ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2))
06: LeakyReLU(negative_slope=0.05)
07: ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2))
08: LeakyReLU(negative_slope=0.05)
09: ConvTranspose2d(32, 1, kernel_size=(5, 5), stride=(2, 2))
10: Sigmoid()

epoch 0/1
0%| | 0/4 [00:00<?, ?it/s]

maimechevee · 2021-12-04T21:18:21Z

I am trying to re-install everything and got this error:
ERROR: Could not find a version that satisfies the requirement torch==1.3.1 (from versions: 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0)
ERROR: No matching distribution found for torch==1.3.1

I did encounter it the first time around and ended up just commenting torch out of the environment text file and installing it on its own afterwards using pip. Could that be the issue?

maimechevee · 2021-12-05T17:47:44Z

Hello,
I fixed the problem by uninstalling pytorch and reinstalling a later version using;
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

That seems to do the trick, thanks for your help!

maimechevee closed this as completed Dec 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue recuiting GPU to AE training #27

Issue recuiting GPU to AE training #27

maimechevee commented Nov 24, 2021 •

edited

Loading

themattinthehatt commented Nov 30, 2021

maimechevee commented Dec 4, 2021

maimechevee commented Dec 4, 2021

maimechevee commented Dec 5, 2021

Issue recuiting GPU to AE training #27

Issue recuiting GPU to AE training #27

Comments

maimechevee commented Nov 24, 2021 • edited Loading

Autoencoder architecture

themattinthehatt commented Nov 30, 2021

maimechevee commented Dec 4, 2021

model: ae session: sess-0

Autoencoder architecture

maimechevee commented Dec 4, 2021

maimechevee commented Dec 5, 2021

maimechevee commented Nov 24, 2021 •

edited

Loading

model: ae
session: sess-0