Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train from scratch not working #2

Closed
amirhamidihd opened this issue Aug 18, 2021 · 13 comments
Closed

Train from scratch not working #2

amirhamidihd opened this issue Aug 18, 2021 · 13 comments

Comments

@amirhamidihd
Copy link

I have done your installation setup step-by-step, but unfortunately the train from scratch bash file not working. this is the log file that code have generated on the output file

usage: main.py [-h] [--data DIR] [--data_train_root DATA_TRAIN_ROOT]
[--data_train_label DATA_TRAIN_LABEL]
[--data_val_root DATA_VAL_ROOT]
[--data_val_label DATA_VAL_LABEL] [--model MODEL]
[--pretrained] [--initial-checkpoint PATH] [--resume PATH]
[--eval_checkpoint PATH] [--no-resume-opt] [--num-classes N]
[--gp POOL] [--img-size N] [--crop-pct N]
[--mean MEAN [MEAN ...]] [--std STD [STD ...]]
[--interpolation NAME] [-b N] [-vb N] [--opt OPTIMIZER]
[--opt-eps EPSILON] [--opt-betas BETA [BETA ...]]
[--momentum M] [--weight-decay WEIGHT_DECAY] [--clip-grad NORM]
[--sched SCHEDULER] [--lr LR]
[--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT]
[--lr-noise-std STDDEV] [--lr-cycle-mul MULT]
[--lr-cycle-limit N] [--warmup-lr LR] [--min-lr LR]
[--epochs N] [--start-epoch N] [--decay-epochs N]
[--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N]
[--decay-rate RATE] [--no-aug] [--scale PCT [PCT ...]]
[--ratio RATIO [RATIO ...]] [--hflip HFLIP] [--vflip VFLIP]
[--color-jitter PCT] [--aa NAME] [--aug-splits AUG_SPLITS]
[--jsd] [--reprob PCT] [--remode REMODE] [--recount RECOUNT]
[--resplit] [--mixup MIXUP] [--cutmix CUTMIX]
[--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
[--mixup-prob MIXUP_PROB]
[--mixup-switch-prob MIXUP_SWITCH_PROB]
[--mixup-mode MIXUP_MODE] [--mixup-off-epoch N]
[--smoothing SMOOTHING]
[--train-interpolation TRAIN_INTERPOLATION] [--drop PCT]
[--drop-connect PCT] [--drop-path PCT] [--drop-block PCT]
[--bn-tf] [--bn-momentum BN_MOMENTUM] [--bn-eps BN_EPS]
[--sync-bn] [--dist-bn DIST_BN] [--split-bn] [--model-ema]
[--model-ema-force-cpu] [--model-ema-decay MODEL_EMA_DECAY]
[--seed S] [--log-interval N] [--recovery-interval N] [-j N]
[--num-gpu NUM_GPU] [--save-images] [--amp] [--apex-amp]
[--native-amp] [--channels-last] [--pin-mem] [--no-prefetcher]
[--output PATH] [--eval-metric EVAL_METRIC] [--tta N]
[--local_rank LOCAL_RANK] [--use-multi-epochs-loader]
[--distributed DISTRIBUTED] [--port PORT]
[--repeated_aug REPEATED_AUG]
main.py: error: unrecognized arguments: main.py

@yuexy
Copy link
Owner

yuexy commented Aug 18, 2021

Thanks for the report.
We have fixed the bug, please fetch the latest version and try again.

@amirhamidihd
Copy link
Author

Thanks for your support,
Is the train_distributed.sh file as like as before? I cannot find any differnenes

@yuexy
Copy link
Owner

yuexy commented Aug 18, 2021

I correct the script. Maybe you can check the commit history for more details.

@amirhamidihd
Copy link
Author

I have cloned it again, but the error is existed

@yuexy
Copy link
Owner

yuexy commented Aug 18, 2021

The script works well for me. Maybe you should check your environment.

@amirhamidihd
Copy link
Author

Thanks again, I cloned it twice and face with this error:
[TORCH] Training in distributed mode. Process 0, local 0, total 1.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 1.
Traceback (most recent call last):
File "main.py", line 803, in
main()
File "main.py", line 327, in main
checkpoint_path=args.initial_checkpoint
File "/home/amir/anaconda3/envs/ps_vit/lib/python3.7/site-packages/timm/models/factory.py", line 59, in create_model
raise RuntimeError('Unknown model (%s)' % model_name)
RuntimeError: Unknown model (ps_vit_b_14)
Traceback (most recent call last):
File "/home/amir/anaconda3/envs/ps_vit/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/amir/anaconda3/envs/ps_vit/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/amir/anaconda3/envs/ps_vit/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/amir/anaconda3/envs/ps_vit/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/amir/anaconda3/envs/ps_vit/bin/python3', '-u', 'main.py', '--local_rank=0', '--config=/home/amir/Documents/GitHub/PS-ViT/configs/ps_vit_b_14.yaml', '--distributed=True']' returned non-zero exit status 1.

@yuexy
Copy link
Owner

yuexy commented Aug 18, 2021

what's the timm version?

@amirhamidihd
Copy link
Author

0.3.4

@amirhamidihd
Copy link
Author

I have run your code multiple times and faced with the following error.

The error is related to create_model, which also when you track the error you realize that it is related to "is_model" in timm library

PS-ViT/main.py

Line 315 in f888a29

model = create_model(

the error is as follows:

raise RuntimeError('Unknown model (%s)' % model_name)
RuntimeError: Unknown model (ps_vit_b_14)

This is due to the fact that "ps_vit_b_14" is not defined in timm library to be recognized.

I guess you have modified timm for your own code or your missed something in your uploaded code.

@yuexy
Copy link
Owner

yuexy commented Aug 19, 2021

I have checked the installation pipeline in a new environment, and I cannot reproduce the issue.
The model "ps_vit_b_14" has been registered, please refer to code.
Did you make any modifications to the code or config?

@amirhamidihd
Copy link
Author

for cloning your repository I have changed git clone git@github.com:yuexy/PS-ViT.git to git clone https://github.com/yuexy/PS-ViT.git because of this error:

Cloning into 'PS-ViT'...
Warning: Permanently added the RSA host key for IP address '140.82.121.4' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

in the code I only comment import models at line 9 because of not used in your main.py and in line 51 change initial-checkpoint to initial_checkpoint

when I debug the code after line that create_model is called, the code go on the timm library, and ps_vit_b_14 is not defined on timm.

maybe I should do something in the timm library like comy the model file on the timm library path or something else.

it is clear that timm library version of 0.3.4 not included ps_vit.

@amirhamidihd
Copy link
Author

Thanks for your help by setting up ps_vit again(build.py)

I finally resolve my issues

@yuexy
Copy link
Owner

yuexy commented Aug 19, 2021

Thanks for your feedback. I guess I can close this issue then. :)

@yuexy yuexy closed this as completed Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants