Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Digits notebook #149

Merged
merged 39 commits into from
Jun 10, 2021
Merged

Digits notebook #149

merged 39 commits into from
Jun 10, 2021

Conversation

bobturneruk
Copy link
Collaborator

Associated with #147

Description

Idea is to test the feasibility of adding an interactive notebook e.g. with myBinder or Google Collab (or both).

Status

Work in progress

On the right (delete these after selection):

  • Select a reviewer if ready for review. Use the suggested one if unsure.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).

@bobturneruk bobturneruk added the work-in-progress Work in progress that should NOT be merged label May 17, 2021
@github-actions github-actions bot added this to In progress in v0.1.0 May 17, 2021
@bobturneruk
Copy link
Collaborator Author

bobturneruk commented May 17, 2021

Currently invoked with:

python main.py --cfg configs/QUICK-AND-DIRTY.yaml

output

DAN:
  METHOD: CDAN
  RANDOM_DIM: 1024
  USERANDOM: False
DATASET:
  DIMENSION: 784
  NAME: digits
  NUM_CLASSES: 10
  NUM_REPEAT: 10
  ROOT: ../data
  SIZE_TYPE: source
  SOURCE: svhn
  TARGET: usps
  WEIGHT_TYPE: natural
OUTPUT:
  DIR: ./outputs\digits_mnist2usps
  PB_FRESH: 0
  ROOT: ./outputs
  VERBOSE: False
SOLVER:
  AD_LAMBDA: True
  AD_LR: True
  BASE_LR: 0.001
  INIT_LAMBDA: 1
  MAX_EPOCHS: 3
  MIN_EPOCHS: 1
  MOMENTUM: 0.9
  NESTEROV: True
  SEED: 2020
  TEST_BATCH_SIZE: 200
  TRAIN_BATCH_SIZE: 150
  TYPE: SGD
  WEIGHT_DECAY: 0.0005
==> Building model for seed 2020 ......
Using downloaded and verified file: ../data\train_32x32.mat
Using downloaded and verified file: ../data\test_32x32.mat
C:\Users\bobturner\.conda\envs\pykale\lib\site-packages\pytorch_lightning\utilities\distributed.py:68: UserWarning: Checkpoint directory ./outputs\digits_mnist2usps\checkpoints\0536afa92da44db5948d0577964a4f6a\CDAN\seed_2020 exists and is not empty.
  warnings.warn(*args, **kwargs)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name              | Type                | Params
----------------------------------------------------------
0 | feat              | SmallCNNFeature     | 312 K
1 | classifier        | ClassNetSmallImage  | 24.4 K
2 | domain_classifier | DomainNetSmallImage | 128 K
----------------------------------------------------------
466 K     Trainable params
0         Non-trainable params
466 K     Total params
1.864     Total estimated model params size (MB)

It's taking a very long time to run (>half an hour) even with a small number of epochs.

@codecov-commenter
Copy link

codecov-commenter commented May 17, 2021

Codecov Report

Merging #149 (f2fc53b) into main (4bf5675) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #149   +/-   ##
=======================================
  Coverage   88.08%   88.08%           
=======================================
  Files          44       44           
  Lines        4156     4156           
=======================================
  Hits         3661     3661           
  Misses        495      495           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bf5675...f2fc53b. Read the comment docs.

@haipinglu
Copy link
Member

It's taking a very long time to run (>half an hour) even with a small number of epochs.

To reduce further

  • Use MAX_EPOCHS=2 (reduce 1/3)
  • Use just 2 out of 10 classes, e.g. 1 and 2. Need to write code to take subsets of source and target (by class label).

Possible to do a profiling to understand the most costly part? We may brainstorm more ideas on Thursday too.

@bobturneruk
Copy link
Collaborator Author

Profiling indicates this takes about an hour to run on my PC as is, so I'll look at reducing number of classes.

@bobturneruk
Copy link
Collaborator Author

Is the envisioned approach to extract train and test sets from source and target, subset, then rebuild the datasetaccess objects for subsequent use e.g. at

dataset = MultiDomainDatasets(

Maybe I'm misunderstanding or not explaining myself well!

@haipinglu
Copy link
Member

Is the envisioned approach to extract train and test sets from source and target, subset, then rebuild the datasetaccess objects for subsequent use e.g. at

dataset = MultiDomainDatasets(

Yes, the line above get the source and target datasets and this line do the splitting to construct a source-target domain dataset (hence multidomain).

Oh, this question helps me spot another way to reduce the computational cost, i.e. setting val_split_ratio to a higher ratio. Default is 0.1 so using 90% for training. We can set it to a higher number, e.g. 0.5 so that using only 50% samples for training, which will give lower accuracy but save quite some training time.

target,
config_weight_type=cfg.DATASET.WEIGHT_TYPE,
config_size_type=cfg.DATASET.SIZE_TYPE,
val_split_ratio=0.5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to change it via yaml config rather than hardcode, e.g., cfg.DATASET.VAL_RATIO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get it running first, you can then config to cfg.DATASET.VAL_RATIO=0.9 in the fast version, but still keeping the default 0.1 as the default in config.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was confused by it having a different name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would DATASET.NUM_REPEAT be another target to reduce?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was confused by it having a different name.

What does your "it" refer to? VAL_SPLIT_RATIO? I was hoping to make it more compact but if that causes confusion, just use cfg.DATASET.VAL_SPLIT_RATIO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would DATASET.NUM_REPEAT be another target to reduce?

Oh, I forgot mention that. That's right and we are almost there!
That should be the first to reduce. Just change it to 1 and you will get 10x improvement. No need to repeat for notebook. Then the notebook should only take minutes to run for 10 classes, and maybe 1 min for 2 classes.

If you think flexible subset is good, we can also make the number of classes in subset a configurable variable.

@bobturneruk
Copy link
Collaborator Author

Got it down to about 5 minutes on my machine. I'll put an notebook together and see how long it takes in myBinder without further modification.

Days              : 0
Hours             : 0
Minutes           : 4
Seconds           : 56
Milliseconds      : 189
Ticks             : 2961896145
TotalDays         : 0.00342812053819444
TotalHours        : 0.0822748929166667
TotalMinutes      : 4.936493575
TotalSeconds      : 296.1896145
TotalMilliseconds : 296189.6145

@bobturneruk
Copy link
Collaborator Author

bobturneruk commented May 24, 2021

WIP badges

Binder

Open In Colab

Colab needs !pip install pykale[extras]

@bobturneruk
Copy link
Collaborator Author

Currently getting this error locally and on myBinder:

AttributeError: 'NoneType' object has no attribute 'best_model_path'

Related: Lightning-AI/pytorch-lightning#2547

Colab needs work so it clones the whole repo.

@haipinglu
Copy link
Member

Currently getting this error locally and on myBinder:

AttributeError: 'NoneType' object has no attribute 'best_model_path'

Related: PyTorchLightning/pytorch-lightning#2547

Colab needs work so it clones the whole repo.

Did it work previously (when you said down to 5min)?
If yes, it might be some config is too minimal that becomes out of expectation (e.g. possible due to min epoch=0?).

" progress_bar_refresh_rate=cfg.OUTPUT.PB_FRESH, # in steps\n",
" min_epochs=cfg.SOLVER.MIN_EPOCHS,\n",
" max_epochs=cfg.SOLVER.MAX_EPOCHS,\n",
" checkpoint_callback=checkpoint_callback,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your Pytorch Lightning>=1.3.0? Could you try replacing checkpoint_callback=checkpoint_callback with callbacks=[checkpoint_callback],. It probably can fix the above Attribute Error. Reference: https://pytorch-lightning.readthedocs.io/en/stable/common/weights_loading.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sz144 - I'll check this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has helped and the notebook now runs, with Pytorch Lightning>=1.3.0 specified in setup.py for extras.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haipinglu we need check our previous examples for compatibility with the latest PyTorch Lightning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could pin a specific earlier version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be covered by tests rather than manual check. Of course examples are not covered by tests but I think good tests should do this. The priority should be the API, then if possible we could cover the examples. But I do not think that this is the priority now.

If we found any compatibility issues, we can have respective tests to cover them or discuss how to prevent that from happening in future.

@bobturneruk
Copy link
Collaborator Author

I suspect the tests are currently failing due to data download issues, but I'm not 100% sure.

@bobturneruk
Copy link
Collaborator Author

Resolves #148

@haipinglu
Copy link
Member

haipinglu commented May 27, 2021

Resolves #148

I've linked it on the right to automate. See the above.

@haipinglu
Copy link
Member

haipinglu commented May 28, 2021

@bobturneruk

  • The TDC server recovered and I've rerun the tests. Now all tests have passed so can consider merge when ready.

  • Really like the progress bar now but the accuracy shows nothing is learned. Please consider the previous suggestion (using the emailed example) to take a subset of 2 classes for demo. In this case, we could use the saved 80%+ time for more epochs to have a better than random accuracy to show.

  • colab is much faster indeed but if it failed to find local modules. ModuleNotFoundError: No module named 'config'. Binder can run but waiting time is much longer (comparing the installation), not good for live demo.

  • If we have colab still, then at the top line put !pip install pykale[extras] with a comment saying comment it off for binder, or add a comment and say remove it for colab.

@bobturneruk
Copy link
Collaborator Author

Thanks @haipinglu:

  • Thanks for rerunning!
  • I agree, if nothing is learned then it's perhaps merely "machine thinking"! I suggest that ideally this example is not the place to put code for splitting and reconstructing the example data into the right objects. Maybe we should cache the data we need in https://github.com/pykale/data? I'd welcome contributions from others on this task to speed things up.
  • Yes - colab, by default, opens the notebook rather than building a container from the repository like binder.
  • Yes. I'll have a think if there's a more elegant solution. It could be with the changes in this PR colab would need to install from this branch rather than pypi to run properly.

I'll prioritise cutting down the data and increase the epochs to get myBinder to perform better, unless I hear from you that you'd prefer me to look again at colab to deliver a speedup without altering the data, perhaps returning to myBinder afterwards.

@haipinglu
Copy link
Member

*I suggest that ideally this example is not the place to put code for splitting and reconstructing the example data into the right objects. Maybe we should cache the data we need in https://github.com/pykale/data? I'd welcome contributions from others on this task to speed things up.

  • I think the best solution is to follow the example notebook I sent to you via email to construct subset. It should take less than 10 lines of code. Have you tried the notebook? Constructing a new dataset will be much more involved (see mnistm.py and upsp.py)
  • I do not think we are allowed to host data not owned by us (even if a subset) on pykale/data. We have to ask for permission/license unless the owner allows so explicitly.
  • Subset extraction could be highly reusable for other examples and eventually become part of the core kale API.

I'll prioritise cutting down the data and increase the epochs to get myBinder to perform better, unless I hear from you that you'd prefer me to look again at colab to deliver a speedup without altering the data, perhaps returning to myBinder afterwards.

Yes, let's get myBinder running first. For Colab, we can deal with it, maybe learn from some other successful examples to see how they did it. Thanks!

@bobturneruk
Copy link
Collaborator Author

Another potential hosting option:

View in Deepnote

@bobturneruk
Copy link
Collaborator Author

Can't currently get it to work Unable to duplicate the notebook..

@haipinglu
Copy link
Member

Can't currently get it to work Unable to duplicate the notebook..

I did not see it used by other ML/pytorch packages so I think we should stick to colab and/or binder, given the two weeks left till the deadline.

@haipinglu
Copy link
Member

@bobturneruk : @mustafa1728 has got the 2-class version done, pending integration. We can discuss later today.

@bobturneruk
Copy link
Collaborator Author

Saw the notebook - looks good @mustafa1728!

@bobturneruk
Copy link
Collaborator Author

I've learned a bit about pytorch from reading it!

@mustafa1728
Copy link
Contributor

Saw the notebook - looks good @mustafa1728!

Thanks @bobturneruk. Glad to be of help!

@bobturneruk
Copy link
Collaborator Author

Now runs (at least partially) on both Colab and myBinder without user needing to comment anything in or out. No progress bar in either (but works locally). Maybe related: Lightning-AI/pytorch-lightning#1112

@bobturneruk
Copy link
Collaborator Author

@haipinglu - I think this is now pretty close. Remaining issues from my perspective are:

  • Not sure if I've described "Domain Adaptation" or our outputs correctly.
  • Still pretty slow without GPU, but progress bars at least give the user some idea what's going on. Do we need to profile to see if subsetting actually speeds up?
  • Some links will need to be changed to point to the main branch when this is merged.
  • Is learning taking place (I think so)? Do we need more epochs?

@haipinglu
Copy link
Member

  • Not sure if I've described "Domain Adaptation" or our outputs correctly.

Add a link to wiki will be good. Then follow the top summary in wiki to describe: https://en.wikipedia.org/wiki/Domain_adaptation

  • Still pretty slow without GPU, but progress bars at least give the user some idea what's going on. Do we need to profile to see if subsetting actually speeds up?

Much more bearable with Colab. Yes, pbar is helpful. No need to profile now.

  • Some links will need to be changed to point to the main branch when this is merged.

OK.

  • Is learning taking place (I think so)? Do we need more epochs?

I saw 3-class got 44% >random 33%. I tried 2 classes just 55% marginally > random 50. Tried 2 class with 0.1 split: still just 55.7%. We'd better tweak to a good acc.

  • Can we add colab and binder link (+icon) at the top of the notebook like many do?

Many thanks. We can target to do merging in today's meeting.

@haipinglu
Copy link
Member

@bobturneruk It will be useful (for us as well as users) to record and display the time taken for training, testing, and overall.

@bobturneruk
Copy link
Collaborator Author

Can we add colab and binder link (+icon) at the top of the notebook like many do?

Can you share an example, please, @haipinglu?

@haipinglu
Copy link
Member

Can we add colab and binder link (+icon) at the top of the notebook like many do?

Can you share an example, please, @haipinglu?

https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution

@bobturneruk
Copy link
Collaborator Author

The button, not the timing. Timing probably a bit different in IPython. Speak soon!

@bobturneruk bobturneruk marked this pull request as ready for review June 10, 2021 13:42
@bobturneruk bobturneruk merged commit 240c2cb into main Jun 10, 2021
v0.1.0 automation moved this from In progress to Done Jun 10, 2021
@bobturneruk bobturneruk deleted the digits-notebook branch June 10, 2021 15:10
@github-actions github-actions bot mentioned this pull request Jun 21, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work-in-progress Work in progress that should NOT be merged
Projects
No open projects
v0.1.0
  
Done
5 participants