Digits notebook #149

bobturneruk · 2021-05-17T11:31:58Z

Associated with #147

Description

Idea is to test the feasibility of adding an interactive notebook e.g. with myBinder or Google Collab (or both).

Status

Work in progress

On the right (delete these after selection):

Select a reviewer if ready for review. Use the suggested one if unsure.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).

bobturneruk · 2021-05-17T11:32:33Z

Currently invoked with:

python main.py --cfg configs/QUICK-AND-DIRTY.yaml

output

DAN:
  METHOD: CDAN
  RANDOM_DIM: 1024
  USERANDOM: False
DATASET:
  DIMENSION: 784
  NAME: digits
  NUM_CLASSES: 10
  NUM_REPEAT: 10
  ROOT: ../data
  SIZE_TYPE: source
  SOURCE: svhn
  TARGET: usps
  WEIGHT_TYPE: natural
OUTPUT:
  DIR: ./outputs\digits_mnist2usps
  PB_FRESH: 0
  ROOT: ./outputs
  VERBOSE: False
SOLVER:
  AD_LAMBDA: True
  AD_LR: True
  BASE_LR: 0.001
  INIT_LAMBDA: 1
  MAX_EPOCHS: 3
  MIN_EPOCHS: 1
  MOMENTUM: 0.9
  NESTEROV: True
  SEED: 2020
  TEST_BATCH_SIZE: 200
  TRAIN_BATCH_SIZE: 150
  TYPE: SGD
  WEIGHT_DECAY: 0.0005
==> Building model for seed 2020 ......
Using downloaded and verified file: ../data\train_32x32.mat
Using downloaded and verified file: ../data\test_32x32.mat
C:\Users\bobturner\.conda\envs\pykale\lib\site-packages\pytorch_lightning\utilities\distributed.py:68: UserWarning: Checkpoint directory ./outputs\digits_mnist2usps\checkpoints\0536afa92da44db5948d0577964a4f6a\CDAN\seed_2020 exists and is not empty.
  warnings.warn(*args, **kwargs)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name              | Type                | Params
----------------------------------------------------------
0 | feat              | SmallCNNFeature     | 312 K
1 | classifier        | ClassNetSmallImage  | 24.4 K
2 | domain_classifier | DomainNetSmallImage | 128 K
----------------------------------------------------------
466 K     Trainable params
0         Non-trainable params
466 K     Total params
1.864     Total estimated model params size (MB)

It's taking a very long time to run (>half an hour) even with a small number of epochs.

codecov-commenter · 2021-05-17T11:58:25Z

Codecov Report

Merging #149 (f2fc53b) into main (4bf5675) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #149   +/-   ##
=======================================
  Coverage   88.08%   88.08%           
=======================================
  Files          44       44           
  Lines        4156     4156           
=======================================
  Hits         3661     3661           
  Misses        495      495

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bf5675...f2fc53b. Read the comment docs.

haipinglu · 2021-05-17T18:46:33Z

It's taking a very long time to run (>half an hour) even with a small number of epochs.

To reduce further

Use MAX_EPOCHS=2 (reduce 1/3)
Use just 2 out of 10 classes, e.g. 1 and 2. Need to write code to take subsets of source and target (by class label).

Possible to do a profiling to understand the most costly part? We may brainstorm more ideas on Thursday too.

bobturneruk · 2021-05-21T08:30:00Z

Profiling indicates this takes about an hour to run on my PC as is, so I'll look at reducing number of classes.

bobturneruk · 2021-05-21T10:15:58Z

Is the envisioned approach to extract train and test sets from source and target, subset, then rebuild the datasetaccess objects for subsequent use e.g. at

pykale/examples/digits_dann_lightn/main.py

Line 48 in 8177ad9

dataset = MultiDomainDatasets(

Maybe I'm misunderstanding or not explaining myself well!

haipinglu · 2021-05-21T10:32:11Z

Is the envisioned approach to extract train and test sets from source and target, subset, then rebuild the datasetaccess objects for subsequent use e.g. at

pykale/examples/digits_dann_lightn/main.py

Line 48 in 8177ad9

dataset = MultiDomainDatasets(

Yes, the line above get the source and target datasets and this line do the splitting to construct a source-target domain dataset (hence multidomain).

Oh, this question helps me spot another way to reduce the computational cost, i.e. setting val_split_ratio to a higher ratio. Default is 0.1 so using 90% for training. We can set it to a higher number, e.g. 0.5 so that using only 50% samples for training, which will give lower accuracy but save quite some training time.

haipinglu · 2021-05-21T13:04:28Z

examples/digits_dann_lightn/main.py

+        target,
+        config_weight_type=cfg.DATASET.WEIGHT_TYPE,
+        config_size_type=cfg.DATASET.SIZE_TYPE,
+        val_split_ratio=0.5,


Better to change it via yaml config rather than hardcode, e.g., cfg.DATASET.VAL_RATIO.

To get it running first, you can then config to cfg.DATASET.VAL_RATIO=0.9 in the fast version, but still keeping the default 0.1 as the default in config.py.

Sure. I was confused by it having a different name.

Would DATASET.NUM_REPEAT be another target to reduce?

Sure. I was confused by it having a different name.

What does your "it" refer to? VAL_SPLIT_RATIO? I was hoping to make it more compact but if that causes confusion, just use cfg.DATASET.VAL_SPLIT_RATIO

Would DATASET.NUM_REPEAT be another target to reduce?

Oh, I forgot mention that. That's right and we are almost there!
That should be the first to reduce. Just change it to 1 and you will get 10x improvement. No need to repeat for notebook. Then the notebook should only take minutes to run for 10 classes, and maybe 1 min for 2 classes.

If you think flexible subset is good, we can also make the number of classes in subset a configurable variable.

bobturneruk · 2021-05-24T08:45:51Z

Got it down to about 5 minutes on my machine. I'll put an notebook together and see how long it takes in myBinder without further modification.

Days              : 0
Hours             : 0
Minutes           : 4
Seconds           : 56
Milliseconds      : 189
Ticks             : 2961896145
TotalDays         : 0.00342812053819444
TotalHours        : 0.0822748929166667
TotalMinutes      : 4.936493575
TotalSeconds      : 296.1896145
TotalMilliseconds : 296189.6145

bobturneruk · 2021-05-24T12:47:07Z

WIP badges

Colab needs !pip install pykale[extras]

bobturneruk · 2021-05-24T13:42:40Z

Currently getting this error locally and on myBinder:

AttributeError: 'NoneType' object has no attribute 'best_model_path'

Related: Lightning-AI/pytorch-lightning#2547

Colab needs work so it clones the whole repo.

haipinglu · 2021-05-24T17:12:38Z

Currently getting this error locally and on myBinder:
AttributeError: 'NoneType' object has no attribute 'best_model_path'
Related: PyTorchLightning/pytorch-lightning#2547

Colab needs work so it clones the whole repo.

Did it work previously (when you said down to 5min)?
If yes, it might be some config is too minimal that becomes out of expectation (e.g. possible due to min epoch=0?).

shuo-zhou · 2021-05-25T14:27:54Z

examples/digits_dann_lightn/main.ipynb

+    "            progress_bar_refresh_rate=cfg.OUTPUT.PB_FRESH,  # in steps\n",
+    "            min_epochs=cfg.SOLVER.MIN_EPOCHS,\n",
+    "            max_epochs=cfg.SOLVER.MAX_EPOCHS,\n",
+    "            checkpoint_callback=checkpoint_callback,\n",


Is your Pytorch Lightning>=1.3.0? Could you try replacing checkpoint_callback=checkpoint_callback with callbacks=[checkpoint_callback],. It probably can fix the above Attribute Error. Reference: https://pytorch-lightning.readthedocs.io/en/stable/common/weights_loading.html

Thanks @sz144 - I'll check this out.

This has helped and the notebook now runs, with Pytorch Lightning>=1.3.0 specified in setup.py for extras.

@haipinglu we need check our previous examples for compatibility with the latest PyTorch Lightning.

Or we could pin a specific earlier version.

Ideally, this should be covered by tests rather than manual check. Of course examples are not covered by tests but I think good tests should do this. The priority should be the API, then if possible we could cover the examples. But I do not think that this is the priority now.

If we found any compatibility issues, we can have respective tests to cover them or discuss how to prevent that from happening in future.

bobturneruk · 2021-05-27T13:12:39Z

I suspect the tests are currently failing due to data download issues, but I'm not 100% sure.

bobturneruk · 2021-05-27T13:56:25Z

Resolves #148

haipinglu · 2021-05-27T13:58:38Z

Resolves #148

I've linked it on the right to automate. See the above.

haipinglu · 2021-05-28T21:04:25Z

@bobturneruk

The TDC server recovered and I've rerun the tests. Now all tests have passed so can consider merge when ready.
Really like the progress bar now but the accuracy shows nothing is learned. Please consider the previous suggestion (using the emailed example) to take a subset of 2 classes for demo. In this case, we could use the saved 80%+ time for more epochs to have a better than random accuracy to show.
colab is much faster indeed but if it failed to find local modules. ModuleNotFoundError: No module named 'config'. Binder can run but waiting time is much longer (comparing the installation), not good for live demo.
If we have colab still, then at the top line put !pip install pykale[extras] with a comment saying comment it off for binder, or add a comment and say remove it for colab.

bobturneruk · 2021-06-01T07:58:15Z

Thanks @haipinglu:

Thanks for rerunning!
I agree, if nothing is learned then it's perhaps merely "machine thinking"! I suggest that ideally this example is not the place to put code for splitting and reconstructing the example data into the right objects. Maybe we should cache the data we need in https://github.com/pykale/data? I'd welcome contributions from others on this task to speed things up.
Yes - colab, by default, opens the notebook rather than building a container from the repository like binder.
Yes. I'll have a think if there's a more elegant solution. It could be with the changes in this PR colab would need to install from this branch rather than pypi to run properly.

I'll prioritise cutting down the data and increase the epochs to get myBinder to perform better, unless I hear from you that you'd prefer me to look again at colab to deliver a speedup without altering the data, perhaps returning to myBinder afterwards.

haipinglu · 2021-06-01T08:32:23Z

*I suggest that ideally this example is not the place to put code for splitting and reconstructing the example data into the right objects. Maybe we should cache the data we need in https://github.com/pykale/data? I'd welcome contributions from others on this task to speed things up.

I think the best solution is to follow the example notebook I sent to you via email to construct subset. It should take less than 10 lines of code. Have you tried the notebook? Constructing a new dataset will be much more involved (see mnistm.py and upsp.py)
I do not think we are allowed to host data not owned by us (even if a subset) on pykale/data. We have to ask for permission/license unless the owner allows so explicitly.
Subset extraction could be highly reusable for other examples and eventually become part of the core kale API.

I'll prioritise cutting down the data and increase the epochs to get myBinder to perform better, unless I hear from you that you'd prefer me to look again at colab to deliver a speedup without altering the data, perhaps returning to myBinder afterwards.

Yes, let's get myBinder running first. For Colab, we can deal with it, maybe learn from some other successful examples to see how they did it. Thanks!

bobturneruk · 2021-06-03T08:01:21Z

Another potential hosting option:

bobturneruk · 2021-06-03T08:04:06Z

Can't currently get it to work Unable to duplicate the notebook..

haipinglu · 2021-06-03T08:08:22Z

Can't currently get it to work Unable to duplicate the notebook..

I did not see it used by other ML/pytorch packages so I think we should stick to colab and/or binder, given the two weeks left till the deadline.

haipinglu · 2021-06-03T08:12:25Z

@bobturneruk : @mustafa1728 has got the 2-class version done, pending integration. We can discuss later today.

bobturneruk · 2021-06-03T08:28:43Z

Saw the notebook - looks good @mustafa1728!

bobturneruk · 2021-06-03T08:35:37Z

I've learned a bit about pytorch from reading it!

mustafa1728 · 2021-06-03T08:44:26Z

Saw the notebook - looks good @mustafa1728!

Thanks @bobturneruk. Glad to be of help!

bobturneruk · 2021-06-03T13:58:54Z

Now runs (at least partially) on both Colab and myBinder without user needing to comment anything in or out. No progress bar in either (but works locally). Maybe related: Lightning-AI/pytorch-lightning#1112

bobturneruk · 2021-06-10T09:47:15Z

@haipinglu - I think this is now pretty close. Remaining issues from my perspective are:

Not sure if I've described "Domain Adaptation" or our outputs correctly.
Still pretty slow without GPU, but progress bars at least give the user some idea what's going on. Do we need to profile to see if subsetting actually speeds up?
Some links will need to be changed to point to the main branch when this is merged.
Is learning taking place (I think so)? Do we need more epochs?

haipinglu · 2021-06-10T10:02:47Z

Not sure if I've described "Domain Adaptation" or our outputs correctly.

Add a link to wiki will be good. Then follow the top summary in wiki to describe: https://en.wikipedia.org/wiki/Domain_adaptation

Still pretty slow without GPU, but progress bars at least give the user some idea what's going on. Do we need to profile to see if subsetting actually speeds up?

Much more bearable with Colab. Yes, pbar is helpful. No need to profile now.

Some links will need to be changed to point to the main branch when this is merged.

OK.

Is learning taking place (I think so)? Do we need more epochs?

I saw 3-class got 44% >random 33%. I tried 2 classes just 55% marginally > random 50. Tried 2 class with 0.1 split: still just 55.7%. We'd better tweak to a good acc.

Can we add colab and binder link (+icon) at the top of the notebook like many do?

Many thanks. We can target to do merging in today's meeting.

haipinglu · 2021-06-10T12:13:11Z

@bobturneruk It will be useful (for us as well as users) to record and display the time taken for training, testing, and overall.

bobturneruk · 2021-06-10T13:20:04Z

Can we add colab and binder link (+icon) at the top of the notebook like many do?

Can you share an example, please, @haipinglu?

haipinglu · 2021-06-10T13:22:27Z

Can we add colab and binder link (+icon) at the top of the notebook like many do?

Can you share an example, please, @haipinglu?

https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution

bobturneruk · 2021-06-10T13:24:46Z

The button, not the timing. Timing probably a bit different in IPython. Speak soon!

bobturneruk added the work-in-progress Work in progress that should NOT be merged label May 17, 2021

github-actions bot added this to In progress in v0.1.0 May 17, 2021

haipinglu reviewed May 21, 2021

View reviewed changes

shuo-zhou reviewed May 25, 2021

View reviewed changes

bobturneruk force-pushed the digits-notebook branch from 8e99097 to 613f46d Compare May 27, 2021 08:47

haipinglu linked an issue May 27, 2021 that may be closed by this pull request

[Bug] digits_dann_lightn example command from readme returns an error as it tries to download data from a server. #148

Closed

consistent heading levels

87c763f

bobturneruk force-pushed the digits-notebook branch from 97960a1 to 87c763f Compare June 10, 2021 08:05

bobturneruk added 8 commits June 10, 2021 09:27

use subset and increase epochs

a8fd251

on colab install from git, not pypi

6a473f5

fix pip from git syntax

4abad82

rename notebook

feba035

rename yaml

eb8285e

spellings

8c2145d

reduce epochs

3da1285

descriptions

a3d3151

better domain adaptation intro

7272891

bobturneruk added 2 commits June 10, 2021 14:40

timing, descriptions, buttons

6ca07bd

use main branch for colab clone

9dafd0e

bobturneruk marked this pull request as ready for review June 10, 2021 13:42

haipinglu approved these changes Jun 10, 2021

View reviewed changes

bobturneruk enabled auto-merge June 10, 2021 13:45

bobturneruk disabled auto-merge June 10, 2021 13:55

fix links

d0e5308

bobturneruk enabled auto-merge June 10, 2021 14:02

Merge branch 'main' into digits-notebook

f2fc53b

bobturneruk merged commit 240c2cb into main Jun 10, 2021

v0.1.0 automation moved this from In progress to Done Jun 10, 2021

bobturneruk deleted the digits-notebook branch June 10, 2021 15:10

github-actions bot mentioned this pull request Jun 21, 2021

Release 0.1.0rc2 #161

Merged

1 task

Digits notebook #149

Digits notebook #149

Conversation

bobturneruk commented May 17, 2021

Description

Status

Types of changes

bobturneruk commented May 17, 2021 • edited Loading

codecov-commenter commented May 17, 2021 • edited Loading

Codecov Report

haipinglu commented May 17, 2021

bobturneruk commented May 21, 2021

bobturneruk commented May 21, 2021

haipinglu commented May 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobturneruk commented May 24, 2021

bobturneruk commented May 24, 2021 • edited Loading

bobturneruk commented May 24, 2021

haipinglu commented May 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobturneruk commented May 27, 2021

bobturneruk commented May 27, 2021

haipinglu commented May 27, 2021 • edited Loading

haipinglu commented May 28, 2021 • edited Loading

bobturneruk commented Jun 1, 2021

haipinglu commented Jun 1, 2021

bobturneruk commented Jun 3, 2021

bobturneruk commented Jun 3, 2021

haipinglu commented Jun 3, 2021

haipinglu commented Jun 3, 2021

bobturneruk commented Jun 3, 2021

bobturneruk commented Jun 3, 2021

mustafa1728 commented Jun 3, 2021

bobturneruk commented Jun 3, 2021

bobturneruk commented Jun 10, 2021

haipinglu commented Jun 10, 2021

haipinglu commented Jun 10, 2021

bobturneruk commented Jun 10, 2021

haipinglu commented Jun 10, 2021

bobturneruk commented Jun 10, 2021

bobturneruk commented May 17, 2021 •

edited

Loading

codecov-commenter commented May 17, 2021 •

edited

Loading

bobturneruk commented May 24, 2021 •

edited

Loading

haipinglu commented May 27, 2021 •

edited

Loading

haipinglu commented May 28, 2021 •

edited

Loading