Vb/imagesource #221

vineetbansal · 2023-02-10T19:43:20Z

I'm starting this new PR (not quite there yet, but close) that demonstrates a new cryodrgn.source.ImageSource class that should simplify calling code quite a bit.

tests/test_source.py is a good place to see how to use this. The basic idea is:

src = ImageSource.from_file(<mrc/star/cs/txt>, lazy=True)
im = src.images(<slice>)  # To get torch.Tensor, or
im = src[<slice>]         # To get torch.Tensor

# Exactly the same usage as above when lazy=False
src = ImageSource.from_file(<mrc/star/cs/txt>, lazy=False)
im = src.images(<slice>)  # To get torch.Tensor, or
im = src[<slice>]         # To get torch.Tensor

For utilizing this during training:

    data = dataset.ImageDataset(
        mrcfile=args.particles,
        tilt_mrcfile=args.tilt,
        lazy=args.lazy,
        ...
     )
     ...
    data_generator = DataLoader(
        data,
        num_workers=num_workers_per_gpu,
        sampler=BatchSampler(
            RandomSampler(data), batch_size=args.batch_size, drop_last=False
        ),
        batch_size=None,
    )
   ...
   
   for epoch in range(...):
          for minibatch in data_generator:
          ...

Quite a few refactorings have been done in cryodrgn already to use this new functionality, but I'm certain there are more, and till the test coverage reaches close to 100% and I'm certain I haven't overlooked any scripts/utilities, I will keep this is draft mode.

We're now also using torch.Tensor for all image data throughout our code (except a few places where saving to .pkl or .mrc files), and torch.fft for all FFT functions.

A large number of changes in the code have to do with the fact that trying to use torch.Tensor/torch.fft by default would not have worked without these changes, because the codebase was assuming np.arrays at lots of places.

There is hardly any documentation, and I need to address some newly introduced pyright complaints too.

Performance on lazy/eager datasets is comparable to the old implementation (actually just slightly faster). This is unsurprising because while the new API makes it easier for us to implement chunking, we haven't (yet) changed any logic to do so. I'll add some performance graphs to this PR as we proceed through the chunking logic.

Todo before merge:

Add DataShuffler support to all scripts that use a DataLoader
Add unit test for data shuffler (test that it returns all the elements; test that if you iterate twice it gives different orders)
Refactor TxtFileSource to inherit from _MRCDataFrameSource
Decide on whether to merge require_adjacent argument for DataShuffler's use #287 [YES]
Decide on whether to merge not supporting getitem on ImageSource anymore #288 [NO]
Set --num-workers 1 default everywhere
Clean up --max-threads argument in train_vae - decide whether to always use 1 thread, or keep the argument and change comment from FFT threads to data-loading threads.
Merge master
Test it on real data [blocked on cluster down]

…o work on cpu/gpu

…ault_tensor_type (dangerous)

…memory as it does not work for CPU case

adamlerer · 2023-02-20T15:48:24Z

Hey @vineetbansal , this looks wild! Please let me know when I should start reviewing this PR, I expect it will take some time to review :)

vineetbansal · 2023-02-20T16:28:20Z

Hi @adamlerer. I think the introduction of an ImageSource and its incorporation into the rest of cryodrgn passes basic sanity checks so that you and @zhonge can start looking at this. Yes, I suspect there will be some back-and-forth so I don't want to hide this PR anymore!

adamlerer

This is super cool! So many useful improvements being worked on here. I'm going to split the review into smaller chunks since there's so much to get through.

I'm seeing a lot of complexity and dangerous code in here that looks like it's the result of a premature optimization around doing the FFT in-place on CPU? I think you just want to do the FFT on the GPU on the current batches, and don't worry about in-place. That's something I'd be interested in discussing if there's another way to accomplish this goal.

cryodrgn/source.py

cryodrgn/mrc.py

cryodrgn/commands/downsample.py

cryodrgn/fft.py

cryodrgn/commands/abinit_homo.py

cryodrgn/commands/downsample.py

cryodrgn/commands/eval_vol.py

cryodrgn/commands/train_vae.py

cryodrgn/dataset.py

…ot return copy of data for ArraySource

adamlerer · 2023-05-21T15:31:10Z

Btw, the use of max_threads in this PR is a bit concerning. If you look at what --max-threads previously meant (e.g. looking at the help string in train_vae) it was the number of threads being used for the FFT. Now it's the number of threads used for data loading. --max-threads and --num-workers don't interact well, becuase --num-workers splits the creation of each batch into N processes, and then --max-threads tries to split each of those into multiple threads, each of which want as large of a sub-batch (from a single file) as possible.

I think we should default --num-workers to 1 everywhere.

… its use)

vineetbansal · 2023-05-22T20:25:03Z

Fair point. n_workers set to 1 (and decoupled with --max-threads). There's value in setting n_workers > 1, but not even close to the gains from the upcoming DataShuffler.

Data Shuffler

require_adjacent argument for DataShuffler's use

A couple of tests for require_contiguous

Outdated

vineetbansal added 2 commits February 10, 2023 14:33

Changes for ImageSource; torch.fft

d0f5cb0

Added new tests

fbf9006

vineetbansal marked this pull request as draft February 10, 2023 19:43

vineetbansal added 17 commits February 10, 2023 14:58

Added ImageSource code

7fecbaa

Renamed MyMRCData to ImageDataset

4207a01

faster tests (reduced no. of epochs)

ce14dee

more tests; minor modifications in analyze_landscape_full to get it t…

8901c36

…o work on cpu/gpu

dtype fixes; Using torch.allclose in tests; Not setting torch.set_def…

10fba8f

…ault_tensor_type (dangerous)

Using shuffle=True for analyze_landscape_full as before; removed pin_…

360b062

…memory as it does not work for CPU case

More tests; pre-specified indices supported for ImageSource

11447d5

saving incomplete phase_flip np to torch work

29d0e4e

Refactored downsample

4c2b716

phase_flip moved to torch

f0150e6

Using torch Samplers everywhere

26a97f2

tests galore

da26e75

pyright fixes

803d946

re-enabled pyright in CI

cc75bdb

pyright fixes for py 3.7; py 3.10 added to test matrix

531dea9

quoted python versions for CI yml

d1fd79c

Refactored preprocess to use chunked processing for ImageSource

18b92b4

vineetbansal marked this pull request as ready for review February 20, 2023 16:23

vineetbansal requested review from adamlerer and zhonge February 20, 2023 16:23

adamlerer requested changes Feb 21, 2023

View reviewed changes

adamlerer previously requested changes Feb 21, 2023

View reviewed changes

vineetbansal added 3 commits February 23, 2023 15:30

Bug fix with indices being specified for eager loading; Refactor to n…

a3ee61c

…ot return copy of data for ArraySource

Suggestions from PR 221

7f3b875

Some more suggestions from PR 221

3b8ce5a

adamlerer added 2 commits May 16, 2023 01:17

Fix fft.py

229523a

Data Shuffler

8ca907c

vineetbansal added 2 commits May 22, 2023 09:44

Running CI on vb/imagesource pull to trigger tests on PR 284

77bf332

Using a single thread for all loading (Upcoming DataShuffler obviates…

9141bf9

… its use)

vineetbansal and others added 12 commits May 23, 2023 10:51

Merge pull request #284 from zhonge/data-shuffler

aac6cac

Data Shuffler

pre-commit checks

92da1bb

moved type-ignore line to correct place (black moved it)

a822247

require_adjacent argument for DataShuffler's use

716f506

Add data shuffling to all scripts; set num_workers default to 1

31523d0

Merge pull request #287 from zhonge/vb/imagesource-datashuffler-tweak

906e111

require_adjacent argument for DataShuffler's use

Add unit test for DataShuffler

7516094

TxtFileSource subclasses _MRCDataFrameSource

3290b46

adjacent -> contiguous naming change

870876a

Merge remote-tracking branch 'public/master' into vb/imagesource

b2f3112

flake8 and black

eee877f

bugfix

4038249

adamlerer force-pushed the vb/imagesource branch from 8350ab6 to 4038249 Compare May 28, 2023 04:37

adamlerer and others added 7 commits May 29, 2023 00:16

bug

b4bf11c

Fix bugs and perf regressions

88860e8

Use the shuffler in eval_z; default num_workers to 0

853bedf

A couple of tests for require_contiguous

e0e8988

Added a summation check for obtained images from DataShuffler

36fb888

Merge pull request #290 from zhonge/vb/contiguous_test

1e03d67

A couple of tests for require_contiguous

Put max threads back, and rename it

6a18bec

zhonge approved these changes Jun 6, 2023

View reviewed changes

zhonge merged commit dcb30eb into master Jun 6, 2023
4 checks passed

michal-g deleted the vb/imagesource branch February 22, 2024 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vb/imagesource #221

Vb/imagesource #221

vineetbansal commented Feb 10, 2023 •

edited by adamlerer

adamlerer commented Feb 20, 2023

vineetbansal commented Feb 20, 2023

adamlerer left a comment

adamlerer commented May 21, 2023

vineetbansal commented May 22, 2023

Vb/imagesource #221

Vb/imagesource #221

Conversation

vineetbansal commented Feb 10, 2023 • edited by adamlerer

Todo before merge:

adamlerer commented Feb 20, 2023

vineetbansal commented Feb 20, 2023

adamlerer left a comment

Choose a reason for hiding this comment

adamlerer commented May 21, 2023

vineetbansal commented May 22, 2023

vineetbansal commented Feb 10, 2023 •

edited by adamlerer