Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed upgrade - Refactor generate network #152

Merged
merged 99 commits into from Jun 22, 2022
Merged

Conversation

zktuong
Copy link
Owner

@zktuong zktuong commented Jun 7, 2022

Bug fixes and Improvements

  • Speed up generate_network
    • pair-wise hamming distance is calculated on per clone/clonotype only if more than 1 cell is assigned to a clone/clonotype
    • .distance slot is removed and is now directly stored/converted from the .graph slot.
    • new options:
      • compute_layout: bool = True. If dataset is too large, generate_layout can be switched to False in which case only the networkx graph is returned. The data can still be visualised later with scirpy's plotting method (see below).
      • layout_method: Literal['sfdp', 'mod_fr'] = 'sfdp'. New default uses the ultra-fast C++ implemented sfdp_layout algorithm in graph-tools to generate final layout. sfdp stands for Scalable Force Directed Placement.
        • Minor caveat is that the repulsion is not as good - when there's a lot of singleton nodes, they don't separate well unless you some how work out which of the parameters in sfdp_layout to tweak will produce an effective separate. changing gamma alone doesn't really seem to do much.
        • The original layout can still be generated by specifying layout_method = 'mod_fr'. Requires a separate installation of graph-tool via conda (not managed by pip) as it has several C++ dependencies.
        • pytest on macos may also stall because of a different backend being called - this is solved by changing tests that calls generate_network to run last.
    • added steps to reduce memory hogging.
    • min_size was doing the opposite previously and this is now fixed. [BUG] min_size in generate_network #155
  • Speed up transfer
  • Fix [BUG] allow manual paths for germline #154
    • reorder the if-else statements.
  • Speed up filter_contigs
    • tree construction is simplified and replaced for-loops with dictionary updates.
  • Speed up initialise_metadata. Dandelion should now initialise and read faster.
    • Removed an unnecessary data sanitization step when loading data.
    • Now load_data will rename umi_count to duplicate_count
    • Speed up Query
      • tree construction is simplified and replaced for-loops with dictionary updates.
      • didn't need to use an airr validator as that slows things down.
  • data initialised by Dandelion will be ordered based on productive first, then followed by umi count (largest to smallest).

Breaking Changes

  • initialise_metadata/update_metadata/Dandelion
    • For-loops to initialise the object has veen vectorized, resulting in a minor speed uprade
    • This results in reduction of some columns in the .metadata which were probably bloated and not used.
      • vdj_status and vdj_status_summary removed and replaced with rearrangement_VDJ_status and rearrange_VJ_status
      • constant_status and constant_summary removed and replaced with constant_VDJ_status and constant_VJ_status.
      • productive and productive_summary combined and replaced with productive_status.
      • locus_status and locus_status_summary combined and replaced with locus_status.
      • isotype_summary replaced with isotype_status.
  • where there was previously unassigned or '' has been changed to :str: None in .metadata.
    • Not changed to NoneType as there's quite a bit of text processing internally that gets messed up if swapped.
    • No_contig will still be populated after transfer to AnnData to reflect cells with no TCR/BCR info.
  • deprecate use of nxviz<0.7.4

Minor changes

  • Rename and deprecate read_h5/write_h5. Use of read_h5ddl/write_h5ddl will be enforced in the next update.

@codecov
Copy link

codecov bot commented Jun 7, 2022

Codecov Report

Merging #152 (8a20407) into master (5dbd1ab) will increase coverage by 5.61%.
The diff coverage is 85.79%.

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
+ Coverage   73.41%   79.03%   +5.61%     
==========================================
  Files          22       44      +22     
  Lines        5748     7245    +1497     
==========================================
+ Hits         4220     5726    +1506     
+ Misses       1528     1519       -9     
Flag Coverage Δ
unittests 79.03% <85.79%> (+5.61%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandelion/logging/_badge.py 100.00% <ø> (ø)
dandelion/preprocessing/_preprocessing.py 63.87% <ø> (-0.85%) ⬇️
tests/fixtures/fixtures.py 100.00% <ø> (ø)
tests/fixtures/fixtures_mouse.py 100.00% <ø> (ø)
dandelion/tools/_gini.py 87.50% <60.00%> (+0.54%) ⬆️
dandelion/plotting/_plotting.py 63.77% <64.39%> (-1.58%) ⬇️
dandelion/preprocessing/external/_preprocessing.py 65.28% <71.00%> (+0.14%) ⬆️
dandelion/tools/_network.py 68.73% <74.81%> (+2.87%) ⬆️
dandelion/logging/_metadata.py 80.00% <75.00%> (+5.00%) ⬆️
dandelion/tools/_tools.py 81.15% <75.00%> (-1.02%) ⬇️
... and 58 more

@zktuong zktuong marked this pull request as draft June 7, 2022 17:38
@zktuong zktuong closed this Jun 7, 2022
@zktuong zktuong reopened this Jun 7, 2022
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zktuong zktuong linked an issue Jun 9, 2022 that may be closed by this pull request
@zktuong zktuong merged commit 7f38a03 into master Jun 22, 2022
@zktuong zktuong deleted the refactor_generate_network branch June 22, 2022 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants