Updated seqFISH Decoding Method: CheckAll Decoder #1978

nickeener · 2022-05-17T20:29:53Z

This is an update to my previous PR for a new seqFISH decoding method. We discovered some problems with the method in regards to the false positive rate and have spent the past several months improving it to it's current state and think it is now fit to be published.

So recapping the previous PR, we found that starFISH's seqFISH decoding (PerRoundMaxChannel with NEAREST_NEIGHBOR or EXACT_MATCH TraceBuildingStrategy) was finding far fewer targets in our seqFISH datasets than we were expecting from previous results and the reason for this is the limited search space within which starFISH looks for barcodes. For both TraceBuildingStrategy's, spots in different rounds are only connected into barcodes if they are spatially close to a spot in an arbitrary anchor round which does not utilize the inherent error correction that can be done by comparing the barcodes found using each round as an anchor. Additionally, the EXACT_MATCH TraceBuildingStrategy requires spots in different rounds to be in the exact same pixel location in order to connected into barcodes while NEAREST_NEIGHBOR requires that they be nearest neighbors, which are both extremely sensitive to spot drift between rounds, requiring precise registration that isn't always possible. The PerRoundMaxChannel decoder also lacks the ability to find error-corrected barcodes (in codebook where all codes have at least a hamming distance of 2 each code can still be uniquely identified by a barcode that is missing one round).

To address this, I attempted to write a seqFISH decoding method that emulates the one used by the lab where seqFISH originated at Caltech. In this method, the spots in all rounds are separately used as anchors to search to spots within a search radius of the anchor spots. So for each spot, all the possible barcodes that could constructed using the spots in other rounds within the search radius are considered, which is a much larger search space than when only considering exact matches or nearest neighbors. Each anchor spot is then assigned a "best" barcode out of its total possible set (it "checks all" of the possible barcodes) based on the spatial variance and intensities of the spots that make it up and whether it has a match to codebook, though the order of those steps will vary.

This larger search space requires a few extra tricks to keep false positive calls down. One of these is a filter that removes barcodes that were not found as best for a certain number of the spots/rounds that make it up (usually one less than the total round number) which the Caltech group calls the seed filter. So if we had a 5 round barcode and that barcode was chosen as the best choice for only 3 of the spots that make it up, it would not pass the filter. I've also introduced a parameter I call strictness that sets a maximum number of possible barcodes a given spot is allowed to have before it is dropped as ambiguous. Barcodes in spot dense regions tend to be difficult to call accurately so it is useful to be able to restrict that if necessary. Another trick I found helpful was to run decoding in multiple nested stages that start off with stricter parameters which are loosened as decoding progresses. This includes the search radius (starts off with radius 0 and increments up to the user-specified value), the number of rounds that can be omitted from the barcode to make a match (only looks for full codes first then looks for partials) and the max allowed score (based on spatial variance and intensity of spots). Between each decoding, the spots that were found to be in decoded barcodes that pass all filters are removed from the original spot set before decoding the remaining spots. This allows you to call high confidence barcodes first and remove their spots, making calling adjacent barcodes easier. Running the whole decoding process repeatedly became quite time consuming so I have added in multiprocessing capability to allow users to speed things up using Python's standard multiprocessing library.

In addition to the filters and incremental decoding I also take advantage of two slightly different decoding algorithms with different precision/recall properties. In one method, which I call "filter-first", once all the possible barcodes for each spot are assembled, the "best" barcode for each spot is chosen based on a scoring function that uses the spatial variance of the spots and their intensities. The chosen barcodes are then matched to the codebook and those that have no match are dropped. The other method, called "decode-first", instead matches all the possible barcodes for each spot to the codebook first and then if there are multiple matches the best is chosen based on the distance score. The "filter-first" method tends to be higher accuracy but return fewer decoded targets (high precision/low recall) while the "decode-first" method finds more mRNA targets but at a cost to accuracy (low precision/high recall). In keeping with my incremental decoding strategy I described earlier, the first set of decoding is done using the high accuracy "filter-first" method followed by decodings done using the low accuracy version.

Below is a description of the required inputs and a step by step explanation of the algorithm.

Inputs:

spots - starFISH SpotFindingResults object
codebook - starFISH Codebook object
search_radius - pixel distance that spots can be from each other and be connected into a barcode
error_rounds - number of rounds that can be dropped from each code and still allow unique barcode matching (only values are 0 or 1 are currently allowed)
mode - accuracy mode, determines settings of several different parameters ("high", "med", or "low")
physical_coords - boolean for whether to using the physical coordinate values found in the spots object for calculating distance between spots or use pixel coordinates

Steps

For each spot in each round, calculate a scaled intensity value by dividing the spot's intensity by the l2 norm of all the spots found in the same channel and same 100 radius pixel neighborhood.
For each spot in each round, find all neighboring spots in different rounds within the search radius.
From each spot and their neighbors, build all possible combinations of spots that form barcodes using their channel labels.
Choose a "best" barcode for each spot using the scoring function and drop barcodes that don't have codebook matches (this is for the filter-first method and is switched for decode-first).
Drop barcodes that were not found as best for most of the spots that make it up (usually one less than the round total).
Choose between overlapping barcodes (barcodes that share spots) using their scores.
Remove spots found in passing barcodes from original spot set.
(2-7 is repeated, first using the filter-first method for each incremental radius and then using the decode-first method for each radius)
One final decoding is done if error_rounds parameter > 0, where partial barcodes are allowed. Uses filter-first method and strict parameters as the barcodes it finds are more error prone than full barcodes.
Turn overall barcode set into DecodedIntensityTable object and return

One final detail worth explaining here is the mode parameter. This parameter can take three values: "high", "med", or "low" and essentially simplifies a number of parameter choices into these three presets including the seed filter value at the end of each decoding step, the max allowed score allowed for a barcode, and the strictness value for the filter-first and decode-first methods. "high" corresponds to high accuracy (lower recall), "low" corresponds to low accuracy (higher recall), and "med" is a balance between the two.

To evaluate the results of our decodings and compare them with starFISH methods, we used two different kinds of qc measures. The first is correlation with orthogonal RNA counts (either smFISH or RNAseq) and the other is the proportion of false positive calls measured by placing "blank" codes in the codebook that adhere to the hamming distance rules of the codebook but don't actually correspond to any true mRNA. This false positive metric was lacking from my previous PR and when we ran it on the previous method our false positive values were quite high. Depending on the method used to calculate the false positive rate and the specific dataset, high false positive rates can be a problem with seqFISH data but the mode parameter allows you to easily adjust the decoding parameters to get the best results.

Below are some figures showing results from 3 different seqFISH datasets comparing the starFISH PerRoundMaxChannel + NEAREST_NEIGHBORS method with each mode choice for the CheckAll decoder. Each red/blue bar pair along the x-axis represent a single cell, FOV, or image chunk (don't have segmented cells yet for this dataset so it is broken into equal sized chunks), while the height of the blue bar represents the total number of on-target (real) barcodes found in that cell/fov/chunk and the red bar height is the off-target count (blank) with the black bar showing the median on-target count. Also printed on each graph is the overall false positive (FP) rate and the pearson correlation with smFISH or RNAseq where available. For each dataset the same set of spots was used for each run.

Intron seqFISH

Mouse ESC lines
10,421 real genes + ~9,000 blank genes
29 (close) z slices
This was the most difficult of the three datasets to get good results from. The large number of genes and close by z slices make finding true barcodes tricky.

RNA SPOTs seqFISH

No cells, just extracted RNA on a slide
10,421 real genes + ~9,000 blank genes
single z slice
Pretty easy dataset as there was no cellular autofluorescence

Mouse Atlas seqFISH

slices of mouse hippocampus
351 real genes, 351 blanks
6 (distant) z slices
Lower false positives than intron seqFISH probably from the lower gene count and distant z slices reducing confusion. Not sure why starFISH does so poorly here though but the FP rate is almost 3x the low accuracy rate for the new decoder.

From these, you can see that this updated decoder is capable of finding more targets than starFISH, with lower false positive rates, and higher smFISH/RNAseq correlations. The one area where the starFISH methods have us beat is speed. Being more thorough is costly so while starFISH can decode a set of spots in 2-3 minutes on my system, this new decoder can take up to 1.5 hours even when using 16 different simultaneous processes. That time is from the intron seqFISH dataset which is quite large (5 x 12 x 29 x 2048 x 2048) so the run times are more reasonable for smaller datasets.

Running the "make all" tests for my current repo throws a mypy error but in code that I haven't touched (like the ImageStack class code) so hopefully we can work together to get those kinks worked out so we can add this to starFISH so others might use it.

berl · 2022-05-18T04:45:34Z

@nickeener this is really cool to see- thanks for the contribution! I'll take a look at the test/CI situation and see if I can help

nickeener · 2022-06-13T20:50:53Z

@berl Any updates on this? It appears that it is erroring out trying to download the sphinx-bootstrap-theme dependency on Python 3.9 but is otherwise passing all the initial tests.

berl · 2022-06-18T22:40:00Z

@nickeener yeah, this has been fixed in sphinx-bootstrap-theme==0.8.1, so that should get you past the install tests.

add this line to the REQUIREMENTS.txt file:
sphinx-bootstrap-theme==0.8.1
and then run
make pin-all-requirements
which will propagate the new requirement into the REQUIREMENTS-*.txt files.
This results in a successful make install-dev on my mac, but I haven't done any other testing past this. if you commit back to this, we'll get the CI tests

nickeener · 2022-06-20T20:36:20Z

@berl Running into this error after I add the sphinx-bootstrap-theme==0.8.1 line to REQUIREMENTS.txt and then run make pin-all-requirements:

make starfish/REQUIREMENTS-STRICT.txt requirements/REQUIREMENTS-CI.txt requirements/REQUIREMENTS-NAPARI-CI.txt
make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish'
[ ! -e .REQUIREMENTS.txt-env ] || exit 1
make[1]: *** [Makefile:103: starfish/REQUIREMENTS-STRICT.txt] Error 1
make[1]: Leaving directory '/home/nick/projects/spacetx/decoder/starfish'
make: *** [Makefile:98: pin-all-requirements] Error 2

Any advice on this?

berl · 2022-06-21T22:15:19Z

@nickeener did you run make pin-all-requirements from inside the starfish directory? my output doesn't show

make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish'

but the error is in line 103 where it checks for a virtual environment. can you create a virtual environment with python=3.9 (using conda or venv ) and activate it before running make pin-all-requirements?

Also, what's your OS and conda version if applicable?

nickeener · 2022-06-22T00:32:46Z

@berl Yes I am in the top level starfish directory when I run make pin-all-requirements. I've tried creating a fresh python3.9 conda env and then installing everything with make install-dev but I get the same exact error.

My OS is Ubuntu 20.04.4 and my conda version is conda 4.13.0.

Would you be able to share the new correctly built requirements files with me so I could just replace the old versions with them?

ttung · 2022-06-22T04:57:15Z

@nickeener I suspect you previously tried to repin requirements, and there are stale files left over from the attempt. Can you remove $STARFISH/ .REQUIREMENTS.txt-env (rm -rf /home/nick/projects/spacetx/decoder/starfish/ .REQUIREMENTS.txt-env) and try again?

nickeener · 2022-06-22T06:28:42Z

@ttung That was exactly what was wrong thanks! Was able to run make pin-all-requirements and get the updated requirements files but now it is throwing several mypy errors, all in files that I haven't changed.

ttung · 2022-06-22T06:37:38Z

I'm cloned your changes into #1984. I have it mostly working.

nickeener · 2022-06-22T23:10:02Z

@ttung I've pushed a fix here that should solve the test error in the #1984

nickeener · 2022-06-30T15:28:20Z

@ttung hey just repinging this, the last commit I made here should fix the error message in #1984

nickeener · 2022-07-19T19:23:13Z

@ttung @berl
Repinging again. Would really like to get this pushed soon as we are currently preparing our manuscript that includes this project so it would be nice to be able to say it is already integrated into starFISH.

ttung · 2022-07-20T05:21:26Z

@nickeener I cherry picked most of 8752555 into my branch and fired off a build.

nickeener · 2022-07-20T20:23:46Z

@ttung @berl
Looks like it's passing all the tests now! Let me know what/if there is anything else you'd like from me before merging.

berl

@nickeener nice work! I noticed a few things to fix and then I think we should merge.

starfish/core/spots/DecodeSpots/check_all_decoder.py

starfish/core/spots/DecodeSpots/check_all_funcs.py

berl · 2022-07-22T05:08:58Z

@ttung thanks for your help with this! My review was pretty basic but after the edits above, I'm inclined to merge to add this functionality to starfish.

berl

thanks for the fixes!

nickeener and others added 26 commits October 10, 2021 21:42

Initial commit

e3c70bf

Updated comments

9386a03

Changed filter_tally to rounds_used in decoded_intensity_table

5d3e8d0

Fixed validation errors

b21b57f

added tests for checkAll decoder

548e4e5

Merge branch 'spacetx:master' into master

54f6fb6

Fixed imports order

c04f503

include test updates

15f5fa0

annotation fixes and speed improvements

bfd0715

added ray to requirements

102d554

More speed and memory improvements

900c6e2

Replaced ray w/ multiprocessing

fe9733e

Fix import order

002d25d

few more fixes

ef5cf5e

Changed test to account for randomness

68a3c5f

fix test

3a10c18

Added new filter step

7f74fb9

Added input error checking and condition for empty results

be39052

Fixed channel bug

0a8ced9

March 30 2022 big update

ea8a2e9

Updated comments

1d19407

Updated comments and fixed mypy errors

ec859a1

Merge branch 'spacetx:master' into ray_2022

475a8f8

Swapped ray multiprocessing w/ standard Python

027b7cf

fixed function arg bug

9c383d3

mypy fix

fe841db

nickeener added 2 commits May 26, 2022 21:43

bug fixes

290c62e

fix

1a6b9a6

Added sphinx update to reqs

f2987a1

fixed error correction bug

8752555

reqs fix

cde796b

berl reviewed Jul 22, 2022

View reviewed changes

starfish/core/spots/DecodeSpots/check_all_decoder.py Outdated Show resolved Hide resolved

starfish/core/spots/DecodeSpots/check_all_funcs.py Outdated Show resolved Hide resolved

Added better error catching

30e919f

berl approved these changes Jul 23, 2022

View reviewed changes

berl merged commit c8327a8 into spacetx:master Jul 23, 2022

nickeener mentioned this pull request Jul 27, 2022

New seqFISH Decoding Method #1960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated seqFISH Decoding Method: CheckAll Decoder #1978

Updated seqFISH Decoding Method: CheckAll Decoder #1978

nickeener commented May 17, 2022

berl commented May 18, 2022

nickeener commented Jun 13, 2022

berl commented Jun 18, 2022

nickeener commented Jun 20, 2022

berl commented Jun 21, 2022

nickeener commented Jun 22, 2022

ttung commented Jun 22, 2022

nickeener commented Jun 22, 2022

ttung commented Jun 22, 2022

nickeener commented Jun 22, 2022 •

edited

nickeener commented Jun 30, 2022

nickeener commented Jul 19, 2022

ttung commented Jul 20, 2022

nickeener commented Jul 20, 2022

berl left a comment

berl commented Jul 22, 2022 •

edited

berl left a comment

Updated seqFISH Decoding Method: CheckAll Decoder #1978

Updated seqFISH Decoding Method: CheckAll Decoder #1978

Conversation

nickeener commented May 17, 2022

berl commented May 18, 2022

nickeener commented Jun 13, 2022

berl commented Jun 18, 2022

nickeener commented Jun 20, 2022

berl commented Jun 21, 2022

nickeener commented Jun 22, 2022

ttung commented Jun 22, 2022

nickeener commented Jun 22, 2022

ttung commented Jun 22, 2022

nickeener commented Jun 22, 2022 • edited

nickeener commented Jun 30, 2022

nickeener commented Jul 19, 2022

ttung commented Jul 20, 2022

nickeener commented Jul 20, 2022

berl left a comment

Choose a reason for hiding this comment

berl commented Jul 22, 2022 • edited

berl left a comment

Choose a reason for hiding this comment

nickeener commented Jun 22, 2022 •

edited

berl commented Jul 22, 2022 •

edited