Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated seqFISH Decoding Method: CheckAll Decoder #1978

Merged
merged 32 commits into from Jul 23, 2022

Conversation

nickeener
Copy link
Contributor

This is an update to my previous PR for a new seqFISH decoding method. We discovered some problems with the method in regards to the false positive rate and have spent the past several months improving it to it's current state and think it is now fit to be published.

So recapping the previous PR, we found that starFISH's seqFISH decoding (PerRoundMaxChannel with NEAREST_NEIGHBOR or EXACT_MATCH TraceBuildingStrategy) was finding far fewer targets in our seqFISH datasets than we were expecting from previous results and the reason for this is the limited search space within which starFISH looks for barcodes. For both TraceBuildingStrategy's, spots in different rounds are only connected into barcodes if they are spatially close to a spot in an arbitrary anchor round which does not utilize the inherent error correction that can be done by comparing the barcodes found using each round as an anchor. Additionally, the EXACT_MATCH TraceBuildingStrategy requires spots in different rounds to be in the exact same pixel location in order to connected into barcodes while NEAREST_NEIGHBOR requires that they be nearest neighbors, which are both extremely sensitive to spot drift between rounds, requiring precise registration that isn't always possible. The PerRoundMaxChannel decoder also lacks the ability to find error-corrected barcodes (in codebook where all codes have at least a hamming distance of 2 each code can still be uniquely identified by a barcode that is missing one round).

To address this, I attempted to write a seqFISH decoding method that emulates the one used by the lab where seqFISH originated at Caltech. In this method, the spots in all rounds are separately used as anchors to search to spots within a search radius of the anchor spots. So for each spot, all the possible barcodes that could constructed using the spots in other rounds within the search radius are considered, which is a much larger search space than when only considering exact matches or nearest neighbors. Each anchor spot is then assigned a "best" barcode out of its total possible set (it "checks all" of the possible barcodes) based on the spatial variance and intensities of the spots that make it up and whether it has a match to codebook, though the order of those steps will vary.

This larger search space requires a few extra tricks to keep false positive calls down. One of these is a filter that removes barcodes that were not found as best for a certain number of the spots/rounds that make it up (usually one less than the total round number) which the Caltech group calls the seed filter. So if we had a 5 round barcode and that barcode was chosen as the best choice for only 3 of the spots that make it up, it would not pass the filter. I've also introduced a parameter I call strictness that sets a maximum number of possible barcodes a given spot is allowed to have before it is dropped as ambiguous. Barcodes in spot dense regions tend to be difficult to call accurately so it is useful to be able to restrict that if necessary. Another trick I found helpful was to run decoding in multiple nested stages that start off with stricter parameters which are loosened as decoding progresses. This includes the search radius (starts off with radius 0 and increments up to the user-specified value), the number of rounds that can be omitted from the barcode to make a match (only looks for full codes first then looks for partials) and the max allowed score (based on spatial variance and intensity of spots). Between each decoding, the spots that were found to be in decoded barcodes that pass all filters are removed from the original spot set before decoding the remaining spots. This allows you to call high confidence barcodes first and remove their spots, making calling adjacent barcodes easier. Running the whole decoding process repeatedly became quite time consuming so I have added in multiprocessing capability to allow users to speed things up using Python's standard multiprocessing library.

In addition to the filters and incremental decoding I also take advantage of two slightly different decoding algorithms with different precision/recall properties. In one method, which I call "filter-first", once all the possible barcodes for each spot are assembled, the "best" barcode for each spot is chosen based on a scoring function that uses the spatial variance of the spots and their intensities. The chosen barcodes are then matched to the codebook and those that have no match are dropped. The other method, called "decode-first", instead matches all the possible barcodes for each spot to the codebook first and then if there are multiple matches the best is chosen based on the distance score. The "filter-first" method tends to be higher accuracy but return fewer decoded targets (high precision/low recall) while the "decode-first" method finds more mRNA targets but at a cost to accuracy (low precision/high recall). In keeping with my incremental decoding strategy I described earlier, the first set of decoding is done using the high accuracy "filter-first" method followed by decodings done using the low accuracy version.

Below is a description of the required inputs and a step by step explanation of the algorithm.

Inputs:

  • spots - starFISH SpotFindingResults object
  • codebook - starFISH Codebook object
  • search_radius - pixel distance that spots can be from each other and be connected into a barcode
  • error_rounds - number of rounds that can be dropped from each code and still allow unique barcode matching (only values are 0 or 1 are currently allowed)
  • mode - accuracy mode, determines settings of several different parameters ("high", "med", or "low")
  • physical_coords - boolean for whether to using the physical coordinate values found in the spots object for calculating distance between spots or use pixel coordinates

Steps

  1. For each spot in each round, calculate a scaled intensity value by dividing the spot's intensity by the l2 norm of all the spots found in the same channel and same 100 radius pixel neighborhood.
  2. For each spot in each round, find all neighboring spots in different rounds within the search radius.
  3. From each spot and their neighbors, build all possible combinations of spots that form barcodes using their channel labels.
  4. Choose a "best" barcode for each spot using the scoring function and drop barcodes that don't have codebook matches (this is for the filter-first method and is switched for decode-first).
  5. Drop barcodes that were not found as best for most of the spots that make it up (usually one less than the round total).
  6. Choose between overlapping barcodes (barcodes that share spots) using their scores.
  7. Remove spots found in passing barcodes from original spot set.
    (2-7 is repeated, first using the filter-first method for each incremental radius and then using the decode-first method for each radius)
  8. One final decoding is done if error_rounds parameter > 0, where partial barcodes are allowed. Uses filter-first method and strict parameters as the barcodes it finds are more error prone than full barcodes.
  9. Turn overall barcode set into DecodedIntensityTable object and return

One final detail worth explaining here is the mode parameter. This parameter can take three values: "high", "med", or "low" and essentially simplifies a number of parameter choices into these three presets including the seed filter value at the end of each decoding step, the max allowed score allowed for a barcode, and the strictness value for the filter-first and decode-first methods. "high" corresponds to high accuracy (lower recall), "low" corresponds to low accuracy (higher recall), and "med" is a balance between the two.

To evaluate the results of our decodings and compare them with starFISH methods, we used two different kinds of qc measures. The first is correlation with orthogonal RNA counts (either smFISH or RNAseq) and the other is the proportion of false positive calls measured by placing "blank" codes in the codebook that adhere to the hamming distance rules of the codebook but don't actually correspond to any true mRNA. This false positive metric was lacking from my previous PR and when we ran it on the previous method our false positive values were quite high. Depending on the method used to calculate the false positive rate and the specific dataset, high false positive rates can be a problem with seqFISH data but the mode parameter allows you to easily adjust the decoding parameters to get the best results.

Below are some figures showing results from 3 different seqFISH datasets comparing the starFISH PerRoundMaxChannel + NEAREST_NEIGHBORS method with each mode choice for the CheckAll decoder. Each red/blue bar pair along the x-axis represent a single cell, FOV, or image chunk (don't have segmented cells yet for this dataset so it is broken into equal sized chunks), while the height of the blue bar represents the total number of on-target (real) barcodes found in that cell/fov/chunk and the red bar height is the off-target count (blank) with the black bar showing the median on-target count. Also printed on each graph is the overall false positive (FP) rate and the pearson correlation with smFISH or RNAseq where available. For each dataset the same set of spots was used for each run.

Intron seqFISH

  • Mouse ESC lines
  • 10,421 real genes + ~9,000 blank genes
  • 29 (close) z slices
  • This was the most difficult of the three datasets to get good results from. The large number of genes and close by z slices make finding true barcodes tricky.

image

RNA SPOTs seqFISH

  • No cells, just extracted RNA on a slide
  • 10,421 real genes + ~9,000 blank genes
  • single z slice
  • Pretty easy dataset as there was no cellular autofluorescence

image

Mouse Atlas seqFISH

  • slices of mouse hippocampus
  • 351 real genes, 351 blanks
  • 6 (distant) z slices
  • Lower false positives than intron seqFISH probably from the lower gene count and distant z slices reducing confusion. Not sure why starFISH does so poorly here though but the FP rate is almost 3x the low accuracy rate for the new decoder.

image

From these, you can see that this updated decoder is capable of finding more targets than starFISH, with lower false positive rates, and higher smFISH/RNAseq correlations. The one area where the starFISH methods have us beat is speed. Being more thorough is costly so while starFISH can decode a set of spots in 2-3 minutes on my system, this new decoder can take up to 1.5 hours even when using 16 different simultaneous processes. That time is from the intron seqFISH dataset which is quite large (5 x 12 x 29 x 2048 x 2048) so the run times are more reasonable for smaller datasets.

Running the "make all" tests for my current repo throws a mypy error but in code that I haven't touched (like the ImageStack class code) so hopefully we can work together to get those kinks worked out so we can add this to starFISH so others might use it.

@berl
Copy link
Collaborator

berl commented May 18, 2022

@nickeener this is really cool to see- thanks for the contribution! I'll take a look at the test/CI situation and see if I can help

@nickeener
Copy link
Contributor Author

@berl Any updates on this? It appears that it is erroring out trying to download the sphinx-bootstrap-theme dependency on Python 3.9 but is otherwise passing all the initial tests.

@berl
Copy link
Collaborator

berl commented Jun 18, 2022

@nickeener yeah, this has been fixed in sphinx-bootstrap-theme==0.8.1, so that should get you past the install tests.

add this line to the REQUIREMENTS.txt file:
sphinx-bootstrap-theme==0.8.1
and then run
make pin-all-requirements
which will propagate the new requirement into the REQUIREMENTS-*.txt files.
This results in a successful make install-dev on my mac, but I haven't done any other testing past this. if you commit back to this, we'll get the CI tests

@nickeener
Copy link
Contributor Author

@berl Running into this error after I add the sphinx-bootstrap-theme==0.8.1 line to REQUIREMENTS.txt and then run make pin-all-requirements:

make starfish/REQUIREMENTS-STRICT.txt requirements/REQUIREMENTS-CI.txt requirements/REQUIREMENTS-NAPARI-CI.txt
make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish'
[ ! -e .REQUIREMENTS.txt-env ] || exit 1
make[1]: *** [Makefile:103: starfish/REQUIREMENTS-STRICT.txt] Error 1
make[1]: Leaving directory '/home/nick/projects/spacetx/decoder/starfish'
make: *** [Makefile:98: pin-all-requirements] Error 2

Any advice on this?

@berl
Copy link
Collaborator

berl commented Jun 21, 2022

@nickeener did you run make pin-all-requirements from inside the starfish directory? my output doesn't show

make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish'

but the error is in line 103 where it checks for a virtual environment. can you create a virtual environment with python=3.9 (using conda or venv ) and activate it before running make pin-all-requirements?

Also, what's your OS and conda version if applicable?

@nickeener
Copy link
Contributor Author

@berl Yes I am in the top level starfish directory when I run make pin-all-requirements. I've tried creating a fresh python3.9 conda env and then installing everything with make install-dev but I get the same exact error.

My OS is Ubuntu 20.04.4 and my conda version is conda 4.13.0.

Would you be able to share the new correctly built requirements files with me so I could just replace the old versions with them?

@ttung
Copy link
Collaborator

ttung commented Jun 22, 2022

@nickeener I suspect you previously tried to repin requirements, and there are stale files left over from the attempt. Can you remove $STARFISH/ .REQUIREMENTS.txt-env (rm -rf /home/nick/projects/spacetx/decoder/starfish/ .REQUIREMENTS.txt-env) and try again?

@nickeener
Copy link
Contributor Author

@ttung That was exactly what was wrong thanks! Was able to run make pin-all-requirements and get the updated requirements files but now it is throwing several mypy errors, all in files that I haven't changed.

@ttung
Copy link
Collaborator

ttung commented Jun 22, 2022

I'm cloned your changes into #1984. I have it mostly working.

@nickeener
Copy link
Contributor Author

nickeener commented Jun 22, 2022

@ttung I've pushed a fix here that should solve the test error in the #1984

@nickeener
Copy link
Contributor Author

@ttung hey just repinging this, the last commit I made here should fix the error message in #1984

@nickeener
Copy link
Contributor Author

@ttung @berl
Repinging again. Would really like to get this pushed soon as we are currently preparing our manuscript that includes this project so it would be nice to be able to say it is already integrated into starFISH.

@ttung
Copy link
Collaborator

ttung commented Jul 20, 2022

@nickeener I cherry picked most of 8752555 into my branch and fired off a build.

@nickeener
Copy link
Contributor Author

@ttung @berl
Looks like it's passing all the tests now! Let me know what/if there is anything else you'd like from me before merging.

Copy link
Collaborator

@berl berl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickeener nice work! I noticed a few things to fix and then I think we should merge.

starfish/core/spots/DecodeSpots/check_all_decoder.py Outdated Show resolved Hide resolved
starfish/core/spots/DecodeSpots/check_all_funcs.py Outdated Show resolved Hide resolved
@berl
Copy link
Collaborator

berl commented Jul 22, 2022

@ttung thanks for your help with this! My review was pretty basic but after the edits above, I'm inclined to merge to add this functionality to starfish.

Copy link
Collaborator

@berl berl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fixes!

@berl berl merged commit c8327a8 into spacetx:master Jul 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants