Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization and gene selection by analytical Pearson residuals #1715

Merged
merged 106 commits into from
Mar 29, 2022

Conversation

jlause
Copy link
Contributor

@jlause jlause commented Mar 3, 2021

Hi everyone,

a while back, @giovp asked me to create a pull request that integrates analytical Pearson residuals in scanpy. We already discussed a bit with @LuckyMD and @ivirshup over at berenslab/umi-normalization#1 how to structure it, and now I made a version that should be ready for review.

As discussed earlier, this pull request implements two core methods:

  • sc.pp.normalize_pearson_residuals(), which applies the method to adata.X. Overall, the function is very similar in structure to sc.pp.normalize_total() (support for layers, inplace operation etc).
  • sc.pp.highly_variable_genes(flavor='pearson_residuals'), which selects genes based on Pearson residual variance. The "inner" function _highly_variable_pearson_residuals() is structured similarly to _highly_variable_seurat_v3() (support for multiple batches, median ranks for tie breaking). It includes the chunksize argument to allow for memory-efficient computation of the residual variance.

We discussed quite a lot how to implement a third function that would bundle gene selection, normalization by analytical residuals and PCA. This PR includes the two options that emerged at the end of that discussion, so now we have to choose ;)

  • sc.pp.recipe_pearson_residuals() which does HVG selection and normalization both via Pearson residuals prior to PCA
  • sc.pp.normalize_pearson_residuals_pca() which applies any HVG selection if the user previously added one to the adata object, and then normalizes via Pearson residuals and does PCA.

Both functions retain the raw input counts as adata.X and add fields for PCA/Normalization/HVG selection results (or return them) as applicable, most importantly the X_pca in adata.obsm['pearson_residuals_X_pca']. I hope this addresses some of the issues we discussed over at the other repo in a scanpy-y way.

Let me know what you think and where you think improvements are needed!

Cheers, Jan.

@giovp
Copy link
Member

giovp commented Mar 4, 2021

thanks @jlause ! Really excited for this!
shall be able to start reviewing during weekend. quick q to @ivirshup , does #1667 somehow potentially conflict with this?

thank you!

@jlause
Copy link
Contributor Author

jlause commented Mar 4, 2021

thanks @jlause ! Really excited for this!
shall be able to start reviewing during weekend. quick q to @ivirshup , does #1667 somehow potentially conflict with this?

thank you!

cool, looking forward to your feedback!
I had a brief look at #1667 - since I use the same layer management that is now to be changed in normalize_total(), it would make sense if I mirror the change in normalize_pearson_residuals(), right? I believe doing that will even simplify the function further. If @ivirshup agrees, I could quickly do that :)

@ivirshup
Copy link
Member

ivirshup commented Mar 5, 2021

it would make sense if I mirror the change in normalize_pearson_residuals(), right? I believe doing that will even simplify the function further. If ivirshup agrees, I could quickly do that :)

Sounds good to me!

@jlause
Copy link
Contributor Author

jlause commented Mar 5, 2021

I finished the edits to conform with #1667! Now looking forward to your thoughts :)

One note on the coding style checks (which I'm not very experienced with): When I activate pre-commit locally, it finds quite a number of style violations in parts of the code that I did not touch and automatically fixes them. This causes many changes that are unrelated to the code I wrote. That's why I disabled pre-commit again (so you don't have to go over all these changes), and tried to follow the style guide manually as good as possible. Hope that is ok for now..

Do you have any advice how to handle that? Should I just do one "style" commit (that fixes all these issues throughout the files I work on here) once you've checked the new parts of the code I wrote? Or should the style be ok in all "old" parts of the code, implying that I set up pre-commit wrong? I'm new to it so that could very well be the case as well..

@ivirshup
Copy link
Member

ivirshup commented Mar 7, 2021

One note on the coding style checks (which I'm not very experienced with): When I activate pre-commit locally, it finds quite a number of style violations in parts of the code that I did not touch and automatically fixes them

Could you share the output you're getting here? The output I'm seeing on the pre-commit CI looks pretty normal, so it'd be helpful to know more exactly what was happening locally. We've also just activated pre-commit, so it's a bit new to us too!

@jlause
Copy link
Contributor Author

jlause commented Mar 9, 2021

Hey @ivirshup,
below is the output after making a small test edit to _highly_variable_genes.py and trying to commit.
Could it be that the file has not been style-checked so far (or not on the branch/version I was forking from?), leading to all these automatic changes/style violations in the old part of the code?

hope this helps, let me know if you need anything else.

jlause@8b38045532aa:~/libs/scanpy/scanpy/preprocessing$ git commit -m "test commit"
black....................................................................Failed
- hook id: black
- files were modified by this hook

reformatted scanpy/preprocessing/_highly_variable_genes.py
All done! ✨ 🍰 ✨
1 file reformatted.

flake8...................................................................Failed
- hook id: flake8
- exit code: 1

scanpy/preprocessing/_highly_variable_genes.py:33:80: E501 line too long (84 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:37:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:49:80: E501 line too long (102 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:51:80: E501 line too long (89 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:108:80: E501 line too long (82 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:123:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:204:27: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:219:5: F821 undefined name 'view_to_actual'
scanpy/preprocessing/_highly_variable_genes.py:220:9: F821 undefined name '_get_obs_rep'
scanpy/preprocessing/_highly_variable_genes.py:244:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:257:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:282:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:291:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:299:80: E501 line too long (84 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:305:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:335:80: E501 line too long (86 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:341:80: E501 line too long (80 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:350:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:365:80: E501 line too long (84 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:392:80: E501 line too long (82 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:417:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:424:80: E501 line too long (87 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:440:80: E501 line too long (86 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:449:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:460:23: F541 f-string is missing placeholders
scanpy/preprocessing/_highly_variable_genes.py:463:80: E501 line too long (82 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:512:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:518:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:519:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:520:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:521:80: E501 line too long (87 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:529:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:531:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:531:86: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:534:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:535:79: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:538:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:539:79: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:542:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:543:79: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:546:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:547:79: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:550:80: E501 line too long (87 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:556:67: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:559:80: E501 line too long (87 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:560:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:560:89: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:563:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:564:1: W293 blank line contains whitespace
scanpy/preprocessing/_highly_variable_genes.py:565:80: E501 line too long (81 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:569:1: W293 blank line contains whitespace
scanpy/preprocessing/_highly_variable_genes.py:571:80: E501 line too long (89 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:572:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:572:89: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:575:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:576:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:579:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:584:80: E501 line too long (97 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:585:80: E501 line too long (86 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:586:80: E501 line too long (84 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:587:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:588:80: E501 line too long (90 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:589:76: W291 trailing whitespace
scanpy/preprocessing/_highly_variable_genes.py:590:80: E501 line too long (89 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:592:1: W293 blank line contains whitespace
scanpy/preprocessing/_highly_variable_genes.py:596:80: E501 line too long (85 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:608:80: E501 line too long (84 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:613:80: E501 line too long (93 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:618:80: E501 line too long (80 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:621:80: E501 line too long (89 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:623:80: E501 line too long (93 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:656:80: E501 line too long (80 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:693:80: E501 line too long (80 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:713:80: E501 line too long (88 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:735:80: E501 line too long (82 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:737:80: E501 line too long (83 > 79 characters)
scanpy/preprocessing/_highly_variable_genes.py:742:80: E501 line too long (80 > 79 characters)

git status and git diff show the automatic changes pre-commit makes:

jlause@8b38045532aa:~/libs/scanpy/scanpy/preprocessing$ git status
On branch pearson_residuals_1.7
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   _highly_variable_genes.py

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   _highly_variable_genes.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	../../.pre-commit-config.yaml

jlause@8b38045532aa:~/libs/scanpy/scanpy/preprocessing$ git diff _highly_variable_genes.py 
diff --git a/scanpy/preprocessing/_highly_variable_genes.py b/scanpy/preprocessing/_highly_variable_genes.py
index 03b01940..e2851f50 100644
--- a/scanpy/preprocessing/_highly_variable_genes.py
+++ b/scanpy/preprocessing/_highly_variable_genes.py
@@ -15,7 +15,8 @@ from ._utils import _get_mean_var
 from ._distributed import materialize_as_ndarray
 from ._simple import filter_genes
 
-#testedit
+# testedit
+
 
 def _highly_variable_genes_seurat_v3(
     adata: AnnData,
@@ -98,7 +99,9 @@ def _highly_variable_genes_seurat_v3(
         else:
             clip_val_broad = np.broadcast_to(clip_val, batch_counts.shape)
             np.putmask(
-                batch_counts, batch_counts > clip_val_broad, clip_val_broad,
+                batch_counts,
+                batch_counts > clip_val_broad,
+                clip_val_broad,
             )
 
         if sp_sparse.issparse(batch_counts):
@@ -173,6 +176,7 @@ def _highly_variable_genes_seurat_v3(
             df = df.drop(['highly_variable_nbatches'], axis=1)
         return df
 
+
 def _highly_variable_pearson_residuals(
     adata: AnnData,
     layer: Optional[str] = None,
@@ -182,7 +186,7 @@ def _highly_variable_pearson_residuals(
     clip: Union[Literal['auto', 'none'], float] = 'auto',
     chunksize: int = 100,
     subset: bool = False,
-    inplace: bool = True
+    inplace: bool = True,
 ) -> Optional[pd.DataFrame]:
     """\
     See `highly_variable_genes`.
@@ -212,7 +216,7 @@ def _highly_variable_pearson_residuals(
         in all batches
     """
 
-    view_to_actual(adata)        
+    view_to_actual(adata)
     X = _get_obs_rep(adata, layer=layer)
     computed_on = layer if layer else 'adata.X'
 
@@ -328,8 +332,7 @@ def _highly_variable_pearson_residuals(
     df = df.loc[adata.var_names]
 
     if inplace or subset:
-        adata.uns['hvg'] = {'flavor': 'pearson_residuals',
-                            'computed_on':computed_on}
+        adata.uns['hvg'] = {'flavor': 'pearson_residuals', 'computed_on': computed_on}
         logg.hint(
             'added\n'
             '    \'highly_variable\', boolean vector (adata.var)\n'

...skipping 1 line
                 ['highly_variable_nbatches', 'highly_variable_intersection'], axis=1
             )
         return df
-    
-    
-    
+
 
 def _highly_variable_genes_single_batch(
     adata: AnnData,
@@ -479,7 +480,6 @@ def _highly_variable_genes_single_batch(
     return df
 
 
-
 def highly_variable_genes(
     adata: AnnData,
     layer: Optional[str] = None,
@@ -493,7 +493,9 @@ def highly_variable_genes(
     theta: float = 100,
     clip: Union[Literal['auto', 'none'], float] = 'auto',
     chunksize: int = 1000,
-    flavor: Literal['seurat', 'cell_ranger', 'seurat_v3', 'pearson_residuals'] = 'seurat',
+    flavor: Literal[
+        'seurat', 'cell_ranger', 'seurat_v3', 'pearson_residuals'
+    ] = 'seurat',
     subset: bool = False,
     inplace: bool = True,
     batch_key: Optional[str] = None,
@@ -651,21 +653,20 @@ def highly_variable_genes(
     if flavor == 'pearson_residuals':
         if n_top_genes is None:
             raise ValueError(
-            "`pp.highly_variable_genes` requires the argument `n_top_genes`"
-            " for `flavor='pearson_residuals'`"
+                "`pp.highly_variable_genes` requires the argument `n_top_genes`"
+                " for `flavor='pearson_residuals'`"
             )
         return _highly_variable_pearson_residuals(
             adata,
-            layer = layer,
-            n_top_genes = n_top_genes,
-            batch_key = batch_key,
-            theta = theta,
-            clip = clip,
-            chunksize= chunksize,
-            subset = subset,
-            inplace = inplace,
+            layer=layer,
+            n_top_genes=n_top_genes,
+            batch_key=batch_key,
+            theta=theta,
+            clip=clip,
+            chunksize=chunksize,
+            subset=subset,
+            inplace=inplace,
         )
-        
 
     if batch_key is None:
         df = _highly_variable_genes_single_batch(

...skipping 1 line
 
             # Add 0 values for genes that were filtered out
             missing_hvg = pd.DataFrame(
-                np.zeros((np.sum(~filt), len(hvg.columns))), columns=hvg.columns,
+                np.zeros((np.sum(~filt), len(hvg.columns))),
+                columns=hvg.columns,
             )
             missing_hvg['highly_variable'] = missing_hvg['highly_variable'].astype(bool)
             missing_hvg['gene'] = gene_list[~filt]

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 9, 2021

Just a quick note as I just read up on this... I think we have set up black with a line length of 88 chars, and not 79 at the moment.

@ivirshup
Copy link
Member

ivirshup commented Mar 10, 2021

Ah I think I see the issue! Feature branches should be based off master and directing the pull request there! I think what's happening is that a pre-commit hook was installed, but the config only exists on the master branch.

I think this should largely be manageable by rebasing onto master (e.g. git rebase --onto master 1.7.x) and changing the branch the PR is targeting via the github interface:

image


Side note: We're considering separating the highly_variable_genes interface into multiple functions, since the arguments to the different methods don't always overlap in meaningful or intuitive ways. There's nothing you need to do about this right now, but just a heads up to keep the logic for this method separate from the main function.

@jlause jlause changed the base branch from 1.7.x to master March 10, 2021 14:08
@jlause
Copy link
Contributor Author

jlause commented Mar 10, 2021

Ah I think I see the issue! Feature branches should be based off master and directing the pull request there! I think what's happening is that a pre-commit hook was installed, but the config only exists on the master branch.

I think this should largely be manageable by rebasing onto master (e.g. git rebase --onto master 1.7.x) and changing the branch the PR is targeting via the github interface:

Thanks a lot, I rebased and changed the PR target to master so I hope everything is on track now!
The pre-commit style checks were working as expected now (auto-edits only in the files / parts I edited).

Side note: We're considering separating the highly_variable_genes interface into multiple functions, since the arguments to the different methods don't always overlap in meaningful or intuitive ways. There's nothing you need to do about this right now, but just a heads up to keep the logic for this method separate from the main function.

Sounds good!

@codecov
Copy link

codecov bot commented Mar 10, 2021

Codecov Report

Merging #1715 (970b0fa) into master (0728d55) will increase coverage by 0.48%.
The diff coverage is 96.56%.

@@            Coverage Diff             @@
##           master    #1715      +/-   ##
==========================================
+ Coverage   71.46%   71.95%   +0.48%     
==========================================
  Files          92       98       +6     
  Lines       11305    11531     +226     
==========================================
+ Hits         8079     8297     +218     
- Misses       3226     3234       +8     
Impacted Files Coverage Δ
scanpy/preprocessing/__init__.py 100.00% <ø> (ø)
scanpy/preprocessing/_recipes.py 92.59% <ø> (ø)
scanpy/experimental/pp/_normalization.py 93.82% <93.82%> (ø)
scanpy/experimental/pp/_highly_variable_genes.py 97.08% <97.08%> (ø)
scanpy/__init__.py 100.00% <100.00%> (ø)
scanpy/experimental/__init__.py 100.00% <100.00%> (ø)
scanpy/experimental/_docs.py 100.00% <100.00%> (ø)
scanpy/experimental/pp/__init__.py 100.00% <100.00%> (ø)
scanpy/experimental/pp/_recipes.py 100.00% <100.00%> (ø)
scanpy/preprocessing/_highly_variable_genes.py 94.27% <100.00%> (+0.02%) ⬆️
... and 1 more

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @jlause for the PR! This is really exciting and super useful!
This is a first round of review, most comments are re types, args and function behviour. I think it looks really good overall and maybe it's time to start writing tests ?
please let me know if anything unclear and also thanks in advance for code explanations!

scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_normalization.py Outdated Show resolved Hide resolved
def normalize_pearson_residuals(
adata: AnnData,
theta: float = 100,
clip: Union[Literal['auto', 'none'], float] = 'auto',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above re clipping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def normalize_pearson_residuals_pca(
adata: AnnData,
theta: float = 100,
clip: Union[Literal['auto', 'none'], float] = 'auto',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

scanpy/preprocessing/_normalization.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_recipes.py Outdated Show resolved Hide resolved
@giovp giovp mentioned this pull request Mar 12, 2021
@jlause
Copy link
Contributor Author

jlause commented Mar 12, 2021

thank you @jlause for the PR! This is really exciting and super useful!
This is a first round of review, most comments are re types, args and function behviour. I think it looks really good overall and maybe it's time to start writing tests ?
please let me know if anything unclear and also thanks in advance for code explanations!

Hey @giovp ,
thanks a lot for the review, this looks very helpful! I'll address the single points above one-by-one and make the required changes over the next few days! Will also add some first tests - are there formal guidelines what you expect to be tested? After looking at the tests for highly_variable_genes, from my naive perspective I'd test the following:

  • tests that check if combinations of input arguments lead to expected output (in terms of returned shapes/columns/...) and don't break the function
  • tests that check if warnings/errors are raised for "common mistakes" (inappropriate data, nonsense input argument combinations..)

Any advice/ideas what else should be tested?

@giovp
Copy link
Member

giovp commented Mar 12, 2021

tests that check if combinations of input arguments lead to expected output (in terms of returned shapes/columns/...) and don't break the function
tests that check if warnings/errors are raised for "common mistakes" (inappropriate data, nonsense input argument combinations..)

yes both makes sense, it would also be useful to come up with a dummy example for which the actual output could be tested against. This is done in seurat_v3 for instance, but in that case it's kind of straightforward because the "expected" is the output computed with original implementation (and as you catched in #1732 it's still might not be enough 😄 ).
another random thing that comes to mind re this specific case is to make sure that indexing etc. is consistent and robust, as you seem to have to sort and resort a fair bit in the hvg implementation.

on another note, I was thinking if it makes sense to also release a short tutorial together with the PR (that would be on theislab/scanpy_tutorials) ? I think that for a lot of people the term "pearson residuals" could be alienating, and so they'd rather stick to normalize_total for comfort (but they shouldn't!). So maybe just something easy like pearson res norm + umap and hvg plots ? curious to hear what you and the others @ivirshup @LuckyMD think about it.

@giovp
Copy link
Member

giovp commented Feb 28, 2022

for normalize_pearson_residual, i think it makes sense to keep normalize in, as it's not the same type of transformation compared to log1p.

Isn't this quite similar to what log1p does though? In that it's a transformation of the matrix?

I think it should stay normalize_pearson_residuals because it mirrors normalize_total.

for the rest, I think we are at a good stage, I'd ask @jlause to build docs locally cd scanpy/docs and then make clean and make html see https://scanpy.readthedocs.io/en/stable/dev/documentation.html#building-the-docs
and check that:

  • arguments and doc params match
  • typo and other minor issues still present (e.g. difficult phrasing).

if this gets approval, before merging to master todo:

  • add release note
  • go over scanpy_tutorials and re run tutorial and merge
  • link tutorial in docs

p.s. docs are failing for reasons I have haven't figured out yet

@jlause
Copy link
Contributor Author

jlause commented Mar 11, 2022

Hi!

for normalize_pearson_residual, i think it makes sense to keep normalize in, as it's not the same type of transformation compared to log1p.

Isn't this quite similar to what log1p does though? In that it's a transformation of the matrix?

I think it should stay normalize_pearson_residuals because it mirrors normalize_total.

I agree.

for the rest, I think we are at a good stage, I'd ask @jlause to build docs locally cd scanpy/docs and then make clean and make html see https://scanpy.readthedocs.io/en/stable/dev/documentation.html#building-the-docs and check that:

* arguments and doc params match

* typo and other minor issues still present (e.g. difficult phrasing).

I started doing that and will finish up tomorrow - there Qs in advance if you happen to look at this before:

  • sometimes we have math expressions like var = mean * mean^2 etc. in the docs. Is there a convention for scanpy docs if those should be in code format or just plain text? e.g. in the adata docstring the matrix shape is described as n_obs x n_var, but elsewhere we say "clipping is done by sqrt(n). I can consistently format them into code if you agree.
  • I think the .._pca function is missing from the release note. should I add it there?
  • The ..pca function also did not use shared docs params yet. I started adding them and can commit tomorrow - is that okay if I just do it like that?

if this gets approval, before merging to master todo:

* [x]  add release note

* [ ]  go over scanpy_tutorials and re run tutorial and merge

I've looked at that and commented in the respective github :)

Very happy we are getting this wrapped up now :)
Best! Jan

@giovp
Copy link
Member

giovp commented Mar 11, 2022

sometimes we have math expressions like var = mean * mean^2 etc. in the docs. Is there a convention for scanpy docs if those should be in code format or just plain text?

I'm not sure, I think with math is nicer but not aware of any convention. @ivirshup ?

I think the .._pca function is missing from the release note. should I add it there?
The ..pca function also did not use shared docs params yet. I started adding them and can commit tomorrow - is that okay if I just do it like that?

must say I missed those sorry, feel free to add and I'll take a look again tomorrow and wrap it up.

@jlause
Copy link
Contributor Author

jlause commented Mar 12, 2022

Hi, I worked over the docs completely once but still need to do a few small things (release note, final spell check) but need to leave my desk now - will do the final bits either later today or tomorrow!
Best jan

@jlause
Copy link
Contributor Author

jlause commented Mar 13, 2022

Hi @giovp ,
I'm done from my side of things: I have re-worded some parts of the docstrings (hopefully to better readability ;) ), added the missing function to the release note and tried to make the returns sections of the docs a bit more consistent.

Also, it seems that building the docs is failing again on github (locally it works with some warnings). Again I'm not sure why / if it is even related to my changes 🤔

Let me know if I can help with fixing that or if anything else comes up!

Best, Jan.

@giovp
Copy link
Member

giovp commented Mar 16, 2022

ok tutorial is merged, you can have a look how it renders here: https://scanpy-tutorials.readthedocs.io/en/latest/tutorial_pearson_residuals.html

I've fixed the tutorial.rst page and the release note. To me it looks good, I'd like to get @ivirshup approval on this before merging.

I'm done from my side of things: I have re-worded some parts of the docstrings (hopefully to better readability ;) ), added the missing function to the release note and tried to make the returns sections of the docs a bit more consistent.

really clear and coincise btw, great job

@ivirshup
Copy link
Member

LGTM

Thanks for pushing and getting this through! And thanks for picking up the review on this @giovp!

@ivirshup ivirshup merged commit 636316d into scverse:master Mar 29, 2022
@giovp
Copy link
Member

giovp commented Mar 29, 2022

thank you @ivirshup !

@jlause
Copy link
Contributor Author

jlause commented Mar 30, 2022

nice! thanks to both of you @giovp @ivirshup :)

@Starlitnightly
Copy link

Starlitnightly commented Jun 6, 2023

Hi, @jlause

There's a issue when using normalize_pearson_residuals, it seems that we can't calculated the log2foldchange in rank_genes_groups will be failed. That's because np.expm1 can't restore the adata.X after normalize_pearson_residuals. Could you solve this issue that completed the downstream currently?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants