New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get errors when performing sc.pp.highly_variable_genes! #8
Comments
Hi, This is just a short reply as I am away from my laptop until Monday. There is at the moment quite a few issues with pandas 0.24.0. The latest scanpy and anndata 0.6.18 has fixed some of them for scanpy code, but the rpy2 integration still has issues that won’t be fixed until rpy2 3.0.0 is out. Could you check if you have the same error if you downgrade pandas to 0.23.4? Alternatively you could upgrade scanpy and anndata, but then there will be issues with the trajectory inference via rpy2 later I fear. As for the If this does not solve your issue, I will only be able to get back to this properly on Monday. You may have more luck posting a scanpy issue if you need a quicker response. |
Hi, adata.X /= adata.obs['size_factors'].values[:,None] This step transform the adata.X to a structure of matrix. But after performing this step, the adata.X is And this format of adata.X is caused error of sc.pp.highly_variable_genes. But I don't know how to fix it. |
Hi, I have fixed the issue. |
Hi, The workflow does not work for sparse matrices at several points. That is why there are notes in several places saying that dense matrices should be used. In some places rpy2 will cause problems for this as well. |
Hi, I was working my way through the excellently documented single-cell tutorial using the datasets that are described in the "Case study: Epithelial cells in different intestinal regions by 10x Chromium"but ran into a problem at section 2.4 "Highly Variable Genes". Up to that, I could reproduce the figures and output from the tutorial. But when invoking: # Check highly variable gene selection to look at the results
disp_filter = sc.pp.filter_genes_dispersion(adata.X, flavor='cell_ranger', n_top_genes=4000, log=False)
print('Number of highly variable genes: {:d}'.format(adata[:, disp_filter['gene_subset']].n_vars)) the script fails with: ValueError Traceback (most recent call last)
I am working within Jupyter lab under osx10.13 with the following modules:
I also tried: sc.pp.highly_variable_genes(adata, same parameters as above) on sparse matrix (generated by adata.X = csr_matrix(adata.X)) but that had the same result. I would appreciate any hints and suggestions on this. Thank you! |
Hi @JuHey It looks to me like the Are you working with sparse matrices the whole time? In that case combat won't work. Could you check your |
Hi @LuckyMD I process according to the https://nbviewer.jupyter.org/github/theislab/single-cell-tutorial/blob/master/Case-study_Mouse-intestinal-epithelium.ipynb. The data are all converted to dense matrix at the point of concatenation and stay dense. Hence, running combat works fine. The output of combat: adata.X is:
With NA values... |
I'm guessing that there are no A couple of possible sources of error in combat are:
Are you using your own data, or the case study dataset? |
Actually, the most likely reason for this is that you have constant gene expression values within a batch for a particular gene. You won't be able to fit a gene-specific within-batch variance, and you get 0. This should be fixed in the Alternatively, you can perform a more strict gene filtering to avoid this case. |
Hi @LuckyMD , Thank you for your feedback and your suggestions. I am just a bit puzzled as I am not aware that I am doing anything different from the tutorial script, and I am using the case dataset processed as in the script. But obviously there is a difference. |
If you are using the case dataset from the tutorial, that is indeed strange. I'm assuming the rest of the steps were exactly followed? I have reproduced the tutorial on quite a few systems and while I get slight differences in clustering output, I have not seen this before. Which tool versions are you using? |
Yes, I followed all steps as shown in the notebook. I can reproduce the figures up to the
I was wondering whether you could send me a .h5ad of the concatenated dataset processed up to block # 8, so before '2. Preprocessing and normalization'. I could then be sure that what I am starting with is not the issue. |
I tried now
Does this in any way correspond to what you find with |
This does indeed look like it has to do with the starting dataset. The inbuilt combat version has a check for zero-variance genes in the each batch. And it looks like you have over 10,000 of them in your dataset. Another possibility is an issue with the scran step. Here is a link to the loaded dataset: It should be available for 2 weeks. On my current system, I still have older versions of scanpy, anndata, and rpy2 running... but that shouldn't change anything up to the first scran step. |
Also, are you sure your version of scran is correct? I am using 1.6.9, and the notebook has been reproduced with 1.10.2 (which is the latest version to my understanding). |
Thank you for making the dataset available!!! I will test it tonight. Yes, you are correct the latest scran version is 1.10.2, which I am using. |
I just ran the tutorial with the dataset you provided and it worked fine including the |
Hi,
I am following this excellent workflow to analyze my single-cell sequencing data sets.
I have calculated the size factor using the scran package and did not perform the batch correction step as I have only one sample. Then, I intended to extract highly variable genes by using the function sc.pp.highly_variable_genes. Unfortunately, I got an error:
'LinAlgError: Last 2 dimensions of the array must be square'
LinAlgError Traceback (most recent call last)
in
----> 1 sc.pp.highly_variable_genes(adata)
~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace)
94 X = np.expm1(adata.X) if flavor == 'seurat' else adata.X
95
---> 96 mean, var = materialize_as_ndarray(_get_mean_var(X))
97 # now actually compute the dispersion
98 mean[mean == 0] = 1e-12 # set entries equal to zero to small value
~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/utils.py in _get_mean_var(X)
16 mean_sq = np.multiply(X, X).mean(axis=0)
17 # enforece R convention (unbiased estimator) for variance
---> 18 var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
19 else:
20 from sklearn.preprocessing import StandardScaler
~/miniconda3/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py in pow(self, other)
226
227 def pow(self, other):
--> 228 return matrix_power(self, other)
229
230 def ipow(self, other):
~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in matrix_power(a, n)
600 a = asanyarray(a)
601 _assertRankAtLeast2(a)
--> 602 _assertNdSquareness(a)
603
604 try:
~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in _assertNdSquareness(*arrays)
213 m, n = a.shape[-2:]
214 if m != n:
--> 215 raise LinAlgError('Last 2 dimensions of the array must be square')
216
217 def _assertFinite(*arrays):
This is my adata.X looks like right now:
matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1.203, ..., 0. , 0. , 0. ],
[0. , 1.096, 0. , ..., 0. , 0. , 0. ],
...,
[0. , 0. , 2.042, ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.926, 0. , 0. ],
[0. , 0. , 2.951, ..., 0. , 0. , 0. ]])
also versions of my modules:
scanpy==1.3.7 anndata==0.6.17 numpy==1.15.4 scipy==1.2.0 pandas==0.24.0 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
Looking forward your response!
Thank you !
The text was updated successfully, but these errors were encountered: