Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get errors when performing sc.pp.highly_variable_genes! #8

Closed
jipeifeng opened this issue Jan 31, 2019 · 17 comments
Closed

Get errors when performing sc.pp.highly_variable_genes! #8

jipeifeng opened this issue Jan 31, 2019 · 17 comments

Comments

@jipeifeng
Copy link

jipeifeng commented Jan 31, 2019

Hi,
I am following this excellent workflow to analyze my single-cell sequencing data sets.
I have calculated the size factor using the scran package and did not perform the batch correction step as I have only one sample. Then, I intended to extract highly variable genes by using the function sc.pp.highly_variable_genes. Unfortunately, I got an error:

'LinAlgError: Last 2 dimensions of the array must be square'

LinAlgError Traceback (most recent call last)
in
----> 1 sc.pp.highly_variable_genes(adata)

~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace)
94 X = np.expm1(adata.X) if flavor == 'seurat' else adata.X
95
---> 96 mean, var = materialize_as_ndarray(_get_mean_var(X))
97 # now actually compute the dispersion
98 mean[mean == 0] = 1e-12 # set entries equal to zero to small value

~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/utils.py in _get_mean_var(X)
16 mean_sq = np.multiply(X, X).mean(axis=0)
17 # enforece R convention (unbiased estimator) for variance
---> 18 var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
19 else:
20 from sklearn.preprocessing import StandardScaler

~/miniconda3/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py in pow(self, other)
226
227 def pow(self, other):
--> 228 return matrix_power(self, other)
229
230 def ipow(self, other):

~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in matrix_power(a, n)
600 a = asanyarray(a)
601 _assertRankAtLeast2(a)
--> 602 _assertNdSquareness(a)
603
604 try:

~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in _assertNdSquareness(*arrays)
213 m, n = a.shape[-2:]
214 if m != n:
--> 215 raise LinAlgError('Last 2 dimensions of the array must be square')
216
217 def _assertFinite(*arrays):

This is my adata.X looks like right now:
matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1.203, ..., 0. , 0. , 0. ],
[0. , 1.096, 0. , ..., 0. , 0. , 0. ],
...,
[0. , 0. , 2.042, ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.926, 0. , 0. ],
[0. , 0. , 2.951, ..., 0. , 0. , 0. ]])

also versions of my modules:
scanpy==1.3.7 anndata==0.6.17 numpy==1.15.4 scipy==1.2.0 pandas==0.24.0 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

Looking forward your response!
Thank you !

@LuckyMD
Copy link
Contributor

LuckyMD commented Jan 31, 2019

Hi,

This is just a short reply as I am away from my laptop until Monday.

There is at the moment quite a few issues with pandas 0.24.0. The latest scanpy and anndata 0.6.18 has fixed some of them for scanpy code, but the rpy2 integration still has issues that won’t be fixed until rpy2 3.0.0 is out. Could you check if you have the same error if you downgrade pandas to 0.23.4? Alternatively you could upgrade scanpy and anndata, but then there will be issues with the trajectory inference via rpy2 later I fear.

As for the highly_variable_genes() function call, I think I would set n_top_genes and flavor=“cellranger”. But that is of course up to you.

If this does not solve your issue, I will only be able to get back to this properly on Monday. You may have more luck posting a scanpy issue if you need a quicker response.

@jipeifeng
Copy link
Author

Hi,
Thanks for the response!
I have downgraded pandas to 0.23.4, however, it not works.
But I figured out where the problem lies in.

adata.X /= adata.obs['size_factors'].values[:,None]

This step transform the adata.X to a structure of matrix.
Before the adata.X is
<6242x15065 sparse matrix of type '<class 'numpy.float32'>'
with 19234986 stored elements in Compressed Sparse Row format>

But after performing this step, the adata.X is
This is my adata.X looks like right now:
matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1.203, ..., 0. , 0. , 0. ],
[0. , 1.096, 0. , ..., 0. , 0. , 0. ],
...,
[0. , 0. , 2.042, ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.926, 0. , 0. ],
[0. , 0. , 2.951, ..., 0. , 0. , 0. ]]),

And this format of adata.X is caused error of sc.pp.highly_variable_genes. But I don't know how to fix it.

@jipeifeng
Copy link
Author

Hi, I have fixed the issue.
It appears that adding, subtracting or dividing numpy.ndarrays with scipy.sparse matrices returns a numpy.matrix. numpy_array /= scipy_sparse_matrix, This command changed the type of numpy_array to numpy.matrix which caused downstream problems. So, you have to transfer the matrix to sparse format again for downstream analysis.
I used the command 'adata.X = scipy.sparse.csr_matrix(adata.X) ' after dividing the measured counts by the size factor.

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 1, 2019

Hi,

The workflow does not work for sparse matrices at several points. That is why there are notes in several places saying that dense matrices should be used. In some places rpy2 will cause problems for this as well.

@JuHey
Copy link

JuHey commented Feb 25, 2019

Hi,

I was working my way through the excellently documented single-cell tutorial using the datasets that are described in the "Case study: Epithelial cells in different intestinal regions by 10x Chromium"but ran into a problem at section 2.4 "Highly Variable Genes". Up to that, I could reproduce the figures and output from the tutorial. But when invoking:

# Check highly variable gene selection to look at the results
disp_filter = sc.pp.filter_genes_dispersion(adata.X, flavor='cell_ranger', n_top_genes=4000, log=False)
print('Number of highly variable genes: {:d}'.format(adata[:, disp_filter['gene_subset']].n_vars))

the script fails with:


ValueError Traceback (most recent call last)
in
----> 1 sc.pp.highly_variable_genes(adata, flavor='cell_ranger', n_top_genes=4000, inplace = False)

/usr/local/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace)
141 -np.inf,
142 np.percentile(df['mean'], np.arange(10, 105, 5)),
--> 143 np.inf
144 ])
145 disp_grouped = df.groupby('mean_bin')['dispersion']

/usr/local/lib/python3.7/site-packages/pandas/core/reshape/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
232 include_lowest=include_lowest,
233 dtype=dtype,
--> 234 duplicates=duplicates)
235
236 return _postprocess_for_cut(fac, bins, retbins, x_is_series,

/usr/local/lib/python3.7/site-packages/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
330 raise ValueError("Bin edges must be unique: {bins!r}.\nYou "
331 "can drop duplicate edges by setting "
--> 332 "the 'duplicates' kwarg".format(bins=bins))
333 else:
334 bins = unique_bins

ValueError: Bin edges must be unique: array([-inf, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

I am working within Jupyter lab under osx10.13 with the following modules:

scanpy==1.4 anndata==0.6.16 numpy==1.15.4 scipy==1.2.0 pandas==0.23.4 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

I also tried: sc.pp.highly_variable_genes(adata, same parameters as above) on sparse matrix (generated by adata.X = csr_matrix(adata.X)) but that had the same result.

I would appreciate any hints and suggestions on this. Thank you!

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 25, 2019

Hi @JuHey

It looks to me like the c.combat call before the highly variable genes section is not working properly for you. If you get errors there, then they will propagate to the next point and fail there.

Are you working with sparse matrices the whole time? In that case combat won't work. Could you check your adata.X for NA values after the combat output?

@JuHey
Copy link

JuHey commented Feb 26, 2019

Hi @LuckyMD

I process according to the https://nbviewer.jupyter.org/github/theislab/single-cell-tutorial/blob/master/Case-study_Mouse-intestinal-epithelium.ipynb. The data are all converted to dense matrix at the point of concatenation and stay dense. Hence, running combat works fine. The output of combat: adata.X is:

array([[ 3.69658247e-04, nan, nan, ...,
1.03650984e+00, nan, 1.95181963e-04],
[ 3.69658247e-04, nan, nan, ...,
-5.07257088e-03, nan, 1.95181963e-04],
[ 3.69658247e-04, nan, nan, ...,
-5.07257088e-03, nan, 1.95181963e-04],
...,
[-2.42223414e-04, nan, nan, ...,
5.98671474e-01, nan, -5.07604412e-04],
[-2.42223414e-04, nan, nan, ...,
-2.39794287e-02, nan, -5.07604412e-04],
[-2.42223414e-04, nan, nan, ...,
-2.39794287e-02, nan, -5.07604412e-04]])

With NA values...

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 27, 2019

I'm guessing that there are no nan's pre-combat? Could you check whether it works if you replace the c.combat run with a call to sc.pp.combat()?

A couple of possible sources of error in combat are:

  • non-categorical batch covariates
  • non-matching indices of the batch covariate and the data (I avoid this by resetting the batch indices)

Are you using your own data, or the case study dataset?

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 27, 2019

Actually, the most likely reason for this is that you have constant gene expression values within a batch for a particular gene. You won't be able to fit a gene-specific within-batch variance, and you get 0. This should be fixed in the sc.pp.combat() version.

Alternatively, you can perform a more strict gene filtering to avoid this case.

@JuHey
Copy link

JuHey commented Feb 27, 2019

Hi @LuckyMD ,

Thank you for your feedback and your suggestions.
I am using the case dataset.
There are no >nan's > pre-combat.
I can try optimizing the filtering post 'scran'.
I will try sc.pp.combat() next.

I am just a bit puzzled as I am not aware that I am doing anything different from the tutorial script, and I am using the case dataset processed as in the script. But obviously there is a difference.

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 27, 2019

If you are using the case dataset from the tutorial, that is indeed strange. I'm assuming the rest of the steps were exactly followed? I have reproduced the tutorial on quite a few systems and while I get slight differences in clustering output, I have not seen this before.

Which tool versions are you using?

@JuHey
Copy link

JuHey commented Feb 27, 2019

Yes, I followed all steps as shown in the notebook. I can reproduce the figures up to the sc.pp.highly-variable_genes. I am working within Jupyter lab under osx10.13 with the following modules:
R is 3.5.2 (via CRAN), Python is 3.7.2 (via Homebrew), rpy2 is 3.0.0; Scran: 3.9; Combat: from Maren's github master folder

scanpy==1.4 anndata==0.6.16 numpy==1.15.4 scipy==1.2.0 pandas==0.23.4 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

I was wondering whether you could send me a .h5ad of the concatenated dataset processed up to block # 8, so before '2. Preprocessing and normalization'. I could then be sure that what I am starting with is not the issue.

@JuHey
Copy link

JuHey commented Feb 28, 2019

I tried now sc.pp.combat(adata, key ='sample', inlace = True) after 'scran'-based normalization as in the tutorial. The output is:

Standardizing Data across genes.
found 6 batches
found 0 categorical variables:
Found 10734 genes with zero variance.
Fitting L/S model and finding priors
Finding parametric adjustments
/usr/local/lib/python3.7/site-packages/scanpy/preprocessing/_combat.py:196: RuntimeWarning: invalid value encountered in true_divide
b_prior[i],
/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims, initial)
/usr/local/lib/python3.7/site-packages/scanpy/preprocessing/_combat.py:196: RuntimeWarning: divide by zero encountered in true_divide
b_prior[i],
Adjusting data

Does this in any way correspond to what you find with sc.pp.combat()?

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 28, 2019

This does indeed look like it has to do with the starting dataset. The inbuilt combat version has a check for zero-variance genes in the each batch. And it looks like you have over 10,000 of them in your dataset.

Another possibility is an issue with the scran step.

Here is a link to the loaded dataset:
https://gigamove.rz.rwth-aachen.de/d/id/wGwWERTU4Be3PU

It should be available for 2 weeks. On my current system, I still have older versions of scanpy, anndata, and rpy2 running... but that shouldn't change anything up to the first scran step.

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 28, 2019

Also, are you sure your version of scran is correct? I am using 1.6.9, and the notebook has been reproduced with 1.10.2 (which is the latest version to my understanding).

@JuHey
Copy link

JuHey commented Feb 28, 2019

Thank you for making the dataset available!!! I will test it tonight. Yes, you are correct the latest scran version is 1.10.2, which I am using.

@JuHey
Copy link

JuHey commented Mar 1, 2019

I just ran the tutorial with the dataset you provided and it worked fine including the sc.pp.highly_variable_genes using the sc.pp.combat function. Thank's a million!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants