Failure to reproduce the paper #5

Starlitnightly · 2024-02-22T03:54:01Z

Hi,

I am using the pancera dataset with all parameters defaulted, why am I unable to reproduce the results of the paper? The code used is all functions within TFvelo_run_demo.py and TFvelo_analysis_demo.py.py.

Can you provide a jupyter tutorial to make it easy for us to see the intermediate analysis results?

SIncerely,

Zehua

The text was updated successfully, but these errors were encountered:

Starlitnightly · 2024-02-22T10:34:00Z

All the code can be found in below

import omicverse as ov
import scanpy as sc
import scvelo as scv
import TFvelo as Tfv
ov.plot_set()

adata = scv.datasets.pancreas()
adata

AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'

if "spliced" in adata.layers:
    adata.layers["total"] = adata.layers["spliced"].todense() + adata.layers["unspliced"].todense()
elif "new" in adata.layers:
    adata.layers["total"] = np.array(adata.layers["total"].todense())
else:
    adata.layers["total"] = adata.X
adata.layers['count']=adata.X.copy()
adata.layers["total_raw"] = adata.layers["total"].copy()
n_cells, n_genes = adata.X.shape
sc.pp.filter_genes(adata, min_cells=int(n_cells/50))
sc.pp.filter_cells(adata, min_genes=int(n_genes/50))
Tfv.pp.filter_and_normalize(adata, min_shared_counts=20, 
                                  n_top_genes=2000, log=True,subset_highly_variable=False) #include the following steps
adata.X = adata.layers["total"].copy()

filtered out 17243 genes that are detected in less than 73 cells
Normalized count data: X, spliced, unspliced, total.
Extracted 2000 highly variable genes.
Logarithmized X.

adata.raw=adata.copy()
adata=adata[:,adata.var['highly_variable']==True]
adata

View of AnnData object with n_obs × n_vars = 3696 × 2000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'n_genes', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size_total', 'initial_size', 'n_counts'
    var: 'highly_variable_genes', 'n_cells', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'total', 'count', 'total_raw'
    obsp: 'distances', 'connectivities'

gene_names = []
for tmp in adata.var_names:
    gene_names.append(tmp.upper())
adata.var_names = gene_names
adata.var_names_make_unique()
adata.obs_names_make_unique()

Tfv.pp.moments(adata, n_pcs=30, n_neighbors=15)

Tfv.pp.get_TFs(adata, databases='ENCODE ChEA')

WARNING: The neighbor graph has an unexpected format (e.g. computed outside scvelo) 
or is corrupted (e.g. due to subsetting). Consider recomputing with `pp.neighbors`.
computing moments based on connectivities
    finished (0:00:00) --> added 
    'M_total', moments of total abundances (adata.layers)
Get TFs according to ENCODE ChEA
max_n_TF: 29
mean_n_TF: 12.3345
gene num of 0 TF: 109
total num of TFs: 46

import numpy as np
adata.uns['genes_pp'] = np.array(adata.var_names)

import os
n_jobs_max = np.max([int(os.cpu_count()/2), 1])
n_jobs=8
if n_jobs >= 1:
    n_jobs = np.min([n_jobs, n_jobs_max])
else:
    n_jobs = n_jobs_max
print('n_jobs:', n_jobs)
flag = Tfv.tl.recover_dynamics(adata, n_jobs=n_jobs, max_iter=20, var_names='all',
    WX_method = 'lsq_linear', WX_thres=20, max_n_TF=99, n_top_genes=2000,
    fit_scaling=True, use_raw=0, init_weight_method='correlation', 
    n_time_points=1000)

adata.write('data/tfvelo_rc.h5ad',compression='gzip')

def get_pseudotime(adata):
    Tfv.tl.velocity_graph(adata, basis=None, vkey='velocity', xkey='M_total')
    Tfv.tl.velocity_pseudotime(adata, vkey='velocity', modality='M_total') 
    Tfv.pl.scatter(adata, basis='X_umap', color='velocity_pseudotime', cmap='gnuplot', fontsize=12, save='pseudotime')
    return adata

def check_data_type(adata):
    for key in list(adata.var):
        if adata.var[key][0] in ['True', 'False']:
            print('Checking', key)
            adata.var[key] = adata.var[key].map({'True': True, 'False': False})
    return

adata.var_names_make_unique()
check_data_type(adata)
print(adata)

losses = adata.varm['loss'].copy()
losses[np.isnan(losses)] = 1e6
adata.var['min_loss'] = losses.min(1)

n_cells = adata.shape[0]
expanded_scaling_y = np.expand_dims(np.array(adata.var['fit_scaling_y']),0).repeat(n_cells,axis=0)
adata.layers['velocity'] = adata.layers['velo_hat'] / expanded_scaling_y  

if 'X_pca' not in adata.obsm.keys():
    print('PCA ing')
    sc.tl.pca(adata, n_comps=50, svd_solver='arpack')
scv.pp.moments(adata, n_pcs=5, n_neighbors=30)

adata = get_pseudotime(adata)

AnnData object with n_obs × n_vars = 3696 × 2000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'n_genes', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size_total', 'initial_size', 'n_counts', 'velocity_self_transition', 'root_cells', 'end_points', 'velocity_pseudotime'
    var: 'highly_variable_genes', 'n_cells', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'n_TFs', 'fit_alpha', 'fit_beta', 'fit_omega', 'fit_theta', 'fit_gamma', 'fit_delta', 'fit_likelihood', 'fit_varx', 'fit_scaling_y', 'min_loss'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'all_TFs', 'genes_pp', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca', 'X_umap'
    varm: 'TFs', 'TFs_id', 'TFs_times', 'TFs_correlation', 'knockTF_Log2FC', 'fit_scaling', 'fit_weights', 'fit_weights_init', 'fit_weights_final', 'loss'
    layers: 'spliced', 'unspliced', 'total', 'count', 'total_raw', 'M_total', 'velo_hat', 'velo_t', 'y_t', 'velo_normed', 'filtered', 'WX', 'y', 'fit_t_raw', 'fit_t', 'velocity', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'
computing moments based on connectivities
    finished (0:00:00) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocity graph (using 1/8 cores)



  0%|          | 0/3696 [00:00<?, ?cells/s]


    finished (0:00:06) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
saving figure to file ./figures/TFvelo_pseudotime.png

def data_type_tostr(adata, key=None):
    if key is None:
        for key in list(adata.var):
            if adata.var[key][0] in [True, False]:
                print('Transfering', key)
                adata.var[key] = adata.var[key].map({True: 'True', False:'False'})
    elif key in adata.var.keys():
        if adata.var[key][0] in [True, False]:
            print('Transfering', key)
            adata.var[key] = adata.var[key].map({True: 'True', False:'False'})
    return

def get_sort_positions(arr):
    positions = np.argsort(np.argsort(arr))
    positions_normed = positions/(len(arr)-1)
    return positions_normed


def get_metric_pseudotime(adata, t_key='latent_t'):
    n_cells, n_genes = adata.shape
    adata.var['spearmanr_pseudotime'] = 0.0
    for i in range(n_genes):
        correlation, _ = scipy.stats.spearmanr(adata.layers[t_key][:,i], adata.obs['velocity_pseudotime'])
        adata.var['spearmanr_pseudotime'][i] = correlation
    return adata

def get_sort_t(adata):
    t = adata.layers['fit_t_raw'].copy()
    normed_t = adata.layers['fit_t_raw'].copy()
    n_bins = 20
    n_cells, n_genes = adata.shape
    sort_t = np.zeros([n_cells, n_genes])
    non_blank_gene = np.zeros(n_genes, dtype=int)
    hist_all, bins_all = np.zeros([n_genes, n_bins]), np.zeros([n_genes, n_bins+1])
    for i in range(n_genes):
        gene_name = adata.var_names[i]
        tmp = t[:,i].copy()
        if np.isnan(tmp).sum():
            non_blank_gene[i] = 1 
            continue
        hist, bins = np.histogram(tmp, bins=n_bins)
        hist_all[i], bins_all[i] = hist, bins
        if not (0 in list(hist)):
            if (tmp.min() < 0.1) and (tmp.max() > 0.8):
                blank_start_bin_id = np.argmin(hist)
                blank_end_bin_id = blank_start_bin_id
                non_blank_gene[i] = 1
                blank_start_bin = bins[blank_start_bin_id]
                blank_end_bin = bins[blank_end_bin_id]
                tmp = (tmp < blank_start_bin)*1 + tmp 
            else:
                blank_end_bin = tmp.min()
        else:
            blank_start_bin_id = list(hist).index(0)
            for j in range(blank_start_bin_id+1, len(hist)):
                if hist[j] > 0:
                    blank_end_bin_id = j
                    break
            blank_start_bin = bins[blank_start_bin_id]
            blank_end_bin = bins[blank_end_bin_id]
            tmp = (tmp < blank_start_bin)*1 + tmp 
            
        t[:,i] = tmp
        tmp = tmp - blank_end_bin
        tmp = tmp/tmp.max()
        normed_t[:,i] = tmp
        sort_t[:,i] = get_sort_positions(tmp)

    adata.layers['latent_t'] = sort_t.copy() 
    adata.var['non_blank_gene'] = non_blank_gene.copy()
    return adata

adata_copy.var['non_blank_gene'].fillna(0)

NCOA2      0
SBSPON     0
MCM3       0
UGGT1      0
FHL2       0
          ..
RAI2       1
AP1S2      1
TMEM27     0
DDX3Y      1
EIF2S3Y    0
Name: non_blank_gene, Length: 2000, dtype: int64

import anndata as ad
import scanpy as sc
import numpy as np
import scipy
import matplotlib

adata_copy = adata.copy()
adata_copy = get_sort_t(adata_copy) 

adata_copy_1 = adata_copy.copy()
data_type_tostr(adata_copy_1)
adata_copy_1.var['non_blank_gene']=adata_copy_1.var['non_blank_gene'].fillna(0)
print(adata_copy_1)
adata_copy_1.write('data/tfvelo_rc1.h5ad')

thres_loss = np.percentile(adata_copy.var['min_loss'], 50) 
adata_copy = adata_copy[:, adata_copy.var['min_loss'] < thres_loss]

thres_n_cells = adata_copy.X.shape[0] * 0.1
adata_copy = adata_copy[:, adata_copy.var['n_cells'] > thres_n_cells]

adata_copy = adata_copy[:, adata_copy.var['non_blank_gene']==0] 


adata_copy = get_metric_pseudotime(adata_copy)
adata_copy = adata_copy[:, adata_copy.var['spearmanr_pseudotime'] > 0.8]

Transfering highly_variable_genes
Transfering highly_variable
Transfering non_blank_gene
AnnData object with n_obs × n_vars = 3696 × 2000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'n_genes', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size_total', 'initial_size', 'n_counts', 'velocity_self_transition', 'root_cells', 'end_points', 'velocity_pseudotime'
    var: 'highly_variable_genes', 'n_cells', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'n_TFs', 'fit_alpha', 'fit_beta', 'fit_omega', 'fit_theta', 'fit_gamma', 'fit_delta', 'fit_likelihood', 'fit_varx', 'fit_scaling_y', 'min_loss', 'non_blank_gene'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'all_TFs', 'genes_pp', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca', 'X_umap'
    varm: 'TFs', 'TFs_id', 'TFs_times', 'TFs_correlation', 'knockTF_Log2FC', 'fit_scaling', 'fit_weights', 'fit_weights_init', 'fit_weights_final', 'loss'
    layers: 'spliced', 'unspliced', 'total', 'count', 'total_raw', 'M_total', 'velo_hat', 'velo_t', 'y_t', 'velo_normed', 'filtered', 'WX', 'y', 'fit_t_raw', 'fit_t', 'velocity', 'latent_t'
    obsp: 'distances', 'connectivities'

def show_adata(adata, save_name, show_all=0):
    if show_all:
        for i in range(int((len(adata.var_names)-1)/20)+1):
            genes2show = adata.var_names[i*20: (i+1)*20]
            Tfv.pl.velocity(adata, genes2show, ncols=4, add_outline=True, layers='na', dpi=300, fontsize=15, save='WX_y_'+save_name+'_'+str(i)) #layers='all'
    if len(adata.obs['clusters'].cat.categories) > 10:
        legend_loc = 'right margin'
    else:
        legend_loc = 'on data'
    cutoff_perc = 20
    Tfv.pl.velocity_embedding_stream(adata, vkey='velocity', use_derivative=False, density=2, basis='X_umap', \
        cutoff_perc=cutoff_perc, smooth=0.5, fontsize=20, recompute=True, \
        legend_loc=legend_loc, save='embedding_stream_'+save_name) # 

    return

Tfv.tl.velocity_graph(adata_copy, basis=None, vkey='velocity', xkey='M_total')
adata_copy.uns['clusters_colors'] = adata.uns['clusters_colors']
show_adata(adata_copy, save_name='velo', show_all=1)


data_type_tostr(adata_copy)
print(adata_copy)
adata_copy.var['non_blank_gene']=adata_copy.var['non_blank_gene'].fillna(0)
adata_copy.write('data/tfvelo_TFvelo.h5ad')

computing velocity graph (using 1/8 cores)



  0%|          | 0/3696 [00:00<?, ?cells/s]


    finished (0:00:00) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
saving figure to file ./figures/TFvelo_WX_y_velo_0.png

saving figure to file ./figures/TFvelo_WX_y_velo_1.png

computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
saving figure to file ./figures/TFvelo_embedding_stream_velo.png

AnnData object with n_obs × n_vars = 3696 × 28
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'n_genes', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size_total', 'initial_size', 'n_counts', 'velocity_self_transition', 'root_cells', 'end_points', 'velocity_pseudotime'
    var: 'highly_variable_genes', 'n_cells', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'n_TFs', 'fit_alpha', 'fit_beta', 'fit_omega', 'fit_theta', 'fit_gamma', 'fit_delta', 'fit_likelihood', 'fit_varx', 'fit_scaling_y', 'min_loss', 'non_blank_gene', 'spearmanr_pseudotime'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'all_TFs', 'genes_pp', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca', 'X_umap', 'velocity_umap'
    varm: 'TFs', 'TFs_id', 'TFs_times', 'TFs_correlation', 'knockTF_Log2FC', 'fit_scaling', 'fit_weights', 'fit_weights_init', 'fit_weights_final', 'loss'
    layers: 'spliced', 'unspliced', 'total', 'count', 'total_raw', 'M_total', 'velo_hat', 'velo_t', 'y_t', 'velo_normed', 'filtered', 'WX', 'y', 'fit_t_raw', 'fit_t', 'velocity', 'latent_t'
    obsp: 'distances', 'connectivities'

hyg200100 · 2024-02-25T05:38:10Z

Dear Starlitnightly,

I hope this message finds you well. I am writing to inquire about the installation process for the TFvelo code. I have been trying to import the TFvelo module using "import TFvelo as TFv" in my Python script, but it appears that the TFvelo module is not available for installation through standard channels like pip.

I have searched for the module on PyPI and other package repositories, but couldn't find a matching distribution. Could you please provide guidance on how to obtain and install the TFvelo module?

I have already obtained the TFvelo code from the repository [repository URL] and downloaded it to my local machine. However, I am unsure of the correct installation procedure. Could you kindly provide instructions on how to install TFvelo or any specific dependencies that need to be satisfied beforehand?

Thank you very much for your time and attention. I greatly appreciate any assistance you can provide in helping me install TFvelo and utilize its functionalities effectively.

Looking forward to your kind response.

Best regards,hyg

lijc0804 · 2024-05-04T01:35:44Z

Hi Zehua,

We have provided a jupyter tutorial at https://github.com/xiaoyeye/TFvelo/blob/main/TFvelo_demo.ipynb for reproducing the results on pancreas dataset in the paper.

Starlitnightly · 2024-05-05T16:45:27Z

Hi Zehua,

We have provided a jupyter tutorial at https://github.com/xiaoyeye/TFvelo/blob/main/TFvelo_demo.ipynb for reproducing the results on pancreas dataset in the paper.

Thanks for the reply, but you don't seem to have provided the h5ad file for the preprocessed pancera, I've carefully compared my code with the code in the ipynb you provided and there doesn't seem to be any difference. Is it possible that different preprocessing could lead to very different results?

result_path = 'TFvelo_pancreas_demo2/'
n_jobs = 28
adata = ad.read_h5ad(result_path + 'pp.h5ad')

TFv.tl.recover_dynamics(adata, n_jobs=n_jobs, max_iter=20, var_names='all',
        WX_method='lsq_linear', WX_thres=20, max_n_TF=99, n_top_genes=2000,
        fit_scaling=True, use_raw=0, init_weight_method="correlation", 
        n_time_points=1000) 

adata.write(result_path + 'rc.h5ad')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to reproduce the paper #5

Failure to reproduce the paper #5

Starlitnightly commented Feb 22, 2024

Starlitnightly commented Feb 22, 2024

hyg200100 commented Feb 25, 2024

lijc0804 commented May 4, 2024

Starlitnightly commented May 5, 2024

Failure to reproduce the paper #5

Failure to reproduce the paper #5

Comments

Starlitnightly commented Feb 22, 2024

Starlitnightly commented Feb 22, 2024

hyg200100 commented Feb 25, 2024

lijc0804 commented May 4, 2024

Starlitnightly commented May 5, 2024