Skip to content
/ MSNE Public

A network embedding based method for partial multi-omics integration in cancer subtyping

Notifications You must be signed in to change notification settings

xuxiaohan/MSNE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSNE

A Network Embedding Based Method for Partial Multi-Omics Integration in Cancer Subtyping.

Integrative analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omics layers. The ever-increasing of multi-omics data provides us a comprehensive insight into cancer subtyping. Many multi-omics integrative methods have been developed, but few of them can deal with partial datasets in which some samples have data for a subset of the omics. In this study, we propose a partial multi-omics integrative method, MSNE (Multiple Similarity Network Embedding), for cancer subtyping. MSNE integrates the multi-omics information by embedding the neighbor relations of samples defined by the random walk on multiple similarity networks. We compared MSNE with five existing multi-omics integrative methods on twelve datasets in both full and partial scenarios. MSNE achieved the best result on pan-cancer and image datasets. Furthermore, on ten cancer subtyping datasets, MSNE got the most enriched clinical parameters and comparable log-rank test P-values in survival analysis. In conclusion, MSNE is an effective and efficient integrative method for multi-omics data and, especially, has a strong power on partial datasets.

Highlights

  • Imputation or filtration on partial datasets leads to worse integration performance.

  • We propose MSNE, a network embedding based integration method of partial omics data.

  • MSNE can capture the similarity of samples that do not appear in any common omics.

  • MSNE outperforms other integration methods on both full and partial datasets.

  • MSNE can be used as a feature extraction method for other downstream analysis.

Version

1.0.0

requirement

gensim<4.0.0

Author

Han Xu, Lin Gao, Mingfeng Huang, Ran Duan.

Maintainer

Han Xu myxuxiaohan@outlook.com

If you have any question, feel free to contact me.

How to use

MSNE(views, n_clusters=5, k=20,workers=4, walk_length=20, num_walks=100, embed_size=100, window_size=10)

MSNE is a multi-omics integrative clustering method for cancer subtyping, especially when the multi-omics dataset is partial (e.g. some samples have only a subset of omics data). MSNE construct similarity network for each omics data, and then embedding the multiple similarity networks to d-dimensional vector space. Kmeans is used to cluster the samples finally.

:param views: the list of pandas.DataFrame(i.e. omics data). each row in omics data is a sample, each column in omics
 data is a feature. the index of omics data will be considered as the name of sample.

:param n_clusters: int, default 10. The number of clusters for Kmeans.

:param k: int, default 20. The top k neighborhoods of each node will be treated as local neighbors.

:param workers: int, default 4. The number of parallel threads.

:param walk_length: int,default 20. The length of sequences generated by random walk on multiple networks.

:param num_walks: int, default 100. Starting with each node, MSNE will generate 'num_walks' sequences.

:param embed_size: int, default 100. the dimension of embedding vectors.

:param window_size: int, default 10. the window_size in skip-gram.

:return: The dict with elements:
    embeddings: pandas.DataFrame, the low dimensional vector representation of each samples.
    group: pandas.DataFrame, the clustering of samples.

example:

    import pandas as pd
    from embedding import MSNE

    view1=pd.read_csv("../data/handwritten/mfeat-fou.csv", index_col=0)
    view2=pd.read_csv("../data/handwritten/mfeat-pix.csv", index_col=0)

    #apply MSNE on the multi-view dataset.
    result=MSNE([view1,view2],
                n_clusters=10, k=20, workers=4,
                walk_length=20, num_walks=20,
                embed_size=100, window_size=10)

    #sort the samples by name
    embeddings=result["embeddings"].reindex(samples)
    group=result["group"].reindex(samples).values.reshape(-1)

Description of data

We have migrated all datasets to MSNE_dataset in figshare, since we realized it was not convenient to share large files in github.

pancancer dataset

We download omics data of five cancer samples from TCGA. The features in each omics was mapped to gene. The samples of these cancer was concatenated. We have already removed the duplicate measured samples and raw partial samples. The mRNA data was in log scale. The top 5000 variant features was selected. After these preprocessing, we got the pancancer.zip that contain the full multi-omics data. z-score transformation was not applied on it, because z-score should be used after generating simulated partial datasets. There are too many combinations of different removing threshold on different omics data. When integrated different omics, the KNN imputation should be applied on each datasets (in our paper, for the purpose of comparing with other methods that can not apply on partial datasets, We perfomred imputation for the partial sampels for ohter methods). the total size of these datasets was about 37GB. Therefore, We did not provide the simulated partial multi-omics datasets. you can generate all these simulated partial multi-omics datasets based on pancancer.zip.

image dataset

the image dataset was download from multi view dataset. We did not applied any preprocessing on it, except removing some samples to simulate partial datasets.

cancer subtyping

We used ten datasets from a review of multi-omics integrative methods. The datasets contain mRNA expression data, DNA methylation data, microRNA expression data, and clinical information of total 4941 samples of ten cancer types. All of these datasets is partial multi-omics data. these datasets was download from http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. We regards the first three fields of samples name as the unique id of samples, and removed the duplicated measurements for the same sample.(e.g. 'TCGA.E2.A15K.11', 'TCGA.E2.A15K.06', 'TCGA.E2.A15K.01' was regarded as the same sample, and only the first one was remained.)

addition information

For convenience to reproduce the result in our paper, we use pandas.DataFrame.sample to generate simulated partial data, and the random_state is hash(f"{omic}_{int(p*10)}")%1000.

For example:

# randomly remove 30% samples from mrna.
df=read_csv(....)# the mrna data. Each row is a sample, and each column is a feature.
df1=df.sample(frac=0.7,random_state=hash(f"mrna_3")%1000)
df1.to_csv("the_partial_mrna_data.csv")

# randomly remove 30% samples from pixel view in image data.
df=read_csv(....)# the pix view data. Each row is a sample, and each column is a feature.
df1=df.sample(frac=0.7,random_state=hash(f"mfeat-pix_3")%1000)
df1.to_csv("mfeat-pix_partial_data.csv")

About

A network embedding based method for partial multi-omics integration in cancer subtyping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages