MSNE

A Network Embedding Based Method for Partial Multi-Omics Integration in Cancer Subtyping.

Integrative analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omics layers. The ever-increasing of multi-omics data provides us a comprehensive insight into cancer subtyping. Many multi-omics integrative methods have been developed, but few of them can deal with partial datasets in which some samples have data for a subset of the omics. In this study, we propose a partial multi-omics integrative method, MSNE (Multiple Similarity Network Embedding), for cancer subtyping. MSNE integrates the multi-omics information by embedding the neighbor relations of samples defined by the random walk on multiple similarity networks. We compared MSNE with five existing multi-omics integrative methods on twelve datasets in both full and partial scenarios. MSNE achieved the best result on pan-cancer and image datasets. Furthermore, on ten cancer subtyping datasets, MSNE got the most enriched clinical parameters and comparable log-rank test P-values in survival analysis. In conclusion, MSNE is an effective and efficient integrative method for multi-omics data and, especially, has a strong power on partial datasets.

Highlights

Imputation or filtration on partial datasets leads to worse integration performance.
We propose MSNE, a network embedding based integration method of partial omics data.
MSNE can capture the similarity of samples that do not appear in any common omics.
MSNE outperforms other integration methods on both full and partial datasets.
MSNE can be used as a feature extraction method for other downstream analysis.

Version

1.0.0

requirement

gensim<4.0.0

Author

Han Xu, Lin Gao, Mingfeng Huang, Ran Duan.

Maintainer

Han Xu myxuxiaohan@outlook.com

If you have any question, feel free to contact me.

How to use

MSNE(views, n_clusters=5, k=20,workers=4, walk_length=20, num_walks=100, embed_size=100, window_size=10)

MSNE is a multi-omics integrative clustering method for cancer subtyping, especially when the multi-omics dataset is partial (e.g. some samples have only a subset of omics data). MSNE construct similarity network for each omics data, and then embedding the multiple similarity networks to d-dimensional vector space. Kmeans is used to cluster the samples finally.

:param views: the list of pandas.DataFrame(i.e. omics data). each row in omics data is a sample, each column in omics
 data is a feature. the index of omics data will be considered as the name of sample.

:param n_clusters: int, default 10. The number of clusters for Kmeans.

:param k: int, default 20. The top k neighborhoods of each node will be treated as local neighbors.

:param workers: int, default 4. The number of parallel threads.

:param walk_length: int,default 20. The length of sequences generated by random walk on multiple networks.

:param num_walks: int, default 100. Starting with each node, MSNE will generate 'num_walks' sequences.

:param embed_size: int, default 100. the dimension of embedding vectors.

:param window_size: int, default 10. the window_size in skip-gram.

:return: The dict with elements:
    embeddings: pandas.DataFrame, the low dimensional vector representation of each samples.
    group: pandas.DataFrame, the clustering of samples.

example:

    import pandas as pd
    from embedding import MSNE

    view1=pd.read_csv("../data/handwritten/mfeat-fou.csv", index_col=0)
    view2=pd.read_csv("../data/handwritten/mfeat-pix.csv", index_col=0)

    #apply MSNE on the multi-view dataset.
    result=MSNE([view1,view2],
                n_clusters=10, k=20, workers=4,
                walk_length=20, num_walks=20,
                embed_size=100, window_size=10)

    #sort the samples by name
    embeddings=result["embeddings"].reindex(samples)
    group=result["group"].reindex(samples).values.reshape(-1)

Description of data

We have migrated all datasets to MSNE_dataset in figshare, since we realized it was not convenient to share large files in github.

pancancer dataset

We download omics data of five cancer samples from TCGA. The features in each omics was mapped to gene. The samples of these cancer was concatenated. We have already removed the duplicate measured samples and raw partial samples. The mRNA data was in log scale. The top 5000 variant features was selected. After these preprocessing, we got the pancancer.zip that contain the full multi-omics data. z-score transformation was not applied on it, because z-score should be used after generating simulated partial datasets. There are too many combinations of different removing threshold on different omics data. When integrated different omics, the KNN imputation should be applied on each datasets (in our paper, for the purpose of comparing with other methods that can not apply on partial datasets, We perfomred imputation for the partial sampels for ohter methods). the total size of these datasets was about 37GB. Therefore, We did not provide the simulated partial multi-omics datasets. you can generate all these simulated partial multi-omics datasets based on pancancer.zip.

image dataset

the image dataset was download from multi view dataset. We did not applied any preprocessing on it, except removing some samples to simulate partial datasets.

cancer subtyping

We used ten datasets from a review of multi-omics integrative methods. The datasets contain mRNA expression data, DNA methylation data, microRNA expression data, and clinical information of total 4941 samples of ten cancer types. All of these datasets is partial multi-omics data. these datasets was download from http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. We regards the first three fields of samples name as the unique id of samples, and removed the duplicated measurements for the same sample.(e.g. 'TCGA.E2.A15K.11', 'TCGA.E2.A15K.06', 'TCGA.E2.A15K.01' was regarded as the same sample, and only the first one was remained.)

addition information

For convenience to reproduce the result in our paper, we use pandas.DataFrame.sample to generate simulated partial data, and the random_state is hash(f"{omic}_{int(p*10)}")%1000.

For example:

# randomly remove 30% samples from mrna.
df=read_csv(....)# the mrna data. Each row is a sample, and each column is a feature.
df1=df.sample(frac=0.7,random_state=hash(f"mrna_3")%1000)
df1.to_csv("the_partial_mrna_data.csv")

# randomly remove 30% samples from pixel view in image data.
df=read_csv(....)# the pix view data. Each row is a sample, and each column is a feature.
df1=df.sample(frac=0.7,random_state=hash(f"mfeat-pix_3")%1000)
df1.to_csv("mfeat-pix_partial_data.csv")

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
data		data
result		result
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSNE

A Network Embedding Based Method for Partial Multi-Omics Integration in Cancer Subtyping.

Highlights

Version

requirement

Author

Maintainer

How to use

example:

Description of data

pancancer dataset

image dataset

cancer subtyping

addition information

About

Releases

Packages

Languages

xuxiaohan/MSNE

Folders and files

Latest commit

History

Repository files navigation

MSNE

A Network Embedding Based Method for Partial Multi-Omics Integration in Cancer Subtyping.

Highlights

Version

requirement

Author

Maintainer

How to use

example:

Description of data

pancancer dataset

image dataset

cancer subtyping

addition information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages