OncoMatch

-- Find the expert oncologist for personalized cancer therapies http://52.90.79.42

This project has 8 sections with code and detailed explanation in 8 jupyter notebooks.

Part I: Web_scraping
Obtaining data from Cancer.Net, Pubmed, scopus, clinicaltrials.gov, with methods including web scraping, API and Biopython.
Part II: Abstract_preprocessing
Obtaining cancer type information and gene information from abstracts. The aim is to annotate each abstract by cancer type and gene which are two critical features for the doctor recommendation system.
Part III: Abstract_cancertype_annotation_LSTM
Building LSTM models and training on the abstracts labeled with cancer type information. The aim is to predict cancer type for the abstracts with no cancer type information.
Part IV: Abstract_keywords
This notebook first generates dataframes and dictionaries to set up connections among oncologists, cancer types and abstract ids (pmid). Secondly, it visualizes keywords in abstracts for each cancer type.
Part V: Word2Vec
Training Word2Vec model to embed abstracts
Part VI: LDA
Training Latent Dirichlet Allocation (LDA) to embed abstracts.
Part VII: BioBERT
Using pre-trained weights from BioBERT to embed abstracts.
Part VIII: Webapp
The functions and steps used in the final webapp including both BioBERT and LDA methods.

Project Aim:

Match the needs of patients to abstracts published by oncologists based on similarity in semantic meanings which are encoded by word embedding.

Data Source:

Cancer.Net: https://www.cancer.net/
Pubmed: https://www.ncbi.nlm.nih.gov/pubmed/
ClinicalTrial: https://clinicaltrials.gov/

Here is a summary of some statistics of the data for this project:

Data type	Number of data
Abstracts	50930
Oncologists	1470
Cancer types	55
Genes	206
Cancer Centers	48

Approach:

Step 1 -- Annotate abstracts by cancer type

Around 83.4% of abstracts are annotated with cancer type by searching for cancer-related information and the remaining 16.6% are unlabeled.

Model -- LSTM

To annotate the unlabeled 16.6% abstracts, an LSTM model is built and trained on the labeled abstracts.

For each cancer type, an LSTM model is trained separately as a binary classification problem. Class_0 suggests the abstract doesn't contain information about the specific cancer type and Class_1 suggests that the abstract contains information about this cancer type.

Classification results

Here is the overall training performance for the top 29 most frequent cancer types

Since for each cancer type, the dataset is imbalanced (more Class_0 than Class_1), F1 score is used as the evaluation metric and the model with the highest F1 score is selected.

Here is the detailed training and prediction results for Breast Cancer:

After annotating the unlabeled abstracts, here is the overall cancer type information for the abstracts:

The keywords for each cancer type are visualized by word frequency and WordCloud.

Step 2 -- Word Embedding

Each abstract will be converted to a numeric vector that represents its semantic meanings.

Validation

To evaluate the word embedding performance, article titles are also converted to numeric vectors. The consine similarity scores between each pair of abstract and title are calculated and ranked. The better embedding performance suggests that the consine similarity score between each abstract and its corresponding title should have the highest ranking (percentile close to 0).

Embedding results

Here I tested Word2Vec, Latent Dirichlet Allocation (LDA) and BioBERT three models. The embedding performance is plotted as the distribution of cosine similarity score rankings between each abstract and its corresponding title (better performance corresponds to higher ranking and percentile closer to 0).

To better visualize the embedding results from BioBERT, here I generated TSNE plots of the embedded vectors of abstracts converted by BioBERT colored based on cancer types or gene mutation information.

Conclusion

BioBERT significantly outperforms Word2Vec and LDA models and I chose BioBERT to embed abstracts as well as the free-form text input from users.

Application

OncoMatch http://52.90.79.42

Users will provide information about Cancer type, Gene mutation (optional), Clinical Trials and a detailed description about the disease including medical records, family history and specific therapies, etc.

For example, a user is looking for oncologists specialized in melanoma treatment. Here if we entered the biography from a famous melanoma oncologist Dr. Jedd D. Wolchok. The number one ranked oncologist from OncoMatch is Dr. Jedd D. Wolchok.

OncoMatch will provide information about the oncologist's affiliation, location, a list of clinical trials and a list of most relevant publications based on the search record. OncoMatch will also give a list of top ranked oncoligsts that users can take a lookt at.

For visualization, the localization of abstracts about melanoma, abstracts published by Dr. Jedd D. Wolchok about melanoma and the text input are plotted in TSNE plot.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
bert_data		bert_data
data		data
images		images
models		models
models_lda		models_lda
models_w2v		models_w2v
webapp		webapp
wordcloud		wordcloud
PartIII_Abstract_cancertype_annotation_LSTM.ipynb		PartIII_Abstract_cancertype_annotation_LSTM.ipynb
PartII_Abstract_preprocessing.ipynb		PartII_Abstract_preprocessing.ipynb
PartIV_Abstract_keywords.ipynb		PartIV_Abstract_keywords.ipynb
PartI_Web_scraping.ipynb		PartI_Web_scraping.ipynb
PartVIII_Webapp.ipynb		PartVIII_Webapp.ipynb
PartVII_BioBERT.ipynb		PartVII_BioBERT.ipynb
PartVI_LDA.ipynb		PartVI_LDA.ipynb
PartV_Word2Vec.ipynb		PartV_Word2Vec.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OncoMatch

Project Aim:

Match the needs of patients to abstracts published by oncologists based on similarity in semantic meanings which are encoded by word embedding.

Data Source:

Approach:

Step 1 -- Annotate abstracts by cancer type

Model -- LSTM

Classification results

Step 2 -- Word Embedding

Validation

Embedding results

Conclusion

Application

About

Releases

Packages

Languages

xiey1/OncoMatch

Folders and files

Latest commit

History

Repository files navigation

OncoMatch

Project Aim:

Match the needs of patients to abstracts published by oncologists based on similarity in semantic meanings which are encoded by word embedding.

Data Source:

Approach:

Step 1 -- Annotate abstracts by cancer type

Model -- LSTM

Classification results

Step 2 -- Word Embedding

Validation

Embedding results

Conclusion

Application

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages