-- Find the expert oncologist for personalized cancer therapies http://52.90.79.42
This project has 8 sections with code and detailed explanation in 8 jupyter notebooks.
- Part I: Web_scraping
Obtaining data from Cancer.Net, Pubmed, scopus, clinicaltrials.gov, with methods including web scraping, API and Biopython. - Part II: Abstract_preprocessing
Obtaining cancer type information and gene information from abstracts. The aim is to annotate each abstract by cancer type and gene which are two critical features for the doctor recommendation system. - Part III: Abstract_cancertype_annotation_LSTM
Building LSTM models and training on the abstracts labeled with cancer type information. The aim is to predict cancer type for the abstracts with no cancer type information. - Part IV: Abstract_keywords
This notebook first generates dataframes and dictionaries to set up connections among oncologists, cancer types and abstract ids (pmid). Secondly, it visualizes keywords in abstracts for each cancer type. - Part V: Word2Vec
Training Word2Vec model to embed abstracts - Part VI: LDA
Training Latent Dirichlet Allocation (LDA) to embed abstracts. - Part VII: BioBERT
Using pre-trained weights from BioBERT to embed abstracts. - Part VIII: Webapp
The functions and steps used in the final webapp including both BioBERT and LDA methods.
Match the needs of patients to abstracts published by oncologists based on similarity in semantic meanings which are encoded by word embedding.
- Cancer.Net: https://www.cancer.net/
- Pubmed: https://www.ncbi.nlm.nih.gov/pubmed/
- ClinicalTrial: https://clinicaltrials.gov/
Here is a summary of some statistics of the data for this project:
Data type | Number of data |
---|---|
Abstracts | 50930 |
Oncologists | 1470 |
Cancer types | 55 |
Genes | 206 |
Cancer Centers | 48 |
Around 83.4% of abstracts are annotated with cancer type by searching for cancer-related information and the remaining 16.6% are unlabeled.
To annotate the unlabeled 16.6% abstracts, an LSTM model is built and trained on the labeled abstracts.
For each cancer type, an LSTM model is trained separately as a binary classification problem. Class_0 suggests the abstract doesn't contain information about the specific cancer type and Class_1 suggests that the abstract contains information about this cancer type.
Here is the overall training performance for the top 29 most frequent cancer types
Since for each cancer type, the dataset is imbalanced (more Class_0 than Class_1), F1 score is used as the evaluation metric and the model with the highest F1 score is selected.
Here is the detailed training and prediction results for Breast Cancer:
After annotating the unlabeled abstracts, here is the overall cancer type information for the abstracts:
The keywords for each cancer type are visualized by word frequency and WordCloud.
Each abstract will be converted to a numeric vector that represents its semantic meanings.
To evaluate the word embedding performance, article titles are also converted to numeric vectors. The consine similarity scores between each pair of abstract and title are calculated and ranked. The better embedding performance suggests that the consine similarity score between each abstract and its corresponding title should have the highest ranking (percentile close to 0).
Here I tested Word2Vec, Latent Dirichlet Allocation (LDA) and BioBERT three models. The embedding performance is plotted as the distribution of cosine similarity score rankings between each abstract and its corresponding title (better performance corresponds to higher ranking and percentile closer to 0).
To better visualize the embedding results from BioBERT, here I generated TSNE plots of the embedded vectors of abstracts converted by BioBERT colored based on cancer types or gene mutation information.
BioBERT significantly outperforms Word2Vec and LDA models and I chose BioBERT to embed abstracts as well as the free-form text input from users.
OncoMatch http://52.90.79.42
Users will provide information about Cancer type, Gene mutation (optional), Clinical Trials and a detailed description about the disease including medical records, family history and specific therapies, etc.
For example, a user is looking for oncologists specialized in melanoma treatment. Here if we entered the biography from a famous melanoma oncologist Dr. Jedd D. Wolchok. The number one ranked oncologist from OncoMatch is Dr. Jedd D. Wolchok.
OncoMatch will provide information about the oncologist's affiliation, location, a list of clinical trials and a list of most relevant publications based on the search record. OncoMatch will also give a list of top ranked oncoligsts that users can take a lookt at.
For visualization, the localization of abstracts about melanoma, abstracts published by Dr. Jedd D. Wolchok about melanoma and the text input are plotted in TSNE plot.