# Capstone Project: Classifying clinically actionable genetic mutations

## Problem Statement

I will build and train a classifier to propose the correct classification of genetic variations based on an expert-annotated knowledge base of cancer mutation annotations and related biomedical terms, so that clinical pathologists can spend less effort manually reviewing medical literature to make the classification. The model performance will be guided by the best accuracy and AUC scores, and the model should improve upon the baseline by at least 10% - the baseline being defined as the proportion of each variant class in the given training set.

- Identifies which of the three proposals you outlined in your lightning talk you have chosen
- Articulates the main goal of your project (your problem statement)
- Outlines your proposed methods and models
- Defines the risks & assumptions of your data 
- Revises initial goals & success criteria, as needed
- Documents your data source
- Performs & summarizes preliminary EDA of your data

Related topics that I've touched on:
- [Increasing weights of words in CountVectorizer](https://stackoverflow.com/questions/49687009/how-to-increase-weight-of-a-word-for-countvectorizer)
    - [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Topics to explore further:
- Pre-trained POS tagger (NLTK maxent_treebank_pos_tagger)
- Dependency Parse Trees (takes POS tags as inputs)
- Named Entity Detection (either use NLTK pre-trained model, or consider using BEST or ClinVar)
- Multi-layer Perceptron (basic neural network)
- [Dealing with imbalanced classes](https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18)
- Word Embeddings: a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.
- [Word Embedding & Sentiment Classification using Keras](https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456)

## Step 1: Data Collection

### Data Sources

1. [Kaggle training datasets](https://www.kaggle.com/c/msk-redefining-cancer-treatment/data):
    - "training_text": a double pipe (||) delimited file that contains 3,322 rows of clinical evidence (text) used to classify genetic mutations.
    - "training_variants": a comma separated file containing 3,322 rows of descriptions of the genetic mutations used for training.


2. [Kaggle testing datasets](https://www.kaggle.com/c/msk-redefining-cancer-treatment/data):
    - "test_test": a double pipe (||) delimited file that contains 3,322 rows of clinical evidence (text) used to classify genetic mutations.
    - "test_variants": a comma separated file containing 2,954 rows of descriptions of the genetic mutations used for testing.


3. US National Centre for Biotechnology Information (NCBI) [Entrez API](https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/#web):
    - used to retrieve disease and severity information based on a given variant


4. [Biomedical Entity Search Tool (BEST)](http://best.korea.ac.kr/)
    - used to find related biomedical terms (i.e. diseases, drugs, drug targets, transcription factors and miRNAs) related to specific genes and variants