In [1]:
# Standard python helper libraries.
import os, sys, re, json, time, wget, csv, string, time, random
import itertools, collections
from importlib import reload
from IPython.display import display

# NumPy and SciPy for matrix ops
import numpy as np
import scipy.sparse

# NLTK for NLP utils
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz
utils.require_package("wget")      # for fetching dataset

from keras.models import Sequential
from keras.layers import GaussianNoise, LSTM, Bidirectional, Dropout, Dense, Embedding, MaxPool1D, GlobalMaxPool1D, Conv1D
from keras.optimizers import Adam

from pymagnitude import *

[nltk_data] Downloading package punkt to /home/renzeer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/renzeer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Using TensorFlow backend.


# INTRODUCTION
## Phenotype Classification of Electronic Health Records

Electronic Health Record (EHR) data is a rapidly growing source of unstructured biomedical data. This data is extremely rich, often capturing a patient’s phenotype. In a clinical context, phenotype refers to the medical conditions, diseases, and disorders of a patient. These records can capture data in higher detail compared to structured encodings such as the International Classification of Diseases (ICD). Traditional methods for extracting phenotypes from this data typically relies on manual review or processing the data through rule-based expert systems. Both approaches are time intensive, rely heavily on human expertise, and scale poorly. This project proposes an automated approach to identifying phenotypes in EHR data through machine learning.


# DATA EXPLORATION
## Word Embedding

The foundation of this project is based on word embeddings, an approach that converts words into number vectors based on co-occurence. These vectors help capture word meanings and context in a format suitable for machine learning. 

Typically these vectors are trained on extremely large corpora, which can take a lot of time and resources. Thankfully, the word embedding space is quite mature and there exists pre-trained models, ready to use out of the box. One such model is Standford's GloVe vectors, which is trained on a corpus of 6B tokens from Wikipedia and Gigaword. These vectors are available at https://nlp.stanford.edu/projects/glove/. We will go through some exercises to explore word vectors.


In [2]:
import glove_helper; reload(glove_helper)
hands = glove_helper.Hands(ndim=100)

Loading vectors from data/glove/glove.6B.zip
Parsing file: data/glove/glove.6B.zip:glove.6B.100d.txt
Found 400,000 words.
Parsing vectors... Done! (W.shape = (400003, 100))


In [9]:
def find_nn_cos(v, Wv, k=10):
    """Find nearest neighbors of a given word, by cosine similarity.
    
    Returns two parallel lists: indices of nearest neighbors, and 
    their cosine similarities. Both lists are in descending order, 
    and inclusive: so nns[0] should be the index of the input word, 
    nns[1] should be the index of the first nearest neighbor, and so on.

    Args:
      v: (d-dimensional vector) word vector of interest
      Wv: (V x d matrix) word embeddings
      k: (int) number of neighbors to return
    
    Returns (nns, ds), where:
      nns: (k-dimensional vector of int), row indices of nearest neighbors, 
        which may include the given word.
      similarities: (k-dimensional vector of float), cosine similarity of each 
        neighbor in nns.
    """

    v_norm = np.linalg.norm(v)
    Wv_norm = np.linalg.norm(Wv, axis=1)

    dot = np.dot(v, Wv.T)

    cos_sim = dot / (v_norm * Wv_norm)

    nns = np.flipud(np.argsort(cos_sim)[-k:])
    ds = np.flipud(np.sort(cos_sim)[-k:])
    
    return [nns, ds]


def show_nns(hands, word, k=10):
    """Helper function to print neighbors of a given word."""
    word = word.lower()
    print("Nearest neighbors for '{:s}'".format(word))
    v = hands.get_vector(word)
    for i, sim in zip(*find_nn_cos(v, hands.W, k)):
        target_word = hands.vocab.id_to_word[i]
        print("{:.03f} : '{:s}'".format(sim, target_word))
    print("")

In [10]:
# Find the top 10 nearest neighbors for a few examples
show_nns(hands, "diabetes")
show_nns(hands, "cancer")
show_nns(hands, "depression")

Nearest neighbors for 'diabetes'
1.000 : 'diabetes'
0.848 : 'hypertension'
0.799 : 'obesity'
0.780 : 'arthritis'
0.779 : 'cancer'
0.774 : 'alzheimer'
0.765 : 'asthma'
0.756 : 'cardiovascular'
0.733 : 'disease'
0.730 : 'epilepsy'

Nearest neighbors for 'cancer'
1.000 : 'cancer'
0.821 : 'breast'
0.807 : 'prostate'
0.785 : 'disease'
0.779 : 'diabetes'
0.766 : 'cancers'
0.751 : 'patients'
0.749 : 'leukemia'
0.744 : 'alzheimer'
0.732 : 'lung'

Nearest neighbors for 'depression'
1.000 : 'depression'
0.706 : 'illness'
0.690 : 'anxiety'
0.679 : 'severe'
0.672 : 'onset'
0.670 : 'schizophrenia'
0.668 : 'disorder'
0.666 : 'alcoholism'
0.643 : 'psychosis'
0.641 : 'mental'



The results we see make sense and showcase the capability of word embeddings. However, we do run into a few issues. For one, 
loading the file into our workspace requires careful memory management. This can become a problem when dealing with larger models or when we want to tweak our models and reload the data. Another issue is that we have to build our own help functions for performing calculations on the word vectors. Not inherently an issue, but these calculations are fairly standard and it is always a good idea to work smarter, not harder.

As an alternative, we can look at third-party packages that offer fast and simple support for word vector operations. The package we will use for this project is Magnitude (https://github.com/plasticityai/magnitude). This package offers "lazy-loading for faster cold starts in development, LRU memory caching for performance in production, multiple key queries, direct featurization to the inputs for a neural network, performant similiarity calculations, and other nice to have features for edge cases like handling out-of-vocabulary keys or misspelled keys and concatenating multiple vector models together." These are all great features that we can leverage for this project.

## Working with Word Vectors - Magnitude

Going through a few simple comparisons and exercises, we can see the difference between working with the raw text file versus working with the magnitude file:
  - The zip file is ~4 times larger than the magnitude file. This is even more impressive consdering the text file still needs to be unpackaged.  
  - Load times are extremely quick for the magnitude file, far outperforming the standard file.  
  - Querying from the standard file outperforms the magnitude file, but querying from the magnitude file is simpler and offers additional functionality.  
  
While the increased query times is not ideal, especially when it comes to training, the portability and the increased functionality just makes life so much easier.

In [25]:
"""
Perform basic functions on our standard zip/txt file
to benchmark performance
"""
print('Standard Text File:')
print('\tFile Size: ', os.stat('data/glove/glove.6B.zip').st_size)

start = time.time()
glove_vectors_txt = glove_helper.Hands(ndim=100, quiet=True)
end = time.time()
print('\tFile Load Time: ', end - start)

start = time.time()
glove_vectors_txt.get_vector('diabetes')
glove_vectors_txt.get_vector('cancer')
glove_vectors_txt.get_vector('hypertension')
end = time.time()
print('\tQuery Time: ', end - start)

print('\tHandling out-of-vocabulary words:')
try:
    print('\t\t', glove_vectors_txt.get_vector('wordnotfoundinvocab'))
except AssertionError:
    print('\t\tWord not found in vocabulary')

"""
Perform basic functions on our magnitude file
to benchmark performance
"""
print('\nMagnitude File:')
print('\tFile Size: ', os.stat('data/glove-lemmatized.6B.100d.magnitude').st_size)

start = time.time()
glove_vectors_mag = Magnitude("data/glove-lemmatized.6B.100d.magnitude")
end = time.time()
print('\tFile Load Time: ', end - start)

start = time.time()
glove_vectors_mag.query("diabetes")
glove_vectors_mag.query("cancer")
glove_vectors_mag.query("hypertension")
end = time.time()
print('\tQuery Time: ', end - start)

print('\tHandling out-of-vocabulary words:')
try:
    print('\t\t', glove_vectors_mag.query('wordnotfoundinvocab'))
except AssertionError:
    print('\t\tWord not found in vocabulary')


Standard Text File:
	File Size:  862182613
	File Load Time:  18.083816289901733
	Query Time:  0.00012183189392089844
	Handling out-of-vocabulary words:
		Word not found in vocabulary

Magnitude File:
	File Size:  266366976
	File Load Time:  0.002287149429321289
	Query Time:  0.006619930267333984
	Handling out-of-vocabulary words:
		 [-0.04397694  0.08708267  0.05870734 -0.04722567 -0.03879925  0.21312321
  0.02859145 -0.03979973 -0.02670808  0.02556176 -0.07791763  0.0055145
 -0.03020298  0.06430179 -0.00551911  0.16249717 -0.06189246 -0.12206172
 -0.02767706 -0.05265569  0.13255737  0.02846519  0.0451067   0.11242716
  0.01290785 -0.04876954 -0.04612697 -0.03764525 -0.00251381  0.11269477
  0.11309229  0.09421328 -0.13763386 -0.02501031  0.01126506  0.06448203
  0.06115726 -0.12342421  0.02004041 -0.0443186  -0.02901474 -0.01431345
  0.05068584 -0.02549015 -0.08328359 -0.07138098  0.0835982  -0.03470181
 -0.00475797 -0.07226969  0.20147627 -0.02546141  0.16691468  0.15587942
 -0.10204

## Corpus Selection - Biomedical Text

With a framework that allows more freedom in corpus selection, we can move into much more larger word embedding models. The GloVe model we have been previously working with is actually on the smaller side. Of course, a larger corpus offers more data to train on, thus better capturing word contexts and meanings. However, another determininig factor in corpus selection is the source of the text. In general, these pre-trained models are based on general topic sources such as Wikipedia and Gigaword. However, since we know the domain we are working in, it may make sense to pull from relevant text sources. 

A Comparison of Word Embeddings for the Biomedical Natural Language Processing (https://arxiv.org/pdf/1802.00400.pdf) explores this idea. The paper concluded that "word embeddings trained on EHR and MedLit can capture the semantics of medical terms better and find semantically relevant medical terms closer to human experts’ judgments than those trained on GloVe and Google News." 

We can test these results ourselves by comparing GloVe against a biomedical based word embedding that was trained on text from PubMed and PubMed Central.

In [30]:
print('GloVe length: ', len(glove_vectors_mag))
print('GloVe dimensions: ', glove_vectors_mag.dim)

print('\nNearest Neighbor examples:')
print('10 NN for diabetes:\n', glove_vectors_mag.most_similar("diabetes", topn = 10))
print('10 NN for cancer:\n', glove_vectors_mag.most_similar("cancer", topn = 10))
print('10 NN for hyperlipidemia:\n', glove_vectors_mag.most_similar("hyperlipidemia", topn = 10))
print('10 NN for e119:\n', glove_vectors_mag.most_similar("e119", topn = 10))

GloVe length:  336951
GloVe dimensions:  100

Nearest Neighbor examples:
10 NN for diabetes:
 [('diabetic', 0.7566893059521317), ('diabetis', 0.7465886369685321), ('obesity', 0.619293770727618), ('hypertension', 0.6162751523182464), ('cardiovascular', 0.5791516470463346), ('asthma', 0.5689611839459698), ('arthriti', 0.5554265183541941), ('mellitu', 0.5439654800171492), ('allergy', 0.5393456576654493), ('alzheimer', 0.5297674264739546)]
10 NN for cancer:
 [('breast', 0.8210739), ('prostate', 0.8065967), ('disease', 0.78536785), ('diabetis', 0.7788438), ('patient', 0.75117147), ('leukemia', 0.7485109), ('alzheimer', 0.744444), ('lung', 0.73171055), ('diseasis', 0.729254), ('heart', 0.7241202)]
10 NN for hyperlipidemia:
 [('dyslipidemia', 0.6900931), ('hypercholesterolemia', 0.67991346), ('insulin-dependent', 0.6547221), ('insipidu', 0.61982), ('hyperglycemia', 0.6196113), ('vaginismu', 0.61709344), ('metformin', 0.6067523), ('pre-eclampsia', 0.60390294), ('prediabetis', 0.6029445), ('pol

In [3]:
med_vectors = Magnitude("data/wikipedia-pubmed-and-PMC-w2v.magnitude", pad_to_length=30)
print('Medical length: ', len(med_vectors))
print('Medical dimensions: ', med_vectors.dim)

print('\nNearest Neighbor examples:')
print('10 NN for diabetes:\n', med_vectors.most_similar("diabetes", topn = 10))
print('10 NN for cancer:\n', med_vectors.most_similar("cancer", topn = 10))
print('10 NN for hyperlipidemia:\n', med_vectors.most_similar("hyperlipidemia", topn = 10))
print('10 NN for e119:\n', med_vectors.most_similar("e119", topn = 10))

Medical length:  5443656
Medical dimensions:  200

Nearest Neighbor examples:
10 NN for diabetes:
 [('T2DM', 0.849025), ('T1DM', 0.8185854), ('prediabetes', 0.8093643), ('mellitus', 0.803476), ('pre-diabetes', 0.78467065), ('DM2', 0.7815228), ('hyperlipidemia', 0.7732315), ('(IDDM)1', 0.763704), ('dyslipidemia', 0.7627928), ('hyperlipidaemia', 0.75711805)]
10 NN for cancer:
 [('cancers', 0.85612094), ('caner', 0.8248045), ('CRC', 0.79751563), ('PCa', 0.7963365), ('cancer.4', 0.7522857), ('cancer.6', 0.7514417), ('cancer.5', 0.7498777), ('breast', 0.7488325), ('cancer.9', 0.74855363), ('cancer.10', 0.74781704)]
10 NN for hyperlipidemia:
 [('hyperlipidaemia', 0.94658345), ('dyslipidemia', 0.92323184), ('dyslipidaemia', 0.90003157), ('hypercholesterolemia', 0.8709867), ('hypertriglyceridemia', 0.83977485), ('dislipidemia', 0.8359443), ('dyslipemia', 0.8347689), ('dyslipidemias', 0.83325243), ('hypertension', 0.80633545), ('dyslipoproteinemia', 0.7960845)]
10 NN for e119:
 [('e19', 0.71280

## Training Data - Labeled Electronic Health Record Text

The goal of this project is to classify Eletronic Health Record (EHR) text. This of course means that we need to get our hands on some EHR data. This can be particularly difficult due to the strict rules and guidelines around healthcare data. The Health Insurance Portability and Accountability Act of 1996, or HIPAA, outlines a set of rules that help protect the privacy of our health information. These rules are vital for building a healthcare system where we can trust our healthcare providers and caregivers, so it is important that we adhere to the standards set by HIPAA. 

For this project, we will be using a dataset provided by MTSamples.com. They provide ~5,000 transcribed medical reports covering 40 specialty types. All of the notes have been de-identified of protected health information, making them HIPAA compliant. Below we will explore a few rows of the raw data.

In [4]:
ehr_notes = []
with open('data/ehr_samples.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        ehr_notes.append([row['Specialty'], row['Note']])
        
print('EHR Sentence Example:\n')
print(ehr_notes[0])
print(ehr_notes[1])

EHR Sentence Example:

['Bariatrics', 'PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to public seating, and lifting objects off the floor.  He exercises three times a week at home and does cardio.  He has difficulty walking two blocks or five flights of stairs.  Difficulty with snoring.  He has muscle and joint pains including knee pain, back pain, foot and ankle pain, and swelling.  He has gastroesophageal reflux disease.,PAST SURGICAL HISTORY:, Includes reconstructive surgery on his right hand 13 years ago.  ,SOCIAL HISTORY:, He is currently single.  He has about ten drinks a year.  He had smoked significantly up until several months ago.  He now smokes less than three cigarettes a day.,FAMILY HISTORY:, Heart disease in both grandfathers, grandmother with stroke, and a grandmother with diabetes.  Denies obesity and hypertension in other family members.,CURRENT MEDICATIONS:, None.,ALLERGIES:,  He is allergic to Penicillin.,M

## Text Processing - Pre-Processing the EHR Notes

With the EHR data now loaded, we could technically start applying Machine Learning operations as is. However, text can come in all forms and some models perform better when the input text is in a certain form. We discover a few of these optimizations throughout our ML notebooks, but it would be good to cover them here as well. 

The first obstacle is managing our text length. As our input text grows, so does the number of variables and the number of operations. Depending on our algorithm, these values can scale exponentially, causing runtime and resource usage to explode out of hand. To help manage the scope of our input text, we will be breaking up our notes into sentences. Limiting our text length also helps the model focus on the signals that matter. This should give us enough context to learn the more complex relationships between our words while minimizing runtime.

Another pre-processing step we can take is to apply basic natural language cleanup techniques that standardize the text and remove non-essential information. Thankfully, python has a package called the Natural Language Toolkit (NLTK) that provides a lot of these transformations as built-in functions. The operations we will use for this project are converting all text to lowercase, removing punctation, filtering out stop words, and removing blanks.

After all of the pre-processing, we can take a look at what the EHR notes now look like.

In [5]:
ehr_sentences = []
for record in ehr_notes:
    sent_text = nltk.sent_tokenize(record[1])
    for sent in sent_text:
        tokens = word_tokenize(sent)

        # convert to lower case
        tokens = [w.lower() for w in tokens]

        # remove punctuation from each word
        table = str.maketrans('', '', string.punctuation)
        tokens = [w.translate(table) for w in tokens]

        # filter out stop words
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in tokens if not w in stop_words]

#         # stem words
#         porter = PorterStemmer()
#         tokens = [porter.stem(word) for word in tokens]

        # remove blanks
        tokens = [w for w in tokens if w != '']

        ehr_sentences.append([record[0], ' '.join(tokens)])

random.Random(4).shuffle(ehr_sentences)

In [6]:
print(ehr_sentences[:10])

specialties = ['Allergy', 'Autopsy', 'Bariatrics', 'Cardiovascular', 'Chart', 'Chiropractic', 'Consult'
               , 'Cosmetic', 'Dentistry', 'Dermatology', 'Diet', 'Discharge', 'Emergency', 'Endocrinology'
               , 'Gastroenterology', 'General', 'Gynecology', 'Hospice', 'IME', 'Letters', 'Nephrology', 'Neurology'
               , 'Neurosurgery', 'Office Notes', 'Oncology', 'Ophthalmology', 'Orthopedic', 'Otolaryngology'
               , 'Pain Management', 'Pathology', 'Pediatrics', 'Podiatry', 'Psychiatry', 'Radiology', 'Rehab'
               , 'Rheumatology', 'Sleep', 'Speech', 'Surgery', 'Urology']

[['Psychiatry', 'exhusband died 1980 acute pancreatitis secondary alcohol abuse'], ['Gynecology', 'patient taken post anesthesia care unit stable condition'], ['Consult', 'send pertussis pcr'], ['Discharge', 'admission diagnoses 1'], ['Surgery', 'time removed 12 mm broach proceeded implanting polyethylene liner within acetabulum'], ['General', 'peripheral vascular disease status post recent last week pta right lower extremity social history negative smoking drinking current home medications novolog 20 units meal lantus 30 units bedtime crestor 10 mg daily micardis 80 mg daily imdur 30 mg daily amlodipine 10 mg daily coreg 125 mg bid lasix 20 mg daily ecotrin 325 mg daily calcitriol 05 mcg daily review systems patient denies complaints states right hand left foot swollen painful came emergency room'], ['Surgery', 'estimated blood loss less 15 ml'], ['Surgery', 'base tumor fulgurated periphery normal mucosa surrounding base bladder tumor'], ['Cardiovascular', 'focal areas consolidation s

# METHODS AND APPROACHES
## Naive Nearest Neighbor

The first method we will explore will be to just leverage the word embedding space with no Machine Learning at all. We mentioned earlier that the word vectors capture context and meaning. Additionally position of these vectors in relation to eachother also convey word relationships. At the core of it, vectors clustered together are more similar in context and meaning. Using this principle, we can use our categories as anchors in our word embedding, calculate a similarity score for a sentence, and identify which category is the nearest neighbor to our sentence. 

This is a very naive approach but it will be a good exercise and can at least set a baseline for performance. 

In [139]:
print('Similarity between diabetes and mellitus: ', med_vectors.similarity("diabetes", "mellitus"))
print('Similarity between diabetes and breast: ', med_vectors.similarity("diabetes", "breast"))

print('\nSimilarity between cancer and mellitus: ', med_vectors.similarity("cancer", "mellitus"))
print('Similarity between cancer and breast: ', med_vectors.similarity("cancer", "breast"))

Similarity between diabetes and mellitus:  0.80347604
Similarity between diabetes and breast:  0.26328182

Similarity between cancer and mellitus:  0.13384798
Similarity between cancer and breast:  0.7488326


In [140]:
nn_results = []
for i, ehr_sent in enumerate(ehr_sentences[0:2000]):
#     print(ehr_sent)
    
    most_similar_specialty = []
    
    for specialty in specialties:
        spec_similarity_sum = 0
        for token in ehr_sent[1].split(' '):
#             print('\t', token, med_vectors.similarity(specialty, token))
            
            spec_similarity_sum += med_vectors.similarity(specialty, token)
        
        spec_similarity = spec_similarity_sum / len(ehr_sent[1].split(' '))
        
#         print(specialty, spec_similarity)

        if not most_similar_specialty:
            most_similar_specialty = [i, ehr_sent[0], specialty, spec_similarity]
        elif spec_similarity > most_similar_specialty[3]:
            most_similar_specialty = [i, ehr_sent[0], specialty, spec_similarity]
        
    nn_results.append(most_similar_specialty)

    
correct_results = [result for result in nn_results if result[1] == result[2]]
print('# of Correct Classifications: ', len(correct_results))
print('Accuracy: ', len(correct_results) / len(nn_results))

# of Correct Classifications:  98
Accuracy:  0.049


In [109]:
print('Example of correct classification:')

correct_example = correct_results[0]
example_sentence = ehr_sentences[correct_example[0]]
print('\tSentence: ', example_sentence)

print('\n\tTrue category:', correct_example[1])
print('\tPredicted category:', correct_example[2])

print('\n\tTrue/Predicted Similarities:')
for token in example_sentence[1].split(' '):
    print('\t\t', token, med_vectors.similarity(correct_example[1], token))
    spec_similarity_sum += med_vectors.similarity(correct_example[1], token)

spec_similarity = spec_similarity_sum / len(example_sentence.split(' '))
print('\t\tAverage similarity: ', spec_similarity)


Example of correct classification:
	Sentence:  ['Orthopedic', 'exposed vertebral bodies c2c3 c4c5 bridged plate']

	True category: Orthopedic
	Predicted category: Orthopedic

	Similarities:
		 exposed -0.03660483
		 vertebral 0.266623
		 bodies 0.068785824
		 c2c3 0.031137193347711197
		 c4c5 0.09221873311276046
		 bridged 0.057441555
		 plate 0.039083302
		Average similarity:  0.1750587690063761


In [113]:
print('Example of incorrect classification:')

incorrect_example = nn_results[0]
example_sentence = ehr_sentences[incorrect_example[0]]
print('\tSentence: ', example_sentence)

print('\n\tTrue category:', incorrect_example[1])
print('\tPredicted category:', incorrect_example[2])

print('\n\tTrue Similarities:')
for token in example_sentence[1].split(' '):
    print('\t\t', token, med_vectors.similarity(incorrect_example[1], token))
    spec_similarity_sum += med_vectors.similarity(incorrect_example[1], token)

spec_similarity = spec_similarity_sum / len(example_sentence[1].split(' '))
print('\t\tAverage similarity: ', spec_similarity)

print('\n\tPredicted Similarities:')
for token in example_sentence[1].split(' '):
    print('\t\t', token, med_vectors.similarity(incorrect_example[2], token))
    spec_similarity_sum += med_vectors.similarity(incorrect_example[2], token)

spec_similarity = spec_similarity_sum / len(example_sentence[1].split(' '))
print('\t\tAverage similarity: ', spec_similarity)

Example of incorrect classification:
	Sentence:  ['Neurology', 'see velocity measurements left carotid eca measurement 0938 msecond']

	True category: Neurology
	Predicted category: Endocrinology

	True Similarities:
		 see 0.034260437
		 velocity 0.07255719
		 measurements 0.015290075
		 left 0.07127067
		 carotid 0.07629804
		 eca 0.094461195
		 measurement -0.020491015
		 0938 0.10253124
		 msecond 0.030581191
		Average similarity:  0.48662400427592156

	Predicted Similarities:
		 see -0.006658733
		 velocity 0.05569666
		 measurements 0.05710432
		 left 0.05892444
		 carotid 0.04782131
		 eca 0.15068905
		 measurement 0.05482881
		 0938 0.14380054
		 msecond 0.061377887
		Average similarity:  0.5559111470957501


So as we can see, the results are pretty terrible with an accuracy of 5%. Looking at an example the classifier got right, it relied on words that are exclusively and very distinctly related. However, these strong signals are not always present in our sentences. Looking at an incorrect example, we see how the signals are being drowned out or offset by the other words. This emphasizes the need for some type of model that can learn and weigh the words that provide strong signals for particular categories.

## Convultional Neural Network

A neural network will allow us to build a model that can take in the word vectors as inputs and learn the complex relationships between those vectors to better classify the target sentence. This is a more holistic approach that tries to capture meaning from the entire sentence rather than token by token.

In this project directory, you can find all the different iterations of CNNs that have been trained.

## LSTM Neural Network

A LSTM is a type of neural network that tries to solve the problem of accurately capturing long term dependencies.

In this project directory, you can find all the different iterations of LSTMs that have been trained.