### Homework #4

This month we learned about Cosine Similarity as it applied to a toy problem of choosing a fitting product label for a  list of product descriptions without labels. Your homework is to take the code from the cosine similarity notebook and the training_data.json and test_data.json files and use them to find the cosine similarity between them.  The files are organized as JSON serializations of an array of arrays, where each element is an array of [nct #, cluster label, condition label, trial text]. For this, use the cluster label as the label and ignore condition label.

The steps you should take:
1. Import the JSON data files
2. Define the training data corpus (list of strings) and training data labels. You should also define your test corpus and test labels. The test labels are for reference.  Next month we will talk about methods for assessing model accuracy using these reference labels.  For now, you will treat the test data as simply data with "unknown" labels.
3. Define your preprocessor & tokenization functions.
4. Instantiate your vectorizer (from scikit learn) with these functions.
5. Train the vectorizer and reduce your document dimensionality (the order of this depends on your choice of using method 1 or method 2 (from the cosine similarity notebook) or any other method of your choice)
6. Transform the test corpus using the trained vectorizer.
7. Calculate the pairwise cosine similarity between each of the test documents and training categories.
8. Identify the most likely category matches.

Some questions to ask yourself:
1. Are my preprocessor/tokenizer functions working in a way that I agree with? There is a fair amount of wiggle room here.
2. Is my choice of document dimensionality a smart one? Is it slow? Does it do what I think it is doing? Are there additional problems with my choice?
3. What is my threshold for assiging category matches? Do I look at the top 5 choices, top 1 choice, top 20 choices? Why?

Remember, here is how to load JSON data:
```python
with open('/path/to/json/data', 'rb') as infile:
    jdata = json.load(infile)
```

In [4]:
import re
import spacy
import pandas as pd
import numpy as np
import random
import json
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Regexes to be used for preprocessing
alpha_only = re.compile(r'[^a-zA-Z ,.!]+')
comma_find = re.compile(r'\,{1,1} *')
period_find = re.compile(r'\.{1,1} *')
nlp = spacy.load('en_core_web_sm')


In [11]:

# This function will be passed to the vectorizer class and will be used to pre-process our text data in whatever way
# we choose.  I'm choosing to simply convert everything to lowercase.
def my_preprocessor(doc):
    a = alpha_only.sub(' ',doc.lower())
    b = period_find.sub('. ', a)
    c = comma_find.sub(', ', b)
    return(c)


# This function will be passed to the vectorizer class and will be user to tokenize our text data in whatever way
# we choose.
def my_tokenizer(doc):
    my_nlp = nlp(doc)
    tokens = []
    for token in my_nlp:
        # Only include tokens longer than 3 characters, that aren't stop words and aren't numbers.
        # Then, only include the lemmatized version of the token.
        if len(token.text) > 3 and not token.is_stop and not token.is_space:
            tokens.append(token.lemma_)
    return (tokens)

# This function is used to make the vectorized array easier to vizualize. 
# It creates a Pandas dataframe from the word matrix 
def wm2df(wm, feat_names):
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return (df)

# This function averages together rows of a numpy array that share the same index values,
# in our case these index values are the product types.
def f_numpy(names, values):
    result_names = np.unique(names)
    result_values = np.empty((result_names.shape[0], values.shape[1]))

    for i, name in enumerate(result_names):
        result_values[i,:] = np.mean(values[names == name], axis=0)

    return result_names, result_values 

# Given a cosine similarity matrix, this function sorts and returns the top 'N' largest 
# values and their locations in the original matrix.

def find_top_n(cs, top_n):
    bst_mtch = np.argsort(-cs)
    i = bst_mtch[:,:top_n]
    yind = np.indices(i.shape).flatten('C')
    top_vals = np.reshape(cs[yind[:-i.size],i.flatten('C')], i.shape)
    return i, top_vals*100

Load the test & training data for the clinical trials.

In [159]:
with open('/Users/rthombley/Box Sync/NLP/UCSF-nlp/classify-clinical-trials/training_data_v2.json', 'rb') as infile:
    train = json.load(infile)
   
with open('/Users/rthombley/Box Sync/NLP/UCSF-nlp/classify-clinical-trials/test_data_v2.json', 'rb') as infile:
    test = json.load(infile)

As an afterthought, I decided to add in keywords from each of the trials. As a first pass at including these keywords in the corpus, I'm just joining them all with the desciptive text.  There are probably better ways to do this that take advantage of the distinctive structure of each of the keywords.

Additionally, you'll see some commented out code below. We need a way to create a single vector representation for each cluster. As we saw in the other cosine similarity notebook, two options for this are a) merging all of the text for similar clusters and building our vector representation from this giant block of text or b) fit each of the documents separately and then average all of the resulting vectors by group. Option a is what is commented out below - there is too much text here for this to work, so we are forced to use option b.

In [172]:
tr_corpus = [' '.join([t[3], ' '.join(t[4])]) for t in train]
#tr_keywords = [' '.join(t[4]) for t in train]
tr_labels = [t[1] for t in train]

tst_corpus = [' '.join([t[3], ' '.join(t[4])]) for t in test]
#tst_corpus = [t[3] for t in test]
#tst_keywords = [' '.join(t[4]) for t in test]
tst_labels = [t[1] for t in test]


### 
# Convert the train data structure to a data frame.
#tr_df = pd.DataFrame(tr_labels, columns=['id'])
#tr_df['text'] = tr_corpus

# Group by product name (called 'id' here) and then join all of the individual product descriptions within these
# groups. THIS DOESN"T CURRENTLY WORK BECAUSE WE QUICKLY HAVE TOO MANY CHARACTERS TO PARSE 
#un_tr_df = tr_df.groupby('id', as_index=False).agg(lambda x:' '.join(x))
#un_train_corpus = un_tr_df['text'].tolist()
#un_train_labels = un_tr_df['id'].tolist()

Build the vectorizer & train it on the training corpus. Additionally, we will transform each of the test_corpus documents to the training corpus' vector space. Finally - we average all vectors that have the same label together to create the cluster "reference" vectors.

In [174]:
# Initialize the vectorizer.
vectorizer = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer, norm='l1', smooth_idf=True)

# Train our vectorizer
cwm = vectorizer.fit_transform(tr_corpus)

# Fit the queries to the vectorized model
qm = vectorizer.transform(tst_corpus)

# Extract our tokens
tokens = vectorizer.get_feature_names()

# Average all trained vectors with the same label together into reference vectors.
arr = cwm.toarray()
nms = np.array([t[1] for t in train])
nms, nm = f_numpy(nms, arr)

# Make the output pretty
df = pd.DataFrame(data=nm, index=nms,
                      columns=tokens)
df


Unnamed: 0,aact,aactg,aacumb,aadc,aall,aaml,aasd,aasld,aastrom,aatd,...,zortress,zostavax,zoster,zsfg,zubrod,zuckerberg,zuckerman,zuclomiphene,zuma,zytiga
"BLOOD, HEART, AND CIRCULATION",0.0,0.0,0.0,0.0,0.0,9.5e-05,0.0,0.0,2.9e-05,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"BONES, JOINTS, AND MUSCLES",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BRAIN AND NERVES,0.0,0.0,0.0,0.000583,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,7.7e-05,0.0,0.0,0.0,0.0
CANCER,2.6e-05,0.0,0.0,0.0,8.3e-05,4.3e-05,2e-06,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,7.6e-05,0.0,2e-06,0.0,8e-06,5e-06
DIGESTIVE SYSTEM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DRUG DEPENDENCE AND RELATED DISORDERS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"EAR, NOSE, AND THROAT",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENDOCRINE SYSTEM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
EYES AND VISION,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FETUS OR NEWBORN DISORDERS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Build the cosine similarity matrix between the test data (qm) and the training reference vectors (nm)

In [175]:
# Now calculate cosine similarity between the unknown product "queries" and the known product data.
cs = cosine_similarity(qm, nm)
# Make the output readable
cdf = pd.DataFrame(data=cs, index=tst_labels, columns=nms)
cdf


Unnamed: 0,"BLOOD, HEART, AND CIRCULATION","BONES, JOINTS, AND MUSCLES",BRAIN AND NERVES,CANCER,DIGESTIVE SYSTEM,DRUG DEPENDENCE AND RELATED DISORDERS,"EAR, NOSE, AND THROAT",ENDOCRINE SYSTEM,EYES AND VISION,FETUS OR NEWBORN DISORDERS,...,MENTAL HEALTH,METABOLIC PROBLEMS,MOUTH AND TEETH,MULTISYSTEM DISORDERS,ORGAN AND TISSUE TRANSPLANT,REPRODUCTIVE HEALTH,RESPIRATORY DISEASES,SKIN DISORDERS,SURGICAL STUDIES,TUMORS
DRUG DEPENDENCE AND RELATED DISORDERS,0.113186,0.072770,0.135822,0.075916,0.066933,0.498287,0.059696,0.085967,0.046308,0.047354,...,0.213410,0.113646,0.067800,0.027228,0.043877,0.109010,0.041784,0.029542,0.118424,0.070582
INFECTIOUS DISEASES,0.117791,0.077705,0.113454,0.121601,0.098808,0.038971,0.043170,0.081874,0.060318,0.076065,...,0.087594,0.115295,0.073522,0.041849,0.071305,0.297102,0.060991,0.042691,0.055477,0.094805
DIGESTIVE SYSTEM,0.181833,0.068358,0.136284,0.104212,0.330911,0.028550,0.066137,0.068830,0.036844,0.264707,...,0.125880,0.114373,0.047563,0.337418,0.173564,0.057473,0.056667,0.034686,0.038795,0.093794
CANCER,0.168165,0.118934,0.183840,0.298016,0.163615,0.035826,0.062196,0.089100,0.073602,0.065997,...,0.127578,0.114809,0.034051,0.067480,0.108370,0.098894,0.036056,0.075072,0.064596,0.244448
LUNGS AND BREATHING,0.252122,0.066394,0.096469,0.088791,0.249727,0.018785,0.029308,0.046011,0.027550,0.147803,...,0.072374,0.058134,0.035641,0.104642,0.077740,0.045686,0.039132,0.023544,0.032769,0.060889
CANCER,0.111102,0.062559,0.136197,0.290970,0.090440,0.023623,0.042362,0.085243,0.046957,0.039706,...,0.068104,0.091451,0.035143,0.034271,0.076563,0.064535,0.017810,0.041544,0.040446,0.274711
CANCER,0.166430,0.099687,0.174812,0.346107,0.127222,0.054985,0.065625,0.112351,0.060554,0.083754,...,0.101480,0.125680,0.043291,0.078535,0.116858,0.121716,0.039898,0.066874,0.050316,0.281748
CANCER,0.143035,0.079540,0.248120,0.336805,0.099103,0.026927,0.052636,0.061942,0.055811,0.047616,...,0.076303,0.073851,0.025699,0.023508,0.071814,0.057645,0.022370,0.042430,0.034349,0.318217
CANCER,0.107595,0.082766,0.140962,0.224089,0.098225,0.021956,0.043148,0.062266,0.051063,0.038692,...,0.087930,0.075178,0.027376,0.022696,0.058174,0.064412,0.021083,0.046535,0.036032,0.158429
BRAIN AND NERVES,0.120940,0.084650,0.262437,0.095426,0.100613,0.053418,0.058339,0.084177,0.057862,0.049958,...,0.366039,0.107867,0.049424,0.051855,0.053688,0.105722,0.031562,0.053692,0.057144,0.071293


The cell values here are a value between 0 and 1 indicating the degree of similarity between the test trial (rows) and the reference vector (column). If we look for the maximum value in the row, the column name associated with that maximum value is the "most similar" reference cluster.  

As displayed, the row names are the actual cluster type for the test trial. If we compare our selection of the "most similar" reference cluster to the row name, we can determine if we correctly guessed the trial cluster.

From here, we can count the number of times we guessed correctly, and how many times we guessed incorrectly. This gives us a percentage accuracy to determine how effective this model is at classifying the test trials.

If we look at the most common cluster type, which is "Cancer", we could simply guess "Cancer" as the correct cluster for every trial. This would be correct ~ 40% of the time.  This is the low threshold for the accuracy of any model we would build.  Doing better than this implies that our model is at least partially effective.

In [176]:
#nms = column names [categories]
def calc_percent_accuracy(n_depth, cs_matrix, tst_labels, col_labels):
    top_index, _ = find_top_n(cs_matrix, n_depth)
    til = [t for t in top_index.tolist()]
    cnt = 0
    for i, pred_ind in enumerate(til):
        if tst_labels[i] in col_labels[pred_ind]:
            cnt += 1
    return(100*(cnt/len(til)))


So, what percent of the time does the trial cluster match that of the reference cluster with the highest similarity?

In [178]:
calc_percent_accuracy(1, cs, tst_labels, nms)

70.25139664804469

Not bad, we see 70% accuracy using this model and selecting just 1 value. Unfortunately, that leaves us with 30% error.  Can we do better? Certainly we could try to tweak our model - using bigrams, changing how we tokenize or otherwise prepare our data or even using a totally different classifier.  But, we can also go down a different pathway. If our goal is just to help a reviewer accurately classify a trial, even if we can't present them with a perfectly accurate prediction, we can certainly say "here are 3 or 4 that we can virtually guarantee are correct."
Let's see how many of the most similar clusters we need to present to get nearly 100% accuracy.

In [179]:
# Calculate % accuracy we get by choosing the top n similarities
for i in range(1,27):
    acc = calc_percent_accuracy(i, cs, tst_labels, nms)
    print("Correct answer in top {}: {}".format(i, acc))

Correct answer in top 1: 70.25139664804469
Correct answer in top 2: 91.34078212290503
Correct answer in top 3: 95.60055865921788
Correct answer in top 4: 97.41620111731844
Correct answer in top 5: 98.04469273743017
Correct answer in top 6: 98.39385474860335
Correct answer in top 7: 98.6731843575419
Correct answer in top 8: 98.88268156424581
Correct answer in top 9: 99.23184357541899
Correct answer in top 10: 99.51117318435753
Correct answer in top 11: 99.65083798882681
Correct answer in top 12: 99.65083798882681
Correct answer in top 13: 99.72067039106145
Correct answer in top 14: 99.72067039106145
Correct answer in top 15: 99.72067039106145
Correct answer in top 16: 99.86033519553072
Correct answer in top 17: 99.93016759776536
Correct answer in top 18: 99.93016759776536
Correct answer in top 19: 99.93016759776536
Correct answer in top 20: 100.0
Correct answer in top 21: 100.0
Correct answer in top 22: 100.0
Correct answer in top 23: 100.0
Correct answer in top 24: 100.0
Correct answer

That's great! We can present just 3 choices and get over 95% accuracy. This would be a real performance gain over not using a model.