## Build classifier
Here we will build a simple NN classifier. The amount of data is quite small, so this won't require a GPU. 

It's always worth checking any model against a simple baseline. In this case, we find that the baseline model performs just as well as the deep-learning classifier. This might change if we can get better training data, but for the purposes of this project, you could use either one.

In [54]:
import pandas as pd
import numpy as np
import pickle
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Dropout
from sklearn.preprocessing import LabelBinarizer
import sklearn.datasets as skds
from pathlib import Path

In [55]:
import os
import json

In [41]:
data_dir = os.path.abspath(r'C:\Users\aday\OneDrive - SAGE Publishing\PROJECT_DATA\pubmed_case_report_classifier\data')
test_data_loc = os.path.join(data_dir,'test embeddings.json')
dev_data_loc = os.path.join(data_dir,'dev embeddings.json')
train_data_loc = os.path.join(data_dir,'train embeddings.json')

In [41]:
test = pd.read_csv(os.path.join(data_dir,'test.csv'), dtype=str)
train = pd.read_csv(os.path.join(data_dir,'train.csv'), dtype=str)
dev = pd.read_csv(os.path.join(data_dir,'dev.csv'), dtype=str)
train.shape, test.shape, dev.shape

((16974, 5), (2166, 5), (2140, 5))

In [42]:
with open(test_data_loc,'r') as f:
    test_embeddings = json.load(f)
with open(train_data_loc,'r') as f:
    train_embeddings  = json.load(f)
with open(dev_data_loc,'r') as f:
    dev_embeddings  = json.load(f)
len(train_embeddings), len(test_embeddings), len(dev_embeddings)

(16974, 2166, 2140)

## We have some embedings missing, so lets ensure that we filter those out

In [59]:
train = train.drop_duplicates(subset=['doi'], keep = 'first')
test = test.drop_duplicates(subset=['doi'], keep = 'first')
dev = dev.drop_duplicates(subset=['doi'], keep = 'first')

In [60]:
dev = dev[dev['doi'].isin(dev_embeddings)]
test = test[test['doi'].isin(test_embeddings)]
train = train[train['doi'].isin(train_embeddings)]
train.shape, test.shape, dev.shape

((16974, 5), (2166, 5), (2140, 5))

These assertion statements will break our code if the number of embeddings we have differs from the number of rows of data. 

In [61]:
assert train.shape[0] == len(train_embeddings)
assert dev.shape[0] == len(dev_embeddings)
assert test.shape[0] == len(test_embeddings)

## combine train and dev sets

In [62]:
train.head(2)

Unnamed: 0,doi,articletitle,abstract,tiabs,casereport
0,10.1007/s10388-020-00767-0,Preoperative computed tomography predicts the ...,Recurrent laryngeal nerve paralysis (RLNP) aft...,Preoperative computed tomography predicts the ...,0
1,10.1167/19.7.2,Dynamic combination of position and motion inf...,"To accurately foveate a moving target, the ocu...",Dynamic combination of position and motion inf...,0


In [63]:
import numpy as np
y_train = np.array([float(x) for x in train['casereport'].values]).T
x_train = np.array([train_embeddings[doi] for doi in train.doi.tolist()])
y_dev = np.array([float(x) for x in dev['casereport'].values]).T
x_dev = np.array([dev_embeddings[doi] for doi in dev.doi.tolist()])
np.shape(y_train), np.shape(x_train),np.shape(y_dev), np.shape(x_dev)

((16974,), (16974, 768), (2140,), (2140, 768))

In [64]:
y_test = np.array([float(x) for x in test['casereport'].values]).T
x_test = np.array([test_embeddings[doi] for doi in test.doi.tolist()])
np.shape(y_test), np.shape(x_test)

((2166,), (2166, 768))

## Start with a simple model to act as a baseline
- this is what we want to beat
- Turns out that we get very good results from SVC without doing any parameter tuning. 

In [65]:
%%time
from sklearn.svm import SVC
clf = SVC()
clf.fit(x_train,y_train)
clf.score(x_test,y_test)

Wall time: 21.9 s


0.9635272391505079

In [66]:
# %%time
# from sklearn.ensemble import RandomForestClassifier
# clf2 = RandomForestClassifier()
# clf2.fit(x_train,y_train)
# clf2.score(x_test,y_test)

Wall time: 1min 3s


0.9427516158818098

In [67]:
# # import xgboost as xgb
# from xgboost import XGBClassifier
# clf3 = XGBClassifier()
# clf3.fit(x_train,y_train)
# clf3.score(x_test,y_test)





0.9533702677746999

## Now create a Keras classifier

see: https://www.opencodez.com/how-to-guide/text-classification-using-keras.htm

Good accuracy with quite a shallow model. Seems probable that most of what we are detecting here is subject-area differences and NOT caserep status. 

In [87]:
batch_size = 8
model = Sequential()

model.add(Dense(16, input_shape=(768,)))
model.add(Activation('relu'))
model.add(Dropout(0.25))

model.add(Dense(8))
model.add(Activation('relu'))
model.add(Dropout(0.25))

# this layer didn't make a big difference. Commenting out. 
# model.add(Dense(4))
# model.add(Activation('relu'))
# model.add(Dropout(0.25))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.build()
model.summary()
 
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
 

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_20 (Dense)             (None, 16)                12304     
_________________________________________________________________
activation_20 (Activation)   (None, 16)                0         
_________________________________________________________________
dropout_14 (Dropout)         (None, 16)                0         
_________________________________________________________________
dense_21 (Dense)             (None, 8)                 136       
_________________________________________________________________
activation_21 (Activation)   (None, 8)                 0         
_________________________________________________________________
dropout_15 (Dropout)         (None, 8)                 0         
_________________________________________________________________
dense_22 (Dense)             (None, 1)                

In [88]:
%%time
history = model.fit(x_train, 
                    y_train,
                    validation_data=(x_dev, y_dev),
                    batch_size=batch_size,
                    epochs=6, 
                    verbose=1,                  
                   )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Wall time: 22.8 s


## Testing

In [89]:
score = model.evaluate(x_test, 
                       y_test,
                       batch_size=batch_size, 
                       verbose=1)

print('Test accuracy:', score[1])

Test accuracy: 0.9630655646324158


In [71]:
model.save('case_report_classifier.h5')

# Now compare baseline with new model

In [72]:
test['svc_pred'] =clf.predict(x_test)
test['xgb_pred'] =clf3.predict(x_test)
test['keras_pred'] = [0 if x<0.5 else 1 for x in model.predict(x_test)]
# crude voting ensemble
test['ensemble'] = [0 if np.mean([row['svc_pred'],row['keras_pred'],row['xgb_pred']]) <0.5 else 1  for i,row in test.iterrows()]
test['casereport'] = pd.to_numeric(test['casereport'])
test.head(2)

Unnamed: 0,doi,articletitle,abstract,tiabs,casereport,svc_pred,xgb_pred,keras_pred,ensemble
0,10.1016/j.celrep.2018.06.048,A Virally Encoded DeSUMOylase Activity Is Requ...,A subset of viral genes is required for the lo...,A Virally Encoded DeSUMOylase Activity Is Requ...,0,0.0,0.0,0,0
1,10.3390/ijerph17239087,Is It Possible to Find Something Positive in B...,"In relation to COVID-19, little research has f...",Is It Possible to Find Something Positive in B...,0,0.0,0.0,0,0


In [73]:
from sklearn.metrics import accuracy_score
accuracy_score(test['casereport'], test['ensemble'])

0.9621421975992613

In [74]:
# false +ves
for i, row in test[(test['keras_pred']==1) & (test['casereport']==0)].iterrows():
    print(row['doi'])
    print(row['articletitle'])
    print(row['abstract'])
    print()


10.1080/17843286.2018.1531616
Approach to and management of abnormalities in plasma sodium.
The differential diagnosis between hypertonic, isotonic and hypotonic hyponatremia are presented. The help of some usual serum (urea, uric acid and TCO2) and urine parameters (mainly osmolality and sodium concentration) are discussed and help to determine the best treatment. Morbidity associated with untreated hyponatremia and with the different treatment available is also discussed. Who to prevent and treat ODS (osmotic demyelating syndrome) is recalled. The pathophysiology and treatment of hypernatremia are also discussed.

10.1002/jca.21553
Successful treatment of pure red cell aplasia because of ABO major mismatched stem cell transplant.
Pure red cell aplasia (PRCA) is a well-documented potential side effect of ABO major mismatched allogeneic hematopoietic stem cell transplants. This side effect may be self-limiting, but is sometimes treated using modalities such as steroids, antithymocyte g

In [75]:
# false -ves
test[(test['keras_pred']==0) & (test['casereport']=='1')]

Unnamed: 0,doi,articletitle,abstract,tiabs,casereport,svc_pred,xgb_pred,keras_pred,ensemble
