# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

Let's start by reloading from the last notebook.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
from snorkel import SnorkelSession
from snorkel.learning.utils import MentionScorer
from snorkel.learning import RandomSearch, ListParameter, RangeParameter
from snorkel.learning.XGBoost_NoiseAware import XGBoost_NoiseAware

session = SnorkelSession()

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



In [None]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

train = session.query(ChemicalDisease).filter(ChemicalDisease.split == 0).all()
dev = session.query(ChemicalDisease).filter(ChemicalDisease.split == 1).all()
test = session.query(ChemicalDisease).filter(ChemicalDisease.split == 2).all()

print 'Training set:\t{0} candidates'.format(len(train))
print 'Dev set:\t{0} candidates'.format(len(dev))
print 'Test set:\t{0} candidates'.format(len(test))    

In [2]:
#load stuff
import cPickle as pickle
with open('test_labels.dat', 'rb') as infile:
    L_gold_test_dense_cp = pickle.load(infile)
with open('train_labels.dat', 'rb') as infile:
    L_gold_train_dense_cp = pickle.load(infile)
with open('dev_labels.dat', 'rb') as infile:
    L_gold_dev_dense_cp = pickle.load(infile)
with open('train_features.dat', 'rb') as infile:
    F_train = pickle.load(infile)
with open('dev_features.dat', 'rb') as infile:
    F_dev = pickle.load(infile)
with open('test_features.dat', 'rb') as infile:
    F_test = pickle.load(infile)
with open('train_marginals.dat', 'rb') as infile:
    train_marginals = pickle.load(infile)
with open('rounded_train_marginals.dat', 'rb') as infile:
    rounded_train_marginals = pickle.load(infile)
#dtrain = xgb.DMatrix( "train.buffer")
#dtrain2 = xgb.DMatrix("train2.buffer")
#dtest = xgb.DMatrix("test.buffer")
dtrain = xgb.DMatrix( F_train, label=train_marginals)
dtest = xgb.DMatrix( F_test, label=L_gold_test_dense_cp)
bst = xgb.Booster({'nthread':4}) #init model
bst.load_model("250.model") # load data


In [None]:
from scipy.sparse import csr_matrix, find
F_train.get_key(session, 5253)

In [3]:
preds = bst.predict(dtest)
preds[preds > 0] = 1
preds[preds < 0] = 0
TP = sum(np.logical_and(preds,L_gold_test_dense_cp))
TN = sum(np.logical_and(np.logical_not(preds),np.logical_not(L_gold_test_dense_cp)))
FP = sum(np.logical_and(preds,np.logical_not(L_gold_test_dense_cp)))
FN = sum(np.logical_and(np.logical_not(preds),L_gold_test_dense_cp))
P = TP/float(TP+FP)
R = TP/float(FN+TP)
F = 2*(P*R)/(P+R)
F

0.54427294882209587

In [None]:
xgb.plot_importance(bst, height=.4, max_num_features=30)
#xgb.to_graphviz(bst, num_trees=2)
#plt.savefig('sample.pdf')


In [None]:
xgb.plot_tree(bst, num_trees=194)
#plt.savefig('tree249.png', dpi=1200)

In [None]:
F_test[:,143].sum()

In [None]:
from load_external_annotations import load_external_labels
from snorkel.annotations import load_gold_labels
load_external_labels(session, ChemicalDisease, split=1, annotator='gold')
#load_external_labels(session, ChemicalDisease, split=0, annotator='gold')
#load_external_labels(session, ChemicalDisease, split=2, annotator='gold')
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)
#L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
#L_gold_train = load_gold_labels(session, annotator_name='gold', split=0)

In [None]:

L_gold_dev_dense_cp

In [None]:
import copy
L_gold_test_dense = L_gold_test.todense()
L_gold_test_dense = np.squeeze(np.asarray(L_gold_test_dense))
L_gold_test_dense_cp = copy.deepcopy(L_gold_test_dense)
L_gold_test_dense_cp[L_gold_test_dense_cp == -1]=0
L_gold_train_dense = L_gold_train.todense()
L_gold_train_dense = np.squeeze(np.asarray(L_gold_train_dense))
L_gold_train_dense_cp = copy.deepcopy(L_gold_train_dense)
L_gold_train_dense_cp[L_gold_train_dense_cp == -1]=0
L_gold_test_dense_cp
L_gold_dev_dense = L_gold_dev.todense()
L_gold_dev_dense = np.squeeze(np.asarray(L_gold_dev_dense))
L_gold_dev_dense_cp = copy.deepcopy(L_gold_dev_dense)
L_gold_dev_dense_cp[L_gold_dev_dense_cp == -1]=0

In [None]:
from snorkel.annotations import FeatureAnnotator
featurizer = FeatureAnnotator()
%time F_dev = featurizer.apply(split=1)
F_dev

In [None]:
#Apply Featurizer
F_train = featurizer.apply_existing(split=0)
F_dev  = featurizer.apply_existing(split=1)
F_test = featurizer.apply_existing(split=2)

In [None]:
F_train = featurizer.load_matrix(session, split=0)
F_dev   = featurizer.load_matrix(session, split=1)
F_test  = featurizer.load_matrix(session, split=2)

In [None]:
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0-preds)
    return grad, hess

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    #preds = 1.0/(1.0+np.exp(-preds))
    errors = np.mean(labels*np.log(1+np.exp(-preds))+(1-labels)*np.log(1+np.exp(preds)))
    # return a pair metric_name, result
    # since preds are margin(before logistic transformation, cutoff at 0)
    return 'error', errors

In [None]:
from scipy.sparse import csr_matrix
F_test2 = csr_matrix(np.zeros(F_test.shape))
dtest2 = xgb.DMatrix( F_test2, label=L_gold_test_dense_cp)

In [None]:
train_marginals_test = np.round(train_marginals)
dtrain2 = xgb.DMatrix( F_train, label=train_marginals_test)

In [None]:
F_train = F_train.tocsc()

In [None]:
F_dev = F_dev.tocsc()

In [None]:
F_dev

In [None]:
watchlist = [(dtrain, 'train')]
num_round = 10

#param = {'max_depth': 6, 'eta': 1, 'objective':'binary:logistic'}
#bst2 = xgb.train(param, dtrain2, num_round, watchlist)

#param = {'max_depth': 5, 'eta': 1, 'eval_metric': 'auc'}
#bst2 = xgb.train(param, dtrain2, num_round, watchlist, logregobj)

param = {'max_depth': 6, 'eta': 1, 'silent':1}
bst2 = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror, verbose_eval = False)

In [None]:
preds2_prob = bst.predict(dtest)
preds2[preds2_prob > 0] = 1
preds2[preds2_prob < 0] = 0

In [None]:
bst2 = xgb.Booster({'nthread':4}) #init model
bst2.load_model("XGBoost_73.model") # load data

In [None]:
preds2_prob = bst2.predict(dtest)
preds2 = np.round(preds2_prob)

In [None]:

TP = sum(np.logical_and(preds2,L_gold_test_dense_cp))
TN = sum(np.logical_and(np.logical_not(preds2),np.logical_not(L_gold_test_dense_cp)))
FP = sum(np.logical_and(preds2,np.logical_not(L_gold_test_dense_cp)))
FN = sum(np.logical_and(np.logical_not(preds2),L_gold_test_dense_cp))
P = TP/float(TP+FP)
R = TP/float(FN+TP)
F = 2*(P*R)/(P+R)
print(sum(preds2), TP, TN, FP, FN, F)

In [None]:
x = False
not(x)

In [None]:
#save stuff
import cPickle as pickle
with open('dev_labels.dat', 'wb') as outfile:
    pickle.dump(L_gold_dev_dense_cp, outfile, pickle.HIGHEST_PROTOCOL)

In [None]:
xgb.plot_tree(bst2, num_trees=1)
plt.savefig('tree1_bad.png', dpi=1200)

In [None]:
dtrain = xgb.DMatrix( F_train, label=train_marginals)
dtest = xgb.DMatrix( F_test, label=L_gold_test_dense_cp)

In [None]:
#Save Tree
bst2.save_model('250.model')

In [None]:
#Save Data
dtrain2.save_binary("train2.buffer")

In [None]:
from snorkel.learning.XGBoost_NoiseAware import XGBoost_NoiseAware
disc_model = XGBoost_NoiseAware()

eta_param = RangeParameter('eta', .05, .3, step=.05)
max_depth_param  = RangeParameter('max_depth', 2, 6, step=1)
min_child_weight_param  = RangeParameter('min_child_weight', 1, 6, step=1)
gamma_param = RangeParameter('gamma', 0, .5, step = .1)
num_rounds_param = RangeParameter('num_rounds', 10, 50, 10)

searcher = RandomSearch(session, disc_model, F_train, train_marginals, [eta_param, max_depth_param, min_child_weight_param, gamma_param, num_rounds_param], n=100)


In [None]:
searcher.fit(F_dev, L_gold_dev_dense_cp)

In [None]:
from snorkel.learning.XGBoost_NoiseAware import XGBoost_NoiseAware
disc_model = XGBoost_NoiseAware()
disc_model.train(F_train, train_marginals)

In [None]:
#disc_model.train(F_test, L_gold_test_dense_cp)

In [None]:
TP, FP, TN, FN = disc_model.score(session, F_dev, L_gold_dev_dense_cp)

# Part V: Training an extraction model

In the intro tutorial, we automatically featurized the candidates and trained a linear model over these features. Here, we'll train a more complicated model for relation extraction: an LSTM network. You can read more about LSTMs [here](https://en.wikipedia.org/wiki/Long_short-term_memory) or [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). An LSTM is a type of recurrent neural network and automatically generates a numerical representation for the candidate based on the sentence text, so no need for featurizing explicitly as in the intro tutorial. LSTMs take longer to train, and Snorkel doesn't currently support hyperparameter searches for them. We'll train a single model here, but feel free to try out other parameter sets. Just make sure to use the development set - and not the test set - for model selection.

In [None]:
from snorkel.contrib.learning import reLSTM

lstm = reLSTM()
lstm.train(
    train, train_marginals, lr=0.005, dim=200, n_epochs=30,
    dropout_rate=0.5, rebalance=0.25, print_freq=5
)

### Scoring on the test set

Finally, we'll evaluate our performance on the blind test set of 500 documents. We'll load labels similar to how we did for the development set, and use the `score` function of our extraction model to see how we did.

In [None]:
from load_external_annotations import load_external_labels
load_external_labels(session, ChemicalDisease, split=2, annotator='gold')

In [None]:
from snorkel.annotations import load_gold_labels
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
L_gold_test

In [None]:
_, _, _, _ = lstm.score(session, test, L_gold_test)