# Anomaly Detection

The `AdeftAnomalyDetector` is an experimental class for detecting if there are texts being incorrectly grounded to a gene. The plan is to use a `OneClassSVM` trained on Entrez texts for a given gene. This can then be used to search for anomalous papers containing texts that have been grounded to that gene. If a paper is anomalous we have cause to believe the grounding is incorrect. Further work is needed to validate if this method is effective. This notebook demonstrates how to use the `AdeftAnomalyDetector` on a small set of example data.

First import required packages and load the data

In [1]:
import json
from adeft.modeling.investigate import AdeftAnomalyDetector


with open('data/example_training_data.json') as f:
    ir_data = json.load(f)

`ir_data` is a dictionary with two keys, `'texts'` and `'labels'`.

In [2]:
texts, labels = ir_data['texts'], ir_data['labels']

`texts` is a list of training texts and `labels` is a list of the corresponding labels

We now split these up into texts referring to the insulin receptor and texts referring to some other expansion of IR

In [3]:
gene_texts = [text for text, label in zip(texts, labels) if label == 'HGNC:6091']
other_texts = [text for text, label in zip(texts, labels) if label != 'HGNC:6091']

This dataset contains 101 texts where IR refers to insulin receptor and 399 texts where it refers to something else

In [4]:
print(len(gene_texts), len(other_texts))

101 399


The we instantiate the anomaly detector. The first argument is the HGNC symbol for the gene of interest and the second argument is a list of synonyms. Here we use "IR" as the only synonym. **When we use the `AdeftAnomalyDetector` for real examples, the list of synonyms should be a filtered list of texts that are being grounded to the gene of interest.**

In [5]:
anomaly_detector = AdeftAnomalyDetector('INSR', ['IR'])

# Validation

As with the `AdeftClassifier` we can use a `cv` method to perform a grid search. For now you'll have to enter a param_grid using the syntax for sklearn pipelines as shown below. In the future you will be able to use parameter names directly as keys. The parameter `nu` for the `OneClassSVM` is an upper bound on the fraction of training errors that can be made.

In [6]:
param_grid = {'max_features': [10, 50],
              'ngram_range': [(1, 1)],
              'nu': [0.2]}

The cross validation scheme will divides `gene_texts` into folds in the typical fashion with each test fold being augmented with a random sample of texts from `other_texts`. The parameter `k` controls the number of non-gene texts to add to each test fold. The following examples use three fold cross validation and add 5, 10, 50, and 100 negative texts per fold. This is useful for seeing how results vary depending on the level of ambiguity of a term. 

In [18]:
anomaly_detector.cv(gene_texts, other_texts, param_grid, n_jobs=4, cv=5)

In [8]:
cv_results = anomaly_detector.grid_search.cv_results_

In [9]:
cv_results

{'mean_fit_time': array([0.04379511, 0.03231721]),
 'std_fit_time': array([0.011137  , 0.00377295]),
 'mean_score_time': array([0.07023048, 0.06568398]),
 'std_score_time': array([0.00689565, 0.00586244]),
 'param_oc_svm__nu': masked_array(data=[0.2, 0.2],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__max_features': masked_array(data=[10, 50],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__ngram_range': masked_array(data=[(1, 1), (1, 1)],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'oc_svm__nu': 0.2,
   'tfidf__max_features': 10,
   'tfidf__ngram_range': (1, 1)},
  {'oc_svm__nu': 0.2,
   'tfidf__max_features': 50,
   'tfidf__ngram_range': (1, 1)}],
 'split0_test_sens': array([0.8625, 0.7625]),
 'split1_test_sens': array([0.875 , 0.7125]),
 'split2_test_sens': array([0.7875, 0.6375]),
 'split3_test_sens': array([0.875, 0.7

In [19]:
anomaly_detector.specificity + anomaly_detector.sensitivity

1.533410141937465

In [11]:
sum(anomaly_detector.predict(gene_texts))/len(gene_texts)

0.19801980198019803

In [12]:
sum(anomaly_detector.predict(other_texts))/len(other_texts)

0.7017543859649122

In [20]:
anomaly_detector.confidence_interval(texts)

500 300
0.5564541227024271 0.6420210092052625


(0.7094138987738328, 0.869828708646096)

In [15]:
anomaly_detector.best_score

1.501704891572853

The metric used is balanced accuracy. It is the average of the recall values for each class. As expected, this score does not vary with the number of anomalous examples in the test set.

# Training

If there are no negative examples available, you can train on only the positive examples using the `train` method

In [14]:
anomaly_detector.train(gene_texts, nu=0.05, max_features=100, ngram_range=(1, 2))

Once trained, predictions can be made using the predict method. A prediction of 1.0 will be made for examples the classifier believes to be anomalous with a prediction of 0.0 being made for other examples

In [12]:
preds1 = anomaly_detector.predict(gene_texts)

In [13]:
preds2 = anomaly_detector.predict(other_texts)

In [14]:
print('%d of the %d gene texts were predicted to be anomalous while %d of the %d other texts were'
      ' predicted to be anomalous' % (sum(preds1), len(gene_texts), sum(preds2), len(other_texts)))

33 of the 101 gene texts were predicted to be anomalous while 358 of the 399 other texts were predicted to be anomalous


In [14]:
399/500

0.798