![title](in.jpg)
# Innoplexus Online Hiring Hackathon: Saving lives with AI

## Problem Statement

Clinical studies often require detailed patients’ information documented in clinical narratives. **Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities of interest (e.g., disease names, medication names and lab tests) from clinical narratives, thus to support clinical and translational research.** Clinical notes have been analyzed in greater detail to harness important information for clinical research and other healthcare operations, as they depict rich, detailed medical information.


In this challenge, hackers are invited to extract all disease names from a given set of 20000 paragraphs/documents in the test set provided the labelled entities (diseases) for 30000 documents in the train set.

For example, here is a sentence from a clinical report:

*We compared the inter-day reproducibility of post-occlusive **reactive hyperemia** (PORH) assessed by single-point laser Doppler flowmetry (LDF) and laser speckle contrast analysis (LSCI).*


In the sentence given, **reactive hyperemia (in bold)** is the named entity with the type disease/indication.

 

## Data Description
The train file has the following structure:
 
|Variable | Definition|
|---|---|
|id|	Unique ID for a token/word|
|Doc_ID	|Unique ID for a Document/Paragraph|
|Sent_ID|	Unique ID for a Sentence|
|Word	|Exact word/token|
|tag	(Target)| Named Entity Tag  |

The target 'tag' follows the **Inside-outside-beginning (IOB)** tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named-entity recognition.

**The B-indications (beginning) tag indicates that the token is the beginning of a disease entity (disease name in this case)
An I-indications (inside) tag indicates that the token is inside an entity
An O (outside) tag indicates that a token is outside a disease entity**
 
**Example**
For more clarity, let's look at the same sample in the given tabular format, each row here corresponds to a word/token:

The disease **'reactive hyperemia'** is labelled using **'B-indications'** for the word **'reactive'** and **'I-indications'** for the word **'hypermia'**. All the other words that are outside **'reactive hyperemia'** are labelled with **'O'.**


## Evaluation Metric

The evaluation for this contest is based on modified F1-Score as explained below:
Suppose the ground truth has the following entities (mentioned in square brackets) for the given sentence

**[Malaria] and [Yellow Fever] remain more deadly than [Hepatitis B] today**

This has 3 entities.
Supposing the actual prediction has the following

**[Malaria] [and] [Yellow] Fever remain more deadly than Hepatitis B [today]**

We have an exact match for Malaria, false positives for and and today, a false negative for Hepatitis B and a substring match for Yellow. We compute precision and recall by first defining matching criteria. We are also trying to reward partial match here and not just exact entity match.

Here, True positives are of 2 types - Exact match and partial match and we are giving a weight of 1 to Exact Match and 0.5 to partial match. The computations are as follows:

Exact Match = 1 (Malaria) and Partial Match = 1 ( Yellow which overlaps Yellow Fever), False Positives =2 (and, and today), False Negatives = 1 (Hepatitis B)

**Precision** = (Exact Match + 0.5 * Partial Match) / (Exact Match + Partial Match + False Positives) = (1 + 0.5)/(1+1+2) = 0.375

**Recall** = (Exact Match + 0.5 * Partial Match) / (Exact Match + Partial Match + False Negatives) = (1 + 0.5)/(1+1+1) = 0.50

**F1 Score** = (2 * Precision * Recall)/(Precision + Recall) = 0.428


The counts of exact match, partial match, false positives and false negatives is summed across all sentences in the test set and overall F1 Score is the leaderboard score.

Please find the script for the evaluation metric implemented in Python at this [link](https://gist.github.com/frenzy2106/3a12b7fefeb33941edea45d881d6f81a) 

In [55]:
import numpy as np
import pandas as pd

In [56]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
s=pd.read_csv('sample_submission.csv')



train.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag
0,1,1,1,Obesity,O
1,2,1,1,in,O
2,3,1,1,Low-,O
3,4,1,1,and,O
4,5,1,1,Middle-Income,O


Fetching Data

In [57]:
trainiob=pd.read_csv('train_treeiob.csv')

testiob=pd.read_csv('test_treeiob.csv')

trainpos=pd.read_csv('train_pos.csv')
testpos=pd.read_csv('test_pos.csv')




In [59]:
train['pos']=trainpos['pos']
train['tree_iob']=trainiob['tree_iob']


test['pos']=testpos['pos']
test['tree_iob']=testiob['tree_iob']

train.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag,pos,tree_iob,pred_tag
0,1,1,1,Obesity,O,NN,B,O
1,2,1,1,in,O,IN,O,O
2,3,1,1,Low-,O,NNP,O,O
3,4,1,1,and,O,CC,O,O
4,5,1,1,Middle-Income,O,JJ,O,O


In [60]:

df=train.copy()
dftest=test.copy()
del train
del test

**Cleaning**

In [61]:
import re
def clean(v):
    v=str(v)
    r=''
    
    if len(v)!=1 and (v[-1]=='.' or v.find(':')!=-1 or v.find("'")!=-1 or v.find(",")!=-1):
        r=re.sub(r'[^\w\s]','',v)
    else:
        r=v
    if r=='':
        return ','
    return r
    
df['Word']=df['Word'].apply(clean)
dftest['Word']=dftest['Word'].apply(clean)

In [9]:
df.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag,pos,tree_iob
0,1,1,1,Obesity,O,NN,B
1,2,1,1,in,O,IN,O
2,3,1,1,Low-,O,NNP,O
3,4,1,1,and,O,CC,O
4,5,1,1,Middle-Income,O,JJ,O


In [10]:
from nltk.corpus import stopwords
from string import punctuation
st=stopwords.words('english')

In [62]:
df.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag,pos,tree_iob,pred_tag
0,1,1,1,Obesity,O,NN,B,O
1,2,1,1,in,O,IN,O,O
2,3,1,1,Low-,O,NNP,O,O
3,4,1,1,and,O,CC,O,O
4,5,1,1,Middle-Income,O,JJ,O,O


In [77]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w,p,e, t) for w,p,e, t in zip(s['Word'].astype(str).values.tolist(), 
                                                       s['pos'].astype(str).values.tolist(), s['tree_iob'].values.tolist(),
                                                           s['tag'].values.tolist())]
        self.grouped = self.data.groupby(['Doc_ID','Sent_ID']).apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try: 
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
        except:
            return None
getter = SentenceGetter(df)
sentences = getter.sentences

In [111]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    tr_iob = sent[i][2]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'word.isidentifier()':word.isidentifier(),
        'postag': postag,
        'postag[:2]': postag[:2], 
        'tree_iob':tr_iob,
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        tr_iob1 = sent[i-1][2]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
            '-1:tree_iob':tr_iob1,
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        tr_iob1=sent[i+1][2]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
            '+1:tree_iob':tr_iob1,
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag,tiob, label in sent]

def sent2tokens(sent):
    return [token for token, postag,tiob, label in sent]

In [112]:
from tqdm import tqdm_notebook as tqdm
X = [sent2features(s) for s in tqdm(sentences)]

In [81]:
y = [sent2labels(s) for s in tqdm(sentences)]

In [90]:
# from sklearn.feature_extraction import DictVectorizer
# from sklearn.feature_extraction.text import HashingVectorizer
# from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
# from sklearn.linear_model import SGDClassifier
# from sklearn.linear_model import PassiveAggressiveClassifier
# from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X[:600], y[:600], test_size=0.25, random_state=0)

In [69]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import RandomizedSearchCV



## Testing and Evaluating Model

For only 600 data

In [109]:

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.001,
    c2=0.001,
    max_iterations=100,
    all_possible_transitions=True,verbose=True
)
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [00:01<00:00, 237.19it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 10688
Seconds required: 0.100

L-BFGS optimization
c1: 0.001000
c2: 0.001000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.01  loss=1310.92  active=10663 feature_norm=1.00
Iter 2   time=0.01  loss=1221.20  active=10594 feature_norm=1.07
Iter 3   time=0.01  loss=1083.03  active=10530 feature_norm=1.23
Iter 4   time=0.01  loss=1034.19  active=10614 feature_norm=1.34
Iter 5   time=0.01  loss=979.07   active=10629 feature_norm=1.48
Iter 6   time=0.01  loss=873.32   active=10644 feature_norm=1.64
Iter 7   time=0.01  loss=647.20   active=10647 feature_norm=3.23
Iter 8   time=0.01  loss=586.06   active=10679 feature_norm=3.73
Iter 9   time=0.01  loss=572.11   active=10668 feature_norm=4.73
Iter 10  time=

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.001,
  c2=0.001, calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=True)

In [92]:
# labels = list(crf.classes_)
# labels.remove('O')
labels=['B-indications','I-indications']

In [93]:
crf.classes_

['O', 'B-indications', 'I-indications']

In [110]:
from sklearn_crfsuite import metrics
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))
metrics.flat_f1_score(y_test, y_pred,average='weighted', labels=labels)

               precision    recall  f1-score   support

B-indications       0.75      0.42      0.54        43
I-indications       0.76      0.34      0.47        38

  avg / total       0.76      0.38      0.51        81



0.5070137527848971

In [114]:

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.01,
    max_iterations=100,
    all_possible_transitions=True,verbose=True
)
crf.fit(X, y)

loading training data to CRFsuite: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 191282/191282 [09:53<00:00, 322.50it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 519782
Seconds required: 65.979

L-BFGS optimization
c1: 0.010000
c2: 0.010000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=5.92  loss=660781.26 active=518830 feature_norm=1.00
Iter 2   time=3.19  loss=598779.72 active=515207 feature_norm=1.09
Iter 3   time=2.94  loss=541180.97 active=512454 feature_norm=1.26
Iter 4   time=2.80  loss=526749.81 active=516050 feature_norm=1.34
Iter 5   time=3.02  loss=517387.72 active=516786 feature_norm=1.42
Iter 6   time=2.96  loss=499076.29 active=517174 feature_norm=1.46
Iter 7   time=2.74  loss=303384.84 active=516533 feature_norm=5.88
Iter 8   time=6.34  loss=288576.05 active=517030 feature_norm=6.90
Iter 9   time=2.75  loss=284301.19 active=517161 feature_norm

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.01, c2=0.01,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=True)

In [83]:
import pickle
pickle.dump(crf,open('crf_model.sav','wb'))

In [84]:
class SentenceGetterTest(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w,p,e) for w,p,e in zip(s['Word'].astype(str).values.tolist(), 
                                                       s['pos'].astype(str).values.tolist(), s['tree_iob'].values.tolist())]
        self.grouped = self.data.groupby(['Doc_ID','Sent_ID']).apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try: 
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
        except:
            return None
getter_test = SentenceGetterTest(dftest)
test_sentences = getter_test.sentences

In [113]:
test_X = [sent2features(s) for s in tqdm(test_sentences)]

In [115]:
y_predtest = crf.predict(test_X)

In [116]:
z=[]
for zz in y_predtest:
    z.extend(zz)
print(len(z))

2994463


In [117]:
s=pd.read_csv('sample_submission.csv')
s['tag']=z
s['tag'].value_counts()

O                2935690
B-indications      30975
I-indications      27798
Name: tag, dtype: int64

In [118]:
s.to_csv('s11_mainCRFft2.csv',index=False)