# Identifying Entities in Health care Data

---
**Problem Statement**

A health-Tech company Be Healthy aims to connect the medical communities with millions of patients across the country it has a web platform that allows doctors to list their services and manage patient interactions and provides services for patients such as booking interactions with doctors and ordering medicines online. Here, doctors can easily organize appointments, track past medical records and provide e-prescriptions.

Build a custom NER to get the list of diseases and their treatment from the dataset and list it out in the form of a table or a dictionary.

---

**We need to perform the following steps:**

- We need to process and modify the data into sentence format. This step has to be done for the 'train_sent' and 'train_label' datasets and for test datasets as well.
- After that, we need to define the features to build the CRF model.
- Then, we need to apply these features in each sentence of the train and the test dataset to get the feature values.
- Once the features are computed, we need to define the target variable and then build the CRF model.
- Then, we need to perform the evaluation using a test data set.
- After that, we need to create a dictionary in which diseases are keys and treatments are values.

In [1]:
#import necessary libraries
import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import pandas as pd
import numpy as np
import re
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
model = spacy.load("en_core_web_sm")

In [2]:
### Function to read text file and convert it into senetnces
def word2sentence(file_name):
    with open(file_name,"r") as data:
        data=data.readlines()
    for i in range(len(data)):
        if(bool(re.search("^.+\n$",data[i]))):
            data[i]=data[i].replace("\n","")
        else:
            data[i]=data[i]
    st=" ".join(data)
    return pd.DataFrame(st.split(" \n "))

In [3]:
#### Reading Data into DataFrae
train_sent =word2sentence("train_sent")
train_label =word2sentence("train_label")
test_sent =word2sentence("test_sent")
test_label =word2sentence("test_label")

In [4]:
print("Printing Training 5 Top rows \n ")
print(train_sent.head())
print("\n \n ")
print("Printing Training label 5 Top rows \n ")
print(train_label.head())

Printing Training 5 Top rows 
 
                                                   0
0  All live births > or = 23 weeks at the Univers...
1  The total cesarean rate was 14.4 % ( 344 of 23...
2  Abnormal presentation was the most common indi...
3  The `` corrected '' cesarean rate ( maternal-f...
4  Arrest of dilation was the most common indicat...

 
 
Printing Training label 5 Top rows 
 
                                                   0
0  O O O O O O O O O O O O O O O O O O O O O O O ...
1  O O O O O O O O O O O O O O O O O O O O O O O O O
2                      O O O O O O O O O O O O O O O
3  O O O O O O O O O O O O O O O O O O O O O O O ...
4        O O O O O O O O O O O O O O O O O O O O O O


In [5]:
print("Printing Test 5 Top rows \n ")
print(test_sent.head())
print("\n \n ")
print("Printing Test label 5 Top rows \n ")
print(test_label.head())

Printing Test 5 Top rows 
 
                                                   0
0  Furthermore , when all deliveries were analyze...
1  As the ambient temperature increases , there i...
2  The daily high temperature ranged from 71 to 1...
3  There was a significant correlation between th...
4  Fluctuations in ambient temperature are invers...

 
 
Printing Test label 5 Top rows 
 
                                                   0
0  O O O O O O O O O O O O O O O O O O O O O O O ...
1              O O O O O O O O O O O O O O O O O O O
2    O O O O O O O O O O O O O O O O O O O O O O O O
3  O O O O O O O O O O O O O O O O O O O O O O O ...
4                              O O O O O O O O O O O


In [6]:
print("The number of lines/sentences in training sentence file",len(train_sent))
print("The number of lines/sentences in test sentence file",len(test_sent))

The number of lines/sentences in training sentence file 2599
The number of lines/sentences in test sentence file 1056


In [7]:
print("The number of lines in training label file",len(train_label))
print("The number of lines in test label file",len(test_label))

The number of lines in training label file 2599
The number of lines in test label file 1056


In [8]:
All_data=pd.concat([train_sent,test_sent],axis=0).reset_index(drop=True)

In [9]:
All_data

Unnamed: 0,0
0,All live births > or = 23 weeks at the Univers...
1,The total cesarean rate was 14.4 % ( 344 of 23...
2,Abnormal presentation was the most common indi...
3,The `` corrected '' cesarean rate ( maternal-f...
4,Arrest of dilation was the most common indicat...
...,...
3650,Reduction of vasoreactivity and thrombogenicit...
3651,Effects of ultrasound energy on total peripher...
3652,High-dose chemotherapy with autologous stem-ce...
3653,`` Tandem '' high-dose chemoradiotherapy with ...


In [10]:
#getting features for each words
def getFeaturesForOneWord(sentence,pos):
    word=sentence[pos]
    #tag=sentence[pos][1]
    features=[
        'word.lower='+ word.lower(), #input in lower case
        'word[-3:]='+ word[-3:], #last 3 characters of word
        'word[-2:]='+ word[-2:], #last 2 characters of word 
        'word.isupper=%s' % word.isupper(), #is the word in all uppercase
        'word.isdigit=%s' % word.isdigit(),  # is the word a number
        'word.istitle=%s'%word.istitle(),# is the word a title
        #'word.postag=%s'%tag.lower(),
        'words.startsWithCapital=%s' % word[0].isupper() # is the word starting with a capital letter   
    ]
    
    if (pos>0):
        prev_word=sentence[pos-1]
        #prev_tag=sentence[pos-1][1]
        features.extend([
            'prev_word.lower='+ prev_word.lower(), #previous word in lower case
            'prev_word.isupper=%s' % prev_word.isupper(), #is previous word in upper case
            'prev_word.isdigit=%s' % prev_word.isdigit(), # is the word a number
            'prev_word.istitle=%s'%prev_word.istitle(),# is the word a title
            #'prev_word.postag=%s'%prev_tag.lower(),
            'prev_word.startsWithCapital=%s' % prev_word[0].isupper() #is the word starting with a capital letter
        ])
    else:
        features.append('BEG')
    
    if (pos==len(sentence)-1):
        features.append('END')
    return features

In [11]:
# Write a code to get features for a sentence.

def getFeaturesForOneSentence(sentence):
    sentence_list = sentence.split()
    return [getFeaturesForOneWord(sentence_list, pos) for pos in range(len(sentence_list))]

In [12]:
# Write a code to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
    return labels.split()

In [13]:
#sentences in list format
train_sent=train_sent[0].tolist()
train_label=train_label[0].tolist()
test_sent=test_sent[0].tolist()
test_label=test_label[0].tolist()

In [14]:
#check whethere feature extractions work correctly or not
example_sentence = train_sent[5]
print(example_sentence)

features = getFeaturesForOneSentence(example_sentence)
features[2]

Cesarean rates at tertiary care hospitals should be compared with rates at community hospitals only after correcting for dissimilar patient groups or gestational age


['word.lower=at',
 'word[-3:]=at',
 'word[-2:]=at',
 'word.isupper=False',
 'word.isdigit=False',
 'word.istitle=False',
 'words.startsWithCapital=False',
 'prev_word.lower=rates',
 'prev_word.isupper=False',
 'prev_word.isdigit=False',
 'prev_word.istitle=False',
 'prev_word.startsWithCapital=False']

In [15]:
#get train and test X data
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sent]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sent]

In [16]:
#get train and test Y data
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_label]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_label]

In [17]:
#CRF
crf = sklearn_crfsuite.CRF(max_iterations=100)
crf.fit(X_train, Y_train)



CRF(keep_tempfiles=None, max_iterations=100)

In [18]:
#make prediction
y_pred = crf.predict(X_test)

In [19]:
#check f1 score
metrics.flat_f1_score(Y_test, y_pred, average='weighted')

0.9032506607059889

In [20]:
#create a dictionary of disease and treatment
dict={}
for i in range(len(y_pred)):
    d=[]
    t=[]
    for word,tag in zip(test_sent[i].split(),y_pred[i]):
        if(tag=='T'):
            t.append(word)
        if(tag=='D'):
            d.append(word)
    if(len(d)>0 and len(t)>0):
        dict[" ".join(d)]=" ".join(t)

In [21]:
#check dictionary
dict

{'hereditary retinoblastoma': 'radiotherapy',
 'unstable angina or non-Q-wave myocardial infarction': 'roxithromycin',
 'coronary-artery disease': 'Antichlamydial antibiotics',
 'cellulitis': 'G-CSF therapy intravenous antibiotic treatment',
 'foot infection': 'G-CSF treatment',
 "early Parkinson 's disease": 'Ropinirole monotherapy',
 'female stress urinary incontinence': 'surgical treatment',
 'stress urinary incontinence': 'therapy',
 'preeclampsia ( proteinuric hypertension )': 'intrauterine insemination with donor sperm versus intrauterine insemination',
 'cancer': 'oral drugs chemotherapy',
 'major pulmonary embolism': 'Thrombolytic treatment right-side hemodynamics',
 'malignant pleural mesothelioma': 'thoracotomy , radiotherapy , and chemotherapy',
 'tumor markers pulmonary symptoms attributable': 'chemotherapy',
 'non-obstructive azoospermia': 'testicular fine needle aspiration ( TEFNA ) open biopsy and testicular sperm extraction ( TESE )',
 'colorectal cancer': 'potential cu

In [22]:
### Finding the data from Dictionary
dict['hereditary retinoblastoma']

'radiotherapy'

In [23]:
#create data frame
disease_treatment_df = pd.DataFrame([(k,v) for k, v in dict.items()], columns=['Disease', 'Predicted Treatment'])

In [24]:
#check created df
disease_treatment_df

Unnamed: 0,Disease,Predicted Treatment
0,hereditary retinoblastoma,radiotherapy
1,unstable angina or non-Q-wave myocardial infar...,roxithromycin
2,coronary-artery disease,Antichlamydial antibiotics
3,cellulitis,G-CSF therapy intravenous antibiotic treatment
4,foot infection,G-CSF treatment
...,...,...
102,postvitrectomy diabetic vitreous hemorrhage,Peripheral retinal cryotherapy
103,hepatitis B,vaccine containing MF59
104,temporomandibular joint arthropathy,arthroscopic treatment
105,severe secondary peritonitis,Surgical management


In [25]:
#keyword
key_word='inflammatory'

In [26]:
#check search using keyword
disease_treatment_df[disease_treatment_df.Disease.str.contains(key_word)]

Unnamed: 0,Disease,Predicted Treatment
25,inflammatory skin diseases,topical corticosteroids
