## "AI for Health" Computer Science 4th year


This project aims to predict the clinical speciality by analaysing and classifying the patient's transcription data. It takes patients' symptoms, histories, current problems as input and predicts the specialty for the given input as output.
The project is based on natural language processing that cleans, analyses the textual data and uses Tfidfvectorizer and Random Forest Classifier to classify, train and predict the outcome. We have sucessfully trained the model and now it is capable of predicting the related medical department given the patients' details. 

dataset drivelink: https://drive.google.com/drive/folders/1vOjR_Qs55BJI_DSaX56crr155hHiq6Go?usp=sharing

In [7]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import re
import string
from numpy import dot
from numpy.linalg import norm

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from imblearn.over_sampling import SMOTE
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('omw-1.4')

import spacy
import en_ner_bionlp13cg_md

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [8]:
#loading dataset
clinicaldf = pd.read_csv('transcription_dataset.csv')
clinicaldf.head(1)

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."


In [9]:
clinicaldf.shape #(rows,columns)

(4999, 6)

Data Cleaning

In [10]:
clinicaldf.isnull().sum() #checking null values

Unnamed: 0              0
description             0
medical_specialty       0
sample_name             0
transcription          33
keywords             1068
dtype: int64

In [11]:
#avoiding null values
clinicaldf = clinicaldf[clinicaldf['transcription'].notna()]
clinicaldf = clinicaldf[clinicaldf['keywords'].notna()]
clinicaldf.isnull().sum()

Unnamed: 0           0
description          0
medical_specialty    0
sample_name          0
transcription        0
keywords             0
dtype: int64

In [12]:
#checking duplicates
clinicaldf.duplicated().sum()

0

In [13]:
#only taking necessary columns/data for prediction
clinicaldf = clinicaldf[['transcription','medical_specialty']]

In [14]:
clinicaldf.medical_specialty.value_counts()

 Surgery                          1021
 Orthopedic                        303
 Cardiovascular / Pulmonary        280
 Radiology                         251
 Consult - History and Phy.        234
 Gastroenterology                  195
 Neurology                         168
 General Medicine                  146
 SOAP / Chart / Progress Notes     142
 Urology                           140
 Obstetrics / Gynecology           130
 ENT - Otolaryngology               84
 Neurosurgery                       81
 Ophthalmology                      79
 Discharge Summary                  77
 Nephrology                         63
 Hematology - Oncology              62
 Pain Management                    58
 Office Notes                       44
 Pediatrics - Neonatal              42
 Podiatry                           42
 Emergency Room Reports             31
 Dermatology                        25
 Dentistry                          25
 Cosmetic / Plastic Surgery         25
 Letters                 

In [15]:
#only taking medical_specialty with value >50 better model and prediction
counts = clinicaldf['medical_specialty'].value_counts()
clinicaldf = clinicaldf[~clinicaldf['medical_specialty'].isin(counts[counts < 80].index)]

In [16]:
clinicaldf.medical_specialty.value_counts() #now have values with counts>50

 Surgery                          1021
 Orthopedic                        303
 Cardiovascular / Pulmonary        280
 Radiology                         251
 Consult - History and Phy.        234
 Gastroenterology                  195
 Neurology                         168
 General Medicine                  146
 SOAP / Chart / Progress Notes     142
 Urology                           140
 Obstetrics / Gynecology           130
 ENT - Otolaryngology               84
 Neurosurgery                       81
Name: medical_specialty, dtype: int64

In [17]:
clinicaldf.columns


Index(['transcription', 'medical_specialty'], dtype='object')

In [18]:
clinicaldf.shape

(3175, 2)

In [19]:
#cleaning data by removing unwanted characters
special_character_remover = re.compile('[/(){}\[\]\|@,;]')
extra_symbol_remover = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
nlp = spacy.load("en_ner_bionlp13cg_md")

def clean_text(text):
    text = text.lower()
    text = special_character_remover.sub(' ',text)
    text = extra_symbol_remover.sub('',text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

def lemmatize_text(text):
    wordlist=[]
    lemmatizer = WordNetLemmatizer() 
    sentences=sent_tokenize(text)
    
    for sentence in sentences:
        words=word_tokenize(sentence)
        for word in words:
            wordlist.append(lemmatizer.lemmatize(word))    
    return ' '.join(wordlist) 

def process_Text(text):
    wordlist=[]
    doc = nlp(text)
    for ent in doc.ents:
        wordlist.append(ent.text)
    return ' '.join(wordlist) 

In [20]:
clinicaldf['transcription'].iloc[1]

'1.  The left ventricular cavity size and wall thickness appear normal.  The wall motion and left ventricular systolic function appears hyperdynamic with estimated ejection fraction of 70% to 75%.  There is near-cavity obliteration seen.  There also appears to be increased left ventricular outflow tract gradient at the mid cavity level consistent with hyperdynamic left ventricular systolic function.  There is abnormal left ventricular relaxation pattern seen as well as elevated left atrial pressures seen by Doppler examination.,2.  The left atrium appears mildly dilated.,3.  The right atrium and right ventricle appear normal.,4.  The aortic root appears normal.,5.  The aortic valve appears calcified with mild aortic valve stenosis, calculated aortic valve area is 1.3 cm square with a maximum instantaneous gradient of 34 and a mean gradient of 19 mm.,6.  There is mitral annular calcification extending to leaflets and supportive structures with thickening of mitral valve leaflets with mi

In [21]:
#cleaning the textual data of transcription column
clinicaldf['transcription']= clinicaldf['transcription'].apply(clean_text)
clinicaldf['transcription']= clinicaldf['transcription'].apply(lemmatize_text)


In [22]:
clinicaldf['transcription']= clinicaldf['transcription'].apply(process_Text)

In [23]:
clinicaldf['transcription'].iloc[1]

'1 left ventricular cavity wall wall left ventricular left ventricular left ventricular left atrium atrium ventricle aortic valve aortic valve aortic valve area leaflet mitral valve mitral regurgitation tricuspid valve tricuspid pulmonary artery pulmonary artery pulmonary valve pulmonary insufficiency 9 pericardial lipomatous'

In [24]:
#these medical specialty includes smaller specialties so to avoid duplication, for e.g. surgery may include --> cases of gyanecology
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' Surgery']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' SOAP / Chart / Progress Notes']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' Emergency Room Reports']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' Discharge Summary']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' Office Notes']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' General Medicine']
clinicaldf = clinicaldf[clinicaldf['medical_specialty'] != ' Pain Management']


In [25]:
clinicaldf.loc[clinicaldf.medical_specialty == ' Neurosurgery', "medical_specialty"] = ' Neurology'
clinicaldf.loc[clinicaldf.medical_specialty == ' Nephrology', "medical_specialty"] = " Urology"

In [26]:
clinicaldf.medical_specialty.value_counts()

 Orthopedic                    303
 Cardiovascular / Pulmonary    280
 Radiology                     251
 Neurology                     249
 Consult - History and Phy.    234
 Gastroenterology              195
 Urology                       140
 Obstetrics / Gynecology       130
 ENT - Otolaryngology           84
Name: medical_specialty, dtype: int64

In [27]:

clinicaldf.shape

(1866, 2)

In [28]:
type(clinicaldf['medical_specialty'].iloc[0])

str

In [29]:
#appling tfidf in 'Transcription' column 
tfv = TfidfVectorizer(min_df=5, max_features=1000, use_idf=True,
                      strip_accents='unicode',analyzer='word',smooth_idf=True,
                      ngram_range=(1,3), sublinear_tf=True,
                      stop_words='english' )
tfidf_mat= tfv.fit_transform(clinicaldf['transcription'])
feature_names= sorted(tfv.get_feature_names_out())
del feature_names[0:35]
print(feature_names)

['alcohol', 'alcohol patient', 'allograft', 'amniotic', 'amniotic fluid', 'ampulla', 'anesthesia', 'anesthesia patient', 'aneurysm', 'annular', 'antebrachial', 'anterior', 'anterior abdominal', 'anterior abdominal wall', 'anterior cervical', 'anterior cruciate', 'anterior descending', 'anterior descending artery', 'anterior posterior', 'anteriorly', 'anterolateral', 'antrum', 'aorta', 'aortic', 'aortic arch', 'aortic valve', 'apex', 'apical', 'appendix', 'arch', 'area', 'arm', 'arterial', 'arteriosus', 'artery', 'artery anterior', 'artery anterior descending', 'artery artery', 'artery carotid', 'artery carotid artery', 'artery circumflex', 'artery coronary', 'artery coronary artery', 'artery femoral', 'artery left', 'artery patient', 'artery posterior', 'artery pulmonary', 'artery renal', 'artery vessel', 'articular', 'articular surface', 'aspirin', 'atrial', 'atrium', 'axillary', 'b12', 'bacitracin', 'barium', 'barium patient', 'base', 'bed', 'benign', 'betadine', 'biceps', 'biceps te

In [30]:
tfidf_mat.shape

(1866, 1000)

In [31]:
from sklearn.model_selection import train_test_split
#splitting data to train and test the model, here data of transcription predicts medical specialty
X_train, X_test, y_train, y_test = train_test_split(tfidf_mat, clinicaldf.medical_specialty, random_state=1, stratify= clinicaldf.medical_specialty )

In [32]:
print('Train_Set_Size:'+str(X_train.shape))
print('Test_Set_Size:'+str(X_test.shape))

Train_Set_Size:(1399, 1000)
Test_Set_Size:(467, 1000)


In [33]:
#from sklearn.model_selection import train_test_split
#from sklearn.decomposition import PCA
#pca= PCA(n_components=0.95)
#tfidfmat_reduced = pca.fit_transform(tfidf_mat.toarray())
#labels= clinicaldf['medical_specialty'].tolist()
#category_list = clinicaldf.medical_specialty.unique()
#X_train, X_test, y_train, y_test = train_test_split(tfidfmat_reduced, labels, random_state=1, stratify= labels)

In [34]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training 
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)


In [35]:
#clf = LogisticRegression(penalty= 'elasticnet', solver= 'saga', l1_ratio=0.5, random_state=1)
#clf.fit(X_train, y_train)
#y_pred= clf.predict(X_test)

In [36]:
from sklearn import metrics
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.5438972162740899


In [37]:
#creating more samples who have less no. of values to balance the dataset
smote_over_sample = SMOTE(sampling_strategy='minority')
labels = clinicaldf['medical_specialty'].tolist()
X, y = smote_over_sample.fit_resample(tfidf_mat, labels)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, stratify=y,random_state=1)   
print('Train_Set_Size:'+str(X_train1.shape))
print('Test_Set_Size:'+str(X_test1.shape))

Train_Set_Size:(1563, 1000)
Test_Set_Size:(522, 1000)


In [38]:
#Create a Gaussian Classifier
clf1=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets
clf1.fit(X_train1,y_train1)

y_pred1=clf1.predict(X_test1)


In [39]:
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy of model:",metrics.accuracy_score(y_test1, y_pred1))

Accuracy of model: 0.5498084291187739


In [40]:
X_tests_input= """Angelina is 19 years old girl 
from the past 7 days she is suffering from abdominal pain cramps,
have mood swings, lack of appetite and has frequent burps. 
her vitals are 92bpm with a temperature of 98.6 F"""
X_test_tf =  tfv.transform([X_tests_input])
y_output= clf1.predict(X_test_tf)


In [41]:
print(y_output)

[' Gastroenterology']


In [42]:
X_tests_input= """Hari - 49 yearls old male. 
Vitals : 110 bpm, Temp : 97F, Weight : 86 kg. 
Symptoms: Suffering from chest pain, heart musce ache near 
central artery region. Since past 3 days.
 Prior Diognosis: Probable heart disease. 
Previous Medicines: Angiotensin-converting enzyme 
(ACE) inhibitors."""
X_test_tf =  tfv.transform([X_tests_input])
y_output= clf.predict(X_test_tf)

In [43]:
print(y_output)

[' Cardiovascular / Pulmonary']


In [44]:
X_tests_input= """he is 78 years old with high blood pressure, 
difficulty in breathing, frequent chest pain. 
Past he experienced mild heart attack."""
X_test_tf =  tfv.transform([X_tests_input])
y_output1= clf1.predict(X_test_tf)

In [45]:
print(y_output1)

[' Cardiovascular / Pulmonary']


In [46]:
X_tests_input= "joint pain, cant climb stairs well, short of breath"
X_test_tf =  tfv.transform([X_tests_input])
y_output1= clf1.predict(X_test_tf)

In [47]:
print(y_output1)

[' Orthopedic']
