# Transcription Prediction

## Description

## Libraries and data import

In [35]:
import pandas as pd
import re

In [28]:
data = pd.read_csv('mtsamples.csv')
print(data.shape)
data.head(2)

(4999, 6)


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."


In [30]:
data = data.drop('Unnamed: 0', axis = 1)

In [31]:
data.head(3)

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."


$Description$ field represents the short description of full transcription

In [14]:
for i in range(10):
    print(data.iloc[i,0])

 A 23-year-old white female presents with complaint of allergies.
 Consult for laparoscopic gastric bypass.
 Consult for laparoscopic gastric bypass.
 2-D M-Mode. Doppler.  
 2-D Echocardiogram
 Morbid obesity.  Laparoscopic antecolic antegastric Roux-en-Y gastric bypass with EEA anastomosis.  This is a 30-year-old female, who has been overweight for many years.  She has tried many different diets, but is unsuccessful. 
 Liposuction of the supraumbilical abdomen, revision of right breast reconstruction, excision of soft tissue fullness of the lateral abdomen and flank.

 2-D Echocardiogram
 Suction-assisted lipectomy - lipodystrophy of the abdomen and thighs.
 Echocardiogram and Doppler


$Transcription$ represents the full text of description - the field we need

In [15]:
data.iloc[0,3]

'SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals:  Weight was 130 pounds and blood pressure 124/78.,HEENT:  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Supple without adenopathy.,

$Medical speciality$ describes the kind of medical treatment needed for the case. Let's look at the values

In [17]:
data['medical_specialty'].value_counts()

 Surgery                          1103
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        372
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  230
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Obstetrics / Gynecology           160
 Urology                           158
 Discharge Summary                 108
 ENT - Otolaryngology               98
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    62
 Psychiatry / Psychology            53
 Office Notes                       51
 Podiatry                           47
 Dermatology                        29
 Dentistry                          27
 Cosmetic / Plastic Surge

We will use $transcription$ field as raw data, and $medical speciality$ as target to predict. Other fields we would not need in this task

In [32]:
data = data[['transcription', 'medical_specialty']]

In [33]:
data.head(2)

Unnamed: 0,transcription,medical_specialty
0,"SUBJECTIVE:, This 23-year-old white female pr...",Allergy / Immunology
1,"PAST MEDICAL HISTORY:, He has difficulty climb...",Bariatrics


## Test nltk

In [19]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Demi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [36]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
        
    """
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub('', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))# delete stopwords from text
    
    return text

In [49]:
data.iloc[3, 0]

'2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left ventricle.,3.  Normal LV systolic function with left ventricular ejection fraction of 51%.,4.  Normal LV diastolic function.,5.  No pericardial effusion.,6.  Normal morphology of aortic valve, mitral valve, tricuspid valve, and pulmonary valve.,7.  PA systolic pressure is 36 mmHg.,DOPPLER: , ,1.  Mild mitral and tricuspid regurgitation.,2.  Trace aortic and pulmonary regurgitation.'

In [40]:
text_prepare(data.iloc[0, 0])

'subjective 23yearold white female presents complaint allergies used allergies lived seattle thinks worse past tried claritin zyrtec worked short time seemed lose effectiveness used allegra also used last summer began using two weeks ago appear working well used overthecounter sprays prescription nasal sprays asthma doest require daily medication think flaring upmedications medication currently ortho tricyclen allegraallergies known medicine allergiesobjectivevitals weight 130 pounds blood pressure 12478heent throat mildly erythematous without exudate nasal mucosa erythematous swollen clear drainage seen tms clearneck supple without adenopathylungs clearassessment allergic rhinitisplan1 try zyrtec instead allegra another option use loratadine think prescription coverage might cheaper2 samples nasonex two sprays nostril given three weeks prescription written well'

In [38]:
print(test_text_prepare())

Basic tests are passed.


In [139]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [65]:
data = data.drop(data[data['transcription'].isna()].index)

In [140]:
corpus = [x for x in data['transcription'].values]
#corpus = [text_prepare(x) for x in data['transcription'].values]
#df = CountVectorizer(ngram_range = (1, 2)).fit_transform(corpus)
df = TfidfVectorizer(ngram_range = (1, 1), stop_words='english').fit_transform(corpus)

df.shape

y = data['medical_specialty'].values
x_train, x_test, y_train, y_test = train_test_split(df, y, stratify = y, test_size = 0.2, random_state = 42)

clf = MultinomialNB()
clf.fit(x_train, y_train)

print(f1_score(clf.predict(x_test), y_test, average = 'weighted'))
print(precision_score(clf.predict(x_test), y_test, average = 'weighted'))
print(recall_score(clf.predict(x_test), y_test, average = 'weighted'))
print(accuracy_score(clf.predict(x_test), y_test))

0.4756920948607732
0.9208667030302551
0.335010060362173
0.335010060362173


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
