# POS tagging and Document Classification

Author: Víctor J. Maciá Medina

Use the data set of 5000 medical cases "medicaltranscriptions.csv". Build a document classification
model.

1. Peform POS-tag on each medical transcript and create a diciontary of nouns only for all documents (i.e. NN, NNP, NNS and NNPS)
2. Create a document vector representation using word count.
3. Convert the "description" of each case to a vector (using the pretrained word to vector dictionary "PubMed-and-PMC-w2v.bin"). You may use the average of word vectors as the document vector representation.
4. Buildamulti-classclassificationSVMmodeltoclassifyadocumentintooneofthe"medical_specialty" categories. Use the document vector as your predictors and use "medical_ specialty" as your target variable.
5. What are the in-sample recall rates for each document types?
6. Which "medical_ specialty" has the highest recall rate?
7. Validate your SVM model by training your model on 3000 cases and apply the model to the rest 1999 cases. Create the confusion matrix and calculate the recall rates for each document type.
8. How are the test recall rates compare to the in sample recall rates?

(1) Peform POS-tag on each medical transcript and create a diciontary of nouns only for all documents (i.e. NN, NNP, NNS and NNPS)

In [1]:
#Loading the dataset and splitting each transcript into words

import nltk
with open("medicaltranscriptions.csv") as f:
    lines = f.read().splitlines() 



In [2]:
# Pretrained model - each medical transcript divided into words

import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords 
import nltk 

word_tokens = len(lines)*[0]

for i in range(1,len(lines)):
    word_tokens[i] = wordpunct_tokenize(lines[i])

    stop_words = set(stopwords.words('english')) 

    word_tokens[i] = [w.lower() for w in word_tokens[i]] 
    word_tokens[i] = [w.lower() for w in word_tokens[i] if not w in stop_words] 
    word_tokens[i] = [w.lower() for w in word_tokens[i] if w.isalpha()] 
    
    
    

In [3]:
word_tokens[1]



['year',
 'old',
 'white',
 'female',
 'presents',
 'complaint',
 'allergies',
 'allergy',
 'immunology',
 'allergic',
 'rhinitis',
 'subjective',
 'year',
 'old',
 'white',
 'female',
 'presents',
 'complaint',
 'allergies',
 'used',
 'allergies',
 'lived',
 'seattle',
 'thinks',
 'worse',
 'past',
 'tried',
 'claritin',
 'zyrtec',
 'worked',
 'short',
 'time',
 'seemed',
 'lose',
 'effectiveness',
 'used',
 'allegra',
 'also',
 'used',
 'last',
 'summer',
 'began',
 'using',
 'two',
 'weeks',
 'ago',
 'appear',
 'working',
 'well',
 'used',
 'counter',
 'sprays',
 'prescription',
 'nasal',
 'sprays',
 'asthma',
 'doest',
 'require',
 'daily',
 'medication',
 'think',
 'flaring',
 'medications',
 'medication',
 'currently',
 'ortho',
 'tri',
 'cyclen',
 'allegra',
 'allergies',
 'known',
 'medicine',
 'allergies',
 'objective',
 'vitals',
 'weight',
 'pounds',
 'blood',
 'pressure',
 'heent',
 'throat',
 'mildly',
 'erythematous',
 'without',
 'exudate',
 'nasal',
 'mucosa',
 'erythem

In [4]:
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/victormaciamedina/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
medical_pos = len(lines)*[0]

for i in range(1,len(lines)):
    medical_pos[i]=nltk.pos_tag(word_tokens[i])
    
    
    

In [6]:
medical_pos[1] # I did the word classification of each medical transcript independently


[('year', 'NN'),
 ('old', 'JJ'),
 ('white', 'JJ'),
 ('female', 'NN'),
 ('presents', 'NNS'),
 ('complaint', 'NN'),
 ('allergies', 'NNS'),
 ('allergy', 'VBP'),
 ('immunology', 'NN'),
 ('allergic', 'JJ'),
 ('rhinitis', 'NN'),
 ('subjective', 'JJ'),
 ('year', 'NN'),
 ('old', 'JJ'),
 ('white', 'JJ'),
 ('female', 'NN'),
 ('presents', 'NNS'),
 ('complaint', 'NN'),
 ('allergies', 'NNS'),
 ('used', 'VBN'),
 ('allergies', 'NNS'),
 ('lived', 'VBD'),
 ('seattle', 'JJ'),
 ('thinks', 'NNS'),
 ('worse', 'JJR'),
 ('past', 'JJ'),
 ('tried', 'VBD'),
 ('claritin', 'NN'),
 ('zyrtec', 'NN'),
 ('worked', 'VBD'),
 ('short', 'JJ'),
 ('time', 'NN'),
 ('seemed', 'VBD'),
 ('lose', 'JJ'),
 ('effectiveness', 'NN'),
 ('used', 'VBN'),
 ('allegra', 'NN'),
 ('also', 'RB'),
 ('used', 'VBD'),
 ('last', 'JJ'),
 ('summer', 'NN'),
 ('began', 'VBD'),
 ('using', 'VBG'),
 ('two', 'CD'),
 ('weeks', 'NNS'),
 ('ago', 'RB'),
 ('appear', 'VBP'),
 ('working', 'VBG'),
 ('well', 'RB'),
 ('used', 'VBN'),
 ('counter', 'NN'),
 ('sprays'

In [7]:
medical_pos[1][2][1]

'JJ'

In [8]:
# nouns

nouns = []

for i in range(1,len(medical_pos)):
    for j in range(len(medical_pos[i])):
            if medical_pos[i][j][1] == 'NN' or medical_pos[i][j][1] == 'NNP' or medical_pos[i][j][1] == 'NNS' or medical_pos[i][j][1] == 'NNPS':
                nouns.append(medical_pos[i][j][0])

In [9]:
nouns

['year',
 'female',
 'presents',
 'complaint',
 'allergies',
 'immunology',
 'rhinitis',
 'year',
 'female',
 'presents',
 'complaint',
 'allergies',
 'allergies',
 'thinks',
 'claritin',
 'zyrtec',
 'time',
 'effectiveness',
 'allegra',
 'summer',
 'weeks',
 'counter',
 'sprays',
 'prescription',
 'nasal',
 'sprays',
 'medication',
 'medications',
 'medication',
 'cyclen',
 'allergies',
 'allergies',
 'vitals',
 'pounds',
 'blood',
 'pressure',
 'heent',
 'throat',
 'nasal',
 'swollen',
 'drainage',
 'neck',
 'supple',
 'lungs',
 'allergic',
 'rhinitis',
 'plan',
 'option',
 'use',
 'loratadine',
 'prescription',
 'coverage',
 'samples',
 'sprays',
 'weeks',
 'prescription',
 'immunology',
 'rhinitis',
 'allergies',
 'sprays',
 'allegra',
 'sprays',
 'consult',
 'laparoscopic',
 'bypass',
 'bariatrics',
 'bypass',
 'consult',
 'history',
 'difficulty',
 'stairs',
 'difficulty',
 'airline',
 'seats',
 'shoes',
 'objects',
 'floor',
 'times',
 'week',
 'home',
 'cardio',
 'difficulty',


2. Create a document vector representation using word count.

In [18]:
from gensim.models import Word2Vec
import pandas as pd
from gensim.models import KeyedVectors
import numpy as np

mt = pd.read_csv("medicaltranscriptions.csv")

In [20]:
mt

Unnamed: 0,Number,description,medical_specialty,sample_name,transcription,keywords,Unnamed: 6
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller...",
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh...",
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart...",
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple...",
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo...",
...,...,...,...,...,...,...,...
4994,4994,Patient having severe sinusitis about two to ...,Allergy / Immunology,Chronic Sinusitis,"HISTORY:, I had the pleasure of meeting and e...",,
4995,4995,This is a 14-month-old baby boy Caucasian who...,Allergy / Immunology,Kawasaki Disease - Discharge Summary,"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...","allergy / immunology, mucous membranes, conjun...",
4996,4996,A female for a complete physical and follow u...,Allergy / Immunology,Followup on Asthma,"SUBJECTIVE: , This is a 42-year-old white fema...",,
4997,4997,Mother states he has been wheezing and coughing.,Allergy / Immunology,Asthma in a 5-year-old,"CHIEF COMPLAINT: , This 5-year-old male presen...",,


In [21]:
mt_token_nouns=[]
for i in range(len(mt[" description"])):
    mt_token_pos=nltk.pos_tag(nltk.word_tokenize(mt[" description"][i]))
    nouns = [word for word, pos in mt_token_pos if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
    mt_token_nouns = mt_token_nouns+nouns
    

In [25]:
mt_token_nouns_lower=[i.lower() for i in mt_token_nouns]
model = KeyedVectors.load_word2vec_format("./pubmed2018_w2v_200D/pubmed2018_w2v_200D.bin", binary=True)


In [28]:
model.word_vec("allergy")

# Here we have the model. With this model we obtain a vector representation of each term.



array([ 0.19854781, -0.21359417,  0.49111918,  0.37198612,  0.02788076,
       -0.6640716 , -0.4250803 ,  0.28902203,  0.57635623, -0.22187796,
       -0.25903288, -0.7195324 ,  0.03803334,  0.3125623 ,  0.298701  ,
       -0.00505887, -0.35185388, -0.2890893 ,  0.20734306,  0.42626593,
       -0.28640017,  0.32212105,  0.02750463, -0.3123023 , -0.29028386,
       -0.03653466,  0.12400573,  0.09789123,  0.13039659, -0.4619009 ,
        0.27209795, -0.17449038,  0.23804636, -0.4312894 , -0.05130282,
       -0.32734397,  0.10837612, -0.3050455 ,  0.42598042, -0.04544117,
       -0.05461522,  0.09972726, -0.17839207,  0.091431  , -0.52221227,
        0.16877934,  0.23499149,  0.18054771,  0.16160291, -0.18967932,
       -0.22417451, -0.05054491,  0.307758  , -0.5508191 ,  0.19027798,
       -0.26564902, -0.33502126, -0.43869987,  0.11796953,  0.67479265,
       -0.10395984,  0.1835647 , -0.4174233 , -0.16568227, -0.1531111 ,
       -0.27356932,  0.4757376 ,  0.4521014 ,  0.11355381,  0.32

3. Convert the "description" of each case to a vector (using the pretrained word to vector dictionary "PubMed-and-PMC-w2v.bin"). You may use the average of word vectors as the document vector representation.



In [36]:
# extracting all the words from 'description'

mt[' description']

0        A 23-year-old white female presents with comp...
1                Consult for laparoscopic gastric bypass.
2                Consult for laparoscopic gastric bypass.
3                                  2-D M-Mode. Doppler.  
4                                      2-D Echocardiogram
                              ...                        
4994     Patient having severe sinusitis about two to ...
4995     This is a 14-month-old baby boy Caucasian who...
4996     A female for a complete physical and follow u...
4997     Mother states he has been wheezing and coughing.
4998     Acute allergic reaction, etiology uncertain, ...
Name:  description, Length: 4999, dtype: object

In [40]:
tokens = len(mt[' description'])*[0]

for i in range(0,len(mt[' description'])):
    tokens[i] = wordpunct_tokenize(mt[' description'][i])

    stop_words = set(stopwords.words('english')) 

    tokens[i] = [w.lower() for w in tokens[i]] 
    tokens[i] = [w.lower() for w in tokens[i] if not w in stop_words] 
    tokens[i] = [w.lower() for w in tokens[i] if w.isalpha()] 

In [42]:
tokens

[['year', 'old', 'white', 'female', 'presents', 'complaint', 'allergies'],
 ['consult', 'laparoscopic', 'gastric', 'bypass'],
 ['consult', 'laparoscopic', 'gastric', 'bypass'],
 ['mode', 'doppler'],
 ['echocardiogram'],
 ['morbid',
  'obesity',
  'laparoscopic',
  'antecolic',
  'antegastric',
  'roux',
  'en',
  'gastric',
  'bypass',
  'eea',
  'anastomosis',
  'year',
  'old',
  'female',
  'overweight',
  'many',
  'years',
  'tried',
  'many',
  'different',
  'diets',
  'unsuccessful'],
 ['liposuction',
  'supraumbilical',
  'abdomen',
  'revision',
  'right',
  'breast',
  'reconstruction',
  'excision',
  'soft',
  'tissue',
  'fullness',
  'lateral',
  'abdomen',
  'flank'],
 ['echocardiogram'],
 ['suction', 'assisted', 'lipectomy', 'lipodystrophy', 'abdomen', 'thighs'],
 ['echocardiogram', 'doppler'],
 ['morbid',
  'obesity',
  'laparoscopic',
  'roux',
  'en',
  'gastric',
  'bypass',
  'antecolic',
  'antegastric',
  'mm',
  'eea',
  'anastamosis',
  'esophagogastroduodenos

In [None]:
# Filter for words which are not on the vocabulary


for i in range(len(tokens)):
    for j in tokens[i]:
         model.word_vec(j)

In [58]:
# Applying the word2vec

description_vec=np.empty((0, 200), float)

for i in range(len(mt[" description"])):
    word2vec_ls=np.empty((0, 200), float)
    mt_token_pos=nltk.pos_tag(nltk.word_tokenize(mt[" description"][i]))
    for word, pos in mt_token_pos:
        if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
            try:
                word2vec_ls=np.vstack([word2vec_ls, model.word_vec(word)])
            except:
                pass
    if len(word2vec_ls)==0:
        desc_vec=np.zeros((1, 200))
    else:
        desc_vec = np.mean(word2vec_ls, axis=0).reshape(1, 200)
    description_vec=np.vstack([description_vec, desc_vec])

In [66]:
des = pd.DataFrame(description_vec)
des

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,0.060857,0.003464,0.233486,-0.013175,0.019383,-0.183592,-0.058973,-0.015472,0.245538,-0.032230,...,0.071063,-0.200934,0.042387,-0.198681,-0.314611,-0.072867,0.023844,-0.100806,0.247444,-0.232486
1,0.417814,0.528424,0.156247,0.314276,-0.028756,-0.003994,-0.286784,-0.238518,-0.251542,-0.158756,...,-0.388704,0.119747,-0.123295,-0.476210,-0.097604,-0.526065,0.843057,0.243901,0.396211,-0.175522
2,0.417814,0.528424,0.156247,0.314276,-0.028756,-0.003994,-0.286784,-0.238518,-0.251542,-0.158756,...,-0.388704,0.119747,-0.123295,-0.476210,-0.097604,-0.526065,0.843057,0.243901,0.396211,-0.175522
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4994,-0.104748,-0.060461,-0.052439,0.084831,-0.274088,-0.157743,-0.047372,0.080325,0.572402,-0.068053,...,0.058013,-0.263365,0.169904,-0.279541,-0.334117,-0.225967,0.158993,-0.102792,0.106748,0.047518
4995,-0.060467,0.012673,0.034065,0.005690,-0.305405,-0.240361,-0.073215,0.128778,0.404935,0.081726,...,-0.093086,-0.309709,-0.195536,-0.262030,-0.377853,-0.129781,0.019788,0.009593,0.171874,0.007599
4996,-0.090798,-0.157026,0.135663,0.357849,-0.090305,-0.012042,-0.088013,-0.153324,0.289086,-0.089122,...,0.090011,-0.462647,0.174449,-0.257873,-0.150761,-0.038270,0.123594,-0.161759,0.263964,0.062692
4997,-0.380668,0.425155,0.201445,-0.038759,-0.270541,0.086544,0.083630,-0.508574,0.049840,-0.290579,...,0.496502,-0.402577,-0.475059,-0.208941,0.263061,-0.545444,-0.014066,-0.176127,0.609830,0.329959


In [70]:
des.mean()

0     -0.017764
1      0.153647
2      0.039458
3      0.068274
4     -0.105049
         ...   
195   -0.191999
196    0.168874
197    0.030339
198    0.156830
199    0.060823
Length: 200, dtype: float64

4. Buildamulti-classclassificationSVMmodeltoclassifyadocumentintooneofthe"medical_specialty" categories. Use the document vector as your predictors and use "medical_ specialty" as your target variable.

In [71]:
from sklearn import preprocessing
from sklearn import svm
from sklearn.metrics import confusion_matrix
Doc2VecRep=pd.DataFrame(description_vec)

In [72]:
(np.unique(mt['medical_specialty']))

array([' Allergy / Immunology', ' Autopsy', ' Bariatrics',
       ' Cardiovascular / Pulmonary', ' Chiropractic',
       ' Consult - History and Phy.', ' Cosmetic / Plastic Surgery',
       ' Dentistry', ' Dermatology', ' Diets and Nutritions',
       ' Discharge Summary', ' ENT - Otolaryngology',
       ' Emergency Room Reports', ' Endocrinology', ' Gastroenterology',
       ' General Medicine', ' Hematology - Oncology',
       ' Hospice - Palliative Care', ' IME-QME-Work Comp etc.',
       ' Lab Medicine - Pathology', ' Letters', ' Nephrology',
       ' Neurology', ' Neurosurgery', ' Obstetrics / Gynecology',
       ' Office Notes', ' Ophthalmology', ' Orthopedic',
       ' Pain Management', ' Pediatrics - Neonatal',
       ' Physical Medicine - Rehab', ' Podiatry',
       ' Psychiatry / Psychology', ' Radiology', ' Rheumatology',
       ' SOAP / Chart / Progress Notes', ' Sleep Medicine',
       ' Speech - Language', ' Surgery', ' Urology'], dtype=object)

In [73]:
le = preprocessing.LabelEncoder()
le.fit(mt['medical_specialty'])

LabelEncoder()

In [74]:
y=le.transform(mt['medical_specialty'])

In [75]:
np.unique(y)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])

In [76]:
le.inverse_transform([7])

array([' Dentistry'], dtype=object)

In [77]:
Doc2VecRep["y"] = y

In [78]:
Doc2VecRep

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,y
0,0.060857,0.003464,0.233486,-0.013175,0.019383,-0.183592,-0.058973,-0.015472,0.245538,-0.032230,...,-0.200934,0.042387,-0.198681,-0.314611,-0.072867,0.023844,-0.100806,0.247444,-0.232486,0
1,0.417814,0.528424,0.156247,0.314276,-0.028756,-0.003994,-0.286784,-0.238518,-0.251542,-0.158756,...,0.119747,-0.123295,-0.476210,-0.097604,-0.526065,0.843057,0.243901,0.396211,-0.175522,2
2,0.417814,0.528424,0.156247,0.314276,-0.028756,-0.003994,-0.286784,-0.238518,-0.251542,-0.158756,...,0.119747,-0.123295,-0.476210,-0.097604,-0.526065,0.843057,0.243901,0.396211,-0.175522,2
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4994,-0.104748,-0.060461,-0.052439,0.084831,-0.274088,-0.157743,-0.047372,0.080325,0.572402,-0.068053,...,-0.263365,0.169904,-0.279541,-0.334117,-0.225967,0.158993,-0.102792,0.106748,0.047518,0
4995,-0.060467,0.012673,0.034065,0.005690,-0.305405,-0.240361,-0.073215,0.128778,0.404935,0.081726,...,-0.309709,-0.195536,-0.262030,-0.377853,-0.129781,0.019788,0.009593,0.171874,0.007599,0
4996,-0.090798,-0.157026,0.135663,0.357849,-0.090305,-0.012042,-0.088013,-0.153324,0.289086,-0.089122,...,-0.462647,0.174449,-0.257873,-0.150761,-0.038270,0.123594,-0.161759,0.263964,0.062692,0
4997,-0.380668,0.425155,0.201445,-0.038759,-0.270541,0.086544,0.083630,-0.508574,0.049840,-0.290579,...,-0.402577,-0.475059,-0.208941,0.263061,-0.545444,-0.014066,-0.176127,0.609830,0.329959,0


In [79]:
clf = svm.SVC(kernel='linear')
clf.fit(Doc2VecRep.iloc[:, : 200], Doc2VecRep["y"])

SVC(kernel='linear')

In [80]:
y_pred=pd.Series(clf.predict(Doc2VecRep.iloc[:, : 200]))

In [81]:
y_pred[0:10]

0     0
1     2
2     2
3    33
4    33
5     2
6    38
7    33
8     6
9    33
dtype: int64

In [82]:
y[0:10]

array([0, 2, 2, 3, 3, 2, 2, 3, 2, 3])

In [83]:
ConfusionM=confusion_matrix(np.array(y_pred), np.array(Doc2VecRep["y"]))

In [84]:
pd.DataFrame(ConfusionM)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,14,0,0,9,0,0,0,0,...,0,0,0,0,0,1,0,0,2,0
3,0,0,0,215,0,44,0,0,0,0,...,0,0,0,49,0,12,0,0,91,1
4,0,0,0,0,4,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,3,0,1,31,1,276,0,4,5,9,...,8,4,25,14,6,46,3,2,18,10
6,0,0,1,0,0,1,3,0,0,0,...,0,0,0,0,0,0,0,0,2,0
7,0,0,0,0,0,1,0,9,0,0,...,0,0,0,1,0,0,0,0,7,0
8,0,0,0,0,0,1,0,0,7,0,...,0,0,0,0,0,3,0,0,2,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [85]:
le.inverse_transform([38])

array([' Surgery'], dtype=object)

In [86]:
le.inverse_transform([27])

array([' Orthopedic'], dtype=object)

In [87]:
np.diag(ConfusionM)

array([  1,   8,  14, 215,   4, 276,   3,   9,   7,   0,   8,  35,  18,
         3, 113,  61,  32,   4,   3,   0,   4,  23,  92,   2,  75,   2,
        37, 189,  44,  11,   0,  14,  23,  89,   1,  35,   9,   6, 606,
        69])

In [88]:
np.sum(ConfusionM,  axis=0)

array([   7,    8,   18,  372,   14,  516,   27,   27,   29,   10,  108,
         98,   75,   19,  230,  259,   90,    6,   16,    8,   23,   81,
        223,   94,  160,   51,   83,  355,   62,   70,   21,   47,   53,
        273,   10,  166,   20,    9, 1103,  158])

In [89]:
np.diag(ConfusionM)/np.sum(ConfusionM,  axis=0)

array([0.14285714, 1.        , 0.77777778, 0.57795699, 0.28571429,
       0.53488372, 0.11111111, 0.33333333, 0.24137931, 0.        ,
       0.07407407, 0.35714286, 0.24      , 0.15789474, 0.49130435,
       0.23552124, 0.35555556, 0.66666667, 0.1875    , 0.        ,
       0.17391304, 0.28395062, 0.41255605, 0.0212766 , 0.46875   ,
       0.03921569, 0.44578313, 0.53239437, 0.70967742, 0.15714286,
       0.        , 0.29787234, 0.43396226, 0.32600733, 0.1       ,
       0.21084337, 0.45      , 0.66666667, 0.5494107 , 0.43670886])

5. What are the in-sample recall rates for each document types?



In [90]:
np.diag(ConfusionM)/np.sum(ConfusionM,  axis=0)

array([0.14285714, 1.        , 0.77777778, 0.57795699, 0.28571429,
       0.53488372, 0.11111111, 0.33333333, 0.24137931, 0.        ,
       0.07407407, 0.35714286, 0.24      , 0.15789474, 0.49130435,
       0.23552124, 0.35555556, 0.66666667, 0.1875    , 0.        ,
       0.17391304, 0.28395062, 0.41255605, 0.0212766 , 0.46875   ,
       0.03921569, 0.44578313, 0.53239437, 0.70967742, 0.15714286,
       0.        , 0.29787234, 0.43396226, 0.32600733, 0.1       ,
       0.21084337, 0.45      , 0.66666667, 0.5494107 , 0.43670886])

6. Which "medical_ specialty" has the highest recall rate?

In [92]:
max(np.diag(ConfusionM)/np.sum(ConfusionM,  axis=0))

1.0

This is position 11 in the matrix.

In [95]:
le.inverse_transform([11])

array([' ENT - Otolaryngology'], dtype=object)

7. Validate your SVM model by training your model on 3000 cases and apply the model to the rest 1999 cases. Create the confusion matrix and calculate the recall rates for each document type.



8. How are the test recall rates compare to the in sample recall rates?