In [1]:
"""
In this notebook,

we are going to split the data into multiple cases, based on the number of words in a jargon terms
we analyze the case where there is only a single jargon term and see how many of these jargon terms do not have a general definition.
Then we perform an error analysis on them.



"""

'\nIn this notebook,\n\nwe are going to split the data into multiple cases, based on the number of words in a jargon terms\nwe analyze the case where there is only a single jargon term and see how many of these jargon terms do not have a general definition.\nThen we perform an error analysis on them.\n\n\n\n'

In [1]:
import pandas as pd
import numpy as np

In [3]:
df_j = pd.read_csv("unique_terms.csv")
df_j

Unnamed: 0.1,Unnamed: 0,ann_text
0,0,Virt - Vite
1,1,1
2,2,MG
3,3,Oral
4,4,Tablet
...,...,...
43324,43324,Clinical Laboratory Improvements Amendments of...
43325,43325,Cytopathology
43326,43326,authenticated
43327,43327,CLIA


In [4]:
df_j["ann_text"] = df_j["ann_text"].str.lower()

In [5]:
df_j = df_j.drop_duplicates("ann_text")
df_j

Unnamed: 0.1,Unnamed: 0,ann_text
0,0,virt - vite
1,1,1
2,2,mg
3,3,oral
4,4,tablet
...,...,...
43323,43323,con
43324,43324,clinical laboratory improvements amendments of...
43325,43325,cytopathology
43327,43327,clia


In [6]:
"""
clearly the lower does not effect the number of terms 
Now let us count the number of words in a jargon term

"""

'\nclearly the lower does not effect the number of terms \nNow let us count the number of words in a jargon term\n\n'

In [7]:
df_j["word_count"] = df_j["ann_text"].str.split().str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_j["word_count"] = df_j["ann_text"].str.split().str.len()


In [8]:
df_j

Unnamed: 0.1,Unnamed: 0,ann_text,word_count
0,0,virt - vite,3.0
1,1,1,1.0
2,2,mg,1.0
3,3,oral,1.0
4,4,tablet,1.0
...,...,...,...
43323,43323,con,1.0
43324,43324,clinical laboratory improvements amendments of...,6.0
43325,43325,cytopathology,1.0
43327,43327,clia,1.0


In [9]:
count = df_j["word_count"].value_counts()
print(count)

1.0        21230
2.0         8058
3.0         2824
4.0          707
5.0          205
6.0           64
7.0           27
8.0            6
9.0            4
10.0           1
11123.0        1
1550.0         1
Name: word_count, dtype: int64


In [10]:
he = df_j[df_j["word_count"] == 11123.0]
yep = he["ann_text"]
print(yep)

39367    \t{do not define}\n1829312\talso gives a h / o...
Name: ann_text, dtype: object


In [11]:
"""
We can see the data quality issue very clearly
Now we perform some data cleaning and proceed.

"""

'\nWe can see the data quality issue very clearly\nNow we perform some data cleaning and proceed.\n\n'

In [16]:
df = pd.read_csv('expert_data.tsv',delimiter = '\t', on_bad_lines='skip')

df.head()

  df = pd.read_csv('expert_data.tsv',delimiter = '\t', on_bad_lines='skip')


Unnamed: 0,id,text_to_annotate,start,end,ann_text,definition
0,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,0.0,3,Virt - Vite,"A mix of vitamins. It provides vitamin B-6, vi..."
1,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,7.0,8,1,{DO NOT DEFINE}
2,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,8.0,9,MG,"A tiny amount of something, usually a drug."
3,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,9.0,10,Oral,Taken by mouth.
4,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,10.0,11,Tablet,A pill.


In [17]:
print(len(df))

348335


In [18]:
df_jt = df[['ann_text']]

In [19]:
df_jt

Unnamed: 0,ann_text
0,Virt - Vite
1,1
2,MG
3,Oral
4,Tablet
...,...
348330,clinical
348331,CLIA
348332,ASR
348333,laboratory


In [20]:
df_jt["ann_text"] = df_jt["ann_text"].str.lower()
df_jt = df_jt.drop_duplicates("ann_text")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_jt["ann_text"] = df_jt["ann_text"].str.lower()


In [21]:
df_jt

Unnamed: 0,ann_text
0,virt - vite
1,1
2,mg
3,oral
4,tablet
...,...
348284,con
348294,clinical laboratory improvements amendments of...
348309,cytopathology
348328,clia


In [22]:
df_jt["word_count"] = df_jt["ann_text"].str.split().str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_jt["word_count"] = df_jt["ann_text"].str.split().str.len()


In [28]:
count_1 = df_jt["word_count"].value_counts()
print(count_1)

1.0     21200
2.0      8041
3.0      2813
4.0       707
5.0       205
6.0        64
7.0        28
8.0         6
9.0         4
10.0        3
Name: word_count, dtype: int64


In [31]:
he = df_jt[df_jt["word_count"] == 0.0]
yep = he["ann_text"]
print(yep)

Series([], Name: ann_text, dtype: object)


In [13]:
df_f = pd.read_pickle("case3.pickle")

In [14]:
df_f.head()

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,split_list,split_def,case3_def
0,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,0.0,3.0,Virt - Vite,"A mix of vitamins. It provides vitamin B-6, vi...",virt - vite,,"[virt, -, vite]","[not found in UMLS, not found in UMLS, not fou...",[The determination of the amount of Vitamin E ...
1,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,8.0,9.0,MG,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[]
2,98580425,Treatment Received : ondansetron HCl oral 2 mg...,7.0,8.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[]
3,98580425,Treatment Received : ondansetron HCl oral 2 mg...,44.0,45.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[]
4,98580425,Treatment Received : ondansetron HCl oral 2 mg...,61.0,62.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[]


In [15]:
df_f["word_count"] = df_f["ann_text_lower"].str.split().str.len()

In [16]:
df_single = df_f[df_f["word_count"] == 1.0]
df_single

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,split_list,split_def,case3_def,word_count
1,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,8.0,9.0,MG,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[],1
2,98580425,Treatment Received : ondansetron HCl oral 2 mg...,7.0,8.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[],1
3,98580425,Treatment Received : ondansetron HCl oral 2 mg...,44.0,45.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[],1
4,98580425,Treatment Received : ondansetron HCl oral 2 mg...,61.0,62.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[],1
5,98580425,Treatment Received : ondansetron HCl oral 2 mg...,100.0,101.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,[mg],[not found in UMLS],[],1
...,...,...,...,...,...,...,...,...,...,...,...,...
312151,1512121,EXAMINATION PERFORMED : CT L - SPINE WO CON .,4.0,7.0,L-SPINE,"The lower back, which is formed by vertebral b...",l-spine,,[l-spine],[],[],1
312152,1512121,EXAMINATION PERFORMED : CT L - SPINE WO CON .,8.0,9.0,CON,A fluid given to help make pictures of the ins...,con,Terminology subset about items of packaging th...,[con],[not found in UMLS],[Terminology subset about items of packaging t...,1
312154,1511435,Explanation of Primary Non - Gynecologic Cyto...,7.0,8.0,Cytopathology,Looking at cells from different parts of the b...,cytopathology,A branch of pathology that studies and diagnos...,[cytopathology],[],[A branch of pathology that studies and diagno...,1
312155,1511235,This laboratory is certified under the Clinica...,14.0,15.0,CLIA,Rules that make laboratory testing better.,clia,A Federal law establishing quality standards f...,[clia],[nan],[None],1


In [25]:


# df_single = df_single[df_single['sbert_def'] != "not found in UMLS"]

In [17]:
len(df_f)

312157

In [18]:
len(df_single)

255148

In [19]:
df_not = df_single[df_single['sbert_def'] == "not found in UMLS"]

In [20]:
df_not = df_not.drop(["split_list","split_def","case3_def"],axis = 1)

In [21]:
df_not.head(10)

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,word_count
1,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,8.0,9.0,MG,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
2,98580425,Treatment Received : ondansetron HCl oral 2 mg...,7.0,8.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
3,98580425,Treatment Received : ondansetron HCl oral 2 mg...,44.0,45.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
4,98580425,Treatment Received : ondansetron HCl oral 2 mg...,61.0,62.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
5,98580425,Treatment Received : ondansetron HCl oral 2 mg...,100.0,101.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
6,97213602,"p.r.n . , aspirin 81 mg daily , alovudine 150 ...",5.0,6.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
7,97213602,"p.r.n . , aspirin 81 mg daily , alovudine 150 ...",10.0,11.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
8,97213602,"p.r.n . , aspirin 81 mg daily , alovudine 150 ...",16.0,17.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
9,97213602,"p.r.n . , aspirin 81 mg daily , alovudine 150 ...",22.0,23.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1
10,97213602,"p.r.n . , aspirin 81 mg daily , alovudine 150 ...",29.0,30.0,mg,"A tiny amount of something, usually a drug.",mg,not found in UMLS,1


In [22]:
df_not.to_csv("analysis.csv")

In [32]:
df_jmiss = df_not[["ann_text_lower"]]

In [33]:
df_jmiss = df_jmiss.drop_duplicates()

In [34]:
df_jmiss

Unnamed: 0,ann_text_lower
1,mg
1963,po
3176,pm
3770,ml
3906,hr
...,...
312141,cd43/
312143,cd5/
312145,cd7/4h9
312146,my31


In [35]:
df_jmiss.to_csv("jmissing.csv")

In [23]:
df_valid1 = df_single[df_single['sbert_def'] != "not found in UMLS"]

In [24]:
df_valid1 = df_valid1.drop(["split_list","split_def","case3_def"],axis = 1)

In [25]:
df_valid1.head(20)

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,word_count
1145,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,9.0,10.0,Oral,Taken by mouth.,oral,A substance intended for administration throug...,1
1146,98580425,Treatment Received : ondansetron HCl oral 2 mg...,5.0,6.0,oral,Taken by mouth.,oral,A substance intended for administration throug...,1
1147,98580425,Treatment Received : ondansetron HCl oral 2 mg...,8.0,9.0,Oral,Taken by mouth.,oral,A substance intended for administration throug...,1
1148,98580425,Treatment Received : ondansetron HCl oral 2 mg...,98.0,99.0,oral,Taken by mouth.,oral,A substance intended for administration throug...,1
1149,98580425,Treatment Received : ondansetron HCl oral 2 mg...,136.0,137.0,Oral,Taken by mouth.,oral,A substance intended for administration throug...,1
1150,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,210.0,211.0,ORAL,Taken by mouth.,oral,A substance intended for administration throug...,1
1151,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,238.0,239.0,ORAL,Taken by mouth.,oral,A substance intended for administration throug...,1
1152,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,285.0,286.0,ORAL,Taken by mouth.,oral,A substance intended for administration throug...,1
1153,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,361.0,362.0,ORAL,Taken by mouth.,oral,A substance intended for administration throug...,1
1154,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,392.0,393.0,ORAL,Taken by mouth.,oral,A substance intended for administration throug...,1


In [26]:
df_valid1.to_csv("single_word_final.csv")

In [36]:
"""
We just finished our analysis for a single word jargon terms case and found all the missing definitions.
Now we will focus on 2 word case

"""

'\nWe just finished our analysis for a single word jargon terms case and found all the missing definitions.\nNow we will focus on 2 word case\n\n'

In [37]:
df_double = df_f[df_f["word_count"] == 2.0]
df_double

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,split_list,split_def,case3_def,word_count
1879,98791999,Virt - Vite 2.5 - 25 - 1 MG Oral Tablet Vitami...,15.0,17.0,folic acid,A B vitamin.,folic acid,A member of the vitamin B family that stimulat...,"[folic, acid]",[A member of the vitamin B family that stimula...,[A member of the vitamin B family that stimula...,2
1880,12720272,Medications on Admission : Medications : - Cam...,23.0,25.0,folic acid,A B vitamin.,folic acid,A member of the vitamin B family that stimulat...,"[folic, acid]",[A member of the vitamin B family that stimula...,[A member of the vitamin B family that stimula...,2
1881,9320471,FOLIC ACID 1MG PO [ # 30 R5 ] [ * * 2120 - 4 -...,0.0,2.0,FOLIC ACID,A B vitamin.,folic acid,A member of the vitamin B family that stimulat...,"[folic, acid]",[A member of the vitamin B family that stimula...,[A member of the vitamin B family that stimula...,2
1882,9320388,- Folic Acid 1 mg DAILY - Nephro - Vite 1 - 60...,1.0,3.0,Folic Acid,A B vitamin.,folic acid,A member of the vitamin B family that stimulat...,"[folic, acid]",[A member of the vitamin B family that stimula...,[A member of the vitamin B family that stimula...,2
1883,8075968,Disp : * 30 Tablet ( s ) * Refills : * 2 * 8 ....,43.0,45.0,folic acid,A B vitamin.,folic acid,A member of the vitamin B family that stimulat...,"[folic, acid]",[A member of the vitamin B family that stimula...,[A member of the vitamin B family that stimula...,2
...,...,...,...,...,...,...,...,...,...,...,...,...
312128,1515314,There is no Battle 's sign or raccoon eyes .,3.0,6.0,Battle 's,A type of skull fracture.,battle 's,,"[battle, 's]","[Domesticated bovine animals of the genus Bos,...",[],2
312129,1515053,Results Summary : Exon 10 of the Factor V gene...,14.0,16.0,INVADER ASSAY,A very sensitive test for changes in the DNA t...,invader assay,,"[invader, assay]","[Penetrated through tissue., A method of measu...",[],2
312133,1513539,There is no pneumothorax or mediastinal shift .,5.0,7.0,mediastinal shift,A build-up of pressure in the pleural cavity a...,mediastinal shift,Related to the mediastinum.,"[mediastinal, shift]",[A membrane in the midline of the THORAX of ma...,[A membrane in the midline of the THORAX of ma...,2
312134,1513380,"CARDIAC : Normal rate , regular rhythm .",2.0,3.0,Normal rate,The heart is beating as it should be beating.,normal rate,<p>No Corrective Action Needed</p>,"[normal, rate]","[In pathology, a term that is used to describe...","[In pathology, a term that is used to describe...",2


In [38]:
df_not_2 = df_double[df_double['sbert_def'] == "not found in UMLS"]
df_not_2 = df_not_2.drop(["split_list","case3_def"],axis = 1)

In [39]:
df_not_2.head(10)

Unnamed: 0,id,text_to_annotate,start,end,ann_text_x,definition,ann_text_lower,sbert_def,split_def,word_count
4759,98580425,Treatment Received : ondansetron HCl oral 2 mg...,62.0,64.0,IV Push,Quickly injecting a medicine into a vein.,iv push,not found in UMLS,"[not found in UMLS, The act of applying force ...",2
4760,98580425,Treatment Received : ondansetron HCl oral 2 mg...,101.0,103.0,IV Push,Quickly injecting a medicine into a vein.,iv push,not found in UMLS,"[not found in UMLS, The act of applying force ...",2
6294,98147807,surgical path : blood with scant fragments of ...,38.0,40.0,& gt,Greater than.,& gt,not found in UMLS,"[not found in UMLS, A country in CENTRAL AMERI...",2
6295,96592550,PAST MEDICAL / SURGICAL HISTORY PMH Other hist...,899.0,901.0,& gt,Greater than.,& gt,not found in UMLS,"[not found in UMLS, A country in CENTRAL AMERI...",2
6296,93995095,Having soft voice PAST MEDICAL HISTORY Essenti...,520.0,522.0,& gt,Greater than.,& gt,not found in UMLS,"[not found in UMLS, A country in CENTRAL AMERI...",2
6297,2878794,Results Review General results New results : R...,27.0,29.0,& gt,Greater than.,& gt,not found in UMLS,"[not found in UMLS, A country in CENTRAL AMERI...",2
10357,96775729,"[ * * PERSON * * ] , MD CT CHEST W / O CONTRAS...",161.0,163.0,without Contrast,Imaging was done with no usage of agents that ...,without contrast,not found in UMLS,"[not found in UMLS, not found in UMLS]",2
10358,96775729,"[ * * PERSON * * ] , MD CT CHEST W / O CONTRAS...",169.0,171.0,without contrast,Imaging was done with no usage of agents that ...,without contrast,not found in UMLS,"[not found in UMLS, not found in UMLS]",2
10359,10065350,- HD today - trend cardiac enzymes - consider ...,11.0,13.0,without contrast,Imaging was done with no usage of agents that ...,without contrast,not found in UMLS,"[not found in UMLS, not found in UMLS]",2
10360,3098343,"TECHNIQUE : Multiplanar , multi sequencebrain ...",7.0,9.0,without contrast,Imaging was done with no usage of agents that ...,without contrast,not found in UMLS,"[not found in UMLS, not found in UMLS]",2


In [40]:
"""
As we can see the gt above is not gutemola but it is greater than and hence we need to do sentence bert.

we can use sentenceBERT in the following way max(similarity_score(abcd(def 1-5), lay def)), max(similarity_score(xyz(def 1-5), lay def))



"""

'\nAs we can see the gt above is not gutemola but it is greater than and hence we need to do sentence bert.\n\nwe can use sentenceBERT in the following way max(similarity_score(abcd(def 1-5), lay def)), max(similarity_score(xyz(def 1-5), lay def))\n\n\n\n'

In [41]:
print(len(df_not_2))

2942


In [42]:
ttt = df_not_2.drop_duplicates(subset = ["ann_text_lower"])

In [43]:
print(len(ttt))

688


In [46]:
ttt = ttt.drop(["id","start","end","ann_text_x","text_to_annotate","word_count","sbert_def","split_def"],axis = 1)

In [47]:
for i in range(len(ttt)):
    print(ttt.iloc[i])

definition        Quickly injecting a medicine into a vein.
ann_text_lower                                      iv push
Name: 4759, dtype: object
definition        Greater than.
ann_text_lower             & gt
Name: 6294, dtype: object
definition        Imaging was done with no usage of agents that ...
ann_text_lower                                     without contrast
Name: 10357, dtype: object
definition        A doctor's degree from medical school.
ann_text_lower                                     m.d .
Name: 12360, dtype: object
definition        Less than.
ann_text_lower          & lt
Name: 38370, dtype: object
definition        Also referred to as Diabetes type 2. Diabetes ...
ann_text_lower                                               type 2
Name: 51167, dtype: object
definition        A mix of eight B vitamins required for good he...
ann_text_lower                                            b complex
Name: 60116, dtype: object
definition        In position.
ann_text_lower    