# Abbreviation Disambiguation in Medical Texts - Data Wrangling & EDA

This Notebook is used to list down:

1. How the Data for Test and Train sets was collected i.e, Data Sources.
2. How the data has been organized for Training and Testing the model.
3. Data Cleaning.
4. Exploratory Data Analysis (EDA).

## Step# 1: Downloading Data from Data Sources.

            Data being used in this problem will be downloaded using Kaggle API from the below location:

            https://www.kaggle.com/xhlulu/medal-emnlp

In [None]:
# Uncomment below lines to download the dataset directly from Kaggle
## To Install Kaggle package in case it is not already present
# !pip install kaggle
## Download dataset
# !kaggle datasets download -d xhlulu/medal-emnlp

### Once the dataset has been downloaded lets check the directory contents

In [1]:
#Importing the Required Python Packages
import os
import shutil
import zipfile
import pandas as pd
import matplotlib.pyplot as plt

Printing the Current Working Directory

In [2]:
print(os.getcwd())

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-


## Step# 2: Organizing data into Logical folder structure from where the system will read the data and process it.


    The Data for this problem will have the following Folder structure:

    Work_Dir
    |
    |
    |----> Data
    |
    |
    |----> Train
    |
    |
    |----> Test
    |
    |
    |----> Validation
    |
    |
    |----> Images
    
    Thus, creating the required Folder structure.

In [3]:
#Reading the current path in a variable
path = os.getcwd()
print(path)

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-


In [4]:
#Creating a folder by name Train
os.mkdir(os.path.join(path, 'Train'))

In [5]:
#Creating a folder by name Test
os.mkdir(os.path.join(path, 'Test'))

In [6]:
#Creating a folder by name Data
os.mkdir(os.path.join(path, 'Data'))

In [7]:
#Creating a folder by name Validation
os.mkdir(os.path.join(path, 'Validation'))

In [8]:
#Creating a folder by name Images
os.mkdir(os.path.join(path, 'Images'))

Moving the NLP_Dataset.zip file into the Data folder and then unzipping the file

In [9]:
#Moving the file to Data folder
source = os.path.join(path, 'NLP_Dataset.zip')
destination = os.path.join(path, 'Data', 'NLP_Dataset.zip')
shutil.move(source, destination)

'd:\\Learning\\Springboard\\GitHub\\Abbreviation-Disambiguation-\\Data\\NLP_Dataset.zip'

In [10]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['NLP_Dataset.zip']


In [11]:
#Unzip the dataset file
with zipfile.ZipFile(destination, 'r') as zip_ref:
    zip_ref.extractall(os.path.join(path, 'Data'))

In [12]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['full_data.csv', 'NLP_Dataset.zip', 'pretrain_subset']


Data was successfully extracted and a new folder 'pretrain_subset' was also created, Let's check the contents of that folder.

In [13]:
#Check the contents of pretrain_subset folder
print(os.listdir(os.path.join(path, 'Data', 'pretrain_subset')))

['test.csv', 'train.csv', 'valid.csv']


Pretrain_subset folder contains the dataset already divided into 3 different files- Train, test and valid. We will use the the data in this folder due to system memory restrictions.

In [14]:
# Deleting full_data.csv and moving train, test and valid files to Data folder.
source = os.path.join(path, 'Data', 'pretrain_subset')
destination = os.path.join(path, 'Data')
shutil.move(os.path.join(source, 'train.csv'), os.path.join(destination, 'train.csv'))
shutil.move(os.path.join(source, 'test.csv'), os.path.join(destination, 'test.csv'))
shutil.move(os.path.join(source, 'valid.csv'), os.path.join(destination, 'valid.csv'))
os.remove(os.path.join(destination, 'full_data.csv'))

#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['NLP_Dataset.zip', 'pretrain_subset', 'test.csv', 'train.csv', 'valid.csv']


## Step# 3: Cleaning Data

### Lets have a look at our data.

In [15]:
#Creating a path variable directly to the dataset
data_path = os.path.join(path, 'Data', 'train.csv')
print(data_path)

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-\Data\train.csv


In [16]:
# Loading the data in a dataframe.
textDF = pd.read_csv(data_path)

In [17]:
#Checking the shape of Dataframe
textDF.shape

(3000000, 4)

### The Data contains 3 million rows but my System won't be able to work with such a huge dataset hence, will take the first 1 Million rows only for this project.

In [20]:
textDF.drop(textDF.index[1000000:], inplace = True)
textDF.shape

(1000000, 4)

In [21]:
#Checking the first 5 rows of the dataframe
textDF.head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
0,14145090,velvet antlers vas are commonly used in tradit...,63,transverse aortic constriction
1,1900667,the clinical features of our cases demonstrate...,85,hodgkins lymphoma
2,8625554,ceftobiprole bpr is an investigational cephalo...,90,methicillinsusceptible s aureus
3,8157202,we have taken a basic biologic RPA to elucidat...,26,parathyroid hormonerelated protein
4,6784974,lipoperoxidationderived aldehydes for example ...,157,lipoperoxidation


In [22]:
#Checking last 5 rows of the dataframe
textDF.tail(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
999995,658615,in vivo flow p nmr spectroscopy of a microscop...,34,saturation transfer
999996,4297576,this study evaluated whether the number of gra...,52,severe aplastic anaemia
999997,500044,gmcsf is a major regulator of myelopoiesis rhG...,7,recombinant human gmcsf
999998,13583706,irritability is an aspect of the negative affe...,127,healthy comparison
999999,1540457,hypercalcemia is an uncommon complication of c...,27,congenital mesoblastic nephroma


In [23]:
#Checking the summary statistics of the dataframe
textDF.describe(include = 'all')

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
count,1000000.0,1000000,1000000.0,1000000
unique,,939852,,22490
top,,purpose to investigate the effect of coordinat...,,tibial nerve
freq,,7,,118
mean,7027032.0,,76.75036,
std,4594329.0,,67.603551,
min,6.0,,0.0,
25%,3057271.0,,22.0,
50%,6381622.0,,60.0,
75%,10827030.0,,114.0,


In [24]:
#Checking the datatypes of dataframe columns
textDF.dtypes

ABSTRACT_ID     int64
TEXT           object
LOCATION        int64
LABEL          object
dtype: object

In [25]:
# Checking the unique Abstract_id to see if Abstract_id can be converted to Index
textDF['ABSTRACT_ID'].nunique()

939860

Hence, the Abstract_ID's are not all unique. So, lets check the duplicates in the dataset.

In [26]:
duplicate = textDF[textDF.duplicated()]
duplicate.head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL


Hence, none of the rows are duplicates.

In [27]:
# Lets check for null values if any
textDF.isnull().values.any()

False

Thus, we don't have any Null values in the dataset.

### At this point the Data Wrangling has been completed and the resulting data is now ready for EDA

## Step# 4: EDA

In [28]:
# Lets look at one row of the dataset in detail
pd.set_option('display.max_colwidth', -1)
textDF.head(1)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
0,14145090,velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfÎ² receptors and disrupts the tgfÎ² pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfÎ² signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfÎ² and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfÎ² expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in Î±smapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfÎ² IB to its receptors and blocking ang iitgfÎ² downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfÎ² pathway in CFs,63,transverse aortic constriction


### As per dataset specifications, location column signifies the word count after which the Abbreviation occurs and its Label is provided in Lable column.

In [29]:
# Lets check the Abbreviations of first 10 rows of the dataset alongwith their labels
split_text = [ t.split(' ') for t in textDF[:10]['TEXT']]
label = [t for t in textDF[:10]['LABEL']]
location = [t for t in textDF[:10]['LOCATION']]

In [30]:
for i in range(0,10):
    print(label[i], ' -- ', split_text[i][location[i]])

transverse aortic constriction  --  TAC
hodgkins lymphoma  --  HD
methicillinsusceptible s aureus  --  MSSA
parathyroid hormonerelated protein  --  PTHrP
lipoperoxidation  --  LPO
hepatitis g virus  --  HGV
radical neck dissection  --  RND
amplified spontaneous emission  --  ASE
portal blood  --  HPB
western medicine  --  WM


From the above analysis, the relationship between Location, Label and Text columns are clearly visible. 

### Let us again check the number of unique ABSTRACT_ID in Dataset

In [31]:
# Checking the unique Abstract_id
textDF['ABSTRACT_ID'].nunique()

939860

In [32]:
# Checking the shape of the Dataset
textDF.shape

(1000000, 4)

It can be seen here that there are some Abstract_ID's which are not unique. Lets find those abstracts and check what is the main differences

In [33]:
duplicate = textDF[textDF['ABSTRACT_ID'].duplicated(keep = False)]

In [34]:
duplicate.sort_values(by = ['ABSTRACT_ID']).head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
11397,6,a doubleblind T0 with intraindividual comparisons was carried out to investigate the effects of mg of ralphahydroxyisopropylalphahtropanium bromidetropate sch mg sch mg OX mg oxazepam and placebo with p.o. in randomized CS on gastric juice volume amount of acid concentration and ph values in healthy volunteers the secretion parameters were measured during a h basal period and a h stimulation period the gastric juice was obtained in min portions via stomach tube stimulation was effected by mugkgh PG via drip infusion the friedman test was used for the comparative statistical evaluation and individual comparisons were carried out by means of the wilcoxon test pairdifferences rank the results show that sch and sch OX were equal in effect on basal and stimulated secretion volume as compared with PL it was not possible to establish an effect on secretion volume for oxazepam CT sch and sch oxazepam were found to be equipotent in reducing the amount of basal acid while oxazepam reduced this quantity only during the first min of basal secretion none of the three AS S9 was capable of inhibiting the stimulated acid although both sch preparations produced a clear trend towards lowered mean values during the basal secretion period all three test S9 had an inhibiting action on acid concentration but none of them had a significant effect during the stimulation period the ph value was savely increased only by sch and sch OX and this even only during the basal period the results are discussed,234,oxazepam
331391,6,a doubleblind T0 with intraindividual comparisons was carried out to investigate the effects of mg of ralphahydroxyisopropylalphahtropanium bromidetropate sch mg sch mg OX mg oxazepam and placebo with p.o. in randomized CS on gastric juice volume amount of acid concentration and ph values in healthy volunteers the secretion parameters were measured during a h basal period and a h stimulation period the gastric juice was obtained in min portions via stomach tube stimulation was effected by mugkgh PG via drip infusion the friedman test was used for the comparative statistical evaluation and individual comparisons were carried out by means of the wilcoxon test pairdifferences rank the results show that sch and sch OX were equal in effect on basal and stimulated secretion volume as compared with PL it was not possible to establish an effect on secretion volume for oxazepam CT sch and sch oxazepam were found to be equipotent in reducing the amount of basal acid while oxazepam reduced this quantity only during the first min of basal secretion none of the three AS S9 was capable of inhibiting the stimulated acid although both sch preparations produced a clear trend towards lowered mean values during the basal secretion period all three test S9 had an inhibiting action on acid concentration but none of them had a significant effect during the stimulation period the ph value was savely increased only by sch and sch OX and this even only during the basal period the results are discussed,112,oxazepam
627471,75,P2 fractions p derived from guineapig CBF cortex were incubated in the presence of mm kcl in a krebsglucose medium torpedo marmorata electric organs were stimulated electrically in vivo at pulsessec for min by electrodes placed on the electric lobe synaptic LDV were isolated from each source and the phospholipid compositions analysed and compared with LDV from unstimulated controls lysophosphatidylcholine was the only lysophosphoglyceride demonstrable in the SVs from either source and its low levels did not increase as a result of chemical or electircal stimulation in each case there was a close similarity of the phospholipid distributions in the vesicles taken from control and stimulated samples control experiments indicated extensive decreases in the acetylcholine content of the LDV from the stimulated electric organ and smaller decreases in the acetylcholine content of the SVs from stimulated crude synaptosomal fractions these fractions were found to respire linearly in the presence of mm glucose and the vesicle fractions were shown to have low levels of contaiminating membranes as judged by marker enzyme analyses P2 fractions from guineapig cerebral SC were incubated in a krebsglucose medium with labelled fatty acids and hglucose in the presence or absence of mm kcl subsynaptosomal fractionation was carried out and specific radioactivities of phosphatidylcholine phosphatidylethanolamine phosphatidylserine and phosphatidylinositol were determined in fractions d SVs e microsomes and h disrupted synaptosomes the release of neurotransmitter did not significantly enhance the labelling of phospholipids in any of the fractions studied as compared with phospholipids from unstimulated fractions this was found after two incubation times and using coleate carachidonate hpalmitate and hglucose,0,crude synaptosomal
742153,75,P2 fractions p derived from guineapig CBF cortex were incubated in the presence of mm kcl in a krebsglucose medium torpedo marmorata electric organs were stimulated electrically in vivo at pulsessec for min by electrodes placed on the electric lobe synaptic LDV were isolated from each source and the phospholipid compositions analysed and compared with LDV from unstimulated controls lysophosphatidylcholine was the only lysophosphoglyceride demonstrable in the SVs from either source and its low levels did not increase as a result of chemical or electircal stimulation in each case there was a close similarity of the phospholipid distributions in the vesicles taken from control and stimulated samples control experiments indicated extensive decreases in the acetylcholine content of the LDV from the stimulated electric organ and smaller decreases in the acetylcholine content of the SVs from stimulated crude synaptosomal fractions these fractions were found to respire linearly in the presence of mm glucose and the vesicle fractions were shown to have low levels of contaiminating membranes as judged by marker enzyme analyses P2 fractions from guineapig cerebral SC were incubated in a krebsglucose medium with labelled fatty acids and hglucose in the presence or absence of mm kcl subsynaptosomal fractionation was carried out and specific radioactivities of phosphatidylcholine phosphatidylethanolamine phosphatidylserine and phosphatidylinositol were determined in fractions d SVs e microsomes and h disrupted synaptosomes the release of neurotransmitter did not significantly enhance the labelling of phospholipids in any of the fractions studied as compared with phospholipids from unstimulated fractions this was found after two incubation times and using coleate carachidonate hpalmitate and hglucose,171,crude synaptosomal
583421,212,the beige mouse is an animal MM for the human chediakhigashi syndrome a disease characterized by giant lysosomes in most cell types in mice treatment with androgenic hormones causes a fold elevation in at least one kidney lysosomal enzyme betaglucuronidase beige mice treated with androgen had significantly higher kidney betaglucuronidase betagalactosidase and nacetylbetadglucosaminidase Hex C2 than normal mice other androgeninducible enzymes and enzyme markers for the cytosol mitochondria and peroxisomes were not increased in kidney of beige mice no significant lysosomal enzyme elevation was observed in five other organs of beige mice with or without androgen treatment nor in kidneys of beige females not treated with androgen histochemical staining for glucuronidase together with subcellular fractionation showed that the higher GUS content of beige mouse kidney is caused by a striking accumulation of giant glucuronidasecontaining lysosomes in tubule cells near the corticomedullary boundary in NM lysosomal enzymes are coordinately released into the lumen of the kidney tubules and appreciable amounts of lysosomal enzymes are present in the urine levels of urinary lysosomal enzymes are much lower in beige mice than in normal mice it appears that lysosomes may accumulate in beige mice because of defective exocytosis resulting either from decreased intracellular motility of lysosomes or from their improper F0 with the plasma membrane a similar defect could account for characteristics of the CHS,119,glucuronidase


### So, based on the above results for duplicates, it can be seen that a single Text might contain more than 1 Abbreviation at diffrent places. Thus, multiple row for multiple Abbreviations are present.

### Let's save the above trainDF in a csv file inside Train folder for further use.

In [36]:
textDF.to_csv('Train/train.csv', index = False)

### Lets load the valid.csv and test.csv as well

In [37]:
# Loading valid.csv
valid = pd.read_csv(os.path.join(path, 'Data', 'valid.csv'))
#Loading test.csv
test = pd.read_csv(os.path.join(path, 'Data', 'test.csv'))

In [38]:
# Check the shape of valid dataset
valid.shape

(1000000, 4)

### The Data contains 1 million rows so lets reduce this data to 20% of train data i.e, 20k records.

In [39]:
valid.drop(valid.index[200000:], inplace = True)
valid.shape

(200000, 4)

In [40]:
#Save this updated Valid.csv to Validation folder
valid.to_csv('Validation/valid.csv', index = False)

In [41]:
# Check the shape of test dataset
test.shape

(1000000, 4)

### The Data contains 1 million rows so lets reduce this data to 20% of train data i.e, 20k records.

In [42]:
test.drop(test.index[200000:], inplace = True)
test.shape

(200000, 4)

In [43]:
#Save this updated test.csv to Test folder
test.to_csv('Test/test.csv', index = False)