# Abbreviation Disambiguation in Medical Texts - Data Wrangling & EDA

This Notebook is used to list down:

1. How the Data for Test and Train sets was collected i.e, Data Sources.
2. How the data has been organized for Training and Testing the model.
3. Data Cleaning.
4. Exploratory Data Analysis (EDA).

## Step# 1: Downloading Data from Data Sources.

            Data being used in this problem will be downloaded using Kaggle API from the below location:

            https://www.kaggle.com/xhlulu/medal-emnlp

In [2]:
# Uncomment below lines to download the dataset directly from Kaggle
## To Install Kaggle package in case it is not already present
# !pip install kaggle
## Download dataset
# !kaggle datasets download -d xhlulu/medal-emnlp

^C


### Once the dataset has been downloaded lets check the directory contents

In [1]:
#Importing the Required Python Packages
import os
import shutil
import zipfile
import pandas as pd
import matplotlib.pyplot as plt

Printing the Current Working Directory

In [2]:
print(os.getcwd())

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-


Printing the content of the Working Directory

In [3]:
print(os.listdir())

['.git', 'NLP_Dataset.zip', 'Project Proposal.pdf', 'README.md', 'Step 1- Data Wrangling and EDA.ipynb']


## Step# 2: Organizing data into Logical folder structure from where the system will read the data and process it.


    The Data for this problem will have the following Folder structure:

    Work_Dir
    |
    |
    |----> Data
    |
    |
    |----> Train
    |
    |
    |----> Test
    |
    |
    |----> Validation
    |
    |
    |----> Images
    
    Thus, creating the required Folder structure.

In [3]:
#Reading the current path in a variable
path = os.getcwd()
print(path)

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-


In [5]:
#Creating a folder by name Train
os.mkdir(os.path.join(path, 'Train'))

In [6]:
#Creating a folder by name Test
os.mkdir(os.path.join(path, 'Test'))

In [7]:
#Creating a folder by name Data
os.mkdir(os.path.join(path, 'Data'))

In [8]:
#Creating a folder by name Validation
os.mkdir(os.path.join(path, 'Validation'))

In [9]:
#Creating a folder by name Images
os.mkdir(os.path.join(path, 'Images'))

Moving the NLP_Dataset.zip file into the Data folder and then unzipping the file

In [15]:
#Moving the file to Data folder
source = os.path.join(path, 'NLP_Dataset.zip')
destination = os.path.join(path, 'Data', 'NLP_Dataset.zip')
shutil.move(source, destination)

'd:\\Learning\\Springboard\\GitHub\\Abbreviation-Disambiguation-\\Data\\NLP_Dataset.zip'

In [16]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['NLP_Dataset.zip']


In [19]:
#Unzip the dataset file
with zipfile.ZipFile(destination, 'r') as zip_ref:
    zip_ref.extractall(os.path.join(path, 'Data'))

In [3]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['full_data.csv', 'NLP_Dataset.zip', 'pretrain_subset']


Data was successfully extracted and a new folder 'pretrain_subset' was also created, Let's check the contents of that folder.

In [4]:
#Check the contents of pretrain_subset folder
print(os.listdir(os.path.join(path, 'Data', 'pretrain_subset')))

['test.csv', 'train.csv', 'valid.csv']


Pretrain_subset folder contains the dataset already divided into 3 different files- Train, test and valid. We will use the the data in this folder due to system memory restrictions.

In [5]:
# Deleting full_data.csv and moving train, test and valid files to Data folder.
source = os.path.join(path, 'Data', 'pretrain_subset')
destination = os.path.join(path, 'Data')
shutil.move(os.path.join(source, 'train.csv'), os.path.join(destination, 'train.csv'))
shutil.move(os.path.join(source, 'test.csv'), os.path.join(destination, 'test.csv'))
shutil.move(os.path.join(source, 'valid.csv'), os.path.join(destination, 'valid.csv'))
os.remove(os.path.join(destination, 'full_data.csv'))

#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

['NLP_Dataset.zip', 'pretrain_subset', 'test.csv', 'train.csv', 'valid.csv']


## Step# 3: Cleaning Data

### Lets have a look at our data.

In [4]:
#Creating a path variable directly to the dataset
data_path = os.path.join(path, 'Data', 'train.csv')
print(data_path)

d:\Learning\Springboard\GitHub\Abbreviation-Disambiguation-\Data\train.csv


In [5]:
# Loading the data in a dataframe.
textDF = pd.read_csv(data_path)

In [6]:
#Checking the shape of Dataframe
textDF.shape

(3000000, 4)

In [7]:
#Checking the first 5 rows of the dataframe
textDF.head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
0,14145090,velvet antlers vas are commonly used in tradit...,63,transverse aortic constriction
1,1900667,the clinical features of our cases demonstrate...,85,hodgkins lymphoma
2,8625554,ceftobiprole bpr is an investigational cephalo...,90,methicillinsusceptible s aureus
3,8157202,we have taken a basic biologic RPA to elucidat...,26,parathyroid hormonerelated protein
4,6784974,lipoperoxidationderived aldehydes for example ...,157,lipoperoxidation


In [8]:
#Checking last 5 rows of the dataframe
textDF.tail(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
2999995,10674546,the results of a surveillance study conducted ...,99,argon plasma coagulation
2999996,15628733,approximately of patients with celiac disease ...,12,glutenfree diet
2999997,15419189,the LT survivorship and PET outcomes of the mo...,15,unicompartmental knee arthroplasty
2999998,2075862,previous work has demonstrated the presence of...,60,complete
2999999,532074,a hospital warm water system was monitored for...,18,legionella pneumophila


In [9]:
#Checking the summary statistics of the dataframe
textDF.describe(include = 'all')

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
count,3000000.0,3000000,3000000.0,3000000
unique,,2530992,,22555
top,,glycyrrhetinic acid and its salts and esters a...,,birth weights
freq,,12,,268
mean,7027529.0,,76.83654,
std,4591529.0,,67.5964,
min,6.0,,0.0,
25%,3065308.0,,22.0,
50%,6382561.0,,60.0,
75%,10822180.0,,115.0,


In [10]:
#Checking the datatypes of dataframe columns
textDF.dtypes

ABSTRACT_ID     int64
TEXT           object
LOCATION        int64
LABEL          object
dtype: object

In [11]:
# Checking the unique Abstract_id to see if Abstract_id can be converted to Index
textDF['ABSTRACT_ID'].nunique()

2531051

Hence, the Abstract_ID's are not all unique. So, lets check the duplicates in the dataset.

In [12]:
duplicate = textDF[textDF.duplicated()]
duplicate.head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL


Hence, none of the rows are duplicates.

In [13]:
# Lets check for null values if any
textDF.isnull().values.any()

False

Thus, we don't have any Null values in the dataset.

### At this point the Data Wrangling has been completed and the resulting data is now ready for EDA

## Step# 4: EDA

In [14]:
# Lets look at one row of the dataset in detail
pd.set_option('display.max_colwidth', -1)
textDF.head(1)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
0,14145090,velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfÎ² receptors and disrupts the tgfÎ² pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfÎ² signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfÎ² and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfÎ² expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in Î±smapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfÎ² IB to its receptors and blocking ang iitgfÎ² downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfÎ² pathway in CFs,63,transverse aortic constriction


### As per dataset specifications, location column signifies the word count after which the Abbreviation occurs and its Label is provided in Lable column.

In [59]:
# Lets check the Abbreviations of first 10 rows of the dataset alongwith their labels
split_text = [ t.split(' ') for t in textDF[:10]['TEXT']]
label = [t for t in textDF[:10]['LABEL']]
location = [t for t in textDF[:10]['LOCATION']]

In [61]:
for i in range(0,10):
    print(label[i], ' -- ', split_text[i][location[i]])

transverse aortic constriction  --  TAC
hodgkins lymphoma  --  HD
methicillinsusceptible s aureus  --  MSSA
parathyroid hormonerelated protein  --  PTHrP
lipoperoxidation  --  LPO
hepatitis g virus  --  HGV
radical neck dissection  --  RND
amplified spontaneous emission  --  ASE
portal blood  --  HPB
western medicine  --  WM


From the above analysis, the relationship between Location, Label and Text columns are clearly visible. 

### Let us again check the number of unique AABSTRACT_ID in Dataset

In [62]:
# Checking the unique Abstract_id
textDF['ABSTRACT_ID'].nunique()

2531051

In [63]:
# Checking the shape of the Dataset
textDF.shape

(3000000, 4)

It can be seen here that there are some Abstract_ID's which are not unique. Lets find those abstracts and check what is the main differences

In [70]:
duplicate = textDF[textDF['ABSTRACT_ID'].duplicated(keep = False)]

In [72]:
duplicate.sort_values(by = ['ABSTRACT_ID']).head(5)

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL
331391,6,a doubleblind T0 with intraindividual comparisons was carried out to investigate the effects of mg of ralphahydroxyisopropylalphahtropanium bromidetropate sch mg sch mg OX mg oxazepam and placebo with p.o. in randomized CS on gastric juice volume amount of acid concentration and ph values in healthy volunteers the secretion parameters were measured during a h basal period and a h stimulation period the gastric juice was obtained in min portions via stomach tube stimulation was effected by mugkgh PG via drip infusion the friedman test was used for the comparative statistical evaluation and individual comparisons were carried out by means of the wilcoxon test pairdifferences rank the results show that sch and sch OX were equal in effect on basal and stimulated secretion volume as compared with PL it was not possible to establish an effect on secretion volume for oxazepam CT sch and sch oxazepam were found to be equipotent in reducing the amount of basal acid while oxazepam reduced this quantity only during the first min of basal secretion none of the three AS S9 was capable of inhibiting the stimulated acid although both sch preparations produced a clear trend towards lowered mean values during the basal secretion period all three test S9 had an inhibiting action on acid concentration but none of them had a significant effect during the stimulation period the ph value was savely increased only by sch and sch OX and this even only during the basal period the results are discussed,112,oxazepam
11397,6,a doubleblind T0 with intraindividual comparisons was carried out to investigate the effects of mg of ralphahydroxyisopropylalphahtropanium bromidetropate sch mg sch mg OX mg oxazepam and placebo with p.o. in randomized CS on gastric juice volume amount of acid concentration and ph values in healthy volunteers the secretion parameters were measured during a h basal period and a h stimulation period the gastric juice was obtained in min portions via stomach tube stimulation was effected by mugkgh PG via drip infusion the friedman test was used for the comparative statistical evaluation and individual comparisons were carried out by means of the wilcoxon test pairdifferences rank the results show that sch and sch OX were equal in effect on basal and stimulated secretion volume as compared with PL it was not possible to establish an effect on secretion volume for oxazepam CT sch and sch oxazepam were found to be equipotent in reducing the amount of basal acid while oxazepam reduced this quantity only during the first min of basal secretion none of the three AS S9 was capable of inhibiting the stimulated acid although both sch preparations produced a clear trend towards lowered mean values during the basal secretion period all three test S9 had an inhibiting action on acid concentration but none of them had a significant effect during the stimulation period the ph value was savely increased only by sch and sch OX and this even only during the basal period the results are discussed,234,oxazepam
2676992,49,"two rb erythrocyte casein kinases gtpcasein kinase i and gtpcasein kinase ii have been purified and fold respectively studies employing sucrose density gradient centrifugation indicate that kinase i has a molecular weight of about s and kinase ii about s these enzymes can utilize either atp or gtp as the phosphoryl donor among various protein substrates examined these kinases catalyze the phosphorylation of CS greater than dephosphorylated phosvitin congruent to dephosphorylated CS greater than phosvitin histones protamine and bovine serum Al are poor phosphoryl acceptors kinetic data indicate that both enzymes are inhibited by high casein substrate concentrations which may be partially relieved by nacl both phosphotransferases require mg for activity and are optimally AS at ph the enzymes have apparent km values of m for gtp m for atp and mgml for CS the incorporation of the terminal phosphate of gtp into CS as catalyzed by these enzymes is inhibited to varying degrees by atp itp adp and gdp but not by utp ctp gmp cAMP and guanosine cyclic monophosphate in addition naf and diphosphoglyceric acid are also found to inhibit the activity of both kinases the effect of 2,3-DPG is interesting and suggests that this metabolite may regulate the activity of the CK in the red blood cells",204,casein kinases
1848382,49,"two rb erythrocyte casein kinases gtpcasein kinase i and gtpcasein kinase ii have been purified and fold respectively studies employing sucrose density gradient centrifugation indicate that kinase i has a molecular weight of about s and kinase ii about s these enzymes can utilize either atp or gtp as the phosphoryl donor among various protein substrates examined these kinases catalyze the phosphorylation of CS greater than dephosphorylated phosvitin congruent to dephosphorylated CS greater than phosvitin histones protamine and bovine serum Al are poor phosphoryl acceptors kinetic data indicate that both enzymes are inhibited by high casein substrate concentrations which may be partially relieved by nacl both phosphotransferases require mg for activity and are optimally AS at ph the enzymes have apparent km values of m for gtp m for atp and mgml for CS the incorporation of the terminal phosphate of gtp into CS as catalyzed by these enzymes is inhibited to varying degrees by atp itp adp and gdp but not by utp ctp gmp cAMP and guanosine cyclic monophosphate in addition naf and diphosphoglyceric acid are also found to inhibit the activity of both kinases the effect of 2,3-DPG is interesting and suggests that this metabolite may regulate the activity of the CK in the red blood cells",190,diphosphoglycerate
2085819,57,the MICs of hybrid dimers of ALP phosphatase containing two chemically modified subunits have been investigated one hybrid species was prepared by dissociation and reconstitution of a mixture of two SCV produced by chemical modification of the native enzyme with SA and TNM respectively the succinylnitrotyrosyl hybrid was separated from the other members of the hybrid set by deaesephadex chromatography and then converted to a succinylaminotyrosyl hybrid by reduction of the modified IMT residues with sodium dithionite a comparison of the MICs of these two hybrids with the activities of the succinyl nitrotyrosyl and aminotyrosyl derivatives has shown that either the subunits of ALP phosphatase CF independently or if the subunits turnover alternately in a reciprocating mechanism then the intrinsic activity of each S1 must be strongly dependent on its partner S1,42,tetranitromethane


### So, based on the above results for duplicates, it can be seen that a single Text might contain more than 1 Abbreviation at diffrent places. Thus, multiple row for multiple Abbreviations are present.