# Abbreviation Disambiguation in Medical Texts - Data Wrangling & EDA

This Notebook is used to list down:

1. How the Data for Test and Train sets was collected i.e, Data Sources.
2. How the data has been organized for Training and Testing the model.
3. Data Cleaning.
4. Exploratory Data Analysis (EDA).

## Step# 1: Downloading Data from Data Sources.

            Data being used in this problem will be downloaded using Kaggle API from the below location:

            https://www.kaggle.com/xhlulu/medal-emnlp

In [None]:
# Uncomment below lines to download the dataset directly from Kaggle
## To Install Kaggle package in case it is not already present
# !pip install kaggle
## Download dataset
# !kaggle datasets download -d xhlulu/medal-emnlp

### Once the dataset has been downloaded lets check the directory contents

In [None]:
#Importing the Required Python Packages
import os
import shutil
import zipfile
import pandas as pd
import matplotlib.pyplot as plt

Printing the Current Working Directory

In [None]:
print(os.getcwd())

## Step# 2: Organizing data into Logical folder structure from where the system will read the data and process it.


    The Data for this problem will have the following Folder structure:

    Work_Dir
    |
    |
    |----> Data
    |
    |
    |----> Train
    |
    |
    |----> Test
    |
    |
    |----> Validation
    |
    |
    |----> Images
    
    Thus, creating the required Folder structure.

In [None]:
#Reading the current path in a variable
path = os.getcwd()
print(path)

In [None]:
#Creating a folder by name Train
os.mkdir(os.path.join(path, 'Train'))

In [None]:
#Creating a folder by name Test
os.mkdir(os.path.join(path, 'Test'))

In [None]:
#Creating a folder by name Data
os.mkdir(os.path.join(path, 'Data'))

In [None]:
#Creating a folder by name Validation
os.mkdir(os.path.join(path, 'Validation'))

In [None]:
#Creating a folder by name Images
os.mkdir(os.path.join(path, 'Images'))

Moving the NLP_Dataset.zip file into the Data folder and then unzipping the file

In [None]:
#Moving the file to Data folder
source = os.path.join(path, 'NLP_Dataset.zip')
destination = os.path.join(path, 'Data', 'NLP_Dataset.zip')
shutil.move(source, destination)

In [None]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

In [None]:
#Unzip the dataset file
with zipfile.ZipFile(destination, 'r') as zip_ref:
    zip_ref.extractall(os.path.join(path, 'Data'))

In [None]:
#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

Data was successfully extracted and a new folder 'pretrain_subset' was also created, Let's check the contents of that folder.

In [None]:
#Check the contents of pretrain_subset folder
print(os.listdir(os.path.join(path, 'Data', 'pretrain_subset')))

Pretrain_subset folder contains the dataset already divided into 3 different files- Train, test and valid. We will use the the data in this folder due to system memory restrictions.

In [None]:
# Deleting full_data.csv and moving train, test and valid files to Data folder.
source = os.path.join(path, 'Data', 'pretrain_subset')
destination = os.path.join(path, 'Data')
shutil.move(os.path.join(source, 'train.csv'), os.path.join(destination, 'train.csv'))
shutil.move(os.path.join(source, 'test.csv'), os.path.join(destination, 'test.csv'))
shutil.move(os.path.join(source, 'valid.csv'), os.path.join(destination, 'valid.csv'))
os.remove(os.path.join(destination, 'full_data.csv'))

#Check the contents of Data folder
print(os.listdir(os.path.join(path, 'Data')))

## Step# 3: Cleaning Data

### Lets have a look at our data.

In [None]:
#Creating a path variable directly to the dataset
data_path = os.path.join(path, 'Data', 'train.csv')
print(data_path)

In [None]:
# Loading the data in a dataframe.
textDF = pd.read_csv(data_path)

In [None]:
#Checking the shape of Dataframe
textDF.shape

### The Data contains 3 million rows but my System won't be able to work with such a huge dataset hence, will take the first 1 Million rows only for this project.

In [None]:
textDF.drop(textDF.index[1000000:], inplace = True)
textDF.shape

In [None]:
#Checking the first 5 rows of the dataframe
textDF.head(5)

In [None]:
#Checking last 5 rows of the dataframe
textDF.tail(5)

In [None]:
#Checking the summary statistics of the dataframe
textDF.describe(include = 'all')

In [None]:
#Checking the datatypes of dataframe columns
textDF.dtypes

In [None]:
# Checking the unique Abstract_id to see if Abstract_id can be converted to Index
textDF['ABSTRACT_ID'].nunique()

Hence, the Abstract_ID's are not all unique. So, lets check the duplicates in the dataset.

In [None]:
duplicate = textDF[textDF.duplicated()]
duplicate.head(5)

Hence, none of the rows are duplicates.

In [None]:
# Lets check for null values if any
textDF.isnull().values.any()

Thus, we don't have any Null values in the dataset.

### At this point the Data Wrangling has been completed and the resulting data is now ready for EDA

## Step# 4: EDA

In [None]:
# Lets look at one row of the dataset in detail
pd.set_option('display.max_colwidth', -1)
textDF.head(1)

### As per dataset specifications, location column signifies the word count after which the Abbreviation occurs and its Label is provided in Lable column.

In [None]:
# Lets check the Abbreviations of first 10 rows of the dataset alongwith their labels
split_text = [ t.split(' ') for t in textDF[:10]['TEXT']]
label = [t for t in textDF[:10]['LABEL']]
location = [t for t in textDF[:10]['LOCATION']]

In [None]:
for i in range(0,10):
    print(label[i], ' -- ', split_text[i][location[i]])

From the above analysis, the relationship between Location, Label and Text columns are clearly visible. 

### Let us again check the number of unique ABSTRACT_ID in Dataset

In [None]:
# Checking the unique Abstract_id
textDF['ABSTRACT_ID'].nunique()

In [None]:
# Checking the shape of the Dataset
textDF.shape

It can be seen here that there are some Abstract_ID's which are not unique. Lets find those abstracts and check what is the main differences

In [None]:
duplicate = textDF[textDF['ABSTRACT_ID'].duplicated(keep = False)]

In [None]:
duplicate.sort_values(by = ['ABSTRACT_ID']).head(5)

### So, based on the above results for duplicates, it can be seen that a single Text might contain more than 1 Abbreviation at diffrent places. Thus, multiple row for multiple Abbreviations are present.

### Let's save the above trainDF in a csv file inside Train folder for further use.

In [None]:
textDF.to_csv('Train/train.csv', index = False)

### Lets load the valid.csv and test.csv as well

In [None]:
# Loading valid.csv
valid = pd.read_csv(os.path.join(path, 'Data', 'valid.csv'))
#Loading test.csv
test = pd.read_csv(os.path.join(path, 'Data', 'test.csv'))

In [None]:
# Check the shape of valid dataset
valid.shape

### The Data contains 1 million rows so lets reduce this data to 20% of train data i.e, 20k records.

In [None]:
valid.drop(valid.index[200000:], inplace = True)
valid.shape

In [None]:
#Save this updated Valid.csv to Validation folder
valid.to_csv('Validation/valid.csv', index = False)

In [None]:
# Check the shape of test dataset
test.shape

### The Data contains 1 million rows so lets reduce this data to 20% of train data i.e, 20k records.

In [None]:
test.drop(test.index[200000:], inplace = True)
test.shape

In [None]:
#Save this updated test.csv to Test folder
test.to_csv('Test/test.csv', index = False)