# Abbreviation Disambiguation in Medical Texts - Data Preprocessing

This Notebook is in continuation of the notebook- 'Step 1- Data Wrangling and EDA' and lists down:

1. Data Preprocessing done on the dataset.
2. Exploring the Preprocessed data.

## Step# 1: Load the datasets

In [None]:
# Lets download the spacy library
#!pip install spacy

In [None]:
#Importing the Required Python Packages
import os
import shutil
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
pd.set_option('display.max_colwidth', -1)

In [None]:
# Lets load the default english model of spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Lets load the train dataset.
train = pd.read_csv('Data/train.csv')

In [None]:
#Lets check the dataset
train.head()

## Step# 2a: Create a new feature 'ABV'

In [None]:
# Lets create a function to create a new feature 'ABV' from dataset
def createFeature(df):    
    return [x.split(' ')[y] for x,y in zip(df['TEXT'], df['LOCATION'])]

In [None]:
train['ABV'] = createFeature(train)

### Lets check which Abbreviations occur the most and lets take top 20 such Abbreviations for further processing due to Hardware limitations

In [None]:
grouped = train.groupby(by=['ABV', 'LABEL'], as_index = False, sort = False).count()
grouped = grouped.sort_values(by='TEXT', ascending = False)
grouped

In [None]:
#Lets take top 20 Abbreviations for further processing
topAbv = grouped['ABV'][:20]
topAbv

### Lets extract only the above 20 Abbreviations from train set for futher processing.

In [None]:
train = train[train['ABV'].isin(topAbv)]
train.shape

## Step# 2: Data Preprocessing: Following preprocessing steps will be performed on the Dataset:

    a. Create a new feature 'ABV' for abbreviation directly deriving it from Location and Text columns.-- Already done above
    b. Convert text to lowercase.
    c. Remove Punctuations from the Text Column.
    d. Tokenize the Text column.
    e. Dropping Abstract_id, Location and Text columns.
    f. Remove stop words from the Text Column.

So, lets start with the Data Preprocessing sub-steps.

## Step# 2b: Convert the data to lowercase

In [None]:
#Lets create a function to convert all the text in lowercase
def tolower(df):
    return [t.lower() for t in df['TEXT']]

## Step# 2c: Remove Punctuations

In [None]:
# Lets create a function to remove all the Punctuations from Text
def removePunctuation(df):
    return [t.translate(str.maketrans('','',string.punctuation)) for t in df['TEXT']]

## Step# 2d: Tokenize the text column and save the tokenized data in a new column 'TOKEN'.

In [None]:
# Lets create a function to Tokenize the Text column of dataset
def createTokens(df):
    return df['TEXT'].apply(lambda x: x.split(' '))

## Step# 2e: Dropping the columns not needed

In [None]:
#Lets create a function to drop "Abstract_id", "Location" and "TEXT" columns from dataset
def dropCols(df):
    return df.drop(columns=['ABSTRACT_ID', 'LOCATION', 'TEXT'])

## Step 2f: Remove Stop words

In [None]:
# Lets create a function to remove stop words from the Text column
def removeStop(df):
    stopWords = spacy.lang.en.stop_words.STOP_WORDS
    # Remove any stopwords which appear to be an Abbreviation
    [stopWords.remove(t) for t in df['ABV'].str.lower() if t in stopWords]
    return df['TOKEN'].apply(lambda x: [item for item in x if not item in stopWords])

## Lets Create a function to apply all the above preprocessing steps to the dataset

In [None]:
def preProcessData(df):   
    df['TEXT'] = tolower(df)
    df['TEXT'] = removePunctuation(df)
    df['TOKEN'] = createTokens(df)
    df = dropCols(df)
    df['TOKEN'] = removeStop(df)
    return df

### Apply the above preProcessData function to train set

In [None]:
preProcessData(train).to_csv('Train/train_final.csv', index = False)

### Lets load the Validation and Test set for Preprocessing.

In [None]:
# Lets load the Valid and test datasets as well.
valid = pd.read_csv('Data/valid.csv')
test = pd.read_csv('Data/test.csv')

In [None]:
valid.head(3)

In [None]:
test.head(3)

### Let's only use the Abbreviations for which we are training the model

In [None]:
# Create ABV feature for valid and test sets.
valid['ABV'] = createFeature(valid)
test['ABV'] = createFeature(test)

In [None]:
# Filter the valid and test datasets based on the topAbv list and check their shapes
valid = valid[valid['ABV'].isin(topAbv)]
test = test[test['ABV'].isin(topAbv)]
print('Valid:', valid.shape)
print('Test:', test.shape)

### Again, due to Hardware limitations, lets use 10K rows for validation and test sets.

In [None]:
valid = valid[:10000]
test = test[:10000]

### Lets apply the preprocessing steps to Valid and Test datasets

In [None]:
valid = preProcessData(valid)

In [None]:
test = preProcessData(test)

In [None]:
valid.to_csv('Validation/valid_final.csv', index = False)

In [None]:
test.to_csv('Test/test_final.csv', index = False)