#### Import libraries

In [19]:
!pip install numpy pandas cufflinks matplotlib nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [20]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt # to visualize data
import re #for regular expression
import string # to handle strings
import math # to perform mathematical operations

import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import nltk


Load training and test datasets.
The training dataset contains 11,550 rows and 2 column, each row representing a single medical abstract and describing which of the five classes it belongs to. Medical abstracts consist of patient health information

There are 5 classes that the medical abstracts have been classified into:\
1 : Neoplasms\
2 : Digestive system diseases\
3 : Nervous system diseases\
4 : Cardiovascular diseases\
5 : General pathological conditions


In [21]:
# Importing datasets
df_train =pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_train.csv") 
df_test =pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_test.csv") 
labels = pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_labels.csv")
print("Training data shape : ",df_train.shape, "\nTesting data shape : ",df_test.shape, "\nLabels: \n", labels)


Training data shape :  (11550, 2) 
Testing data shape :  (2888, 2) 
Labels: 
    condition_label                   condition_name
0                1                        neoplasms
1                2        digestive system diseases
2                3          nervous system diseases
3                4          cardiovascular diseases
4                5  general pathological conditions


In [22]:
print("First few rows of training dataset: \n", df_train.head(10))
print("\nFirst few rows of testing dataset: \n", df_test.head(10))


First few rows of training dataset: 
    condition_label                                   medical_abstract
0                5  Tissue changes around loose prostheses. A cani...
1                1  Neuropeptide Y and neuron-specific enolase lev...
2                2  Sexually transmitted diseases of the colon, re...
3                1  Lipolytic factors associated with murine and h...
4                3  Does carotid restenosis predict an increased r...
5                3  The shoulder in multiple epiphyseal dysplasia....
6                2  The management of postoperative chylous ascite...
7                4  Pharmacomechanical thrombolysis and angioplast...
8                5  Color Doppler diagnosis of mechanical prosthet...
9                5  Noninvasive diagnosis of right-sided extracard...

First few rows of testing dataset: 
    condition_label                                   medical_abstract
0                3  Obstructive sleep apnea following topical orop...
1             

### Check for any missing values in the dataset

In [23]:
## checking for missing values (na's)
print("Missing values in training dataset \n", df_train.isnull().sum())
print("Missing values in testing dataset \n", df_test.isnull().sum())

Missing values in training dataset 
 condition_label     0
medical_abstract    0
dtype: int64
Missing values in testing dataset 
 condition_label     0
medical_abstract    0
dtype: int64


In [24]:
# drop na's if present

df_train.dropna(inplace=True)
df_train.isnull().sum()


condition_label     0
medical_abstract    0
dtype: int64

#### Show an example of a medical abstract for each class

In [25]:
# examples of each category

# 1 : Neoplasms
print("NEOPLASMS example :\n",df_train[df_train['condition_label']==1]['medical_abstract'].values[0]) 
# 2 : Digestive system diseases
print("DIGESTIVE SYSTEM disease example :\n",df_train[df_train['condition_label']==2]['medical_abstract'].values[0]) #.values[0])
# # 3 : Nervous system diseases
print("NERVOUS SYSTEM diseases:\n",df_train[df_train['condition_label']==3]['medical_abstract'].values[0]) #.values[0])
# # 4 : Cardiovascular diseases
print("CARDIOVASCULAR diseases\n",df_train[df_train['condition_label']==4]['medical_abstract'].values[0]) #.values[0])
# # 5 : General pathological conditions
print("GENERAL PATHOLOGICAL conditions :\n",df_train[df_train['condition_label']==5]['medical_abstract'].values[0]) #.values[0])


NEOPLASMS example :
 Neuropeptide Y and neuron-specific enolase levels in benign and malignant pheochromocytomas. Neuron-specific enolase (NSE) is the isoform of enolase, a glycolytic enzyme found in the neuroendocrine system. Neuropeptide Y (NPY) is a peptide recently discovered in the peripheral and central nervous systems. Serum NSE and plasma NPY levels have been reported to be increased in some patients with pheochromocytoma. The authors evaluated whether the measurement of these molecules could help to discriminate between benign and malignant forms of pheochromocytoma. The NSE levels were normal in all patients with benign pheochromocytoma (n = 13) and elevated in one half of those with malignant pheochromocytoma (n = 13). Plasma NPY levels were on the average significantly higher in the malignant (177.1 +/- 38.9 pmol/l, n = 16) than in the benign forms of the disease (15.7 +/- 389 pmol/l, n = 24). However, there was no difference in the percentage of patients with elevated NPY 

#### Training data class frequency

In [26]:
# training set class frequency
df_train["condition_label"].value_counts()

condition_label
5    3844
1    2530
4    2441
3    1540
2    1195
Name: count, dtype: int64

In [27]:
# training set class frequency percentage
df_train["condition_label"].value_counts(normalize=1)

condition_label
5    0.332814
1    0.219048
4    0.211342
3    0.133333
2    0.103463
Name: proportion, dtype: float64

#### Test data class frequency

In [28]:
# test dataset class frequency
df_test["condition_label"].value_counts()

condition_label
5    961
1    633
4    610
3    385
2    299
Name: count, dtype: int64

In [29]:
# test dataset class frequency percentage
df_test["condition_label"].value_counts(normalize=1)

condition_label
5    0.332756
1    0.219183
4    0.211219
3    0.133310
2    0.103532
Name: proportion, dtype: float64

#### Distribution of classes

In [30]:
# distribution of classes

df_train['condition_label'].value_counts(normalize=True).iplot(kind='bar',
                                                      yTitle='Percentage',
                                                      linecolor='black',
                                                      opacity=0.7,
                                                      color='red',
                                                      theme='pearl',
                                                      bargap=0.6,
                                                      gridcolor='white',
                                                      title='Distribution of Condition Classes in the Training Set')

In [31]:
# functions for pre processing 
def clean_text(text):
    """Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers """
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text


def text_preprocessing(text):
    """
    Cleaning and parsing the text.
    """
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    nopunc = clean_text(text)
    tokenized_text = tokenizer.tokenize(nopunc)
    #remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    combined_text = ' '.join(tokenized_text)
    return combined_text

In [32]:
# Applying the cleaning function to both training and test datasets and
df_train['text_clean'] = df_train['medical_abstract'].apply(str).apply(lambda x: text_preprocessing(x))
df_test['text_clean'] = df_test['medical_abstract'].apply(str).apply(lambda x: text_preprocessing(x))

In [33]:
df_train['text_clean'].head()


0    tissue changes around loose prostheses a canin...
1    neuropeptide y and neuronspecific enolase leve...
2    sexually transmitted diseases of the colon rec...
3    lipolytic factors associated with murine and h...
4    does carotid restenosis predict an increased r...
Name: text_clean, dtype: object

In [34]:

df_train['text_clean'].values[0]

'tissue changes around loose prostheses a canine model to investigate the effects of an antiinflammatory agent the aseptically loosened prosthesis provided a means for investigating the in vivo and in vitro activity of the cells associated with the loosening process in seven dogs the cells were isolated and maintained in culture for sufficient periods of time so that their biologic activity could be studied as well as the effect of different agents added to the cells in vivo or in vitro the biologic response as determined by and prostaglandin activity paralleled the roentgenographic appearance of loosening and the technetium images and observations made at the time of revision surgery the correlation between clinical roentgenographic histologic and biochemical loosening indicates that the canine model is suitable for investigating the mechanisms of prosthetic failure a canine model permits the study of possible nonsurgical therapeutic interventions with the ultimate hope of stopping or

In [35]:
print("ORIGINAL TEXT: \nTissue changes around loose prostheses. A canine model to investigate the effects of an antiinflammatory agent. The aseptically loosened prosthesis provided a means for investigating the in vivo and in vitro activity of the cells associated with the loosening process in seven dogs. The cells were isolated and maintained in culture for sufficient periods of time so that their biologic activity could be studied as well as the effect of different agents added to the cells in vivo or in vitro. The biologic response as determined by interleukin-1 and prostaglandin E2 activity paralleled the roentgenographic appearance of loosening and the technetium images and observations made at the time of revision surgery. The correlation between clinical, roentgenographic, histologic, and biochemical loosening indicates that the canine model is suitable for investigating the mechanisms of prosthetic failure. A canine model permits the study of possible nonsurgical therapeutic interventions with the ultimate hope of stopping or slowing the loosening process. \n\n CLEAN TEXT:\n", df_train['text_clean'].values[0])

ORIGINAL TEXT: 
Tissue changes around loose prostheses. A canine model to investigate the effects of an antiinflammatory agent. The aseptically loosened prosthesis provided a means for investigating the in vivo and in vitro activity of the cells associated with the loosening process in seven dogs. The cells were isolated and maintained in culture for sufficient periods of time so that their biologic activity could be studied as well as the effect of different agents added to the cells in vivo or in vitro. The biologic response as determined by interleukin-1 and prostaglandin E2 activity paralleled the roentgenographic appearance of loosening and the technetium images and observations made at the time of revision surgery. The correlation between clinical, roentgenographic, histologic, and biochemical loosening indicates that the canine model is suitable for investigating the mechanisms of prosthetic failure. A canine model permits the study of possible nonsurgical therapeutic interventi

In [36]:
df_test['text_clean'].head()

0    obstructive sleep apnea following topical orop...
1    neutrophil function and pyogenic infections in...
2    a phase ii study of combined methotrexate and ...
3    flow cytometric dna analysis of parathyroid tu...
4    paraneoplastic vasculitic neuropathy a treatab...
Name: text_clean, dtype: object

In [37]:
# add word cloud