### Download the data

In [24]:
!wget https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv

--2022-08-18 20:39:46--  https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5998096 (5.7M) [text/plain]
Saving to: ‘AE_Data.csv.2’


2022-08-18 20:39:47 (63.4 MB/s) - ‘AE_Data.csv.2’ saved [5998096/5998096]



### Import Libraries

In [25]:
import pandas as pd
import numpy as np
import re

In [26]:
WORDS_TO_REMOVE = ['##padding##', 'ti-', 'ti -']

In [27]:
df = pd.read_csv('/content/AE_Data.csv')
df.head()

Unnamed: 0,title,abstract,label
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0


In [28]:
df['label'].value_counts()

0    3851
1     294
Name: label, dtype: int64

### Data Cleaning

1. Combine Title and Abstract
2. Lower case entire corpus
2. Remove newline and tabs from the dataset
3. Remove brackets, #, colons, 'TI" (title identifier), '##PADDING##'
2. Remove records with less than 10 words

In [29]:
df['text'] = df['title'] + ' ' + df['abstract']
df.head()

Unnamed: 0,title,abstract,label,text
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0,antimicrobial impacts of essential oils on foo...
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0,purification and characterization of a cystein...
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0,telavancin activity tested against gram-positi...
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0,the in vitro antimicrobial activity of cymbopo...
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0,screening currency notes for microbial pathoge...


In [30]:
df.replace(r'\n','', regex=True).iloc[4140, :]['text']

'TI  - [AUTO-INFECTION (INTESTINAL) IN RADIATION SICKNESS AND ITS PREVENTION IN WISTAR WHITE RATS]. ##PADDING##'

In [31]:
joined_words_to_remove = '|'.join(WORDS_TO_REMOVE)


def clean_text(x):
  # Lower case the text
  lower_x = x.lower()

  # Remove line breaks and tabs
  no_break_x = re.sub("\n|\r|\t", " ", lower_x)

  # Remove specific words
  no_waste_words_x = re.sub(joined_words_to_remove, " ", no_break_x)

  # Remove all non alphabet, numeral and space characters
  alpha_x = re.sub('[^0-9a-zA-Z ]+', ' ', no_waste_words_x)

  # Remove multi-spaces with single space
  beta_x = ' '.join(alpha_x.split())

  return beta_x

Let's test the function on few examples

In [32]:
for sample_text in df.sample(5)['text'].values:
  print('ORIGINAL TEXT : ', sample_text)
  print('-'*100)
  print('CLEANED TEXT : ', clean_text(sample_text))
  print('\n')
  print('*'*100)

ORIGINAL TEXT :  successful treatment of chronic bone and joint infections with oral linezolid.
 linezolid is an attractive alternative for the treatment of chronic bone and joint infections because it is active against common pathogens including methicillin-resistant staphylococci and vancomycin-resistant enterococci and because its oral formulation is convenient for long-term administration. to evaluate the ability of linezolid to produce long-term remission, we prospectively monitored 11 consecutive adult patients who received linezolid for  osteomyelitis (n = 9) or prosthetic joint infection (n = 2). linezolid 600 mg was administered orally twice daily for a mean of 10 weeks (range, 6 to 19 weeks). pathogens were methicillin-resistantother_species (n = 5), methicillin-resistant coagulase-negative staphylococci (n = 4), vancomycin-resistantother_species (n = 1), and vancomycin-sensitiveother_species (n = 1). after a mean followup of 27 months (range, 17 to 41 months), all 11 patient

In [33]:
# Apply cleaning function to the text field
df['clean_text'] = df['text'].apply(lambda x : clean_text(x))

# Get the length and drop records less than 10 words
df['text_len'] = df['clean_text'].str.split().apply(len)

cleaned_df = df[df['text_len'] > 10][['clean_text', 'label']]

In [34]:
cleaned_df.head()

Unnamed: 0,clean_text,label
0,antimicrobial impacts of essential oils on foo...,0
1,purification and characterization of a cystein...,0
2,telavancin activity tested against gram positi...,0
3,the in vitro antimicrobial activity of cymbopo...,0
4,screening currency notes for microbial pathoge...,0


In [35]:
print('Total records retained after data cleaning : ', cleaned_df.shape[0])
cleaned_df['label'].value_counts()

Total records retained after data cleaning :  4064


0    3770
1     294
Name: label, dtype: int64