### Download the data

In [1]:
!wget https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv

--2022-08-17 21:03:31--  https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5998096 (5.7M) [text/plain]
Saving to: ‘AE_Data.csv’


2022-08-17 21:03:31 (221 MB/s) - ‘AE_Data.csv’ saved [5998096/5998096]



### Import Libraries

In [2]:
import pandas as pd
import numpy as np
import re

In [23]:
WORDS_TO_REMOVE = ['##padding##', 'ti-', 'ti -']

In [12]:
df = pd.read_csv('/content/AE_Data.csv')
df.head()

Unnamed: 0,title,abstract,label
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0


In [4]:
df['label'].value_counts()

0    3851
1     294
Name: label, dtype: int64

### Data Cleaning

1. Combine Title and Abstract
2. Lower case entire corpus
2. Remove newline and tabs from the dataset
3. Remove brackets, #, colons, 'TI" (title identifier), '##PADDING##'
2. Remove records with less than 10 words

In [13]:
df['text'] = df['title'] + ' ' + df['abstract']
df.head()

Unnamed: 0,title,abstract,label,text
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0,antimicrobial impacts of essential oils on foo...
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0,purification and characterization of a cystein...
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0,telavancin activity tested against gram-positi...
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0,the in vitro antimicrobial activity of cymbopo...
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0,screening currency notes for microbial pathoge...


In [18]:
df.replace(r'\n','', regex=True).iloc[4140, :]['text']

'TI  - [AUTO-INFECTION (INTESTINAL) IN RADIATION SICKNESS AND ITS PREVENTION IN WISTAR WHITE RATS]. ##PADDING##'

In [32]:
joined_words_to_remove = '|'.join(WORDS_TO_REMOVE)


def clean_text(x):
  # Lower case the text
  lower_x = x.lower()

  # Remove line breaks and tabs
  no_break_x = re.sub("\n|\r|\t", " ", lower_x)

  # Remove specific words
  no_waste_words_x = re.sub(joined_words_to_remove, " ", no_break_x)

  # Remove all non alphabet, numeral and space characters
  alpha_x = re.sub('[^0-9a-zA-Z ]+', ' ', no_waste_words_x)

  # Remove multi-spaces with single space
  beta_x = ' '.join(alpha_x.split())

  return beta_x

Let's test the function on few examples

In [31]:
for sample_text in df.sample(5)['text'].values:
  print('ORIGINAL TEXT : ', sample_text)
  print('-'*100)
  print('CLEANED TEXT : ', clean_text(sample_text))
  print('\n')
  print('*'*100)

ORIGINAL TEXT :  rat model of experimental endocarditis.
 a simple model of infective endocarditis was produced in rats. with the aid of a  guide wire, polyethylene catheters were passed into the left ventricle through the right carotid artery of sprague-dawley rats weighing 300 to 350 g. a volume of 1 ml of an overnight culture ofother_species,other_species, orother_species was intravenously injected 1 to 2 days after catheterization. bacterial titers ofother_species in vegetations were about 10(4)-fold greater than in other tissues. blood cultures were always positive after 6 h. mortality was 19% at 1 week and 82% at 2 weeks. catheters were pulled 24 h after infection, and vegetation titers of greater than 7.0 log10 colony-forming units per g were sustained at 5 days. in intravenously infected rats without catheters, blood and tissues were sterile after 3 to 5 days. withother_species, vegetations had greater than 9.0 log10 colony-forming units and withother_species 8.8 +/- 0.3 log10 

In [11]:
df['text_len'] = df['text'].str.split().apply(len)

df[df['text_len'] < 10]

Unnamed: 0,title,abstract,label,text,text_len
607,pancreatic infections.\n,\n,0,pancreatic infections.\n \n,2
738,total synthesis and antibiotic activity of deh...,\n,0,total synthesis and antibiotic activity of deh...,7
905,at the heart of the matter.\n,\n,0,at the heart of the matter.\n \n,6
962,mrsa usa300 clone and vref--a u.s.-colombian c...,\n,0,mrsa usa300 clone and vref--a u.s.-colombian c...,7
1092,multidrug-resistant bacteria in southeastern a...,\n,0,multidrug-resistant bacteria in southeastern a...,5
...,...,...,...,...,...
4134,TI - PROPHYLAXIS OF DIARRHEA IN NEWBORN PIGS.\n,##PADDING##\n,0,TI - PROPHYLAXIS OF DIARRHEA IN NEWBORN PIGS....,9
4137,TI - AN ANTIBIOTIC-LIKE EFFECT OF LACTOBACILL...,##PADDING##\n,0,TI - AN ANTIBIOTIC-LIKE EFFECT OF LACTOBACILL...,9
4138,TI - [Treatment of intestinal radiation effec...,##PADDING##\n,0,TI - [Treatment of intestinal radiation effec...,8
4139,TI - [SYNCHRONIZATION OF CELL DIVISION].\n,##PADDING##\n,0,TI - [SYNCHRONIZATION OF CELL DIVISION].\n ##...,7
