Preprocessing: This part loads the data and prepares it for vectorization and classification.

In [1]:
########## import all required libraries ############

# could also use the nltk one, I cannot download any package from there somehow
from stop_words import get_stop_words
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from tqdm import tqdm
from stop_words import get_stop_words
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from os.path import join
import string
from unidecode import unidecode
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import random

Global variables are loaded here. Among these are a library of english stop words that might need to be filtered out, a string of specific punctuation symbols that need to be filtered out, a library of contractions that can be seperated in their full words. 
The list of the 5 possible labels is created.

In [2]:
########## Global Variables ###########

# Define the stop_words library as english
stop_words = get_stop_words('english')

# Define a string with all punctuations
punctuations = '''!()-[]{};:'"\,=<>./?@#$%^&*_~'''

# Define a library of contractions
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

# create a list of the possible lables
labels = ["background", "objective", "methods", "results", "conclusions"]


Different functions are created to tailor the cleaning of the text. Based on the endgoal, in this case, classifying a sentence from a scientific abstract as being from which header/part of the abstract, you might want to keep or get rid of specific words/wordparts. We have focussed specifically on scientific symbols and specific numbers that might occur more often in results or conclusion sections. Furthermore we included a function for stop-words, the WordNetLemmatizer, contractions, punctuations and accentuations.

In [3]:
########## Functions used for data cleaning. ###########

# This function replaces specific symbols that are important for scientific context in strings so they are not removed.
def special_symbol_replacer(sentence_list):
    lemmatizer = {
        '%': 'percentage',
        '>': 'larger',
        '<': 'smaller',
        '+': 'plus',
        '=': 'equals',
        'n': 'amount',
        '/': 'slash'}
    new_sentence = []
    for word in sentence_list:
        if word in lemmatizer:
            word = lemmatizer[word]
        new_sentence.append(word)
    return new_sentence

# Function that replaces contractions with the two seperate words.
def replace_contraction(list):
    new_sentence = []
    for word in list:
        if word in contraction_dict:
            new_word = contraction_dict[word]
            new_sentence.append(new_word)
        else:
            new_sentence.append(word)
    return new_sentence


# Function to handle numbers. Turns them into a string defining a specific category: 'integer', 'float', 'fraction'.
# It ignores any letter/number combination words
def handle_nums(sentence_list):
    sentence_list = list(filter(lambda word: len(word) != 0, sentence_list))
    output = []
    for word in sentence_list:
        if any(char.isdigit() for char in word):  # if there is a number in the word
            if any(char.isalpha() for char in word):   # if there is also a letter in the word, ignore.
                continue
            if '.' in word:
                output.append('float')
            elif '/' in word:
                output.append('fraction')
            else:
                output.append('integer')
        else:
            output.append(word)
    return output


# Function to handle dashes. Removes the dash and returns a word splitted by a dash in two words
def handle_dash(sentence_list):
    output = []
    for word in sentence_list:
        output += word.split('-')
    return output


# Function removes any single letter words from the text.
def remove_singles(sentence_list):
    return list(filter(lambda word: not(len(word) == 0 and word.isalpha()), sentence_list))


# Function to perform lemmatization on the text. The lemmatizer needs to be defined elsewhere
def lemmatizer(list):
    # Define the lemmatizer as the WordNetLemmatizer from NLTK
    my_lemmatizer = WordNetLemmatizer()
    output = []
    for word in list:
        new_word = my_lemmatizer.lemmatize(word)
        output.append(new_word)
    return output


# Function that removes all remaining punctuations
def remove_punctuation(list):
    output=[]
    for word in list:
        new_word = ""
        for letter in word:
            if letter not in string.punctuation:
                new_word += letter
        output.append(new_word)
    return output


# Function that removes all accentuated characters.
def remove_accented_chars(list):
    output=[]
    for word in list:
        output.append(unidecode(word))
    return output
        

# Function to return all the words from each sentence back into one single string.
def list_to_string(sentence):
    return ' '.join(word for word in sentence)



The proprocessing function is defined. It takes as an input a text document and returns two lists of length n, containing labels and the cleaned sentences respectively in corresponding order.
The cleaning functions are placed in specific order. Particular cleaning functions can be included or excluded. After testing the cleaning functions on the baseline classifier, it showed that no cleaning at all resulted in the best weighted F1 scores.
A full cleaning would reduce the amount of unique words in the training text by roughly 11%.

In [4]:
# Function that reads whole text files, selects and splits labels and sentences, and cleans the sentences.
def preprocess_text(text):
    output_labels = []  # define an empty list to store the labels
    output_sentences = []  # define an empty list to store the sentences

    for line in tqdm(text):
        lowers = line.lower()  # puts all letters in text in lowercase
        splitted = lowers.split()  # splits the sentence in a list of words

        # select only the relevant parts of the text
        if len(splitted) > 0:  # ignores all empty lines
            # ignores all sentences that do not start with a predifined label
            if splitted[0] not in labels:
                continue
            else:
                # split the sentence into its label and the sentence:
                label = splitted[0]
                labelnum = labels.index(label)
                word_list = splitted[1:]

                # Cleaning functions 
                word_list = replace_contraction(word_list)  # replaces contractions with full words
                word_list = handle_nums(word_list)  # handeles the numbers in the text
                word_list = special_symbol_replacer(word_list) # replaces symbol for text
                word_list = handle_dash(word_list)  # handles words with dashes
                word_list = remove_punctuation(word_list)   # removes punctuations
                word_list = remove_accented_chars(word_list)   # removes accentuation
                word_list = [word for word in word_list if not word in stop_words]
                word_list = remove_singles(word_list)  # removes single letter words
                word_list = [word for word in word_list if not len(word) == 0]  # removes empty strings
                word_list = lemmatizer(word_list)  # Performs lemmatization

                
            # Put the obtained labels and processed text in corresponding lists.
            output_labels.append(labelnum)
            output_sentences.append(list_to_string(word_list))
    return output_labels, output_sentences

Defines the get_data function to load the different data files and run the preprocessing function linewise.

In [7]:

# Define the path that stores the text files.
path = "C:\\Users\\I0327140\\PubMed_200k_RCT\\PubMed_200k_RCT\\"


# Returns [labels, sentences] pair. set type: 'test', 'dev' or 'train'
def get_data(set_type='test'):
    with open(join(path, f'{set_type}.txt'), "r") as f:
        data = f.readlines()
    return preprocess_text(data)


Load the data of the train, test and dev file and save their labels and cleaned sentences as labels anc corpus respectively.

In [6]:
labels_train, corpus_train = get_data('train')
labels_test, corpus_test = get_data('test')
labels_dev, corpus_dev = get_data('dev')

  0%|                                                                                      | 0/2593169 [00:00<?, ?it/s]


LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\I0327140/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\I0327140\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
