# Logistic Regression on TfidfVectorizer (Baseline)
A common approach to NLP is to run (multinomial) Logistic Regression on the vectorized words. Making use of the sklearn library, TfidfVectorizer internally gets the CountVectorizer representation of token counts and transforms it with Tfidf (term frquency inverse document frequency).

"The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus." (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import random
import os
from tqdm import tqdm
from sklearn.metrics import f1_score

For the baseline result, we ran TfidfVectorizer+LogisticRegression with both preprocessed and unprocessed data. We achieved better results with no preprocessing. Even after varying the used preprocessing methods, the results were best on the uncleaned dataset.

In [2]:
def get_unprocessed_data(filename): # train, dev, test
    labels = ['background', 'objective', 'methods', 'results', 'conclusions']
    data = []
    with open(os.path.join('./PubMed_200k_RCT', f'{filename}.txt'), 'r') as f:
        data = f.readlines()
    output_labels = []  # define an empty list to store the labels
    output_sentences = []  # define an empty list to store the sentences

    for line in tqdm(data):
        line = line.split()
        if len(line) >= 2:
            label = line[0].lower()
            if label not in labels:
                continue
            else:
                labelnum = labels.index(label)
                
                output_labels.append(labelnum)
                output_sentences.append(' '.join(line[1:]))
    return output_labels, output_sentences

In [13]:
 def evaluate(y_pred, y):
    micro = f1_score(y, y_pred, average='micro')
    macro = f1_score(y, y_pred, average='macro')
    weighted = f1_score(y, y_pred, average='weighted')
    print(f'F1 Score: micro {micro}, macro {macro}, weighted {weighted}')

def run_and_evaluate_baseline():
    labels, corpus = get_unprocessed_data('train')
    labels_valid, corpus_valid = get_unprocessed_data('dev')
    labels_test, corpus_test = get_unprocessed_data('test')

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    scikit_log_reg = LogisticRegression(solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000, verbose=1)
    model=scikit_log_reg.fit(X, labels)
    
    X_valid = vectorizer.transform(corpus_valid)
    X_test = vectorizer.transform(corpus_test)
    y_pred_valid = model.predict(X_valid)
    y_pred_test = model.predict(X_test)
    
    print('Result on validation set:')
    evaluate(y_pred_valid, labels_valid)
    print('Result on test set:')
    evaluate(y_pred_test, labels_test)

run_and_evaluate_baseline()

100%|████████████████████████████████████████████████████████████████████| 2593169/2593169 [00:05<00:00, 489593.99it/s]
100%|████████████████████████████████████████████████████████████████████████| 33932/33932 [00:00<00:00, 507834.49it/s]
100%|████████████████████████████████████████████████████████████████████████| 34493/34493 [00:00<00:00, 508313.40it/s]


[LibLinear]Result on validation set:
F1 Score: micro 0.8243121802848058, macro 0.7572239244299099, weighted 0.8210974554468179
Result on test set:
F1 Score: micro 0.8247041670904961, macro 0.7573541905307171, weighted 0.8214023347513191


## Neural network with GRU layers
Recurrent Neural Networks are used because of their ability to store long-term memory and to account for new inputs as effectively as possible. (https://compstat-lmu.github.io/seminar_nlp_ss20/recurrent-neural-networks-and-their-applications-in-nlp.html)

GRU stands for Gated Recurrent Unit. GRU has two gates: reset and update. 

Comparing GRU and LSTM, GRU controls flow infromation like LSTM, but without using memory units. GRUs simpler, easier to modify, and train a lot faster (computationally more efficient).

LSTMs have a separate forget and update gate which makes them more sophisticated. LSTM should outperform GRUs in modeling long distance relations.

In [4]:
from gensim.models import Word2Vec
import time
from tensorflow.keras.layers import Dense, GRU, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow as tf



In [5]:
# get_data function should be defined above and removes this import
from preprocessing import get_data

# these are hopefully defined above in the notebook
labels, corpus = get_data('train')
labels_valid, corpus_valid = get_data('dev')
labels_test, corpus_test = get_data('test')

100%|█████████████████████████████████████████████████████████████████████| 2593169/2593169 [02:27<00:00, 17580.95it/s]
100%|█████████████████████████████████████████████████████████████████████████| 33932/33932 [00:02<00:00, 15943.31it/s]
100%|█████████████████████████████████████████████████████████████████████████| 34493/34493 [00:01<00:00, 17958.84it/s]


The following code is necessary when running tensorflow locally with GPU.

In [6]:
# put these cells above as well?
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

We define our own f1_weighted function, which compares the ground truth labels (as numbers) to the softmax prediction array.

In [7]:
# this should be above as well?
import tensorflow.keras.backend as K


def f1_weighted(label, pred):
    num_classes = 5
    label = K.cast(K.flatten(label), 'int32')
    true = K.one_hot(label, num_classes)
    pred_labels = K.argmax(pred, axis=-1)
    pred = K.one_hot(pred_labels, num_classes)

    ground_positives = K.sum(true, axis=0) + K.epsilon()  # = TP + FN
    pred_positives = K.sum(pred, axis=0) + K.epsilon()  # = TP + FP
    true_positives = K.sum(true * pred, axis=0) + K.epsilon()  # = TP

    precision = true_positives / pred_positives
    recall = true_positives / ground_positives

    f1 = 2 * (precision * recall) / (precision + recall + K.epsilon())

    weighted_f1 = f1 * ground_positives / K.sum(ground_positives)
    weighted_f1 = K.sum(weighted_f1)

    return weighted_f1

Our GRU model is relatively simple.
1. We start off with a text vectorization layer, which we adapt to the corpus to initialize it with the known vocabulary.
2. Our embedding layer is initialized with weights from our best performing Word2Vec model. Here we set mask_zero=True, to avoid bringing in new information.
3. Next come our GRU layers. We make use of the Bidirectional layer which allows us to make predictions from both previous and following time steps.
4. With the Dense layer of size 5, we model the output to predict the probability of the text belonging to each of the 5 classes.

In [8]:
def get_gru_model():
    w2v = Word2Vec.load('trained_models/word2vec_100_7_15.model')
    weight_matrix = w2v.wv.vectors
    vocab_size = weights.shape[0]
    embedding_dim = weights.shape[1]
    
    num_classes = 5
    vectorize_layer = TextVectorization(max_tokens=vocab_size, output_mode='int')
    vectorize_layer.adapt(corpus)

    model = Sequential([
        vectorize_layer,
        Embedding(vocab_size, embedding_dim, embeddings_initializer=Constant(weight_matrix), mask_zero=True),
        Bidirectional(GRU(embedding_dim, return_sequences=True)),
        Bidirectional(GRU(32)),
        Dense(num_classes, activation='softmax')
    ])
    acc = tf.keras.metrics.SparseCategoricalAccuracy()
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=[acc, f1_weighted])
    return model

The following code snippet trains and saves our GRU model.

In [9]:
def train_gru(model):
    timestr = time.strftime("%Y%m%d-%H%M%S")
    model_name = f'GRU_{timestr}'
    model_save_path = f'models/{model_name}'
    epochs = 20
    batch = 32
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=f'./logs/{model_name}', update_freq='batch')
    model.fit(corpus, labels, validation_data=(corpus_valid, labels_valid), epochs=epochs, batch_size=batch,
                  callbacks=[tensorboard_callback])

    model.save(model_save_path)
    print(f'Model saved: {model_save_path}')

# train_gru(get_gru_model())

In [15]:
loaded_model = tf.keras.models.load_model('GRU_checkpoint', custom_objects={"f1_weighted": f1_weighted})
print(f'Model successfully loaded')
# print(f'Train loss, acc, f1: {loaded_model.evaluate(corpus, labels)}')
print(f'Valid loss, acc, f1: {loaded_model.evaluate(corpus_valid, labels_valid)}')
print(f'Test loss, acc, f1: {loaded_model.evaluate(corpus_test, labels_test)}')

Model successfully loaded
Valid loss, acc, f1: [0.4640914499759674, 0.8280428647994995, 0.8270767331123352]
Test loss, acc, f1: [0.47924497723579407, 0.826227068901062, 0.8251857757568359]
