In [1]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import datetime
import os
import pickle
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 200)

import tensorflow_hub as hub
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

In [2]:
train = pd.read_csv("../assets/train_prep.csv", keep_default_na=False)

In [3]:
train = train[['id', 'class', 'text']]

In [4]:
train.head()

Unnamed: 0,id,class,text
0,0,1,cyclin dependent kinase cdks regulate variety fundamental cellular process cdk stand one last orphan cdks activate cyclin identify kinase activity reveal previous work cdk silence increase ets v e...
1,1,2,abstract background non small lung nsclc heterogeneous group disorder number genetic proteomic alteration c cbl e ubiquitin ligase adaptor molecule important normal homeostasis determine genetic v...
2,2,2,abstract background non small lung nsclc heterogeneous group disorder number genetic proteomic alteration c cbl e ubiquitin ligase adaptor molecule important normal homeostasis determine genetic v...
3,3,3,recent evidence demonstrate acquire uniparental disomy aupd novel mechanism pathogenetic may reduce homozygosity help identify novel myeloproliferative neoplasm mpns perform genome wide single nuc...
4,4,4,oncogenic monomeric casitas b lineage lymphoma cbl gene found many significance remains largely unknown several human c cbl cbl structure recently solve depict protein different stage activation c...


In [5]:
X = train[['id', 'text']]
y = train['class']

In [6]:
X_all_text = []
for text in X['text']:
    X_all_text.append(text.split())
X_all_text = np.array(X_all_text)

We load the ELMo module (version 3) from TensorFlow Hub. The ELMo module computes contextualized word representations using character-based word representations and bidirectional LSTMs. 

The module supports inputs both in the form of raw text strings or tokenized text strings. It outputs fixed embeddings at each LSTM layer, a learnable aggregation of the 3 layers, and a fixed mean-pooled vector representation of the input.

In [7]:
elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)

We define a function that will extract the ELMo vectors of all the words in a single sample and take their mean.

In [8]:
def elmo_vectors(x):
    embeddings = elmo(x, signature="default", as_dict=True)["elmo"]
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings,1))

We now call the function for each sample text in the X_all_text list we have created earlier. Due to memory constraints, we cannot store the entire set of ELMo vectors in memory. Instead, we write them to disk using the Pickle serialisation library using the most efficient Pickle serialisation protocol available. 

In [None]:
%%time
now = datetime.datetime.now()
print ("Started creating pickle file at: {}".format(now.strftime("%Y-%m-%d %H:%M:%S")))
num_rows = len(X_all_text)
first_start = time.time() # track the start of the whole pickling operation

for pickle_batch in range((num_rows//200)+1):
    with open("../assets/elmo_vectors_"+str(pickle_batch)+".pickle", "wb") as pickle_out:
        max_batch_row = min(num_rows, pickle_batch*200+200)
        for row in range(pickle_batch*200, max_batch_row):
            row_start = time.time()
            pickle.dump(elmo_vectors(X_all_text[row]), pickle_out, pickle.HIGHEST_PROTOCOL)
            # Flush the contents of memory to disk as each row is processed, so that we can
            # better track the progress of the pickling operation
            pickle_out.flush() 
            os.fsync(pickle_out)
            print ("Row {} of {} appended, duration (h:m:s): {}, total elapsed time (h:m:s): {}".format(row, num_rows,
                   str(datetime.timedelta(seconds=time.time()-row_start)),
                   str(datetime.timedelta(seconds=time.time()-first_start))))

Started creating pickle file at: 2020-04-08 20:04:17
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Row 0 of 3321 appended, duration (h:m:s): 0:00:12.152543, total elapsed time (h:m:s): 0:00:12.153543
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore
