<h1><center> RNN-based Generation of arXiv Titles at Character Level </h1></center>

### Get data from arXiv

From [Wikipedia](https://en.wikipedia.org/wiki/ArXiv): "arXiv (pronounced 'archive') is a repository of electronic preprints (known as e-prints) approved for publication after moderation, that consists of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online."

The code below uses the package [arxivpy](https://github.com/titipata/arxivpy), which is a wrapper for the [arXiv API](https://arxiv.org/help/api/index).

We use it to fetch titles, categories, and published dates from paper submissions in the 'cs' (Computer Science) and 'stats' (Statistics) categories, in the following sub-categories:

- cs.CV: Computer Vision and Pattern Recognition
- cs.CL: Computation and Language
- cs.LG: Learning
- cs.AI: Artificial Intelligence
- cs.NE: Neural and Evolutionary Computing
- stat.ML: Machine Learning

#### Important:
There is no need to run the cell below, as the dataset is already provided in the 'data' folder. It takes a long time to run.

In [None]:
# import arxivpy
# import pandas as pd

# category_list = ['cs.CV', 'cs.CL', 'cs.LG', 'cs.AI', 'cs.NE', 'stat.ML']

# titles = []
# terms1 = []
# terms2 = []
# dates = []

# for category in category_list:
#     search_query = arxivpy.generate_query(terms=category, prefix='category')
#     articles = arxivpy.query(search_query, start_index=0, max_index=5000, results_per_iteration=1000, wait_time=5.0, sort_by='lastUpdatedDate')
#     for i in range(len(articles)):
#         titles.append(articles[i]['title'])
#         terms1.append(articles[i]['term'])
#         terms2.append(articles[i]['terms'])
#         dates.append(articles[i]['publish_date'])

# dataset = pd.DataFrame({'title': titles, 'term':terms1, 'terms':terms2, 'publish_date':dates})
# dataset.to_csv('data/arxiv_dataset.csv', index=False)

### Upgrade CNTK to the latest version

The default CNTK version for the Python runtimes in Azure Notebooks is 2.0 (when this notebook was created). The CNTK model in this notebook was created and trained with CNTK 2.4. Therefore it is necessary to upgrade CNTK due to compatibility issues between CNTK versions. Here we upgrade it to the latest version.

In [1]:
!pip install --upgrade --no-deps cntk

Requirement already up-to-date: cntk in /home/nbuser/anaconda3_501/lib/python3.6/site-packages (2.5.1)


### Load required packages

In [2]:
import cntk as C
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import string
import re
import time

In [3]:
print(C.__version__)

2.5.1


### Load and prepare arXiv data

Here we load the provided dataset and build a list with all unique titles available that will be used for the model training. We also preprocess the data by removing any title that has some character that is not an ASCII letter, not a digit, not a punctuation character, or not a blank space.

Then we pad each title by preceeding it with a 'Start of Sentence' token, given by the character '>' and appending an 'End of Sentence' token, given by the '\n' character. These special tokens will be used to facilitate the definition of input and output data for model training and inference.

In [4]:
data = pd.read_csv('data/arxiv_titles.csv', engine='python')
data = data.drop_duplicates().reset_index()

titles = data['title']
allowed_chars = string.ascii_letters + string.digits + string.punctuation + ' '
processed_titles = []

for i in range(len(titles)):
    invalids = [char for char in titles[i] if char not in allowed_chars]
    if len(invalids) == 0:
        title = re.sub(' +', ' ', titles[i])
        processed_titles.append(title.lower())

data_train = ['>' + title + '\n' for title in processed_titles]
np.random.shuffle(data_train)

print('Number of titles in the training dataset: %d' % len(data_train))

print('Some examples of sentences in the training data:')
for title in data_train[0:5]:
    print(title)

Number of titles in the training dataset: 36380
Some examples of sentences in the training data:
>recurrent batch normalization

>smooth loss functions for deep top-k classification

>simnets: a generalization of convolutional networks

>databright: towards a global exchange for decentralized data ownership and trusted computation

>distribution-dependent concentration inequalities for tighter generalization bounds



### Create vocabulary and lookup dictionaries

Here we first create a list containing each unique character from all titles available in the dataset.

From this list we then create 2 dictionaries:
One mapping from each character in the vocabulary to a numeric index, and the other mapping from each numeric index to the corresponding character in the vocabulary.

These dictionaries will help to build the one-hot encoded representation of the data and also to map form the numeric outputs from the model back to the corrsponding characters.

In [5]:
vocab = list(set(''.join(data_train)))

vocab_len = len(vocab)
char_to_index = { ch:i for i,ch in enumerate(sorted(vocab)) }
index_to_char = { i:ch for i,ch in enumerate(sorted(vocab)) }

avg_title_len = int(np.mean([len(x) for x in data_train]))
max_title_len = int(np.max([len(x) for x in data_train]))
min_title_len = int(np.min([len(x) for x in data_train]))

print('Average title length: %d' % avg_title_len)
print('Max title length: %d' % max_title_len)
print('Min title length: %d' % min_title_len)
print('Vocabulary Length: %d' % vocab_len)

Average title length: 70
Max title length: 219
Min title length: 8
Vocabulary Length: 70


### Prepare batch data

CNTK has a high level API for efficient reading and feeding data for model training, which are very useful when you have data that doesn't fit into memory and it is then necessary to stream it from external storage into your training session. This API also takes care of efficient random sampling and batching, but has a proprietary data format. More details [here](https://cntk.ai/pythondocs/Manual_How_to_feed_data.html).

But in several situations, as this one, your data is already in a tabular format (such as a CSV file) and is small enough to be fitted in memory. In this case you have to explicitly prepare your training data, usually into mini-batches, as described here.

Now we prepare the data that will be fed into the neural network during model training. We define 2 sets of data: one for model input, which begins with '>' tokens followed by the arXiv titles from the dataset, and the other given by the same arXiv titles followed by '\n' tokens. In this way, we build inputs and outputs to be fed into the neural network allowing it to learn to predict the next token in a sequence, as illustrted by the following diagram:

<img src="./figures/fig4.png" alt="RNN for predicting the next token" style="width: 600px;"/>

First we need to map from characters to the corresponding numeric representation. Here we first map each character to its numeric vocabulary index. We then convert this index to the corresponding one-hot encoded vector.

Finally, we divide the data into mini-batches of 64 sequences each. Here each sequence corresponds to a given arXiv title from the dataset that will be unrolled for 'Backpropagation Through Time' during training.

In CNTK we can just define each minibatch as a list of NumPy arrays. Notice that these NumPy arrays can be of variable sizes, corresponding to the variable lenghts we have for the arXiv titles.

In [6]:
data_x = []
data_y = []
for seq in data_train:
    index_x = [char_to_index[ch] for ch in ''.join(seq)][:-1]
    index_y = index_x[1:] + [char_to_index['\n']]
    index_x = np.eye(vocab_len, dtype=np.float32)[index_x]
    index_y = np.eye(vocab_len, dtype=np.float32)[index_y]
    data_x.append(index_x)
    data_y.append(index_y)

mb_size = 64 # number of sequences in each mini batch
data_len = len(data_x)
data_x = [data_x[s : s+mb_size] for s in range(0, data_len, mb_size)]
data_y = [data_y[s : s+mb_size] for s in range(0, data_len, mb_size)]

num_mb = len(data_x)
print('Number of minibatches: %d' % num_mb)

print('Shapes of the first 5 elements in the first minibatch of data_x:')
for i in range(5):
    print(data_x[0][i].shape)

Number of minibatches: 569
Shapes of the first 5 elements in the first minibatch of data_x:
(30, 70)
(52, 70)
(52, 70)
(95, 70)
(84, 70)


### Define input and output variables, and the network model

The neural network we build has the architecture shown in the following diagram:

<img src="./figures/fig5.png" alt="RNN model architecture" style="width: 600px;"/>

It takes as input a vector of one-hot encoded characters corresponding to a given arXiv title from the dataset. This one-hot encoded vectors go to a Recurrent Neural Network block, which is comprised of a Stabilization Layer, a Normalization Layer, and a GRU-based one-directional Recurrent Layer. The output of this Recurrent Neural Network block is then fed into a Dense Layer.

Notice that the Recurrent Neural Network block is constructed in a way that we can define the stacking of multiple such blocks, using the CNTK *For()* function.

Notice also that it is a common pattern in CNTK (and also in other Deep Learning frameworks) to define the last Dense Layer without an activation function when performing multinomial classification. The activation (Softmax in this case) is defined later, together with the model Cost (Loss) Function.

The placeholder for the input values is defined by the variable *X*. The variable *Y* defines the placeholder for the oputput values, which are used to "teach the network" the true labels in a supervised manner during training. Remember that in our setup a true label is defined as the next token in a sequence.

The actual RNN model is defined by the *create_model()* function. CNTK's [Layer library](https://docs.microsoft.com/en-us/python/cognitive-toolkit/layerref?view=cntk-py-2.5.1) allows you to build models in a pattern known as [Function Composition](https://en.wikipedia.org/wiki/Function_composition).

In this way we instantiate our RNN model just by calling the *create_model()* function. The returned object from this function is also a function which we finally use to pass the input variable *X* to the model. The model predicted output is defined by the variable *Z*.

In [7]:
X = C.sequence.input_variable(shape=(vocab_len))
Y = C.sequence.input_variable(shape=(vocab_len))

n_hidden = 1024
n_layers = 1

def create_model():
    with C.layers.default_options(initial_state=0):
        model = C.layers.Sequential([
                C.layers.For(range(n_layers), lambda:
                    C.layers.Sequential([C.layers.Stabilizer(),
                                         C.layers.LayerNormalization(),
                                         C.layers.Recurrence(C.layers.GRU(shape=n_hidden))])),
                C.layers.Dense(vocab_len)])
    return model

model = create_model()
Z = model(X)

### Define the error, cost function, learner and trainer objects

Next we define the CNTK objects needed for model training:

The model error, which is expressed as a classification error. We model the prediction of the next token as a classification problem, so given an input token the next token in the sequence is predicted at the output as one of the possible tokens in the vocabulary.

The model [cost function](https://en.wikipedia.org/wiki/Loss_function) (also known as the loss function), which is defined as the [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy) loss with [softmax](https://en.wikipedia.org/wiki/Softmax_function) (standard for classification problems).

The [learning rate](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Background) for the optimization algorithm, which is defined in a scheduled manner. This allows the specification of a decreasing learning rate as the optimization progresses.

The [momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum), which is used by some optimization algorithms. In this case, it is also defined in a scheduled manner allowing for the momentum decaying as the optimization progresses.

The gradient clipping threshold, used to help avoiding the exploding gradient problem with RNNs.

The learner, which defines the optimization algorithm to be used during training. Here we use the [adam](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) optimization algorithm.

The progress_printer, which is an optional helper object that allows you to print learning statistics during model training.

And finally the trainer object, which wraps the model predicted output *Z*, the *cost* function, the model *error*, the *learner*, and optionally the *progress_printer*.

In [8]:
error = C.classification_error(Z, Y)

cost = C.cross_entropy_with_softmax(Z, Y)

lr = C.learning_parameter_schedule([0.0005 * mb_size]*10 +
                                   [0.0001 * mb_size]*10 +
                                   [0.00005 * mb_size]*10 +
                                   [0.00001 * mb_size],
                                   minibatch_size=mb_size,
                                   epoch_size=num_mb*mb_size*avg_title_len)

m = C.momentum_schedule(0.999, minibatch_size=mb_size)

gc = 5.0

learner = C.adam(parameters=Z.parameters, lr=lr, momentum=m,
                 gradient_clipping_threshold_per_sample=gc,
                 gradient_clipping_with_truncation=True)

progress_printer = C.logging.ProgressPrinter(freq=num_mb, tag='Training')

trainer = C.Trainer(Z, (cost, error), learner, progress_printer)

print(C.logging.log_number_of_parameters(Z))

Training 3435731 parameters in 9 parameter tensors.
None


### Define a function to sample from the model at inference time

We use the function below to sample tokens, which are characters in this case, from the model output. After the model learns to predict the next token in a sequence, we can sample an entire sequence that looks very similar to sequences learned by the model.

The sampling process is represented in the following diagram:

<img src="./figures/fig6.png" alt="Sampling from an RNN model" style="width: 700px;"/>

We start by feeding the '>' token and a hidden state initialized as a vector of zeros to the network. Then, we sample from the softmax output according to one of the following strategies:

- argmax: we just get the maximum value from the model output as the vocabulary index for the sampled token.
- softmax: we apply the softmax function to the model output and optionally rescale it using the value from the *temp* parameter, a value between 0 and 1 whare a value of 0 means sampling from the vocabulary at random with equal probability, and a value of 1 means sampling from the vocabulary according to the softmax distribution.

We then feed the sampled token and the previous hidden state to the model and sampled another token. We keep doing this until we sample an '\n' token.

Notice that a *temp* parameter between 0 and 1 has the effect of smoothing the softmax distribution which in turn causes the sampling of sequences that differ more from the learned distribution.

A note about the *mask* argument: CNTK allows you to also pass a mask argument together with the input and output values for training or inference. This mask controls if you are passing a complete sequence (mask = True), or if your sequence continues (mask = False).

In [9]:
def sample(n_chars, n_titles, seed='>', temp=1, use_argmax=False):
    output = []
    
    for i in range(n_titles):
        
        chars = []
        x = np.zeros((1, vocab_len), dtype=np.float32)
        idx = char_to_index[seed]
        x[0, idx] = 1 
            
        counter = 0
        eos = char_to_index['\n']
        if seed != '>':
            chars.append(seed)
        
        mask = [True]
        while (counter <= n_chars):
            arguments = ({X : [x]}, mask)
            mask = [False]
            p = Z.eval(arguments)[0][0]
            if use_argmax:
                p = C.argmax(p).eval()
                idx = int(p)
            else:
                p = C.softmax(p).eval()
                p = np.power(p, (temp))
                p = p / np.sum(p)
                idx = np.random.choice(range(vocab_len), p=p.ravel())
            c = index_to_char[idx]
            chars.append(c)
            x = np.zeros((1, vocab_len), dtype=np.float32)
            x[0, idx] = 1
                
            if idx == eos:
                counter = n_chars+1
            counter += 1
        
        if (chars[len(chars) - 1] != '\n'):
                chars.append('\n')
        output.append(''.join(chars))

    return(''.join(output))

### Train the model

CNTK also has a high level API for model training, which is described [here](https://cntk.ai/pythondocs/Manual_How_to_train_using_declarative_and_imperative_API.html). It allows you to define a training session object that takes care of the training loop and is very useful specially when training in a distributed fashion.

Another option, which we use here, is to explicitly control the training loop.

In this setup we define an outer loop that iterates over the entire training data. Each of these iterations is also known as an *epoch*. We also define an inner loop that iterates over the mini-batches. For each mini-batch, we feed the input and output values to the network in the *train_minibatch()* method of the *trainer* object. Finally we compute the loss for each mini-batch and for the entire epoch. For this model we also compute the [perplexity](https://en.wikipedia.org/wiki/Perplexity), which is a common evaluation metric for language models.

#### Important:
There is no need to run the cell below, as a trained model is already provided in the 'models' folder. It takes a long time to run.

In [None]:
# costs = []
# perps = []
# log = open('logs/generate_arxiv_char_log.txt' , 'a')

# iter = 100

# start_time = time.asctime()
# print('Start training time: ' + start_time)
# log.write('Start training time: ' + start_time + '\n\n')

# for i in range(iter):
    
#     print('Iteration: %i' % (i))
#     log.write('Iteration: %i' % (i) + '\n')
    
#     epoch_cost = 0
#     epoch_perp = 0
#     for k in range(num_mb):
#         mb_X, mb_Y = data_x[k], data_y[k]
#         masks = [True] * len(mb_X)
#         arguments = ({X : mb_X, Y : mb_Y}, masks)
#         trainer.train_minibatch(arguments)
#         minibatch_cost = trainer.previous_minibatch_loss_average
#         minibatch_error = trainer.previous_minibatch_evaluation_average
#         epoch_cost += minibatch_cost / num_mb
#         epoch_perp += np.exp(minibatch_cost) / num_mb
    
#     print("Cost after itaration %i: %f" % (i, epoch_cost))
#     log.write("Cost after itaration %i: %f" % (i, epoch_cost) + '\n')
#     costs.append(epoch_cost)
    
#     print('Perplexity after iteration %i: %f' % (i, epoch_perp))
#     log.write("Perplexity after itaration %i: %f" % (i, epoch_perp) + '\n')
#     perps.append(epoch_perp)
    
#     print('Sampling 5 titles from the model:\n')
#     log.write('Sampling 5 titles from the model:\n\n')
#     s = sample(max_title_len, 5, '>') + '\n\n'
#     print(s)
#     log.write(s)
    
# end_time = time.asctime()
# print('End training time: ' + end_time)
# log.write('End training time: ' + end_time + '\n')
    
# model_file = 'models/generate_arxiv_char_epoch_%d.dnn'  % (i+1)
# trainer.save_checkpoint(model_file)
    
# log.close()

# plt.figure()
# plt.plot(costs)

# plt.figure()
# plt.plot(perps)

### Generate some titles from the model

Here we sample some titles from the model. We will sample 100 titles, with at most the maximum title lenght found in the training data, and using a value of 1 for the temperature parameter.

We then print the sampled titles that are not in the training data.

Before running the cell below you need to run all but the commented cells above.

In [10]:
trainer.restore_from_checkpoint('models/generate_arxiv_char_epoch_100.dnn')
Z = trainer.model

generated_titles = sample(max_title_len, 100, '>', temp=1).split('\n')
generated_titles = ['>' + title + '\n' for title in generated_titles[:-1]]
print('Number of generated titles: %d' % len(generated_titles))

generated_titles = list(set(generated_titles) - set(generated_titles).intersection(set([d[0] for d in data_train])))
print('Number of generated titles not in the training data: %d' % len(generated_titles))

print('Examples of some generated titles not in the training data:\n')
titles = [w[1:-1] for w in generated_titles]
for title in titles[0:100]:
    print(title + '\n')

Number of generated titles: 100
Number of generated titles not in the training data: 100
Examples of some generated titles not in the training data:

activation regisel structure inference for diabetecting human-robot teams and mecons in iketse substrates

an extremal framework for drone squares

inhoduce-based deep ensemble reconstruction from road networks using monolinear dynamical recurrent two-stip method

learning multimodal word representation via pairwise spatial recurrent neural networks

reinforcement learning based visual saliency and geodesic distance estimation using gaussian mixture relevance

face-frizud, pseudo predictive modeling and latent factor analysis

learning limited memory based on clinical and quantitative inverses using context-aware valuation

unification of evolutionary algorithm work loop

exchangeable random minim: a simple youttunebcornel approach

stock truncation in a nonparametric model for the future

universal reinforcement learning algorithms: surf