In [2]:
import sys
import paddle.v2 as paddle

# PaddlePaddle init
paddle.init(use_gpu=True, trainer_count=1, log_error_clipping=True)


In [3]:
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % 100 == 0:
            print "\nPass %d, Batch %d, Cost %f, %s" % (
                event.pass_id, event.batch_id, event.cost, event.metrics)
        else:
            sys.stdout.write('.')
            sys.stdout.flush()
    if isinstance(event, paddle.event.EndPass):
        with open('./params_pass_%d.tar' % event.pass_id, 'w') as f:
                parameters.to_tar(f)

        result = trainer.test(reader=test_reader, feeding=feeding)
        print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)



As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.

### Text Convolution Neural Network (Text CNN)

We create a neural network `convolution_net` as the following snippet code.

Note: `paddle.networks.sequence_conv_pool` includes both convolution and pooling layer operations.



In [4]:
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
    data = paddle.layer.data("word",
                             paddle.data_type.integer_value_sequence(input_dim))
    emb = paddle.layer.embedding(input=data, size=emb_dim)
    conv_3 = paddle.networks.sequence_conv_pool(
        input=emb, context_len=3, hidden_size=hid_dim)
    conv_4 = paddle.networks.sequence_conv_pool(
        input=emb, context_len=4, hidden_size=hid_dim)
    output = paddle.layer.fc(input=[conv_3, conv_4],
                             size=class_dim,
                             act=paddle.activation.Softmax())
    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
    cost = paddle.layer.classification_cost(input=output, label=lbl)
    return cost, output



1. Define input data and its dimension

    Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `convolution_net`, the input to the network is defined in `paddle.layer.data`.

1. Define Classifier

    The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.

1. Define Loss Function

    In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.

#### Stacked bidirectional LSTM

We create a neural network `stacked_lstm_net` as below.



In [5]:
def stacked_lstm_net(input_dim,
                     class_dim=2,
                     emb_dim=128,
                     hid_dim=512,
                     stacked_num=3):
    """
    A Wrapper for sentiment classification task.
    This network uses a bi-directional recurrent network,
    consisting of three LSTM layers. This configuration is
    motivated from the following paper, but uses few layers.
        http://www.aclweb.org/anthology/P15-1109
    input_dim: here is word dictionary dimension.
    class_dim: number of categories.
    emb_dim: dimension of word embedding.
    hid_dim: dimension of hidden layer.
    stacked_num: number of stacked lstm-hidden layer.
    """
    assert stacked_num % 2 == 1

    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
    para_attr = [fc_para_attr, lstm_para_attr]
    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
    relu = paddle.activation.Relu()
    linear = paddle.activation.Linear()

    data = paddle.layer.data("word",
                             paddle.data_type.integer_value_sequence(input_dim))
    emb = paddle.layer.embedding(input=data, size=emb_dim)

    fc1 = paddle.layer.fc(input=emb,
                          size=hid_dim,
                          act=linear,
                          bias_attr=bias_attr)
    lstm1 = paddle.layer.lstmemory(
        input=fc1, act=relu, bias_attr=bias_attr)

    inputs = [fc1, lstm1]
    for i in range(2, stacked_num + 1):
        fc = paddle.layer.fc(input=inputs,
                             size=hid_dim,
                             act=linear,
                             param_attr=para_attr,
                             bias_attr=bias_attr)
        lstm = paddle.layer.lstmemory(
            input=fc,
            reverse=(i % 2) == 0,
            act=relu,
            bias_attr=bias_attr)
        inputs = [fc, lstm]

    fc_last = paddle.layer.pooling(
        input=inputs[0], pooling_type=paddle.pooling.Max())
    lstm_last = paddle.layer.pooling(
        input=inputs[1], pooling_type=paddle.pooling.Max())
    output = paddle.layer.fc(input=[fc_last, lstm_last],
                             size=class_dim,
                             act=paddle.activation.Softmax(),
                             bias_attr=bias_attr,
                             param_attr=para_attr)

    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
    cost = paddle.layer.classification_cost(input=output, label=lbl)
    return cost, output



1. Define input data and its dimension

    Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `stacked_lstm_net`, the input to the network is defined in `paddle.layer.data`.

1. Define Classifier

    The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.

1. Define Loss Function

    In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.


To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`.



In [6]:
import urllib

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.urlretrieve(url, fname)

# uncomment if redownloading is needed:
# download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
# download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [1]:
import pandas as pd
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [7]:
len(annotations['rev_id'].unique())

115864

In [8]:
# labels a comment as an attack if the majority of annotators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [9]:
# join labels and comments
comments['attack'] = labels

In [10]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [11]:
import re
def strip_comment_to_word_list(comment):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        return re.sub("(@[A-Za-z0-9]+)|([^'0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", comment).lower().split()

In [12]:
comment_word_dict = {}
incr = 0

In [13]:
wiki_examples = []
for comment in comments["comment"].values:
    ls = strip_comment_to_word_list(comment)
    for word in ls:
        if word not in comment_word_dict:
            incr += 1
            comment_word_dict[word] = incr
    wiki_examples.append(map(lambda x: comment_word_dict[x], ls))

In [14]:
num_examples = len(wiki_examples)
num_examples

115864

In [15]:
# word_dict = paddle.dataset.imdb.word_dict()
word_dict = comment_word_dict

dict_dim = len(word_dict)
class_dim = 2

# option 1
[cost, output] = convolution_net(dict_dim, class_dim=class_dim)
# option 2
# [cost, output] = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)



## Model Training

### Define Parameters

First, we create the model parameters according to the previous model configuration `cost`.



In [16]:
# create parameters
parameters = paddle.parameters.create(cost)



### Create Trainer

Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Adam` optimization algorithm via `paddle.optimizer`.



In [17]:
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
    learning_rate=2e-3,
    regularization=paddle.optimizer.L2Regularization(rate=8e-4),
    model_average=paddle.optimizer.ModelAverage(average_window=0.5))

# create trainer
trainer = paddle.trainer.SGD(cost=cost,
                                parameters=parameters,
                                update_equation=adam_optimizer)



### Training

`paddle.dataset.imdb.train()` will yield records during each pass, after shuffling, a batch input is generated for training.



In [18]:
import random
def custom_wiki_reader():
    while True:
        idx = random.randint(0, num_examples - 1)
        yield (wiki_examples[idx], int(labels.values[idx]))

In [19]:
train_reader = paddle.batch(
    paddle.reader.shuffle(
        lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
    batch_size=100)

test_reader = paddle.batch(
    lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)

# ---- let's overwrite the above

train_reader = paddle.batch(paddle.reader.shuffle(lambda: custom_wiki_reader(), buf_size=1000), batch_size=25)

test_reader = paddle.batch(custom_wiki_reader(), batch_size=10)
# i know this is committing ML heresy by not separating train and test data, but this isn't a research project...


`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `paddle.dataset.imdb.train()` corresponds to `word` feature.



In [20]:
feeding = {'word': 0, 'label': 1}



Callback function `event_handler` will be invoked to track training progress when a pre-defined event happens.




Finally, we can invoke `trainer.train` to start training:



In [21]:
trainer.train(
    reader=train_reader,
    event_handler=event_handler,
    feeding=feeding,
    num_passes=10)



Pass 0, Batch 0, Cost 0.692342, {'classification_error_evaluator': 0.3199999928474426}
...................................................................................................
Pass 0, Batch 100, Cost 0.692865, {'classification_error_evaluator': 0.36000001430511475}
...................................................................................................
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
...................................................................................................
Pass 0, Batch 300, Cost nan, {'classification_error_evaluator': 1.0}
...................................................................................................
Pass 0, Batch 400, Cost nan, {'classification_error_evaluator': 1.0}
...................................................................................................
Pass 0, Batch 500, Cost nan, {'classification_error_evaluator': 1.0}
..............................................

KeyboardInterrupt: 