# Tutorial 2.  Sentence classification with word embeddings

# Colab link

https://colab.research.google.com/drive/1Dnr3wC3FBf4KS0GOVNlEbp5fg74f0FM1

This tutorial is aimed to make participants of Conversational Intelligence Summer School-2019 familiar with text classification on **DeepPavlov**.

We are going to implement **multi-layer perceptron** on `Keras` with `TensorFlow` backend. Preprocessed tokenized texts should be **padded and vectorized using GloVe word embeddings**, then given to neural network.

The tutorial has the following **structure**:

1. [Data preparation](#Data-preparation)

2. [Library and requirements installation](#Library-and-requirements-installation)

3. [Dataset Reader](#Dataset-Reader): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_readers.html)

4. [Dataset Iterator](#Dataset-Iterator): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_iterators.html)

5. [Preprocessor](#Preprocessor): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)

6. [Tokenizer](#Tokenizer): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)

7. [GloVe Embedder](#Embedder): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)
[pre-trained embeddings link](https://deeppavlov.readthedocs.io/en/latest/intro/pretrained_vectors.html)

8. [Vocabulary of classes](#Vocabulary-of-classes)

9. [Keras Classifier](#Classifier): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/classifiers.html)

## Dataset preparation.

This tutorial uses dataset Stanford Sentiment Treebank (SST) from [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf).

The dataset contains unlabelled sentences divided to train/dev/test sets, phrases labelled with float sentiment value. Most of the sentences are contained in labelled list of phrases. Therefore, we are going to extract sentences coinciding with labelled phrases, convert their float sentiment to fine-grained (5 classes: very negative, negative, neutral, positive, very positive) and binary classes (negative and positive only), build two classifiers.

Let's download and extract the SST dataset.

In [1]:
!wget http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip

--2019-06-24 12:57:45--  http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip [following]
--2019-06-24 12:57:45--  https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6372817 (6.1M) [application/zip]
Saving to: ‘stanfordSentimentTreebank.zip’


2019-06-24 12:57:47 (3.03 MB/s) - ‘stanfordSentimentTreebank.zip’ saved [6372817/6372817]



In [2]:
!unzip stanfordSentimentTreebank.zip

Archive:  stanfordSentimentTreebank.zip
   creating: stanfordSentimentTreebank/
  inflating: stanfordSentimentTreebank/datasetSentences.txt  
   creating: __MACOSX/
   creating: __MACOSX/stanfordSentimentTreebank/
  inflating: __MACOSX/stanfordSentimentTreebank/._datasetSentences.txt  
  inflating: stanfordSentimentTreebank/datasetSplit.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._datasetSplit.txt  
  inflating: stanfordSentimentTreebank/dictionary.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._dictionary.txt  
  inflating: stanfordSentimentTreebank/original_rt_snippets.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._original_rt_snippets.txt  
  inflating: stanfordSentimentTreebank/README.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._README.txt  
  inflating: stanfordSentimentTreebank/sentiment_labels.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._sentiment_labels.txt  
  inflating: stanfordSentimentTreebank/SOStr.txt  
  inflating: stanfo

In [0]:
import numpy as np
import pandas as pd

Read the dictionary with phrases that are labelled with sentiment (labels are in other file).

In [4]:
dictionary = pd.read_csv("./stanfordSentimentTreebank/dictionary.txt", 
                         sep="|", header=None, names=["phrase", "id"]) 
print(dictionary.shape[0])
dictionary.head()

239232


Unnamed: 0,phrase,id
0,!,0
1,! ',22935
2,! '',18235
3,! Alas,179257
4,! Brilliant,22936


Read the file with sentiment labels of phrases.

In [5]:
labels = pd.read_csv("./stanfordSentimentTreebank/sentiment_labels.txt", sep="|") 
labels.set_index("phrase ids", inplace=True)
print(labels.shape[0])
labels.head()

239232


Unnamed: 0_level_0,sentiment values
phrase ids,Unnamed: 1_level_1
0,0.5
1,0.5
2,0.44444
3,0.5
4,0.42708


Read the sentences.

In [6]:
sentences = pd.read_csv("./stanfordSentimentTreebank/datasetSentences.txt", sep="\t")
print(sentences.shape[0])
sentences.head()

11855


Unnamed: 0,sentence_index,sentence
0,1,The Rock is destined to be the 21st Century 's...
1,2,The gorgeously elaborate continuation of `` Th...
2,3,Effective but too-tepid biopic
3,4,If you sometimes like to go to the movies to h...
4,5,"Emerges as something rare , an issue movie tha..."


Read the file with split of sentences to train/dev/test parts.

In [7]:
split = pd.read_csv("./stanfordSentimentTreebank/datasetSplit.txt", sep=",") 
split.set_index("sentence_index", inplace=True)
print(split.shape[0])
split.head()

11855


Unnamed: 0_level_0,splitset_label
sentence_index,Unnamed: 1_level_1
1,1
2,1
3,2
4,2
5,2


Now we need to merge dataframes, so we firstly need to rename columns.

In [0]:
dictionary.rename(columns={"phrase": "text", "id": "phrase_id"}, inplace=True)
labels.rename(columns={"phrase ids": "phrase_id"}, inplace=True)
sentences.rename(columns={"sentence": "text", "sentence_index": "sent_id"}, inplace=True)
split.rename(columns={"sentence_index": "sent_id"}, inplace=True)

Let's merge them!

In [9]:
df = pd.merge(sentences, dictionary)
df = df.join(labels, on="phrase_id")
df = df.join(split, on="sent_id")

print(df.shape[0])
df.head()

11286


Unnamed: 0,sent_id,text,phrase_id,sentiment values,splitset_label
0,1,The Rock is destined to be the 21st Century 's...,226166,0.69444,1
1,2,The gorgeously elaborate continuation of `` Th...,226300,0.83333,1
2,3,Effective but too-tepid biopic,13995,0.51389,2
3,4,If you sometimes like to go to the movies to h...,14123,0.73611,2
4,5,"Emerges as something rare , an issue movie tha...",13999,0.86111,2


We have obtained a dataframe with 11286 rows with sentences contained in labelled phrases set.
We need to convert float sentiment values to classes.

In [0]:
def get_binary_label(x):
    """
    For binary classification we take only 
    negative sentences (sentiment <= 0.4)
    and positive sentences (sentiment > 0.6)
    """
    if x <= 0.4:
        return "negative"
    elif x > 0.6:
        return "positive"
    
def get_fine_grained_label(x):
    """
    For fine-grained classification we divide sentiment range [0, 1]
    into 5 intervals:
    [0, 0.2] - very negative
    (0.2, 0.4] - negative
    (0.4, 0.6] - neutral
    (0.6, 0.8] - positive
    (0.8, 1.] - very positive
    """
    if x <= 0.2:
        return "very_negative"
    elif x <= 0.4:
        return "negative"
    elif x <= 0.6:
        return "neutral"
    elif x <= 0.6:
        return "positive"
    else:
        return "very_positive"
    
df["binary_label"] = df["sentiment values"].apply(lambda x: get_binary_label(x))
df["fine_grained_label"] = df["sentiment values"].apply(lambda x: get_fine_grained_label(x))

In [11]:
df.head()

Unnamed: 0,sent_id,text,phrase_id,sentiment values,splitset_label,binary_label,fine_grained_label
0,1,The Rock is destined to be the 21st Century 's...,226166,0.69444,1,positive,very_positive
1,2,The gorgeously elaborate continuation of `` Th...,226300,0.83333,1,positive,very_positive
2,3,Effective but too-tepid biopic,13995,0.51389,2,,neutral
3,4,If you sometimes like to go to the movies to h...,14123,0.73611,2,positive,very_positive
4,5,"Emerges as something rare , an issue movie tha...",13999,0.86111,2,positive,very_positive


Hurray! We have datasets for classification on **fine-grained** and **binary** sentiment labels! Let's save them.

In [12]:
train_df = df.loc[df["splitset_label"] == 1, ["text", "fine_grained_label"]]
valid_df = df.loc[df["splitset_label"] == 3, ["text", "fine_grained_label"]]
test_df = df.loc[df["splitset_label"] == 2, ["text", "fine_grained_label"]]

train_df.to_csv("train_fine_grained.csv", index=False)
valid_df.to_csv("valid_fine_grained.csv", index=False)
test_df.to_csv("test_fine_grained.csv", index=False)

train_df.shape, valid_df.shape, test_df.shape

((8117, 2), (1044, 2), (2125, 2))

In [13]:
# we need to drop NaNs (NaNs contained in binary_label column, they are neutral sentences)
df.dropna(inplace=True)

train_df = df.loc[df["splitset_label"] == 1, ["text", "binary_label"]]
valid_df = df.loc[df["splitset_label"] == 3, ["text", "binary_label"]]
test_df = df.loc[df["splitset_label"] == 2, ["text", "binary_label"]]

train_df.to_csv("train_binary.csv", index=False)
valid_df.to_csv("valid_binary.csv", index=False)
test_df.to_csv("test_binary.csv", index=False)

train_df.shape, valid_df.shape, test_df.shape

((6568, 2), (825, 2), (1749, 2))

## Library and requirements installation

We are going to implement MLP on Keras over token-level GloVe embeddings.

Let's install library and dependencies for Keras.

In [14]:
!pip install deeppavlov

Collecting deeppavlov
[?25l  Downloading https://files.pythonhosted.org/packages/30/30/912a9ee9094140247718a08fd4461357864e2d13e9e153c9e454c2020747/deeppavlov-0.3.1-py3-none-any.whl (673kB)
[K     |████████████████████████████████| 675kB 2.9MB/s 
[?25hCollecting tqdm==4.23.4 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 14.7MB/s 
[?25hCollecting scikit-learn==0.19.1 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/3d/2d/9fbc7baa5f44bc9e88ffb7ed32721b879bfa416573e85031e16f52569bc9/scikit_learn-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (12.4MB)
[K     |████████████████████████████████| 12.4MB 43.8MB/s 
[?25hCollecting scipy==1.1.0 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a173

In [0]:
!python -m deeppavlov install intents_snips

2019-06-19 11:15:02.245 INFO in 'deeppavlov.core.common.file'['file'] at line 30: Interpreting 'intents_snips' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/classifiers/intents_snips.json'
Collecting tensorflow==1.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/ee/e6/a6d371306c23c2b01cd2cb38909673d17ddd388d9e4b3c0f6602bfd972c8/tensorflow-1.10.0-cp36-cp36m-manylinux1_x86_64.whl (58.4MB)
[K     |████████████████████████████████| 58.4MB 40.1MB/s 
[?25hCollecting tensorboard<1.11.0,>=1.10.0 (from tensorflow==1.10.0)
[?25l  Downloading https://files.pythonhosted.org/packages/c6/17/ecd918a004f297955c30b4fffbea100b1606c225dbf0443264012773c3ff/tensorboard-1.10.0-py3-none-any.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 40.7MB/s 
[?25hCollecting setuptools<=39.1.0 (from tensorflow==1.10.0)
[?25l  Downloading https://files.pythonhosted.org/packages/8c/10/79282747f9169f21c053c562a0baa21815a8c7879be97abd930dbcf862e8/setuptools-39.1.0-py2.py3-no

## Dataset Reader

DatasetReaders are components for reading datasets from files. DeepPavlov contains several different DatasetReaders, one can use either presented DatasetReader or build his own component. 

The only requirements is the output of **DatasetReader**: 
* output must be a dictionary with three fields "train", "valid" and "test", 
* each dictionary value must be a list of corresponding samples,
* each sample must be a tuple (x, y) where either x, y or both can also be lists of several inputs.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/dataset_readers.html

In [0]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

In [0]:
reader = BasicClassificationDatasetReader()
data = reader.read(data_path="./", 
                   train="train_binary.csv", valid="valid_binary.csv", test="test_binary.csv",
                   x="text", y="binary_label")

In [17]:
data.keys()

dict_keys(['train', 'valid', 'test'])

For every samples we store label(s) as list because we don't know whether it is binary, multi-class or multi-label classification.

In [18]:
data["train"][0]

("The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 ['positive'])

## Dataset Iterator

DatasetIterators are components for iterating over datasets. DeepPavlov contains several different DatasetIterators, one can either use presented iterator or build his own component.

DatasetIterator must have the following methods:
* **gen_batches** - method generates batches of inputs and expected output to train neural networks. Output is a tuple of a batch of inputs and a batch of expected outputs.
* **get_instances** - method gets all data for a selected data type ("train", "valid", "test"). Output is a tuple of all inputs for a data type and all expected outputs for a data type.
* **split** - method merges/splits data of a selected data type from DatasetReader ("train", "valid", "test").

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/dataset_iterators.html

In [0]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [0]:
iterator = BasicClassificationDatasetIterator(data, seed=42, shuffle=True)

## Preprocessor

We can preprocess text according to our needs. 
Let's define the most simple preprocessor - lower-casing.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [21]:
from deeppavlov.models.preprocessors.str_lower import StrLower

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/nonbreaking_prefixes.zip.


In [0]:
preprocessor = StrLower()

In [23]:
preprocessor(["The Rock is destined to be the 21st Century 's new `` Conan ''."])

["the rock is destined to be the 21st century 's new `` conan ''."]

## Tokenizer

We need to tokenize our texts because we are going to use word embeddings.
DeepPavlov contains several different tokenizers, one can choose the most appropriate.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/tokenizers.html

In [0]:
from deeppavlov.models.tokenizers.nltk_tokenizer import NLTKTokenizer

In [0]:
tokenizer = NLTKTokenizer()

In [26]:
tokenizer(["The Rock is destined to be the 21st Century 's new `` Conan ''."])

[['The',
  'Rock',
  'is',
  'destined',
  'to',
  'be',
  'the',
  '21st',
  'Century',
  "'",
  's',
  'new',
  '``',
  'Conan',
  "''."]]

## Embedder

We are planning to use non-trainable GloVe word embeddings. Let's download file.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/embedders.html

Now we need to download GloVe embeddings file. One can download from [here](https://nlp.stanford.edu/projects/glove/) but it downloads more than 800 Mb. To save your time, you can download GloVe embeddings file from DeepPavlov (downloads 350 Mb).

In [27]:
from deeppavlov.core.data.utils import download

download("./glove.6B.100d.txt", source_url="http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt")

2019-06-24 13:01:03.109 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt to /content/glove.6B.100d.txt
347MB [00:21, 16.2MB/s]


Now we can define GloVeEmbedder. Parameter `pad_zero` which is set to `True` determines whether to pad embedded batch of tokens to the longest sample length.

In [28]:
from deeppavlov.models.embedders.glove_embedder import GloVeEmbedder

embedder = GloVeEmbedder(load_path="./glove.6B.100d.txt", 
                         pad_zero=True  # means whether to pad up to the longest sample in a batch
                        )

2019-06-24 13:01:25.653 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/content/glove.6B.100d.txt`]
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [29]:
embedder(["The Rock is destined to be the 21st Century 's new `` Conan ''.",
          "The Rock is destined..."])

array([[[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
          0.      ],
        [-0.20314 ,  0.50467 , -0.25223 , ..., -0.34618 , -0.18627 ,
         -0.31606 ],
        [-0.52606 , -0.066991, -0.17351 , ..., -0.79123 ,  0.047581,
          0.084428],
        ...,
        [-0.34562 , -0.24993 ,  0.58678 , ..., -1.3106  ,  1.0294  ,
         -0.058794],
        [-0.34562 , -0.24993 ,  0.58678 , ..., -1.3106  ,  1.0294  ,
         -0.058794],
        [-0.33979 ,  0.20941 ,  0.46348 , ..., -0.23394 ,  0.47298 ,
         -0.028803]],

       [[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
          0.      ],
        [-0.20314 ,  0.50467 , -0.25223 , ..., -0.34618 , -0.18627 ,
         -0.31606 ],
        [-0.52606 , -0.066991, -0.17351 , ..., -0.79123 ,  0.047581,
          0.084428],
        ...,
        [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
          0.      ],
        [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
   

In [30]:
embedder(["The Rock is destined to be the 21st Century 's new `` Conan ''.",
          "The Rock is destined..."]).shape

(2, 63, 100)

## Vocabulary of classes

By default, we assume that we have different classes which also can be given as strings. Therefore, we need to convert them to something more appropriate for classifier. For example, neural classifiers always need to get **one-hot** representation of classes. To get one-hot representation we have to collect a dictionary with all the classes appeared (if needed one can add "unknown" class), index class samples and convert to one-hot representation.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/core/data.html

In [0]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [32]:
vocab = SimpleVocabulary(save_path="./binary_classes.dict")



In [0]:
vocab.fit(iterator.get_instances(data_type="train")[1])

In [34]:
list(vocab.items())

[('positive', 0), ('negative', 1)]

In [35]:
vocab(["positive", "positive", "negative"])

[0, 0, 1]

In [36]:
vocab([0, 0, 1])

['positive', 'positive', 'negative']

**One-hotter**

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [0]:
from deeppavlov.models.preprocessors.one_hotter import OneHotter

In [0]:
one_hotter = OneHotter(depth=vocab.len, 
                       single_vector=True  # means we want to have one vector per sample
                      )

In [39]:
one_hotter(vocab(["positive", "positive", "negative"]))

[array([1., 0.], dtype=float32),
 array([1., 0.], dtype=float32),
 array([0., 1.], dtype=float32)]

**Converting from probability to labels**

Neural model not only accepts one-hot classes representation but also returns for every sample vector of probability distribution of classes. Therefore, we need to use some component to convert probability ditribution to label indices. 

`Proba2Labels` component supports three different model:
* if `max_proba` is true, returns indices of the highest probabilities,
* if `confident_threshold` is given, returns indices with probabiltiies higher than threshold,
* if `top_n` is given, returns `top_n` indices with highest probabilities.
**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [0]:
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

prob2labels = Proba2Labels(max_proba=True)

In [41]:
prob2labels([[0.5, 0.2, 0.3], 
             [0.2, 0.4, 0.4]])

[[0], [1]]

## Classifier

DeepPavlov contains several classification components: sklearn classifiers, NNs on [Keras](https://keras.io/), BERT classifier on tensorflow. This tutorial demonstrates how to build neural networks classifier on Keras. We are going to build MLP on Keras.

[Keras](https://keras.io/) is a high-level neural network framework which can be run on top of `TensorFlow`, `Theano` and `CNTK`. In `DeepPavlov` we are going to work on `Keras` with `TensorFlow` backend.

`Keras` allows user to not care about building graphs and running sessions. `Keras` is very user-friendly and comfortable in terms of usage not "very custom" layers and training one model (not several models in parallel).

`Keras` neural network is a nothing else but [`keras.Model`](https://keras.io/models/model/) instance which is determined with input `keras.layers.Input` and some output (e.g. `keras.layers.Activation`) layers. Input layer and output layer are interlyed by several layers from `keras.layers` (e.g `keras.layers.Dense` or `keras.layers.Dropout`). Each layer instance is callable and returns tensor. 

Every `Keras` model should be compiled to determine loss and optimizer for training:
```python
input = Input(shape=(784,))

output = Dense(64, activation='relu')(input)

model = Model(inputs=inputs, outputs=output)

model.compile(optimizer="Adam", 
              loss='categorical_crossentropy')
```

Then `keras.Model` can be trained using methods `Model.train_on_batch` or  `Model.fit` (https://keras.io/models/model/) and infered using `Model.predict`.
While in `DeepPavlov` one can use `KerasClassificationModel.train_on_batch`, `KerasClassificationModel.__call__` (as well as `KerasClassificationModel.infer_on_batch`) and `KerasClassificationModel.save` to save the model.


`KerasClassificationModel` is a class building Keras classifier where network architecture is built in a separate class method returning not compiled (compilation will be done automatically) `keras.Model` accepting tokenized embedded texts as input. 

**TASK:** Now you should implement multi-layer perceptron containing several consitent dense layers.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/classifiers.html

In [42]:
from keras.layers import Input, Dense, Activation, Dropout, Flatten, GlobalMaxPooling1D
from keras import Model

from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.metrics.accuracy import sets_accuracy

Using TensorFlow backend.


In [0]:
class MyKerasClassificationModel(KerasClassificationModel):
    
    def multi_layer_perceptron(*args, **kwargs):
        """
        Build Multi-layer perceptron network for text classification.
        
        Args:
            kwargs: dictionary with parameters which can be used below
            
        Returns:
            not compiled Keras Model
        """
        inp = Input(shape=(None, embedder.dim))
        # `inp` is 3-dimensional: batch_size X number_of_tokens X embedding_size
        # `output` should be 2-dimensional: batch_size X number_of_classes
        
        # you may use `GlobalMaxPooling1D` for reducing dimensions,
        # you must use `softmax` activation as we do not doing binary classification
        # because we converted our each label to two-dimensional vector,
        # you may use several consistent `Dense` layers
        # but note the last one layer should have `vocab.len` units (number of classes)
        
        # here is your code
        
        model = Model(inputs=inp, outputs=output)
        return model

In [0]:
model = MyKerasClassificationModel(
    # Don't forget to specify parameters which you used in MLP
    # start of your code
    units=[64, 32, 16, 8],
    dropout_rate=0.,
    # end of your code
    save_path="./mlp_model_v0", 
    load_path="./mlp_model_v0", 
    embedding_size=embedder.dim,
    n_classes=vocab.len,
    model_name="multi_layer_perceptron",  # HERE we put our new network-method name
    optimizer="Adam",
    learning_rate=0.001,
    learning_rate_decay=0.001,
    loss="categorical_crossentropy")

The MLP neural model was sucessfully defined. Now we are ready to train it!

**TASK:** You need to implement training procedure containing the following steps.

In [0]:
# Method `get_instances` returns all the samples of particular data field
x_valid, y_valid = iterator.get_instances(data_type="valid")
# You need to save model only when validation score is higher than previous one.
# This variable will contain the highest accuracy score
best_score = 0.

# let's train for 10 epochs
for ep in range(10):
    # for iterating over `train` data you can use `gen_batches` method
    # don't forget to set `data_type` to `train`, and `shuffle` dataset.
    
    # batch of text samples should be consistently given to 
    # preprocessor, tokenizer, embedder
    
    # batch of classes should be consistently given to 
    # vocab and one-hotter
    
    # model has method `train_on_batch` which
    # accepts two inputs:
    # embedded batch of texts and one-hot representation of classes
    
    # after iterating over `train` dataset
    # you need to validate obtained model:
    # you can ``__call__`` model given embedded tokenized preprocessed `x_valid`,
    # then you should convert predictions using `proba2labels` and `vocab` to labels
    # and calculate `sets_accuracy` between `y_valid` and predicted labels
    
    # the last step is to compare achieved score to `best_score` 
    # and save mode using `save` method,
    # don't forget to change `best_score`
    
    # here is your code
    

In [0]:
# Let's look into obtained resulting outputs
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted probability distribution: {}".format(dict(zip(vocab.keys(), 
                                                               y_valid_pred[0]))))
print("Predicted label: {}".format(vocab(prob2labels(y_valid_pred))[0]))

Text sample: It 's a lovely film with lovely performances by Buy and Accorsi .
True label: ['positive']
Predicted probability distribution: {'positive': 0.9770100116729736, 'negative': 0.02298995666205883}
Predicted label: ['positive']


# Fine-grained classification

Fine-grained labelled dataset corresponds to multi-class classification task with 5 classes.
Still this classification is not multi-label, so you do not need to change anything from binary classifiaction except of network or training parameters.

The **TASK** is to build from scratch fine-grained classifier.