# Модель классификации последовательностей для анализа настроений IMDB
(c) Deniz Yuret, 2019
* Задачи: Изучить структуру набора данных IMDB и обучить простой модели RNN.

In [1]:
# Set display width, load packages, import symbols
ENV["COLUMNS"] = 72
using Pkg; haskey(Pkg.installed(),"Knet") || Pkg.add("Knet")
using Statistics: mean
using Knet: Knet, AutoGrad, RNN, param, dropout, minibatch, nll, accuracy, progress!, adam, save, load, gc

In [2]:
# Установите константы для модели и обучения
EPOCHS=3          # Количество тренировочных эпох
BATCHSIZE=64      # Количество экземпляров в мини-пакете
EMBEDSIZE=125     # Размер вложения слова
NUMHIDDEN=100     # Размер скрытого слоя
MAXLEN=150        # максимальный размер последовательности слов, укороченные последовательности, усеченные более длинные
VOCABSIZE=30000   # максимальный размер словаря, сохраняется наиболее частые 30K, сопоставляя остаток с токеном UNK
NUMCLASS=2        # количество выходных классов
DROPOUT=0.5       # Уровень отчисления
LR=0.001          # Скорость обучения
BETA_1=0.9        # Параметр оптимизации Адама
BETA_2=0.999      # Параметр оптимизации Адама
EPS=1e-08         # Параметр оптимизации Адама

1.0e-8

## Загрузка и просмотр данных

In [3]:
include(Knet.dir("data","imdb.jl"))   # определяет загрузчик IMDB

imdb

In [4]:
@doc imdb

```
imdb()
```

Load the IMDB Movie reviews sentiment classification dataset from https://keras.io/datasets and return (xtrn,ytrn,xtst,ytst,dict) tuple.

# Keyword Arguments:

  * url=https://s3.amazonaws.com/text-datasets: where to download the data (imdb.npz) from.
  * dir=Pkg.dir("Knet/data"): where to cache the data.
  * maxval=nothing: max number of token values to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept. nothing means keep all, equivalent to maxval = vocabSize + pad + stoken.
  * maxlen=nothing: truncate sequences after this length. nothing means do not truncate.
  * seed=0: random seed for sample shuffling. Use system seed if 0.
  * pad=true: whether to pad short sequences (padding is done at the beginning of sequences). pad_token = maxval.
  * stoken=true: whether to add a start token to the beginning of each sequence. start_token = maxval - pad.
  * oov=true: whether to replace words >= oov*token with oov*token (the alternative is to skip them). oov_token = maxval - pad - stoken.


In [5]:
@time (xtrn,ytrn,xtst,ytst,imdbdict)=imdb(maxlen=MAXLEN,maxval=VOCABSIZE);

┌ Info: Loading IMDB...
└ @ Main /home/deniz/.julia/dev/Knet/data/imdb.jl:57


  6.811008 seconds (29.27 M allocations: 1.493 GiB, 7.70% gc time)


In [6]:
println.(summary.((xtrn,ytrn,xtst,ytst,imdbdict)))

25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
Dict{String,Int32} with 88584 entries


(nothing, nothing, nothing, nothing, nothing)

In [7]:
# Слова кодируются целыми числами
rand(xtrn)'

1×150 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 30000  30000  30000  30000  30000  …  1908  92  11  6  1  17  15  22

In [8]:
# Каждая последовательность слов дополняется или усекается до длины 150
length.(xtrn)'

1×25000 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 150  150  150  150  150  150  150  …  150  150  150  150  150  150

In [9]:
# Определяет функцию, которая может печатать фактические слова:
imdbvocab = Array{String}(undef,length(imdbdict))
for (k,v) in imdbdict; imdbvocab[v]=k; end
imdbvocab[VOCABSIZE-2:VOCABSIZE] = ["<unk>","<s>","<pad>"]
function reviewstring(x,y=0)
    x = x[x.!=VOCABSIZE] # remove pads
    """$(("Sample","Negative","Positive")[y+1]) review:\n$(join(imdbvocab[x]," "))"""
end

reviewstring (generic function with 2 methods)

In [10]:
# Нажмите Ctrl-Enter, чтобы увидеть случайные абзацы:
r = rand(1:length(xtrn))
println(reviewstring(xtrn[r],ytrn[r]))

Positive review:
who definitely needed a hug these evil people capture the yokai and throw them into a red pit along with unwanted objects like <unk> and other mechanical things and these meld into one horribly violent robotic monsters whose only job is to kill takashi a young boy is the one to become their saviour alongside a red man dragon a turtle man and a river princess as well as a cute little creature that if it had been america they could have turned it into a cuddly toy and sold it at all good toy stores the lines are good especially the don't try this at home kids and other gems that bring a smile to your lips suspend belief and watch this with a child or on your own and enjoy though i must admit that the end was a wee bit sad and not necessarily so cheers <unk>


In [11]:
# Вот метки: 1 = отрицательный, 2 = положительный are the labels: 1=negative, 2=positive
ytrn'

1×25000 LinearAlgebra.Adjoint{Int8,Array{Int8,1}}:
 1  2  2  1  1  1  1  1  2  2  1  …  1  2  2  1  1  2  1  1  1  2  2

## Определяем модель

In [12]:
struct SequenceClassifier; input; rnn; output; pdrop; end

In [13]:
SequenceClassifier(input::Int, embed::Int, hidden::Int, output::Int; pdrop=0) =
    SequenceClassifier(param(embed,input), RNN(embed,hidden,rnnType=:gru), param(output,hidden), pdrop)

SequenceClassifier

In [14]:
function (sc::SequenceClassifier)(input)
    embed = sc.input[:, permutedims(hcat(input...))]
    embed = dropout(embed,sc.pdrop)
    hidden = sc.rnn(embed)
    hidden = dropout(hidden,sc.pdrop)
    return sc.output * hidden[:,:,end]
end

(sc::SequenceClassifier)(input,output) = nll(sc(input),output)

## Эксперимент

In [15]:
dtrn = minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)
dtst = minibatch(xtst,ytst,BATCHSIZE)
length.((dtrn,dtst))

(390, 390)

In [16]:
#Для проведения экспериментов
function trainresults(file,model; o...)
    if (print("Train from scratch? "); readline()[1]=='y')
        progress!(adam(model,repeat(dtrn,EPOCHS);lr=LR,beta1=BETA_1,beta2=BETA_2,eps=EPS))
        Knet.save(file,"model",model)
        Knet.gc() # To save gpu memory
    else
        isfile(file) || download("http://people.csail.mit.edu/deniz/models/tutorial/$file",file)
        model = Knet.load(file,"model")
    end
    return model
end

trainresults (generic function with 1 method)

In [17]:
model = SequenceClassifier(VOCABSIZE,EMBEDSIZE,NUMHIDDEN,NUMCLASS,pdrop=DROPOUT)
nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)

(0.69312066f0, 0.69312423f0, 0.5135817307692307, 0.5096153846153846)

In [18]:
# 2.51e-01  100.00%┣████████████████████┫ 1170/1170 [00:16/00:16, 75.46i/s]
model = trainresults("imdbmodel113.jld2",model);

Train from scratch? stdin> y
1.53e-01  100.00%┣████████████████████┫ 1170/1170 [00:18/00:18, 64.14i/s]


In [19]:
# (0.059155148f0, 0.3877507f0, 0.9846153846153847, 0.8583733974358975)
nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)

(0.05890469f0, 0.38913542f0, 0.9833733974358975, 0.8548477564102565)

## Playground

In [20]:
predictstring(x)="\nPrediction: " * ("Negative","Positive")[argmax(Array(vec(model([x]))))]
UNK = VOCABSIZE-2
str2ids(s::String)=[(i=get(imdbdict,w,UNK); i>=UNK ? UNK : i) for w in split(lowercase(s))]

str2ids (generic function with 1 method)

In [21]:
# Здесь мы видим прогнозы для случайных обзоров из тестового набора; нажмите Ctrl-Enter для примера:
r = rand(1:length(xtst))
println(reviewstring(xtst[r],ytst[r]))
println(predictstring(xtst[r]))

Negative review:
<s> this is an emperor's new clothes situation someone needs to say that's not a funny and original etc etc film that is an inferior film don't waste your money on it the film is trashy and the people in it are embarrassingly inferior trailer trash they are all too realistically only themselves they have no lines they don't act the american dream is not to create shoddy no quality films or anything else shoddy and of no quality it is to achieve something of quality and thereby success only people who are desperate to praise any film not made in hollywood it can't have been made in hollywood can it would try to any kind of quality to this film it's worse than ed woods another film about a film maker without standards these films shouldn't have been made and you shouldn't go see american movie

Prediction: Negative


In [22]:
# Прежде чем пользователь может ввести свои отзывы и классифицировать их:
println(predictstring(str2ids(readline(stdin))))

stdin> this was not a great movie

Prediction: Negative
