# Data

Source: http://ai.stanford.edu/~amaas/data/sentiment/index.html \\
Description: \\
The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.


In [0]:
import os
import re
import numpy as np
from keras import layers as L
from keras.models import Sequential
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [0]:
!git clone https://github.com/vyhuholl/imdb_review_classification.git

fatal: destination path 'imdb_review_classification' already exists and is not an empty directory.


#Preprocessing

In [0]:
PATH = 'imdb_review_classification/data/'

corpus = []
X_train = []
X_test = []
y_train = []
y_test = []
for file in os.listdir(PATH + 'train/pos'):
  rating = int(re.match('\d*_(\d*)', file).group(1))
  with open(PATH + 'train/pos/' + file) as text:
    review = text.read()
    corpus += review
    X_train.append(review)
    y_train.append(rating)
for file in os.listdir(PATH + 'train/neg'):
  rating = int(re.match('\d*_(\d*)', file).group(1))
  with open(PATH + 'train/neg/' + file) as text:
    review = text.read()
    corpus += review
    X_train.append(review)
    y_train.append(rating)
for file in os.listdir(PATH + 'test/pos'):
  rating = int(re.match('\d*_(\d*)', file).group(1))
  with open(PATH + 'test/pos/' + file) as text:
    review = text.read()
    corpus += review
    X_test.append(review)
    y_test.append(rating)
for file in os.listdir(PATH + 'test/neg'):
  rating = int(re.match('\d*_(\d*)', file).group(1))
  with open(PATH + 'test/neg/' + file) as text:
    review = text.read()
    corpus += review
    X_test.append(review)
    y_test.append(rating)

In [0]:
# Превращаем метки класса в one-hot векторы
encoder = LabelBinarizer()
encoder.fit(range(11))
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)
#токенизируем и векторизуем
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus) #для чистоты эксперимента обучать токенайзер будем только на обучающей выборке
vocab_size = len(tokenizer.word_index) + 1

#нейросеть, обученная на текстах, полученных с помощью texts_to_matrix(mode='tfidf'), давала очень низкую accuracy (около 0.2, поэтому я решила использовать text_to_sequences)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [0]:
# padding sequences to len 100
X_train = pad_sequences(X_train, maxlen=100, padding='post')
X_test = pad_sequences(X_test, maxlen=100, padding='post')

#Training simple neural network

In [19]:
model = Sequential()
model.add(L.Dense(512, input_shape=(100,)))
model.add(L.Activation('sigmoid'))
model.add(L.BatchNormalization())
model.add(L.Dense(256))
model.add(L.Activation('relu'))
model.add(L.BatchNormalization())
model.add(L.Dense(11))
model.add(L.Activation('softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
 
history = model.fit(X_train, y_train,
                    batch_size=100,
                    epochs=1000,
                    verbose=1,
                    validation_split=0.1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 512)               51712     
_________________________________________________________________
activation_13 (Activation)   (None, 512)               0         
_________________________________________________________________
batch_normalization_9 (Batch (None, 512)               2048      
_________________________________________________________________
dense_14 (Dense)             (None, 256)               131328    
_________________________________________________________________
activation_14 (Activation)   (None, 256)               0         
_________________________________________________________________
batch_normalization_10 (Batc (None, 256)               1024      
_________________________________________________________________
dense_15 (Dense)             (None, 11)                2827      
__________

In [21]:
score = model.evaluate(X_test, y_test,
                       batch_size=100, verbose=1)
print('Test accuracy:', score[1])

Test accuracy: 0.1865200002193451


#Training neural network with GloVe embeddings

#Results

# References

* Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659. \\
* Learning Word Vectors for Sentiment Analysis
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts
Stanford University,
Stanford, CA 94305
* Howard J., Ruder S. (2018) Universal Language Model Fine-tuning for Text Classification
* https://github.com/stanfordnlp/GloVe