## RNNs

We will use Recurrent Neural Networks, and in particular LSTMs, to perform sentiment analysis in Keras.  Conveniently, Keras has a built-in IMDb movie reviews dataset that we can use.

In [1]:
from keras.datasets import imdb

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
vocabulary_size = 90000

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

Loaded dataset with 25000 training samples, 25000 test samples


 Inspect a sample review and its label

In [3]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

---review---
[1, 6740, 365, 1234, 5, 1156, 354, 11, 14, 5327, 6638, 7, 1016, 10626, 5940, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 16393, 9363, 1117, 1831, 7485, 5, 4831, 26, 6, 71690, 4183, 17, 369, 37, 215, 1345, 143, 32677, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 26441, 8564, 63, 271, 6, 196, 96, 949, 4121, 4, 74170, 7, 4, 2212, 2436, 819, 63, 47, 77, 7175, 180, 6, 227, 11, 94, 2494, 33740, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 5390, 99, 76, 23, 77842, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---label---
1


In [4]:
type(X_train[0])

list

Map word IDs back to words

In [5]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

---review with words---
['the', 'boiled', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'murdering', 'naschy', 'br', 'villain', 'council', 'suggestion', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'echoed', 'concentrates', 'concept', 'issue', 'skeptical', 'to', "god's", 'he', 'is', 'dedications', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'rocketed', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', "captain's", 'starship', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'mayfair', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'originals', 'things', 'is', 'far', 'this', 'make', 'mistakes', "kevin's", 'was', "couldn't", 'of', 'few', 'br', 'of', 'you', 'to', "don't", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'dose', 'movies', 'get', 'are', '498', 'br', 'yes', 'female', 'just', 'it

Maximum review length and minimum review length

In [6]:
print('Maximum review length: {}'.format(
len(max((X_train + X_test), key=len))))

Maximum review length: 2697


In [7]:
print('Minimum review length: {}'.format(
len(min((X_test + X_test), key=len))))

Minimum review length: 14


### Pad sequences

In order to feed this data into our RNN, all input documents must have the same length. We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the pad_sequences() function in Keras. For now, set max_words to 500.

In [8]:
from keras.preprocessing import sequence

max_words = 1000
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

### TODO: Design an RNN model for sentiment analysis

Build our model architecture in the code cell below. We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.

Remember that our input is a sequence of words (technically, integer word IDs) of maximum length = max_words, and our output is a binary sentiment label (0 or 1).

In [9]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1000, 32)          2880000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 2,933,301
Trainable params: 2,933,301
Non-trainable params: 0
_________________________________________________________________
None


To summarize, our model is a simple RNN model with 1 embedding, 1 LSTM and 1 dense layers. 213,301 parameters in total need to be trained.

### Train and evaluate our model

We first need to compile our model by specifying the loss function and optimizer we want to use while training, as well as any evaluation metrics we'd like to measure. Specify the approprate parameters, including at least one metric 'accuracy'.

In [10]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

Once compiled, we can kick off the training process. There are two important training parameters that we have to specify - batch size and number of training epochs, which together with our model architecture determine the total training time.

Training may take a while, so grab a cup of coffee, or better, go for a run!

In [11]:
batch_size = 64
num_epochs = 3

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x7f15ff687fd0>

In [12]:
model.save('imdb_{}.h5'.format(max_words))

scores[1] will correspond to accuracy if we pass metrics=['accuracy']

In [13]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.8574000000953674


In [14]:
import glob
import os
import numpy as np
# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)

In [15]:
# load documents and tokenize
documents = [tokenizer(open(filename, 'r').read()) for filename in glob.glob(os.path.join('Webpages', '*'))]

In [16]:
# embed the documents with imdb embeddings   
embedded_documents = [[word2id[str(token).lower()] for token in doc if str(token).lower() in word2id] for doc in documents]

In [17]:
X_doc = sequence.pad_sequences(embedded_documents, maxlen=max_words)
Y_doc = label_documents = [1,1,0,0,0.5,0.5,0.5,0,1,0.5,1,0.5,0,0,1,0,1,1,0,0.5,1,0,1,\
                           0.5,1,1,0,0,1,0.5,1,1,0,0.5,1,0,0,0,1,1,1,1]

In [19]:
scores = model.predict(X_doc)

Doc accuracy: [[0.9371836 ]
 [0.825713  ]
 [0.9344975 ]
 [0.10576358]
 [0.5777002 ]
 [0.9582455 ]
 [0.10163125]
 [0.34419048]
 [0.98698694]
 [0.12759072]
 [0.9827621 ]
 [0.68403924]
 [0.01969293]
 [0.02185491]
 [0.14388424]
 [0.04417193]
 [0.02762491]
 [0.42843536]
 [0.06281677]
 [0.9180878 ]
 [0.02717674]
 [0.03231892]
 [0.7974294 ]
 [0.7161536 ]
 [0.9604568 ]
 [0.954371  ]
 [0.0055702 ]
 [0.63643205]
 [0.99804217]
 [0.01298153]
 [0.9741243 ]
 [0.31941736]
 [0.07783806]
 [0.07875988]
 [0.98281205]
 [0.9710159 ]
 [0.00295758]
 [0.8450662 ]
 [0.87336814]
 [0.794011  ]
 [0.25328597]
 [0.89796776]]


In [24]:
avg_err = np.mean([abs(Y_doc[i] - scores[i]) for i in range(len(scores))])
score = 0
for i in range(len(scores)):
    if Y_doc[i] == 0 and scores[i] < 0.4:
        score += 1
    elif Y_doc[i] == 0.5 and scores[i] > 0.4 and scores[i] < 0.6:
        score += 1
    elif Y_doc[i] == 1 and scores[i] > 0.6:
        score += 1
    else:
        print(Y_doc[i], scores[i])
        print(documents[i])
print(score)
print(round(score/len(Y_doc)*100, 2))
        

0 [0.9344975]
﻿Considerations for the Development of Shale Gas in the United Kingdom _ PSE _ Physicians, Scientists, and Engineers for Healthy Energy
The United States shale gas boom has precipitated global interest in the development of unconventional oil and gas resources. Recently, government ministers in the United Kingdom started granting licenses that will enable companies to begin initial exploration for shale gas. Meanwhile, concern is increasing among the scientific community about the potential impacts of shale gas and other types of unconventional natural gas development (UGD) on human health and the environment. Although significant data gaps remain, there has been a surge in the number of articles appearing in the scientific literature, nearly three-quarters of which has been published since the beginning of 2013. Important lessons can be drawn from the UGD experience in the United States. Here we explore these considerations and argue that shale gas development policies i