[参考](http://blog.csdn.net/william_2015/article/details/72978387)

# 数据介绍
本文采用的数据来自 kaggle[UMICH SI650 - Sentiment Classification](https://www.kaggle.com/c/si650winter11/team)

下面是training.txt的数据：**每个句子对一个一个1或0，代表着这个句子的情绪为积极或者消极**

    1	Brokeback Mountain was so awesome.
    1	Brokeback Mountain was an AWESOME movie.
    1	man i loved brokeback mountain!
    1	dudeee i LOVED brokeback mountain!!!!
    1	I either LOVE Brokeback Mountain or think it's great that homosexuality is becoming more acceptable!:
    1	Anyway, thats why I love " Brokeback Mountain.
    1	Brokeback mountain was beautiful...
    0	da vinci code was a terrible movie.
    0	Then again, the Da Vinci code is super shitty movie, and it made like 700 million.
    0	The Da Vinci Code comes out tomorrow, which sucks.
    0	i thought the da vinci code movie was really boring.


# 引入包

In [22]:
from keras.layers import Activation,Dense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import nltk #实现分词
import collections# 统计词频
import numpy as np
import csv

# 数据准备

In [11]:
maxlen=0 #句子最大长度
word_freqs=collections.Counter() #词频
num_recs=0 #样本数
with open('files/data/python44-data/training.txt','r+',encoding='utf-8') as file:
    for line in file:
        labels,sentence=line.strip().split("\t")
        words=nltk.word_tokenize(sentence.lower())
        if len(words)>maxlen:
            maxlen=len(words)
        for word in words:
            word_freqs[word]+=1
        num_recs+=1
print('max_len',maxlen)
print('nb_words',len(word_freqs))
print('num_recs',num_recs)

max_len 42
nb_words 2328
num_recs 7086


可见一共有 2324 个不同的单词，包括标点符号。每句话最多包含 42 个单词。 根据不同单词的个数 (nb_words)，我们可以把词汇表的大小设为一个定值，并且对于不在词汇表里的单词，把它们用伪单词 UNK 代替。 根据句子的最大长度 (max_lens)，我们可以统一句子的长度，把短句用 0 填充。 

依前所述，我们把 VOCABULARY_SIZE 设为 2002。包含训练数据中按词频从大到小排序后的前 2000 个单词，外加一个伪单词 UNK 和填充单词 0。 最大句子长度 MAX_SENTENCE_LENGTH 设为40。 

In [12]:
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40

建立两个 lookup tables，分别是 word2index 和 index2word，用于单词和数字转换 

In [15]:
vocab_size = min(MAX_FEATURES, len(word_freqs)) + 2
word2index = {x[0]: i+2 for i, x in enumerate(word_freqs.most_common(MAX_FEATURES))}
word2index["PAD"] = 0
word2index["UNK"] = 1
index2word = {v:k for k, v in word2index.items()}

下面就是根据 lookup table 把句子转换成数字序列了，并把长度统一到 MAX_SENTENCE_LENGTH， 不够的填 0 ， 多出的截掉。 

In [16]:
X = np.empty(num_recs,dtype=list)
y = np.zeros(num_recs)
i=0
with open('files/data/python44-data/training.txt','r+',encoding='utf-8') as f:
    for line in f:
        label, sentence = line.strip().split("\t")
        words = nltk.word_tokenize(sentence.lower())
        seqs = []
        for word in words:
            if word in word2index:
                seqs.append(word2index[word])
            else:
                seqs.append(word2index["UNK"])
        X[i] = seqs
        y[i] = int(label)
        i += 1
X = sequence.pad_sequences(X, maxlen=MAX_SENTENCE_LENGTH)

划分数据，80% 作为训练数据，20% 作为测试数据

In [17]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建神经网络
- 损失函数用 binary_crossentropy， 
- 优化方法用 adam。 
- 至于 EMBEDDING_SIZE , HIDDEN_LAYER_SIZE , 以及训练时用到的BATCH_SIZE 和 NUM_EPOCHS 这些超参数，就凭经验多跑几次调优了。 

In [18]:
EMBEDDING_SIZE = 128
HIDDEN_LAYER_SIZE = 64

model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_SIZE,input_length=MAX_SENTENCE_LENGTH))
model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam",metrics=["accuracy"])

# 训练模型
10 个 epochs 和 batch_size 取 32 来训练这个网络。在每个 epoch， 我们用测试集当作验证集。 

In [19]:
BATCH_SIZE = 32
NUM_EPOCHS = 10
model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,validation_data=(Xtest, ytest))

Train on 5668 samples, validate on 1418 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x201ecb37fd0>

In [20]:
# 预测结果
score, acc = model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
print("\nTest score: %.3f, accuracy: %.3f" % (score, acc))
print('{}   {}      {}'.format('预测','真实','句子'))
for i in range(5):
    idx = np.random.randint(len(Xtest))
    xtest = Xtest[idx].reshape(1,40)
    ylabel = ytest[idx]
    ypred = model.predict(xtest)[0][0]
    sent = " ".join([index2word[x] for x in xtest[0] if x != 0])
    print(' {}      {}     {}'.format(int(round(ypred)), int(ylabel), sent))


Test score: 0.071, accuracy: 0.987
预测   真实      句子
 0      0     is it just me , or does harry potter suck ? ...
 1      1     brokeback mountain was beautiful .
 1      1     we 're gon na like watch mission impossible or hoot . (
 1      1     anyway , we both love harry potter , books , pirates of the caribbean , taking pictures , and writing and we have the same sarcastic and quirky sense of humor .
 1      1     brokeback mountain was an awesome movie .


In [21]:
# 测试其他句子
INPUT_SENTENCES = ['I love reading.','You are so boring.']
XX = np.empty(len(INPUT_SENTENCES),dtype=list)
i=0
for sentence in  INPUT_SENTENCES:
    words = nltk.word_tokenize(sentence.lower())
    seq = []
    for word in words:
        if word in word2index:
            seq.append(word2index[word])
        else:
            seq.append(word2index['UNK'])
    XX[i] = seq
    i+=1

XX = sequence.pad_sequences(XX, maxlen=MAX_SENTENCE_LENGTH)
labels = [int(round(x[0])) for x in model.predict(XX) ]
label2word = {1:'积极', 0:'消极'}
for i in range(len(INPUT_SENTENCES)):
    print('{}   {}'.format(label2word[labels[i]], INPUT_SENTENCES[i]))

积极   I love reading.
消极   You are so boring.


In [26]:
# 测试结果
with open('files/data/python44-data/results.csv','a+',encoding='utf-8',newline='') as file:
    csv_writer=csv.writer(file)
    with open('files/data/python44-data/testdata.txt','r+',encoding='utf-8') as f:
        INPUT_SENTENCES=f.readlines()
        XX = np.empty(len(INPUT_SENTENCES),dtype=list)
        i=0
        for sentence in  INPUT_SENTENCES:
            words = nltk.word_tokenize(sentence.lower())
            seq = []
            for word in words:
                if word in word2index:
                    seq.append(word2index[word])
                else:
                    seq.append(word2index['UNK'])
            XX[i] = seq
            i+=1
        XX = sequence.pad_sequences(XX, maxlen=MAX_SENTENCE_LENGTH)
        labels = [int(round(x[0])) for x in model.predict(XX) ]
        label2word = {1:'positive', 0:'negative'}
        for i in range(len(INPUT_SENTENCES)):
            csv_writer.writerow((label2word[labels[i]], INPUT_SENTENCES[i].strip()))