# Recognize named entities on Twitter with LSTMs

In this assignment, you will use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task you will experiment to recognize named entities from Twitter.

For example, we want to extract persons' and organizations' names from the text. Than for the input text:

    Ian Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

用Bi-LSTMS来进行解决

A solution of the task will be based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).

### Libraries

For this task you will need the following libraries:
 - [Tensorflow](https://www.tensorflow.org) — an open-source software library for Machine Intelligence.
 - [Numpy](http://www.numpy.org) — a package for scientific computing.
 
If you have never worked with Tensorflow, you would probably need to read some tutorials during your work on this assignment, e.g. [this one](https://www.tensorflow.org/tutorials/recurrent) could be a good starting point. 

### Data

The following cell will download all data required for this assignment into the folder `week2/data`.

In [2]:
"""
    课程设计：识别twitter中的命名实体
    数据量：约13万条数据
    采用的网络：双向LSTM
    准确率：
    召回率：
"""

'\n    课程设计：识别twitter中的命名实体\n    数据量：约13万条数据\n    采用的网络：双向LSTM\n    准确率：\n    召回率：\n'

In [3]:
"""
    用递归神经网络解决命名实体识别的问题．命名实体识别是自然语言系统中的一个常见的任务．可以用来提取人物，机构，地点等实体。
    这个实验是从推特中识别出实体
    如果我没有想错，这个命名实体识别的作业在面试的时候应该是可以拿出来讲的（算是NLP方面一个具体的任务）
    解决方案是通过双向LSTM来解决
"""

import sys
sys.path.append("..")
from common.download_utils import download_week2_resources

download_week2_resources()  #下载训练集，验证集，测试集

File data/train.txt is already downloaded.
File data/validation.txt is already downloaded.
File data/test.txt is already downloaded.


### Load the Twitter Named Entity Recognition corpus


We will work with a corpus, which contains twits with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

The function *read_data* reads a corpus from the *file_path* and returns two lists: one with tokens and one with the corresponding tags. You need to complete this function by adding a code, which will replace a user's nickname to `<USR>` token and any URL to `<URL>` token. You could think that a URL and a nickname are just strings which start with *http://* or *https://* in case of URLs and a *@* symbol for nicknames.

In [4]:
 """
    语料中包含twitter和对应的标签．每一行包含一条twitter，每一个twitter有token/tag对组成，空格分隔，不同tweet之间有一个空行
    
    处理完毕之后返回一个token list 和 tag list
    将user's nickname　用<USR>标签替换，nicknname一般以 @ 开始
    将URL用<URL>标签替换，url一般以 https:// 或 http:// 开始
"""
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:  #这里应该是指遇到空行的时候
            if tweet_tokens:
                tokens.append(tweet_tokens)     #应该是用append，形成一个列表的列表
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()   #已经按照空格进行切分了
            # Replace all urls with <URL> token
            # Replace all users with <USR> token

            ######################################
            ######### YOUR CODE HERE #############
            ######################################
            if token.startswith("http://") or token.startswith("https://"):
                token = "<URL>"
            if token.startswith("@"):
                token = "<USR>"
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [5]:
train_tokens, train_tags = read_data('data/train.txt')  #用来训练模型
validation_tokens, validation_tags = read_data('data/validation.txt')  #用来评估和调节超参数
test_tokens, test_tags = read_data('data/test.txt')    #用来评估最终的模型

You should always understand what kind of data you deal with. For this purpose, you can print the data running the following cell:

In [7]:
for i in range(1):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))   #切分正确
        #print(type((token,tag)))
    print()

RT	O
<USR>	O
:	O
Online	O
ticket	O
sales	O
for	O
Ghostland	B-musicartist
Observatory	I-musicartist
extended	O
until	O
6	O
PM	O
EST	O
due	O
to	O
high	O
demand	O
.	O
Get	O
them	O
before	O
they	O
sell	O
out	O
...	O



### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Now you need to implement the function *build_dict* which will return {token or tag}$\to${index} and vice versa. 

In [77]:
"""
    形成两种映射：
        词->词id：便于为当前的词定位到词嵌入矩阵中特定的行
        标签->标签id: 便于形成one-hot向量，便于为网络的输出计算损失
"""
from collections import defaultdict
from collections import Counter

In [95]:
"""
    添加 token or tag 到下标的映射
    read_data 将 用户昵称 和 超链接 替换完之后才放进 build_dict
"""
def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """
    # Create a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = []
    
    # Create mappings from tokens (or tags) to indices and vice versa.
    # Add special tokens (or tags) to the dictionaries.      将一些特殊的token或tags加入字典
    # The first special token must have index 0.　　          第一个特殊的token必须为零
    
    # Mapping tok2idx should contain each token or tag only once. 
    # To do so, you should extract unique tokens/tags from the tokens_or_tags variable
    # and then index them (for example, you can add them into the list idx2tok
    # and for each token/tag save the index into tok2idx).
    
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    #利用set去重
    tokens = []
    uni_tokens = []
    for token_oneTwitter in tokens_or_tags:
        tokens.extend(token_oneTwitter)
#     print(uni_tokens)
   
    #用于统计带标签的5000个高频词
    withTag_tokens = []
    for i in len(tokens_or_tags):
        
    
    uni_tokens =list(set(tokens))   #利用set去重(看看序列长度，只能保证词表大小个单词_5000)
    print("uni_tokens.length:", len(uni_tokens)) #有20505个不重复的字符串，不可能都建立词向量，那怎么办，可以建高频词
    
    #取最高频的5000个词(对tokens取，而不是uni_tokens),改为收集带标签的5000个高频词，效果应该会提升很多
    counter = Counter(tokens)     #这特么，去重了怎么统计高频词
    count_pairs = counter.most_common(5000-len(special_tokens))
    words,_ = list(zip(*count_pairs))  #4998个高频单词的集合
    words = list(words)    #元组转list
#     print("最高频的5000个词：", words)
     
    #对于tag,首先要为标签'0'去重，这几行代码专为tag
    if len(special_tokens)==1:
        tag = special_tokens[0]
        if tag in words:
            words.remove('O')
    #防止这个特殊tag早已经存在
    words = special_tokens + list(words) #5000收集完毕（还得思考低频词如何处理）
    
    
    #这里做一个判断，如果len(words)<50，就拼接到50，这一步是为了tags,不好做，tag映射成下标有问题
#     if len(words)<50:
#         for i in range(len(words),50):
#             words.append("0")
    print("len(words):", len(words))
   
    
    #为每个token赋予一个下标
    tok2idx = dict(zip(words, range(len(words) )))  #将两个长度相同的list按序打包成字典，前一个为键，后一个为值
    idx2tok = {}
    for key, val in tok2idx.items():  #字典反转，键值互换（首先要保证两者都是不重复的）
        idx2tok[val] = key
    print("len(tok2idx):", len(tok2idx))
    return tok2idx, idx2tok

After implementing the function *build_dict* you can make dictionaries for tokens and tags. Special tokens in our case will be:
 - `<UNK>` token for out of vocabulary tokens;
 - `<PAD>` token for padding sentence to the same length when we create batches of sentences.

In [96]:
#<UNK>表示未登录词
#<PAD>用于填充句子，使长度相同
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Create dictionaries （创建词映射和标签映射字典）
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)
print("tag2idx:", tag2idx)

uni_tokens.length: 20503
len(words): 5000
len(tok2idx): 5000
uni_tokens.length: 21
len(words): 21
len(tok2idx): 21
tag2idx: {'O': 0, 'B-geo-loc': 1, 'B-person': 2, 'I-other': 3, 'B-other': 4, 'B-company': 5, 'I-product': 6, 'I-person': 7, 'I-facility': 8, 'B-product': 9, 'B-facility': 10, 'B-musicartist': 11, 'I-geo-loc': 12, 'I-company': 13, 'B-sportsteam': 14, 'I-musicartist': 15, 'I-movie': 16, 'I-sportsteam': 17, 'B-movie': 18, 'B-tvshow': 19, 'I-tvshow': 20}


The next additional functions will help you to create the mapping between tokens and ids for a sentence. 

In [97]:
"""
    来了任何一个句子，都可以词转下标，标签转下标，或者是下标转词，下标转标签
"""
def words2idxs(tokens_list):     #如果不在我的词典中，直接赋值标签<UNK>,<UNK>对应的tag是"O"
    idxs = []
    for word in tokens_list:
        if word in token2idx:
            idxs.append(token2idx[word])
        else:
            idxs.append(0)    #0 比表示<UNK>对应的下标，说明没出现在词表中的词的处理方式
    return idxs

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
#     tags=[]
#     for idx in idxs:
#         if idx>=len(idx2tag):
#             tags.append('0')
    return [idx2tag[idx] for idx in idxs]

### Generate batches

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special `<PAD>` token. It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts. We provide the batching function *batches_generator* readily available for you to save time. 


In [98]:
"""
    神经网络经常批量训练数据。每一次都是基于部分序列来更新网络的权重，但是要保证一个batch内所有的序列都有同样的长度。
    所以采用<pad>来进行了填充。
    给rnn提供序列长度是一个很好的方法，这样可以跳过填充部分的计算．
    
"""
def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
    """Generates padded batches of tokens and tags."""
    
    n_samples = len(tokens)  #推特的数量
    if shuffle:
        order = np.random.permutation(n_samples)   #随机打乱，permutation不直接在原来的数组上进行操作，而是返回一个新的打乱顺序的数组，并不改变原来的数组
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size   #看可以形成多少个batchs
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1     #最后一部分不足一个batch_size个数的元素也形成一个batc，只是后续需要进行padding

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples) #主要是为最后一个考虑
        current_batch_size = batch_end - batch_start    #正常来说的等于batch_size的
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:   #遍历该batch中每一个下标
            x_list.append(words2idxs(tokens[idx]))  #根据下标挑选出对应的词！然后在词典里面找下标（一定是这样才行）,应该是通过这个方式将下标控制在0-5000之间
            y_list.append(tags2idxs(tags[idx]))      
            max_len_token = max(max_len_token, len(tags[idx]))    #最大长度的标签的长度
            
        # Fill in the data into numpy nd-arrays filled with padding indices.
        #保证token和tag的长度是一样的
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
    
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        
        lengths = np.zeros(current_batch_size, dtype=np.int32)
#         print("序列长度：", lengths.shape)
        for n in range(current_batch_size):    #用来记录每一个sequence的真实长度
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y, lengths

## Build a recurrent neural network

This is the most important part of the assignment. Here we will specify the network architecture based on TensorFlow building blocks. It's fun and easy as a lego constructor! We will create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.  



In [99]:
"""
    创建一个LSTM网络，为句子中的每个标记生成标签概率分布。要考虑token的左右侧的上下文。
    密集层将用于标签分类 
"""
import tensorflow as tf
import numpy as np


In [100]:
class BiLSTMModel():
    pass

First, we need to create [placeholders](https://www.tensorflow.org/versions/master/api_docs/python/tf/placeholder) to specify what data we are going to feed into the network during the execution time.  For this task we will need the following placeholders:
 - *input_batch* — sequences of words (the shape equals to [batch_size, sequence_len]);
 - *ground_truth_tags* — sequences of tags (the shape equals to [batch_size, sequence_len]);
 - *lengths* — lengths of not padded sequences (the shape equals to [batch_size]);
 - *dropout_ph* — dropout keep probability; this placeholder has a predefined value 1;
 - *learning_rate_ph* — learning rate; we need this placeholder because we want to change the value during training.

It could be noticed that we use *None* in the shapes in the declaration, which means that data of any size can be feeded. 

You need to complete the function *declare_placeholders*.

In [101]:
"""
    初始化一些赋值变量：即创建placeholders．在程序执行期间，才会给变量赋值
    词序列
    标签序列
    没有填充的序列长度：即 batch_size
    丢弃概率
    学习速率
"""
def declare_placeholders(self):
    """Specifies placeholders for the model."""

    # Placeholders for input and ground truth output.
    # 每一次处理batch_size个推特，每个推特的长度为sequence_len
    self.input_batch = tf.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch')  #batch_size * sequence_len
    print("self.input_batch.shape:", self.input_batch.shape)
    self.ground_truth_tags = tf.placeholder(dtype=tf.int32, shape=[None, None], name="ground_truth_tags") #batch_size * sequence_len
  
    # Placeholder for lengths of the sequences.(序列长度)
    self.lengths = tf.placeholder(dtype=tf.int32, shape=[None], name='lengths') #batch_size
    
    # Placeholder for a dropout keep probability. If we don't feed
    # a value for this placeholder, it will be equal to 1.0.
    # tf.cast　改变一个张量的数据类型（保留概率）
    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    
    # Placeholder for a learning rate (tf.float32).
    self.learning_rate_ph = tf.placeholder(dtype=tf.float32)

In [102]:
BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)

Now, let us specify the layers of the neural network. First, we need to perform some preparatory steps: 
 
- Create embeddings matrix with [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable). Specify its name (*embeddings_matrix*), type  (*tf.float32*), and initialize with random values.
- Create forward and backward LSTM cells. TensorFlow provides a number of [RNN cells](https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cells_for_use_with_TensorFlow_s_core_RNN_methods) ready for you. We suggest that you use *BasicLSTMCell*, but you can also experiment with other types, e.g. GRU cells. [This](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) blogpost could be interesting if you want to learn more about the differences.
- Wrap your cells with [DropoutWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper). Dropout is an important regularization technique for neural networks. Specify all keep probabilities using the dropout placeholder that we created before.
 
After that, you can build the computation graph that transforms an input_batch:

- [Look up](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) embeddings for an *input_batch* in the prepared *embedding_matrix*.
- Pass the embeddings through [Bidirectional Dynamic RNN](https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn) with the specified forward and backward cells. Use the lengths placeholder here to avoid computations for padding tokens inside the RNN.
- Create a dense layer on top. Its output will be used directly in loss function.  
 
Fill in the code below. In case you need to debug something, the easiest way is to check that tensor shapes of each step match the expected ones. 
 

In [147]:
"""#表示神经网络的层
    创建词嵌入矩阵
    创建前向、后向 LSTM cell，类型选择Basic LSTMCell
    用DropoutWrapper对cell进行正则化
    之后可以将每个输入batch转换为一个计算图
        1.从之前准备好的embedding_matrix中为input_batch进行查找，组成embedding矩阵
        2.通过带有特定的前向，后向cells的双向动态RNN来处理这个embeddings(可以使用长度占位符来避免RNN对填充标记长度的计算)
        3.在顶部创建一个密集层，他的输出将会用于直接计算损失函数
"""
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
    """Specifies bi-LSTM architecture and computes logits for inputs."""
    
    # Create embedding variable (tf.Variable) with dtype tf.float32
    # V * N
    # 词嵌入矩阵初始化
    initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
    embedding_matrix_variable = tf.Variable(initial_embedding_matrix, dtype=tf.float32, name="embeddings_matrix")   #这里的代码没验证正确性
    
    # Create RNN cells (for example, tf.nn.rnn_cell.BasicLSTMCell) with n_hidden_rnn number of units 
    # and dropout (tf.nn.rnn_cell.DropoutWrapper), initializing all *_keep_prob with dropout placeholder.
    forward_cell =  tf.contrib.rnn.BasicLSTMCell(n_hidden_rnn, forget_bias=1.0, state_is_tuple=True)   #正方向传播的RNN
    tf.contrib.rnn.DropoutWrapper(forward_cell)
    backward_cell =  tf.contrib.rnn.BasicLSTMCell(n_hidden_rnn, forget_bias=1.0, state_is_tuple=True)    #反方向传播的RNN
    tf.contrib.rnn.DropoutWrapper(backward_cell)
    
    # Look up embeddings for self.input_batch (tf.nn.embedding_lookup).
    # Shape: [batch_size, sequence_len, embedding_dim].
    embeddings =  tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
    print(embeddings.shape)    #(?,?,200)
    
    # Pass them through Bidirectional Dynamic RNN (tf.nn.bidirectional_dynamic_rnn).
    # Shape: [batch_size, sequence_len, 2 * n_hidden_rnn]. 
    # Also don't forget to initialize sequence_length as self.lengths and dtype as tf.float32.
    """yes, 原来错在sequence_length上面"""
    # 现在报的是TypeError错误，在文档中看到一句：TypeError: If cell_fw or cell_bw is not an instance of RNNCell.
    # 经过测试，错误在forget_bias = self.dropout_ph(self.dropout_ph这玩意现在根本就不能用，改成forget_bias就没事)
    (rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(forward_cell, backward_cell, inputs=embeddings, sequence_length=self.lengths, dtype=tf.float32)
    rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)  #以前版本的形式

    # Dense layer on top.
    # Shape: [batch_size, sequence_len, n_tags].  形成全连接层 
    self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)

In [148]:
BiLSTMModel.__build_layers = classmethod(build_layers)

To compute the actual predictions of the neural network, you need to apply [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) to the last layer and find the most probable tags with [argmax](https://www.tensorflow.org/api_docs/python/tf/argmax).

In [149]:
#应用softmax函数到最后一层，用argmax来找到最可能的标签
def compute_predictions(self):
    """Transforms logits to probabilities and finds the most probable tags."""
    
    # Create softmax (tf.nn.softmax) function
    # tf.nn.softmax(logits, axis=None, name=None, dim=None)
    softmax_output = tf.nn.softmax(self.logits)
    
    # Use argmax (tf.argmax) to get the most probable tags
    # Don't forget to set axis=-1
    # otherwise argmax will be calculated in a wrong way
    print("softmax_output:", softmax_output)
    self.predictions = tf.argmax(softmax_output, axis=-1)    #概率最大的类别的下标

In [150]:
BiLSTMModel.__compute_predictions = classmethod(compute_predictions)

During training we do not need predictions of the network, but we need a loss function. We will use [cross-entropy loss](http://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy), efficiently implemented in TF as 
[cross entropy with logits](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits). Note that it should be applied to logits of the model (not to softmax probabilities!). Also note,  that we do not want to take into account loss terms coming from `<PAD>` tokens. So we need to mask them out, before computing [mean](https://www.tensorflow.org/api_docs/python/tf/reduce_mean).

In [151]:
#用交叉熵来做损失函数，在计算损失时，不需要将<PAD>token考虑在内
def compute_loss(self, n_tags, PAD_index):
    """Computes masked cross-entopy loss with logits."""
    
    # Create cross entropy function function (tf.nn.softmax_cross_entropy_with_logits)
    ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels= ground_truth_tags_one_hot)
    #返回逐个元素x!=y的布尔值 not_equal(x,y,name=None)
    mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
    # Create loss function which doesn't operate with <PAD> tokens (tf.reduce_mean)
    # Be careful that the argument of tf.reduce_mean should be
    # multiplication of mask and loss_tensor.
    self.loss =  tf.reduce_mean(tf.matmul(mask, loss_tensor,  transpose_a=False, transpose_b=True))

In [152]:
BiLSTMModel.__compute_loss = classmethod(compute_loss)

The last thing to specify is how we want to optimize the loss. 
We suggest that you use [Adam](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) optimizer with a learning rate from the corresponding placeholder. 
You will also need to apply [clipping](https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_clipping) to eliminate exploding gradients. It can be easily done with [clip_by_norm](https://www.tensorflow.org/api_docs/python/tf/clip_by_norm) function. 

In [153]:
#优化损失函数2
#选择合适的优化函数和进行梯度剪切
def perform_optimization(self):
    """Specifies the optimizer and train_op for the model."""
    
    # Create an optimizer (tf.train.AdamOptimizer)
    # 千万别加上minimize(self.loss)，不然它就不是一个优化器了阿．．．．
    self.optimizer =  tf.train.AdamOptimizer(learning_rate=self.learning_rate_ph)
    # grads_and_vars: compute_gradients()函数返回的(gradient, variable)对的列表 
    # 卧槽，原始是可以用的，只是我为啥要加上minimize()!!!,终于通过了．．．．．．．．．．
    self.grads_and_vars = self.optimizer.compute_gradients(self.loss)  
    
    # Gradient clipping (tf.clip_by_norm) for self.grads_and_vars
    # Pay attention that you need to apply this operation only for gradients 
    # because self.grads_and_vars contains also variables.
    # list comprehension might be useful in this case.
    clip_norm = tf.cast(1.0, tf.float32)   #梯度剪切
    self.grads_and_vars =  [(tf.clip_by_norm( grad, clip_norm), var) for grad, var in self.grads_and_vars]
    
    self.train_op = self.optimizer.apply_gradients(self.grads_and_vars)   #system supply

In [154]:
BiLSTMModel.__perform_optimization = classmethod(perform_optimization)

Congratulations! You have specified all the parts of your network. You may have noticed, that we didn't deal with any real data yet, so what you have written is just recipes on how the network should function.
Now we will put them to the constructor of our Bi-LSTM class to use it in the next section. 

In [155]:
"""
    现在已经写完了我的网络的所有部分，但是还没有使用任何真实的数据，下一个模块会应用写好的Bi_LSTM
"""
def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
    self.__declare_placeholders()   #声明需要后期赋值的变量
    self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)  #建立我的整个网络层
    self.__compute_predictions()   #将全连接层的输出放进softmax函数，得出预测结果
    self.__compute_loss(n_tags, PAD_index)    #利用交叉熵损失函数，真实标签的one-hot,预测的y向量来计算平均损失函数
    self.__perform_optimization()   #利用Adam，梯度剪切来优化损失函数

In [156]:
BiLSTMModel.__init__ = classmethod(init_model)

## Train the network and predict tags

[Session.run](https://www.tensorflow.org/api_docs/python/tf/Session#run) is a point which initiates computations in the graph that we have defined. To train the network, we need to compute *self.train_op*, which was declared in *perform_optimization*. To predict tags, we just need to compute *self.predictions*. Anyway, we need to feed actual data through the placeholders that we defined before. 

In [157]:
"""
    利用x_batch和y_batch进行训练
"""
def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
    feed_dict = {self.input_batch: x_batch,
                 self.ground_truth_tags: y_batch,
                 self.learning_rate_ph: learning_rate,
                 self.dropout_ph: dropout_keep_probability,
                 self.lengths: lengths}
    
    session.run(self.train_op, feed_dict=feed_dict)

In [158]:
BiLSTMModel.train_on_batch = classmethod(train_on_batch)

Implement the function *predict_for_batch* by initializing *feed_dict* with input *x_batch* and *lengths* and running the *session* for *self.predictions*.

In [159]:
"""
    对测试集的x_batch进行预测
"""
def predict_for_batch(self, session, x_batch, lengths):
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    feed_dict = {
        self.input_batch:x_batch,
        self.lengths:lengths        
                }
    predictions = session.run(self.predictions, feed_dict=feed_dict)
    
    return predictions     #这里的返回值是一个numpy.arra,千万不能写self.predictions,那是还没有计算之前的值

In [160]:
BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)

We finished with necessary methods of our BiLSTMModel model and almost ready to start experimenting.

### Evaluation 
To simplify the evaluation process we provide two functions for you:
 - *predict_tags*: uses a model to get predictions and transforms indices to tokens and tags;
 - *eval_conll*: calculates precision, recall and F1 for the results.

In [161]:
from evaluation import precision_recall_f1

In [162]:
#看传进来的token_idxs_batch是不是打乱的，如果是打乱的，下面的token,tag元组列表就有意义
def predict_tags(model, session, token_idxs_batch, lengths):
    """Performs predictions and transforms indices to tokens and tags."""
    
    #这里的tag_idxs_batch是一个Tensor
    tag_idxs_batch = model.predict_for_batch(session, token_idxs_batch, lengths) #预测出的tag下标列表
#     for tag_idx in tag_idxs_batch:
#         print("tag_idx:", tag_idx)
    
    tags_batch, tokens_batch = [], []
    for tag_idxs, token_idxs in zip(tag_idxs_batch, token_idxs_batch):  
        tags, tokens = [], []
        #这种写法可能是想少写一个循环，也就是一个循环循环了两个list
        for tag_idx, token_idx in zip(tag_idxs, token_idxs):  #标签和token的下标元组
            tags.append(idx2tag[tag_idx])        #下标对应的tag,这里出错，找不到下标对应的tag,因为实际tag的维度只有21,但是softmax输出的n_tag却有50维
            tokens.append(idx2token[token_idx])  #下标对应的token
        tags_batch.append(tags)     #真实字符串的list
        tokens_batch.append(tokens)
    return tags_batch, tokens_batch    #预测的tag列表   和   tokens列表(真实的字符串，不是下标)
    
    
def eval_conll(model, session, tokens, tags, short_report=True):
    """Computes NER quality measures using CONLL shared task script."""
    
    y_true, y_pred = [], []
    for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
#         print("x_batch:", x_batch)   #这里下标超出5000太多，无法进行embedding查找
        tags_batch, tokens_batch = predict_tags(model, session, x_batch, lengths)
        if len(x_batch[0]) != len(tags_batch[0]):     #一直没搞清楚token长度和tag长度有什么关系
            raise Exception("Incorrect length of prediction for the input, "
                            "expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
        predicted_tags = []     #预测标签
        ground_truth_tags = []  #真实标签
        for gt_tag_idx, pred_tag, token in zip(y_batch[0], tags_batch[0], tokens_batch[0]): 
            if token != '<PAD>':
                ground_truth_tags.append(idx2tag[gt_tag_idx])   #真实标签的集合
                predicted_tags.append(pred_tag)    #预测标签的集合

        # We extend every prediction and ground truth sequence with 'O' tag
        # to indicate a possible end of entity.
        y_true.extend(ground_truth_tags + ['O'])    #用来表示一次序列的结束
        y_pred.extend(predicted_tags + ['O'])
        
    #求精确度，召回率，和F1值
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

## Run your experiment

Create *BiLSTMModel* model with the following parameters:
 - *vocabulary_size* — number of tokens;
 - *n_tags* — number of tags;
 - *embedding_dim* — dimension of embeddings, recommended value: 200;
 - *n_hidden_rnn* — size of hidden layers for RNN, recommended value: 200;
 - *PAD_index* — an index of the padding token (`<PAD>`).

Set hyperparameters. You might want to start with the following recommended values:
- *batch_size*: 32;
- 4 epochs;
- starting value of *learning_rate*: 0.005
- *learning_rate_decay*: a square root of 2;
- *dropout_keep_probability*: try several values: 0.1, 0.5, 0.9.

However, feel free to conduct more experiments to tune hyperparameters and earn extra points for the assignment.

In [165]:
tf.reset_default_graph()  #清除默认图形堆栈并重置全局默认图形

# model = BiLSTMModel( 5000, 50, 200, 200, 1)
model = BiLSTMModel( 5000, 21, 200, 200, 1)   #由于tag实际类别为21的问题，修改了下

batch_size = 32 ######### YOUR CODE HERE #############
n_epochs = 4 ######### YOUR CODE HERE #############
learning_rate = 0.005 ######### YOUR CODE HERE #############
learning_rate_decay =  np.sqrt(2)######### YOUR CODE HERE #############
dropout_keep_probability = 0.9 ######### YOUR CODE HERE #############

self.input_batch.shape: (?, ?)
(?, ?, 200)
softmax_output: Tensor("Reshape_1:0", shape=(?, ?, 21), dtype=float32)


If you got an error *"Tensor conversion requested dtype float64 for Tensor with dtype float32"* in this point, check if there are variables without dtype initialised. Set the value of dtype equals to *tf.float32* for such variables.

Finally, we are ready to run the training!

In [166]:
sess = tf.Session()    #建立会话
sess.run(tf.global_variables_initializer())   #对会话进行初始化

print('Start training... \n')
for epoch in range(n_epochs):
    # For each epoch evaluate the model on train and validation data
    print('-' * 20 + ' Epoch {} '.format(epoch+1) + 'of {} '.format(n_epochs) + '-' * 20)
          
    print('Train data evaluation:')
    eval_conll(model, sess, train_tokens, train_tags, short_report=True)
    print('Validation data evaluation:')
    eval_conll(model, sess, validation_tokens, validation_tags, short_report=True)
    
    # Train the model
    for x_batch, y_batch, lengths in batches_generator(batch_size, train_tokens, train_tags):
#         print("x_batch:", x_batch)
        model.train_on_batch(sess, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability)
        
    # Decaying the learning rate
    learning_rate = learning_rate / learning_rate_decay
    
print('...training finished.')

Start training... 

-------------------- Epoch 1 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 71725 phrases; correct: 171.

precision:  0.24%; recall:  3.81%; F1:  0.45

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 8655 phrases; correct: 27.

precision:  0.31%; recall:  5.03%; F1:  0.59

-------------------- Epoch 2 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 1980 phrases; correct: 1000.

precision:  50.51%; recall:  22.28%; F1:  30.92

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 223 phrases; correct: 106.

precision:  47.53%; recall:  19.74%; F1:  27.89

-------------------- Epoch 3 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 3224 phrases; correct: 1814.

precision:  56.27%; recall:  40.41%; F1:  47.04

Validation data evaluation:
processed 12836 tokens with 537

Now let us see full quality reports for the final model on train, validation, and test sets. To give you a hint whether you have implemented everything correctly, you might expect F-score about 40% on the validation set.

**The output of the cell below (as well as the output of all the other cells) should be present in the notebook for peer2peer review!**

In [167]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)

print('-' * 20 + ' Validation set quality: ' + '-' * 20)
validation_results = eval_conll(model, sess, validation_tokens, validation_tags, short_report=False )   #在开发集上的F1值要超过40%才行

print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = eval_conll(model, sess, test_tokens, test_tags, short_report=False)

-------------------- Train set quality: --------------------
processed 105778 tokens with 4489 phrases; found: 3696 phrases; correct: 2729.

precision:  73.84%; recall:  60.79%; F1:  66.68

	     company: precision:   79.37%; recall:   67.03%; F1:   72.68; predicted:   543

	    facility: precision:   72.54%; recall:   68.15%; F1:   70.28; predicted:   295

	     geo-loc: precision:   86.75%; recall:   72.99%; F1:   79.28; predicted:   838

	       movie: precision:   33.33%; recall:   30.88%; F1:   32.06; predicted:    63

	 musicartist: precision:   61.50%; recall:   49.57%; F1:   54.89; predicted:   187

	       other: precision:   72.63%; recall:   54.69%; F1:   62.40; predicted:   570

	      person: precision:   71.73%; recall:   65.58%; F1:   68.51; predicted:   810

	     product: precision:   53.74%; recall:   49.69%; F1:   51.63; predicted:   294

	  sportsteam: precision:   73.56%; recall:   29.49%; F1:   42.11; predicted:    87

	      tvshow: precision:   44.44%; recall:  

### Conclusions

Could we say that our model is state of the art and the results are acceptable for the task? Definately, we can say so. Nowadays, Bi-LSTM is one of the state of the art approaches for solving NER problem and it outperforms other classical methods. Despite the fact that we used small training corpora (in comparison with usual sizes of corpora in Deep Learning), our results are quite good. In addition, in this task there are many possible named entities and for some of them we have only several dozens of trainig examples, which is definately small. However, the implemented model outperforms classical CRFs for this task. Even better results could be obtained by some combinations of several types of methods, e.g. see [this](https://arxiv.org/abs/1603.01354) paper if you are interested.

In [None]:
"""
    2018.6.3
    既然已经不是买的课了，就不把它当成一个作业了，仔细认真的搞完每个细节
    终于完成啦，接下来优化
"""