## 字母级语言模型 - Dinosaurus land
欢迎来到恐龙大陆！ 6500万年前，恐龙就已经存在，并且在该作业下它们又回来了。假设你负责一项特殊任务，领先的生物学研究人员正在创造新的恐龙品种，并计划将它们带入地球，而你的工作就是为这些新恐龙起名字。如果恐龙不喜欢它的名字，它可能会变得疯狂，所以需要明智地选择！

幸运的是，你掌握了深度学习的一些知识，你将使用它来节省时间。你的助手已收集了他们可以找到的所有恐龙名称的列表，并将其编译到此dataset中。（请单击上一个链接查看）要创建新的恐龙名称，你将构建一个字母级语言模型来生成新名称。你的算法将学习不同的名称模式，并随机生成新名称。希望该算法可以使你和你的团队免受恐龙的愤怒！

完成此作业，你将学习：
- 如何存储文本数据以供RNN使用
- 如何在每个时间步采样预测并将其传递给下一个RNN单元以合成数据
- 如何建立一个字母级的文本生成循环神经网络
- 为什么梯度裁剪很重要

我们将从加载rnn_utils中为你提供的一些函数开始。具体来说，你可以访问诸如rnn_forward和rnn_backward之类的函数，这些函数与你在上一个作业中实现的函数等效。


In [2]:
import numpy as np
from utils import *
import random
from random import shuffle

### 1 问题陈述
#### 1.1 数据集和预处理
运行以下单元格以读取包含恐龙名称的数据集，创建唯一字符列表（例如a-z），并计算数据集和词汇量。

In [6]:
data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.


这些字符是a-z（26个字符）加上“\n”（换行符），在此作业中，其作用类似于我们在讲座中讨论的<EOS>（句子结尾）标记，仅在此处表示恐龙名称的结尾，而不是句子的结尾。在下面的单元格中，我们创建一个python字典（即哈希表），以将每个字符映射为0-26之间的索引。我们还创建了第二个python字典，该字典将每个索引映射回对应的字符。这将帮助你找出softmax层的概率分布输出中哪个索引对应于哪个字符。下面的char_to_ix和ix_to_char是python字典。

In [8]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


#### 1.2 模型概述
你的模型将具有以下结构：

- 初始化参数
- 运行优化循环
    - 正向传播以计算损失函数
    - 反向传播以计算相对于损失函数的梯度
    - 剪裁梯度以避免梯度爆炸
    - 使用梯度下降方法更新参数。
- 返回学习的参数

- 我：lstm解决了梯度消失的问题，但是没有解决梯度爆炸的问题。梯度爆炸可以看到权重成指数型增长，这是因为梯度很大所以权重也导致权重急剧变化。

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### 2 构建模型模块
在这一部分中，你将构建整个模型的两个重要模块：

- 梯度裁剪：避免梯度爆炸
- 采样：一种用于生成字符的技术

然后，你将应用这两个函数来构建模型。

#### 2.1 在优化循环中裁剪梯度
在本节中，你将实现在优化循环中调用的clip函数。回想一下，你的总体循环结构通常由正向传播，损失计算，反向传播和参数更新组成。在更新参数之前，你将在需要时执行梯度裁剪，以确保你的梯度不会“爆炸”，这意味着要采用很大的值。

在下面的练习中，你将实现一个函数clip，该函数接收梯度字典，并在需要时返回裁剪后的梯度。梯度裁剪有多种方法。我们将使用简单的按元素裁剪程序，其中将梯度向量的每个元素裁剪为位于范围[-N，N]之间。通常，你将提供一个maxValue（例如10）。在此示例中，如果梯度向量的任何分量大于10，则将其设置为10；并且如果梯度向量的任何分量小于-10，则将其设置为-10。如果介于-10和10之间，则将其保留。

![image.png](attachment:image.png)

**练习**：实现以下函数以返回字典gradients的裁剪梯度。你的函数接受最大阈值，并返回裁剪后的梯度。你可以查看此hint，以获取有关如何裁剪numpy的示例。你将需要使用参数out = ...。

In [9]:
### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    ### START CODE HERE ###
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient,-maxValue , maxValue, out=gradient)
    ### END CODE HERE ###
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients


In [10]:
np.random.seed(3)
dWax = np.random.randn(5,3)*10
dWaa = np.random.randn(5,5)*10
dWya = np.random.randn(2,5)*10
db = np.random.randn(5,1)*10
dby = np.random.randn(2,1)*10
gradients = {"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
gradients = clip(gradients, 10)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])

gradients["dWaa"][1][2] = 10.0
gradients["dWax"][3][1] = -10.0
gradients["dWya"][1][2] = 0.2971381536101662
gradients["db"][4] = [10.]
gradients["dby"][1] = [8.45833407]


**预期输出**:

gradients["dWaa"][1][2] = 10.0

gradients["dWax"][3][1] = -10.0

gradients["dWya"][1][2] = 0.2971381536101662

gradients["db"][4] = [10.]

gradients["dby"][1] = [8.45833407]



#### 2.2 采样
现在假设你的模型已经训练好。你想生成新文本（字符）。下图说明了生成过程：

![image.png](attachment:image.png)

**练习**：实现以下的sample函数来采样字母。你需要执行4个步骤：

![image.png](attachment:image.png)

In [11]:
# GRADED FUNCTION: sample

def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    ### START CODE HERE ###
    # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
    x = np.zeros((vocab_size,1))
    # Step 1': Initialize a_prev as zeros (≈1 line)
    a_prev = np.zeros((n_a,1))
    
    # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
    indices = []
    
    # Idx is a flag to detect a newline character, we initialize it to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
    # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
    # trained model), which helps debugging and prevents entering an infinite loop. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax,x)+np.dot(Waa,a_prev)+b)
        z = np.dot(Wya,a)+by
        y = softmax(z)
        
        # for grading purposes
        np.random.seed(counter+seed) 
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(range(len(y)),p=y.ravel())


        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size,1))
        x[idx] = 1
        
        # Update "a_prev" to be "a"
        a_prev = a
        
        # for grading purposes
        seed += 1
        counter +=1
        
    ### END CODE HERE ###

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

In [12]:
np.random.seed(2)
n, n_a = 20, 100
a0 = np.random.randn(n_a, 1)
i0 = 1 # first character is ix_to_char[i0]
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}


indices = sample(parameters, char_to_ix, 0)
print("Sampling:")
print("list of sampled indices:", indices)
print("list of sampled characters:", [ix_to_char[i] for i in indices])

Sampling:
list of sampled indices: [18, 2, 26, 0]
list of sampled characters: ['r', 'b', 'z', '\n']


**预期输出**:

Sampling:

list of sampled indices: [18, 2, 26, 0]

list of sampled characters: ['r', 'b', 'z', '\n']

- 我：对采样的概念模糊不清的话，可以参考[文章](https://www.modb.pro/db/381462)。我理解是，根据输出的概率，采样拿到概率最大的一个或几个词作为当前词，然后继续对下一个T求字符。

### 3 建立语言模型
现在是时候建立用于文字生成的字母级语言模型了。

#### 3.1 梯度下降
在本部分中，你将实现一个函数，该函数执行随机梯度下降的一个步骤（梯度裁剪）。你将一次查看一个训练示例，因此优化算法为随机梯度下降。提醒一下，以下是RNN常见的优化循环的步骤：

- 通过RNN正向传播以计算损失
- 随时间反向传播以计算相对于参数的损失梯度
- 必要时裁剪梯度
- 使用梯度下降更新参数

**练习**：实现此优化过程（随机梯度下降的一个步骤）。

我们为你提供了以下函数：
```python
def rnn_forward(X, Y, a_prev, parameters):  
    """ Performs the forward propagation through the RNN and computes the cross-entropy loss.  
    It returns the loss' value as well as a "cache" storing values to be used in the backpropagation."""  
    ....  
    return loss, cache  

def rnn_backward(X, Y, parameters, cache):  
    """ Performs the backward propagation through time to compute the gradients of the loss with respect  
    to the parameters. It returns also all the hidden states."""  
    ...  
    return gradients, a  

def update_parameters(parameters, gradients, learning_rate):  
    """ Updates parameters using the Gradient Descent Update Rule."""  
    ...  
    return parameters
```

In [14]:
# GRADED FUNCTION: optimize

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    ### START CODE HERE ###
    
    # Forward propagate through time (≈1 line)
    loss, cache = rnn_forward(X,Y,a_prev,parameters)
    
    # Backpropagate through time (≈1 line)
    gradients, a = rnn_backward(X,Y,parameters,cache)
    
    # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
    gradients = clip(gradients,5)
    
    # Update parameters (≈1 line)
    parameters = update_parameters(parameters,gradients,learning_rate)
    
    ### END CODE HERE ###
    
    return loss, gradients, a[len(X)-1]

In [15]:
np.random.seed(1)
vocab_size, n_a = 27, 100
a_prev = np.random.randn(n_a, 1)
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}
X = [12,3,5,11,22,3]
Y = [4,14,11,22,25, 26]

loss, gradients, a_last = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
print("Loss =", loss)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("np.argmax(gradients[\"dWax\"]) =", np.argmax(gradients["dWax"]))
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])
print("a_last[4] =", a_last[4])

Loss = 126.50397572165382
gradients["dWaa"][1][2] = 0.19470931534716163
np.argmax(gradients["dWax"]) = 93
gradients["dWya"][1][2] = -0.007773876032002922
gradients["db"][4] = [-0.06809825]
gradients["dby"][1] = [0.01538192]
a_last[4] = [-1.]


**预期输出**:

Loss = 126.50397572165363

gradients["dWaa"][1][2] = 0.19470931534719205

np.argmax(gradients["dWax"]) = 93

gradients["dWya"][1][2] = -0.007773876032003275

gradients["db"][4] = [-0.06809825]

gradients["dby"][1] = [0.01538192]

a_last[4] = [-1.]

#### 3.2 训练模型
给定恐龙名称数据集，我们将数据集的每一行（一个名称）用作一个训练示例。每100步随机梯度下降，你将抽样10个随机选择的名称，以查看算法的运行情况。请记住要对数据集进行混洗，以便随机梯度下降以随机顺序访问示例。

**练习**：按照说明进行操作并实现model()。当examples [index]包含一个恐龙名称（字符串）时，创建示例（X，Y），可以使用以下方法：
```python
index = j % len(examples)  
        X = [None] + [char_to_ix[ch] for ch in examples[index]]   
        Y = X[1:] + [char_to_ix["\n"]]
```

![image.png](attachment:image.png)

In [16]:
# GRADED FUNCTION: model

def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    """
    Trains the model and generates dinosaur names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    dino_names -- number of dinosaur names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Initialize loss (this is required because we want to smooth our loss, don't worry about it)
    loss = get_initial_loss(vocab_size, dino_names)
    
    # Build list of all dinosaur names (training examples).
    with open("dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all dinosaur names
    shuffle(examples)
    
    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        
        ### START CODE HERE ###
        
        # Use the hint above to define one training example (X,Y) (≈ 2 lines)
        index = j%len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix["\n"]]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X,Y,a_prev,parameters,learning_rate=0.01)  
        
        ### END CODE HERE ###
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of dinosaur names to print
            seed = 0
            for name in range(dino_names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  # To get the same result for grading purposed, increment the seed by one. 
      
            print('\n')
        
    return parameters


运行以下单元格，你应该观察到模型在第一次迭代时输出看似随机的字符。经过数千次迭代后，你的模型应该学会生成看起来合理的名称。



In [17]:
parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.080748

Nkzxwtdmfqoeyhsqwasjkjvu
Kneb
Kzxwtdmfqoeyhsqwasjkjvu
Neb
Zxwtdmfqoeyhsqwasjkjvu
Eb
Xwtdmfqoeyhsqwasjkjvu


Iteration: 2000, Loss: 27.927396

Livtos
Hlda
Hvtos
Lca
Xusbbldorawerntaremeseantixcebeeluedeandaiqaurlonm
Ba
Tos


Iteration: 4000, Loss: 25.976786

Onyushandoravhusbatglesjanthylebegctaldancaipatong
Kmcaadosaurus
Lytrodonapgveoptarangus
Olaafton
Wusodolosaurus
Caadosaurus
Toranesaurus


Iteration: 6000, Loss: 24.823285

Optrrong
Indeceropa
Kyrronlesaurus
Olaamrraedosaurus
Xusrcheronus
Daalrradaus
Trrareramuruscareratraroptheandosauros


Iteration: 8000, Loss: 23.955594

Meusourus
Inabaeshacaropteraxamazaiosaurus
Ivptongarhavergrasaurus
Mabaeropa
Xusogiaopaveritatinashanpewda
Daacisaurus
Troenatopteritathorosaurus


Iteration: 10000, Loss: 23.689744

Mivusiangronvesaurus
Kiecamosaurus
Kytrogomiontesaurus
Mecaisheechuslelotieevalosaurus
Yusodonosaurus
Edaersa
Trodomiontesaurus


Iteration: 12000, Loss: 23.569251

Miutus
Ingaagosaurus
Kysusaurus
Meaa

### 结论
你可以看到，在训练即将结束时，你的算法已开始生成合理的恐龙名称。刚开始时，它会生成随机字符，但是到最后，你会看到恐龙名字的结尾很酷。运行该算法更长时间，并调整超参数来看看是否可以获得更好的结果。我们的实现产生了一些非常酷的名称，例如“maconucon”，“marloralus”和“macingsersaurus”。你的模型还有望了解到恐龙名称往往以saurus，don，aura，tor等结尾。

如果你的模型生成了一些不酷的名字，请不要完全怪罪模型-并非所有实际的恐龙名字听起来都很酷。（例如，dromaeosauroides是实际存在的恐龙名称，并且也在训练集中。）但是此模型应该给你提供了一组可以从中挑选的候选名字！

该作业使用了相对较小的数据集，因此你可以在CPU上快速训练RNN。训练英语模型需要更大的数据集，并且通常需要更多的计算，在GPU上也要运行多个小时。我们使用恐龙的名字已经有一段时间了，到目前为止，我们最喜欢的名字是great, undefeatable,且fierce的：Mangosaurus!

### 4 像莎士比亚一样创作（可选）

略