# 基于神经网络的自然语言处理

### 传统的词袋模型

词袋模型（Bag of Words, BoW）是一种用于自然语言处理（NLP）任务的文本表示方法。它的主要思想是将文本转换为数字特征向量，以便能够使用机器学习算法进行处理。尽管它在现代深度学习和NLP领域中已经被更复杂的模型所取代，如TF-IDF、Word2Vec等，但它仍然是理解文本特征提取的基础概念之一。

1. **词汇表构建**：首先，从文档集中抽取所有出现的单词，形成一个词汇表（也称为词典）。词汇表中的每个词都会被赋予一个唯一的索引号。

2. **文档向量化**：然后，将每篇文档转换为一个固定长度的向量，向量的长度等于词汇表的大小。对于词汇表中的每一个词，计算它在文档中出现的次数，该次数作为相应位置上的值。

### 特点

- **无序性**：词袋模型忽略了词序信息，只关注词汇的出现频率。
- **稀疏性**：由于词汇表可能非常大，而每篇文档中只包含一部分词汇，因此生成的向量通常是稀疏的（大部分元素为零）。
- **高维性**：文档向量通常是高维的，这可能会导致“维度灾难”，使得计算变得复杂。

### 优点

- 简单易用：实现起来相对容易，可以快速地将文本转换为可计算的特征向量。
- 广泛适用：适用于多种文本分类任务，如情感分析、主题分类等。

词袋模型虽然简单，但在许多实际应用中仍然有效，尤其是在初步探索和快速原型设计阶段。然而，在需要捕捉更多语义信息的任务中，通常会选择更复杂的模型和技术。

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# 文档集合
corpus = [
    'This is the first document. This is the second document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# 创建 CountVectorizer 对象
vectorizer = CountVectorizer()

# 学习词汇表并转换文档集
X = vectorizer.fit_transform(corpus)

# 获取词汇表
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

# 查看文档向量
print("Document vectors:\n", X.toarray())

Vocabulary: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Document vectors:
 [[0 2 1 2 0 1 2 0 2]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## 词嵌入(word embedding)

### 从独热编码开始(one-hot encoding)

在机器学习任务中,类别特征通常需要转换为独热编码(one-hot encoding)进行模型输入,主要有以下两个原因:

1. 避免假定类别顺序。直接将类别特征用数字编码(比如0,1,2)会假定类别之间有大小顺序关系,而实际上类别仅仅是不同的类型,之间并无大小顺序可言。

2. 线性可分。多分类任务需要模型输出维度等于类别数,而独热编码可以将类别扩展为一个个0/1的特征维度,使不同类别成为线性可分的。

具体来说,对于一个有N个类别的特征,使用独热编码将其转换为一个N维0/1向量,其中类别的索引位置为1,其他位置为0。

比如对颜色特征["红","绿","蓝"]使用独热编码,可以将其转换为:

红 -> [1, 0, 0] 

绿 -> [0, 1, 0]

蓝 -> [0, 0, 1]

这样模型就能够区分不同颜色,而不会对颜色顺序进行假设。因此,独热编码常用于处理机器学习模型中的类别特征。

In [2]:
import pandas as pd

df = pd.DataFrame([
    ['green', 'Chevrolet', 2017],
    ['blue', 'BMW', 2015],
    ['yellow', 'Lexus', 2018],
])
df.columns = ['color', 'make', 'year']
df

Unnamed: 0,color,make,year
0,green,Chevrolet,2017
1,blue,BMW,2015
2,yellow,Lexus,2018


In [4]:
df_processed = pd.get_dummies(df, columns=['color', 'make'] ,dtype=int)
df_processed

Unnamed: 0,year,color_blue,color_green,color_yellow,make_BMW,make_Chevrolet,make_Lexus
0,2017,0,1,0,0,1,0
1,2015,1,0,0,1,0,0
2,2018,0,0,1,0,0,1


### 词嵌入原理

词嵌入(word embedding)是自然语言处理中的一个重要技术,它的思想是将词映射到一个连续的低维向量空间中,使得语义相似的词在这个空间中距离较近。

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

https://blog.csdn.net/raelum/article/details/125462028

In [15]:
from IPython.display import Image
Image(url= "30.jpg",width=800)


## pytorch 中的词嵌入nn.Embedding

nn.Embedding和nn.Linear都是PyTorch中常用的层,两者的主要区别是:

1. nn.Embedding用于处理离散特征,nn.Linear用于处理连续特征。

2. nn.Embedding将整数索引映射到固定维度的稠密向量,nn.Linear将输入数据映射到输出维度。

3. nn.Embedding输出维度由嵌入矩阵定义,nn.Linear输出维度由Linear层参数定义。

4. nn.Embedding输入一般是词索引,nn.Linear输入可以是任意形状张量。

上面例子展示了nn.Embedding和nn.Linear的不同之处:前者处理离散特征获得到词向量,后者用于连续特征的线性映射。两者在自然语言处理任务中经常联合使用。



In [6]:

Image(url= "25.png",width=400)

In [44]:
Image(url= "24.webp",width=500)

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import math
from torch.autograd import Variable
import matplotlib.pyplot as plt
import numpy as np
import copy

* 简单实现（在torch中并没有用独热，而是通过位置与nn.Embedding特殊的机制实现了类似独热的效果）

In [7]:
word_to_ix = {"hello": 0, "world": 1, "pytorch": 2}
embeds = nn.Embedding(3, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"], word_to_ix["world"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[-0.1604,  0.7571,  0.0956, -0.5087,  2.5918],
        [-0.6639,  0.1369, -0.1471, -0.6938,  0.7911]],
       grad_fn=<EmbeddingBackward0>)


In [13]:
torch.tensor([word_to_ix["hello"], word_to_ix["world"]], dtype=torch.long)

tensor([0, 1])

In [14]:
embeds.weight

Parameter containing:
tensor([[-0.1604,  0.7571,  0.0956, -0.5087,  2.5918],
        [-0.6639,  0.1369, -0.1471, -0.6938,  0.7911],
        [-0.8335, -2.7482, -1.6768,  0.1432,  0.3586]], requires_grad=True)

In [12]:
embeds(torch.tensor([word_to_ix["hello"], word_to_ix["world"]], dtype=torch.long))

tensor([[-0.1604,  0.7571,  0.0956, -0.5087,  2.5918],
        [-0.6639,  0.1369, -0.1471, -0.6938,  0.7911]],
       grad_fn=<EmbeddingBackward0>)

In [16]:
embeds(torch.tensor([word_to_ix["hello"], word_to_ix["world"]], dtype=torch.long)).view((1, -1))

tensor([[-0.1604,  0.7571,  0.0956, -0.5087,  2.5918, -0.6639,  0.1369, -0.1471,
         -0.6938,  0.7911]], grad_fn=<ViewBackward0>)

In [33]:
hello_embed.view((1, -1))

tensor([[ 0.8123,  0.4575, -0.9634,  0.3268, -0.0092,  0.8424, -0.2896, -0.2440,
         -0.5013, -0.4043]], grad_fn=<ViewBackward0>)

In [32]:
embeds.weight

Parameter containing:
tensor([[ 0.8123,  0.4575, -0.9634,  0.3268, -0.0092],
        [ 0.8424, -0.2896, -0.2440, -0.5013, -0.4043]], requires_grad=True)

In [10]:
# nn.Embedding
embed = nn.Embedding(10, 64) # 10个词,embedding维度64 
input = torch.LongTensor([1,5,8]) # 输入词索引
embed_vector = embed(input) # 将词索引映射为词向量

# nn.Linear 
fc = nn.Linear(32, 10) # 输入维度32,输出维度10
input = torch.randn(8, 32) # 8个32维输入 
output = fc(input) # 全连接层映射到10维输出

In [23]:
corpus = ["he is an old worker", "english is a useful tool", "the cinema is far away"]
word_list=[]
for i in corpus:
    #print(i)
    for j in i.split():
        word_list.append(j)
word_set=set(word_list)
print(word_set)

{'tool', 'the', 'away', 'useful', 'worker', 'a', 'an', 'old', 'english', 'he', 'is', 'far', 'cinema'}


In [25]:
word_to_ix = {}
for i,j in enumerate(word_set):
    word_to_ix.update({j:i})
word_to_ix

{'tool': 0,
 'the': 1,
 'away': 2,
 'useful': 3,
 'worker': 4,
 'a': 5,
 'an': 6,
 'old': 7,
 'english': 8,
 'he': 9,
 'is': 10,
 'far': 11,
 'cinema': 12}

### 从语料到特征的转换

In [26]:
# 定义一个语料库
corpus = ["he is an old worker", "english is a useful tool", "the cinema is far away"]
word_list=[]
for i in corpus:
    # 将每一句话拆分成单词
    for j in i.split():
        # 将单词添加到列表中
        word_list.append(j)
# 将列表转换为集合，去除重复的单词
word_set=set(word_list)
# 将单词和索引对应起来
word_to_ix = {word: i for i, word in enumerate(word_set)}
word_to_ix

# 定义一个嵌入层，输入维度为单词数量，输出维度为5
embeds = nn.Embedding(len(word_set), 5)  
# 定义一个张量，存储要查找的单词的索引
lookup_tensor = torch.tensor([word_to_ix["an"],word_to_ix["he"],word_to_ix["he"]], dtype=torch.long)
# 使用嵌入层查找单词的嵌入向量
hello_embed = embeds(lookup_tensor)
print(hello_embed)
# 将嵌入向量展平
print(hello_embed.view((1, -1)))

tensor([[ 1.5263, -0.4624, -0.4639,  2.2752,  0.1722],
        [-0.1584, -0.3063, -0.8228, -0.3899,  0.8475],
        [-0.1584, -0.3063, -0.8228, -0.3899,  0.8475]],
       grad_fn=<EmbeddingBackward0>)
tensor([[ 1.5263, -0.4624, -0.4639,  2.2752,  0.1722, -0.1584, -0.3063, -0.8228,
         -0.3899,  0.8475, -0.1584, -0.3063, -0.8228, -0.3899,  0.8475]],
       grad_fn=<ViewBackward0>)


## classwork 1

* 定义一个包含商品数据框，包含商品名称、价格、类别等字段，具体值请自己设定，然后对类别字段进行独热编码，并输出编码后的数据框。

* 完成上面从语料库到特征的转换：1，先建立词表和索引表；2，定义嵌入层，输入几个词构成的特征，请输出此特征的嵌入向量。

## N-gram语言模型

N-gram语言模型是自然语言处理中一种重要的语言模型,它通过计算语言序列中连续N个词的联合概率来建模语言。

具体来说,N-gram模型假设词的出现只与前N-1个词相关。例如,在双gram(N=2)模型中,词$w_i$的条件概率可以表示为:

$P(w_i|w_{i-1})$

在trigram(N=3)模型中,词$w_i$的条件概率为:

$P(w_i|w_{i-1},w_{i-2})$ 

一般地,N-gram模型中词$w_i$的条件概率为:

$P(w_i|w_{i-1},...,w_{i-N+1})$

根据链式法则,语言序列中所有词连乘的联合概率可以表示为:

$P(w_1, ..., w_M) = \prod_{i=1}^{M} P(w_i|w_{i-1},...,w_{i-N+1})$

其中M是词序列长度。

N-gram模型通过统计语料中N个词共现的频率来估计条件概率$P(w_i|w_{i-1},...,w_{i-N+1})$。通常采用最大似然估计或平滑技巧来解决数据稀疏性问题。

N-gram建模简单易实现,可有效模拟语言局部词序列模式。但无法捕捉长距离依赖关系。

### 语言模型的数据预处理

In [22]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
torch.manual_seed(1)
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

ngrams = [
    ([test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],test_sentence[i])
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
print(ngrams[:3])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]


In [23]:
ngrams[:3]

[(['forty', 'When'], 'winters'),
 (['winters', 'forty'], 'shall'),
 (['shall', 'winters'], 'besiege')]

In [25]:
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [26]:
word_to_ix

{'say,': 0,
 'winters': 1,
 'shame,': 2,
 'This': 3,
 'brow,': 4,
 'fair': 5,
 'his': 6,
 'the': 7,
 'And': 8,
 'Thy': 9,
 'much': 10,
 'asked,': 11,
 'Shall': 12,
 'deep': 13,
 'thy': 14,
 'own': 15,
 'so': 16,
 'couldst': 17,
 'all-eating': 18,
 'thine!': 19,
 'where': 20,
 'thriftless': 21,
 "feel'st": 22,
 'it': 23,
 'field,': 24,
 'answer': 25,
 'trenches': 26,
 'count,': 27,
 'worth': 28,
 'old': 29,
 'beauty': 30,
 'all': 31,
 'see': 32,
 'being': 33,
 "totter'd": 34,
 'Were': 35,
 'blood': 36,
 'by': 37,
 'proud': 38,
 'lusty': 39,
 'eyes,': 40,
 'within': 41,
 'weed': 42,
 "deserv'd": 43,
 'an': 44,
 'old,': 45,
 'of': 46,
 'when': 47,
 'livery': 48,
 'use,': 49,
 'succession': 50,
 'made': 51,
 "excuse,'": 52,
 'besiege': 53,
 'Where': 54,
 'held:': 55,
 'Proving': 56,
 'gazed': 57,
 'were': 58,
 'days;': 59,
 'lies,': 60,
 'treasure': 61,
 'thou': 62,
 'my': 63,
 'praise': 64,
 "beauty's": 65,
 'cold.': 66,
 'forty': 67,
 "'This": 68,
 'art': 69,
 'new': 70,
 'a': 71,
 'Will

In [36]:
log_probs = torch.randn(3, 2).log_softmax(dim=1)

In [37]:
log_probs

tensor([[-0.4371, -1.0383],
        [-0.1610, -1.9057],
        [-0.7871, -0.6073]])

In [24]:
ngrams[0]

(['forty', 'When'], 'winters')

### NLLLoss 损失函数

`nn.NLLLoss` 是 PyTorch 中的一个损失函数，用于衡量模型输出的负对数似然（Negative Log-Likelihood，简称 NLL）。它通常用于分类任务，特别是在模型的输出层使用了对数softmax（log_softmax）激活函数的情况下。

给定一个样本 $ x $，其对应的正确标签为 $ y $，模型的输出为 $ p $，其中 $ p $ 经过 log_softmax 处理后得到的是对数概率向量。对于一个包含 $ C $ 个类别的分类任务，$ p $ 的第 $ i $ 个元素表示的是第 $ i $ 类的概率的对数值。则负对数似然损失 $ L $ 可以表示为：

$$ L(x, y) = -\log(p(y)) $$

对于一个批量（batch）的数据，损失可以是单个样本的平均（mean reduction）或总和（sum reduction）：

$$ L_{\text{batch}} = \frac{1}{N} \sum_{n=1}^{N} L(x_n, y_n) \quad \text{(mean reduction)} $$

或

$$ L_{\text{batch}} = \sum_{n=1}^{N} L(x_n, y_n) \quad \text{(sum reduction)} $$

其中 $ N $ 是批量大小。

下面这个例子中：
- `log_probs` 是一个形状为 `(3, 2)` 的张量，代表一个批次中有三个样本，每个样本属于两个类别的对数概率。
- `targets` 是一个形状为 `(3)` 的一维张量，表示每个样本的实际标签索引。
- `loss.item()` 返回标量损失值。

通过这个简单的例子，你可以看到如何定义 `NLLLoss` 并使用它来计算分类任务中的损失。如果你有任何疑问或需要进一步的帮助，请随时询问。

In [38]:
import torch
import torch.nn as nn

# 创建一个 NLLLoss 对象
criterion = nn.NLLLoss()

# 假设有两个类，即 C = 2
# 创建一个 3x2 的随机张量，模拟一个批次的数据通过 log_softmax 后的输出
log_probs = torch.randn(3, 2).log_softmax(dim=1)

# 创建一个张量，表示每个样本的实际标签
targets = torch.tensor([1, 0, 1], dtype=torch.long)  # 0 和 1 表示类别

# 计算损失
loss = criterion(log_probs, targets)

# 打印损失值
print("Loss:", loss.item())

Loss: 0.5339012742042542


### 模型的代码实现

In [28]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
torch.manual_seed(1)
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(20):
    total_loss = 0
    for context, target in ngrams:
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        model.zero_grad()
        log_probs = model(context_idxs)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]
[523.6641716957092, 520.8901543617249, 518.1371247768402, 515.4032278060913, 512.6889622211456, 509.9923801422119, 507.3103280067444, 504.6435902118683, 501.99337005615234, 499.3561701774597, 496.73104524612427, 494.1173939704895, 491.5132586956024, 488.91930198669434, 486.3346767425537, 483.75973081588745, 481.193528175354, 478.6359317302704, 476.0870225429535, 473.54870867729187]
tensor([ 0.7903,  1.3658, -0.8506,  0.5156,  1.0474, -0.3156,  0.1405,  2.3403,
        -0.6116,  0.8145], grad_fn=<SelectBackward0>)


In [36]:
for context,target in ngrams:
    print(context,target)
    break

['forty', 'When'] winters


In [17]:
torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

tensor([35,  8, 30, 33])

### classwork2

1. 已知test_sentence语料，完成词表，索引表，基于N-gram建立特征矩阵及其目标向量

2. 完成N-gram模型的神经网络训练

3. 完成下面CBOW模型的数据预处理与训练

## CBOW(Continuous Bag of Words)和skip-gram

CBOW(Continuous Bag of Words)和skip-gram都是词向量训练算法,两者的主要区别如下:

**CBOW**

- 输入是上下文词,预测目标词
- 输入层到隐层投影是一个连续的词袋(不区分词顺序)
- 目标是根据上下文词预测当前词

**skip-gram**

- 输入是中心词,预测上下文词 
- 从输入词向量映射到输出词向量 
- 目标是通过当前词预测上下文

具体来说:

**CBOW**  

给定一个词序列(w1, w2, w3, ..., wT),CBOW 的目标是最大化给定上下文词预测当前词wt的概率:

P(wt | wt-k, ..., wt+k)

其中wt-k,...,wt+k为wt的上下文词窗口。

**skip-gram**

给定一个词序列(w1, w2, w3, ..., wT),skip-gram 的目标是最大化通过当前词wt预测上下文词wj的概率:

P(wj | wt) 

其中wj为wt的上下文词。

CBOW 通过上下文词预测当前词,更关注上下文的语义信息。skip-gram 通过当前词预测上下文,更关注中心词的语义信息。两者分别从不同方面学到词向量的语义信息。

In [19]:
Image(url= "27.png",width=800)

In [29]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study'), (['study', 'to', 'idea', 'of'], 'the'), (['the', 'study', 'of', 'a'], 'idea')]


In [19]:
context_idxs

tensor([41, 15,  5, 27])

In [21]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))

ngrams=data

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE*2)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        model.zero_grad()
        log_probs = model(context_idxs)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight)

[226.42033410072327, 224.91781544685364, 223.42698192596436, 221.94492840766907, 220.47052597999573, 219.00309443473816, 217.54419946670532, 216.09297513961792, 214.6481363773346, 213.210031747818]
Parameter containing:
tensor([[-5.8585e-01, -1.3563e+00,  6.6131e-01,  2.8232e-02,  6.3295e-01,
          1.0595e+00,  1.0390e+00, -6.1170e-02,  7.2656e-01,  1.3601e-01],
        [-1.4624e+00, -1.0491e-01,  2.4525e-01,  1.8917e+00, -1.5939e-01,
          4.2361e-01,  3.2680e-01, -1.3162e-01,  6.4910e-01, -1.6663e+00],
        [ 5.7637e-01,  8.9270e-01, -1.2334e+00,  1.4321e+00, -1.0238e+00,
         -1.3552e+00,  6.8550e-01,  4.5344e-01, -6.2759e-01, -3.5659e-01],
        [-6.8798e-01,  2.1950e+00,  1.6118e+00, -9.2169e-01,  1.4742e+00,
          2.0857e+00,  7.5543e-01,  9.2551e-01,  1.6955e+00, -5.4772e-01],
        [ 1.0931e+00,  1.2244e+00, -5.8556e-01, -9.4666e-01, -7.2124e-01,
         -3.4694e-01, -2.8819e+00, -3.9432e-01,  4.3597e-02, -9.7642e-01],
        [-6.5408e-01,  6.8744e-01, 