# 前言

RNN 与 LSTM 对比 CNN 更多用于文本问题。所以再学习的时候要结合nlp进行理解。

NLP的教学见NLP文件夹下内容


nlp中最关键的一个概念就是上下文，即一个词的含义，同义词，反义词都可以通过上下文得到。

为了表现上下文，就出现了很多不同的方法

NLP背景知识

面向深度学习的NLP与面向计数的NLP在基础的词语向量化基础原理存在区别，但基本原理都包括基于上下文的出现概率。

在此只考虑基于one-hot表示以及基于tf-idf表示词语的方法


---



Word2Vec

word2vec可以视为一种无监督学习方法。

本质而言，word2vec也是一个神经网络模型，其结构与之前（面向数值、面向图像）的神经网络模型没有区别，也是分为两个基本步骤


>模型训练： 输入值$\rightarrow$ forward()预测$\rightarrow$ 预测值 $\rightarrow$ 最小化 loss(预测值,实际值) $\rightarrow$ backward()反向传播$\rightarrow$更新权参、偏参$\rightarrow$得到优化后模型

>模型应用：输入值$\rightarrow$ 优化后模型 forward()预测$\rightarrow$预测值

不同之处在于输入的值形式不同，输出的值形式不同，总结对比：
* 数值（多元函数）：
  * 输入: 矩阵 [行实例，列属性]  (不考虑mini-batch情况下 );三阶张量 [mini-batch,行实例，列属性]  
  * 输出：单值（全连接层 out=1）
* 图像：
  * 输入: 4阶张量 [mini-batch,channel,H,W]
  * 输出：多值n分类（softmax层 out=n）
* 文本：
  * 输入: 4阶张量 [mini-batch,channel,H,W]
  * 输出：多值n分类（softmax层 out=n）


  ----

  one-hot可以直接通过sklearn实现，也可以通过自定义包实现。

  word2vec实现包很多，在此通过pytroch实现

  ----------------------------------------------------------------

  

In [2]:
import numpy as np
import pandas as pd
import jieba 
 
 
def token2onehot(words)->pd.DataFrame:
    words_set=sorted(set(words))
    print("分词后list转化为集合，去重，并进行排序处理",words_set)
    diction={}
    for index,value in enumerate(words_set):
       diction[index]=value
    print("转换后的本地词典：",diction)

    column=len(words)
    row=len(diction)
    onehotMatrix=np.zeros((row,column),dtype=float)
    print("one-hot矩阵大小：",onehotMatrix.shape)
    for i in range(len(words)):
        for j in range(len(diction)):
          if words[i]==diction[j]:
              
             onehotMatrix[j,i]=1
    df=pd.DataFrame(onehotMatrix)
    df.columns=words
    return(df)
if __name__=="__main__":
    print("英文one-hot，词典中单词来源于原文")
    sents="you say goodbye and i say hello."
    words=sents.split()
    df=token2onehot(words)
    #print(df)
    #print("中文one-hot，词典中单词来源于原文")
    #sents="中国国家统计局15日公布的70个大中城市房价数据显示"
    #words=list(jieba.cut(sents))
    #df2=token2onehot(words)
    #print(df2)

英文one-hot，词典中单词来源于原文
分词后list转化为集合，去重，并进行排序处理 ['and', 'goodbye', 'hello.', 'i', 'say', 'you']
转换后的本地词典： {0: 'and', 1: 'goodbye', 2: 'hello.', 3: 'i', 4: 'say', 5: 'you'}
one-hot矩阵大小： (6, 7)


1. 使用sklearn进行onenote处理

[onehot features](https://www.youtube.com/watch?v=NxLfpcfGzns&feature=youtu.be)
[onehot sklearn](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)

2. 基于pyotch实现基本的word2vec对照书中通过numpy实现

   1. 构建模型
   2. 初始化权参
   3. onehot代入模型并进行forward

3. 初始化权重（随机）实现embedding

In [41]:
from sklearn import preprocessing  
from sklearn.preprocessing import OneHotEncoder
import numpy as np 
from scipy.special import softmax

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

if __name__=='__main__':
    enc = OneHotEncoder(sparse=False)  # 注意创建的转换器对应稀疏矩阵
    print("英文one-hot，词典中单词来源于原文")
    sents="you say goodbye and i say hello."
    words=sents.split()
    print(type(words))
    words=np.array(words).reshape(-1,1)#转换为矩阵形式
    print(words)
    one_hot_word=enc.fit_transform(words)#实现one-hot转换
    print(one_hot_word.shape)
    print("onehot字典位置，以及对应的featurename")
    print(enc.get_feature_names())
    #初始化参数实现权重相乘，即只实现forward，不实现backward
    word_dim=5#第一层维度
    
    W1=np.random.rand(len(enc.get_feature_names()),word_dim)
    
    W2=np.random.rand( word_dim,len(enc.get_feature_names()))
    print(f"得到权参矩阵的形状：",W1.shape)
    #抽取其中第一个单词
    firt_word= one_hot_word[0,:]
    embedding=firt_word@W1
    s1=sigmoid(embedding)#激活层
    embedding2=embedding@W2#最后
    s2=sigmoid(embedding2)#激活层
    print("得到第一层隐藏层形状：",embedding.shape)
    print("得到第二层隐藏层形状：",embedding2.shape)
    print(s2)
    out=softmax(s2)
    print(out)
    #返回对应的
    max_index = np.argmax(out)
    one_hot = np.zeros_like(out)
    one_hot[max_index] = 1
    print(one_hot)
  
    

英文one-hot，词典中单词来源于原文
<class 'list'>
[['you']
 ['say']
 ['goodbye']
 ['and']
 ['i']
 ['say']
 ['hello.']]
(7, 6)
onehot字典位置，以及对应的featurename
['x0_and' 'x0_goodbye' 'x0_hello.' 'x0_i' 'x0_say' 'x0_you']
得到权参矩阵的形状： (6, 5)
得到第一层隐藏层形状： (5,)
得到第二层隐藏层形状： (6,)
[0.68164513 0.67242392 0.73547645 0.71407802 0.69188172 0.77509685]
[0.16162037 0.16013688 0.17055903 0.16694811 0.1632833  0.17745231]
[0. 0. 0. 0. 0. 1.]




得到句子中每一个词的onehot表示，可以通过其代入预训练模型，推理出词上下文（词）。

1. 输入一个$n$长度的onehot向量，$n$为句子长度，每个向量单位代表一个词
2. 经过深度学习，一般是全连接层到softmax层
3. 得到同样的$n$长度输出向量，每个向量的值代表对应词是输入词上下文的概率
4. 通过max得到最大概率的词，推理完成



 <img src="figs\word embedding 1.jpg" height="50%" width="50%">

如图所示，word2vec基本的结构与一般神经网络没有太大区别，

假设一句话句子长度为$l$，该句话对应的词典（lexcion）长度为$n$，lexcion去除了句子中重复的词，并可能根据字母排序。

通过one-hot处理，该句话每个词对应一个长度为$l$向量，在词典出现位置上标识1，没有出现位置标识0。

整个句子就可以用$m \times l$矩阵表示。

---

此时目标预测一个词（在句子位置为$w_i$)后面一个词（位置为$w_{i-1}$）

假设此时神经网络已经经过训练，得到了优化的权参，第一层全连接层

1. one-hot 输入向量$A_{1 \times n}$
2. 全连接层中，权参矩阵 $W_{n \times m}$ 
3. 得到中间层结果向量$Z_{ 1 \times m}=A_{1 \times n} \cdot W_{n \times m}$即embedding层
4. 最后输出层softmax层输出为$l$长度向量，每个向量对应lexcion，

---

The shapes of the layers in Word2Vec depend on the vocabulary size and the size of the hidden layer. Assuming a vocabulary size of V and a hidden layer size of N, the shapes of the layers in both the Continuous Bag of Words (CBOW) and Skip-gram models are as follows:

Input layer: The input layer is a one-hot encoded vector representing a word in the vocabulary. Its shape is (V, 1), where V is the size of the vocabulary.

Hidden layer: The hidden layer contains the word embeddings, which are the numerical representations of the input words. Its shape is (N, 1), where N is the size of the hidden layer.

Output layer: The output layer is a softmax function that produces the probability distribution of the words in the vocabulary given the context words or the target word. Its shape is (V, 1), where V is the size of the vocabulary.

During training, the weights of the neural network are adjusted using backpropagation to minimize the negative log-likelihood loss function of the output layer. The weight matrix connecting the input layer to the hidden layer has a shape of (N, V), and the weight matrix connecting the hidden layer to the output layer has a shape of (V, N).

In [4]:
# define training data
import torch
corpus = [
    'the quick brown fox',
    'jumped over the lazy dog'
]
tokens = []
for sentence in corpus:
    tokens.extend(sentence.split())
word2idx = {w: i for i, w in enumerate(set(tokens))}
idx2word = {i: w for w, i in word2idx.items()}
data = torch.tensor([word2idx[w] for w in tokens], dtype=torch.long)
print(word2idx)
print(data)

{'over': 0, 'jumped': 1, 'fox': 2, 'the': 3, 'quick': 4, 'dog': 5, 'lazy': 6, 'brown': 7}
tensor([3, 4, 7, 2, 1, 0, 3, 6, 5])


In the Word2Vec model, the size of the input vector is typically equal to the size of the vocabulary, which is the total number of unique words in the corpus. Each word in the vocabulary is assigned a unique index, and the input vector is a one-hot encoded vector of size vocab_size, where the value at the index corresponding to the current word is 1 and all other values are 0.

For example, if the vocabulary contains 10,000 unique words, the input vector for a given word would be a one-hot encoded vector of size 10,000. However, in practice, the one-hot encoding of the input vector can be quite large and sparse, making it computationally expensive to train the model. Therefore, techniques like subsampling and negative sampling are used to reduce the size of the input vector and improve training efficiency, while still preserving the quality of the word embeddings.
In the Word2Vec model, the parameters are the word embeddings and the weights of the linear layer that are learned during training. The shape of the parameters depends on the vocabulary size and the embedding dimension.

Specifically, the Word2Vec model learns a matrix of word embeddings, where each row corresponds to the embedding of a single word in the vocabulary. If the vocabulary size is vocab_size and the embedding dimension is embedding_dim, then the shape of the embedding matrix is (vocab_size, embedding_dim).

Additionally, the Word2Vec model also learns a weight matrix that maps the embedded center word to a predicted output distribution over all the words in the vocabulary. If the vocabulary size is vocab_size and the embedding dimension is embedding_dim, then the shape of the weight matrix is (embedding_dim, vocab_size).

During training, these parameters are updated using backpropagation to minimize the loss between the predicted output distribution and the true target distribution. The optimized parameters are then used to obtain the final word embeddings that capture the semantic and syntactic relationships between words in the vocabulary.



Regenerate response

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the Word2Vec model
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, center_word):
        center_embed = self.embedding(center_word)
        center_hidden = center_embed.sum(dim=1)
        target_score = self.linear(center_hidden)
        return target_score

# Define the Word2Vec training function
def train_word2vec(corpus, embedding_dim, window_size, batch_size, learning_rate, num_epochs):
    # Build the vocabulary
    vocab = list(set(corpus))
    word2idx = {w: i for i, w in enumerate(vocab)}
    idx2word = {i: w for w, i in word2idx.items()}
    vocab_size = len(vocab)

    # Prepare the training data
    data = []
    for i in range(len(corpus)):
        center_word = corpus[i]
        for j in range(1, window_size + 1):
            if i - j >= 0:
                target_word = corpus[i - j]
                data.append((center_word, target_word))
            if i + j < len(corpus):
                target_word = corpus[i + j]
                data.append((center_word, target_word))

    # Define the model, loss function, and optimizer
    model = Word2Vec(vocab_size, embedding_dim)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Train the model
    for epoch in range(num_epochs):
        total_loss = 0.0
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            center_word_batch = [word2idx[w[0]] for w in batch]
            target_word_batch = [word2idx[w[1]] for w in batch]
            center_word_batch = torch.tensor(center_word_batch).unsqueeze(1)
            target_word_batch = torch.tensor(target_word_batch)
            target_score = model(center_word_batch)
            loss = criterion(target_score, target_word_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, total_loss/len(data)))

    # Return the trained model
    return model, word2idx, idx2word

# Example usage
corpus = ['this', 'is', 'a', 'test', 'sentence', 'for', 'word2vec']
embedding_dim = 50
window_size = 2
batch_size = 16
learning_rate = 0.001
num_epochs = 100
model, word2idx, idx2word = train_word2vec(corpus, embedding_dim, window_size, batch_size, learning_rate, num_epochs)

# Get the word embeddings for a specific word
word = 'word2vec'
word_idx = word2idx[word]
word_embed = model.embedding(torch.tensor([word_idx]))
print('Embedding for "{}": {}'.format(word, word_embed.squeeze().detach().numpy()))

In this example, we define a small vocabulary consisting of four words and convert each word to a one-hot vector. We create an embedding matrix with a dimension of 2 and random initial values. We define a simple neural network with one hidden layer to predict whether each word in the sentence is positive or negative. We train the model using mean squared error loss and gradient descent, and finally extract the learned word embeddings from the embedding matrix by multiplying each one-hot vector with the embedding matrix. We use the sigmoid activation function for the hidden and output layers.

In [7]:
import numpy 
vocabulary = {'hello': 0, 'world': 1, 'goodbye': 2, 'cruel': 3}
one_hot = np.eye(len(vocabulary))
inputs = np.array([one_hot[vocabulary[word]] for word in ['hello', 'world', 'hello', 'goodbye', 'cruel', 'world']])
print(inputs.shape)
inputs = inputs.reshape(-1, embedding_dim * len(vocabulary))
print(inputs.shape)

(6, 4)
(3, 8)


In [1]:
import numpy as np

# Define the vocabulary and convert words to one-hot vectors
vocabulary = {'hello': 0, 'world': 1, 'goodbye': 2, 'cruel': 3}
one_hot = np.eye(len(vocabulary))

# Define the embedding dimension and create the embedding matrix
embedding_dim = 2
embedding_matrix = np.random.randn(len(vocabulary), embedding_dim)

# Define the model architecture
hidden_dim = 5
W1 = np.random.randn(embedding_dim * len(vocabulary), hidden_dim)
b1 = np.zeros((1, hidden_dim))
W2 = np.random.randn(hidden_dim, 1)
b2 = np.zeros((1, 1))

# Define the activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Train the model
learning_rate = 0.1
for epoch in range(100):
    inputs = np.array([one_hot[vocabulary[word]] for word in ['hello', 'world', 'hello', 'goodbye', 'cruel', 'world']])
    inputs = inputs.reshape(-1, embedding_dim * len(vocabulary))
    labels = np.array([1, 0, 1, 0, 1, 0]).reshape(-1, 1)

    # Forward pass
    hidden = sigmoid(np.dot(inputs, W1) + b1)
    outputs = sigmoid(np.dot(hidden, W2) + b2)

    # Backward pass
    d_outputs = (outputs - labels) * outputs * (1 - outputs)
    d_hidden = np.dot(d_outputs, W2.T) * hidden * (1 - hidden)
    d_W2 = np.dot(hidden.T, d_outputs)
    d_b2 = np.sum(d_outputs, axis=0, keepdims=True)
    d_W1 = np.dot(inputs.T, d_hidden)
    d_b1 = np.sum(d_hidden, axis=0, keepdims=True)

    # Update the parameters
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1

    if epoch % 10 == 9:
        loss = np.mean(np.square(outputs - labels))
        print(f'Epoch {epoch+1}, loss: {loss}')

# Get the word embeddings
word_embeddings = embedding_matrix

# Print the word embeddings
for word, index in vocabulary.items():
    one_hot_vector = one_hot[index]
    embedding = np.dot(one_hot_vector, word_embeddings)
    print(f'{word}: {embedding}')

ValueError: operands could not be broadcast together with shapes (3,1) (6,1) 