# 词向量

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud
from torch.nn.parameter import Parameter

## Tensor和tensor
在Pytorch中，Tensor和tensor都用于生成新的向量
- `torch.Tensor()`是python类。更明确的说，是默认张量类型`torch.FloatTensor()`的别名。  
 - `torch.Tensor([1,2])`会调用Tensor类的构造函数`_init__`，生成单精度浮点型的张量。
- `torch.tensor()`是python的函数，函数原型是`torch.tensor(data, dtype=None, requires_grad=False)`
 - data可以是：list，tuple，array，scala等类型
 - `torch.tensor()`可以从data中的数据部分做拷贝（而不是直接引用），根据原始数据类型生成相应的  
   - torch.LongTensor
   - torch.FloatTensor
   - torch.DoubleTensor
 
思考：
习惯了基本数据类型，看到`tensor([1,2])`这种张量一开始会特别奇怪。
为何不直接写成列表呢？
一个比较好的解释角度：一切皆对象。尽管从value的角度来看，tensor和list是一样的，但是tensor这种数据类型肯定还定义了list类型没有的方法。传统的list等类型，尽管从值的角度来看，确实差不多，但在深度学习中，需要一层更好的封装。这个封装就是tensor。

举例：
```python
import torch
x_train, y_train, x_valid, y_valid = map(
    torch.tensor,
    (x_train, y_train, x_valid, y_valid)
)
x_train.shape
x_train.min()
x_train.max()
```
```python
def map(function, iterable, ...):
    ...
    return iterable

```

In [3]:
a = torch.Tensor([1,2])
a

tensor([1., 2.])

In [4]:
a = torch.tensor([1,2])
a

tensor([1, 2])

In [10]:
a = torch.tensor([1,2])
a.type()

'torch.LongTensor'

In [9]:
b = torch.tensor([1.,2.])
b.type()

'torch.FloatTensor'

In [11]:
c = np.zeros(2, dtype=np.float64)
c = torch.tensor(c)
c.type()

'torch.DoubleTensor'

## nn.functional
`import torch.nn.functional as F`
包含torch.nn库中所有函数，同时包含大量loss和activation function
```python
import torch.nn.functional as F

loss_func = F.cross_entropy
loss = loss_func(model(x), y)

loss.backward()
```

## nn和nn.functional有什么区别
- nn.functional.xxx是函数接口
- nn.Xxx是nn.functional.xxx的类封装，并且nn.Xxx都继承于一个共同的祖先nn.Module
- nn.Xxx除了具有nn.functional.xxx功能之外，内部附带nn.Module相关的属性和方法
 - train()
 - eval()
 - load_state_dict
 - state_dict

## class torch.utils.data.Dataset
> 作用：
创建数据集
- `__getitem__(self, index)`函数根据索引序号获取图片和标签
- `__len__(self)`函数来获取数据集的长度

其他数据集类必须是torch.utils.data.Dataset的子类

In [19]:
from torch.utils.data import Dataset

class TensorDataset(Dataset):
    """
    Dataset wrapping data and target tensors
    
    Each sample will be retrived by indexing both tensors along the first dimension.
    
    Arguments:
        data_tensor (Tensor): contains sample data
        target_tensort (Tensor): contains sample targets (labels)
    """
    def __init__(self, data_tensor, target_tensor):
        assert data_tensor.size(0) == target_tensor.size(0)
        self.data_tensor = data_tensor
        self.target_tensor = target_tensor
        
    def __getitem__(self, index):
        return self.data_tensor[index], self.target_tensor[index]
    
    def __len__(self):
        return self.data_tensor.size(0)

In [4]:
from collections import Counter
import numpy as np
import random

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity

## Counter

In [10]:
A=['a','b','b','c','d','b','a']
counter = Counter(A)
counter

Counter({'a': 2, 'b': 3, 'c': 1, 'd': 1})

In [12]:
B = counter.most_common(3)
B

[('b', 3), ('a', 2), ('c', 1)]

In [14]:
C = dict(B)
C

{'b': 3, 'a': 2, 'c': 1}

In [5]:
USE_CUDA = torch.cuda.is_available()

# 为了保证实验结果可以复现，我们经常会把各种random seed固定在某一个值
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if USE_CUDA:
    torch.cuda.manual_seed(53113)
    
# 设定一些超参数
    
K = 100 # number of negative samples
C = 3 # nearby words threshold
NUM_EPOCHS = 2 # The number of epochs of training
MAX_VOCAB_SIZE = 30000 # the vocabulary size
BATCH_SIZE = 128 # the batch size
LEARNING_RATE = 0.2 # the initial learning rate
EMBEDDING_SIZE = 100
       
    
LOG_FILE = "word-embedding.log"

In [6]:
def word_tokenize(text):
    return text.split()

In [17]:
with open("/Users/liuwangxiang/Desktop/nlp5/第三阶段 预训练模型与机器翻译/01-词向量实战/text8/text8.train.txt", "r") as fin:
    text = fin.read()

- read() 读取文件所有内容
- readline() 读取文件一行
- readlines() 读取文件所有行，返回一个列表

## 使用列表推导式有什么好处吗
```python
text = [w for w in word_tokenize(text.lower())]
```
这一段代码明明可以用下面代码代替
```python
text = text.split()
```

In [18]:
text = [w for w in word_tokenize(text.lower())]

In [26]:
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))

In [23]:
vocab.keys()



In [27]:
vocab.values()

dict_values([958035, 536684, 375233, 371796, 335503, 292250, 285093, 235406, 224705, 172079, 164575, 118931, 113412, 106452, 104935, 104416, 103344, 101939, 100587, 98710, 97719, 91897, 90940, 85741, 82022, 68769, 66093, 65738, 62113, 58234, 55741, 55471, 53073, 49838, 49636, 49410, 39778, 39726, 35792, 35190, 33922, 32114, 29018, 28410, 26355, 25976, 25969, 25522, 25219, 23659, 23598, 22954, 22880, 22853, 21830, 21770, 21703, 21281, 20450, 20253, 19196, 18852, 18698, 18512, 18449, 17796, 17550, 17324, 17101, 16971, 16214, 16105, 16081, 15668, 15514, 14730, 14233, 14089, 14045, 13556, 13525, 13388, 13266, 13217, 13176, 13162, 13035, 12992, 12666, 12550, 12122, 11623, 11621, 11573, 11525, 11446, 11368, 11156, 11156, 11074, 10975, 10846, 10750, 10706, 10667, 10643, 10629, 10562, 10553, 10354, 10351, 10277, 10211, 10185, 9919, 9881, 9800, 9693, 9594, 9556, 9538, 9461, 9372, 9353, 9287, 9205, 9119, 9085, 8981, 8888, 8860, 8824, 8733, 8667, 8630, 8606, 8580, 8570, 8566, 8495, 8472, 8402, 83

## 字典的keys()和values()
这两个函数的返回值是什么呢？是列表吗？不太像。
但是通过list()函数转换下就可以转成列表

In [28]:
type(vocab.values())

dict_values

In [None]:
type(vocab.)