- macOS
- python3
- Flask 2.0.2
- matplotlib 3.4.3
- nltk 3.6.5
- torch 1.10.0
- sklearn
- Implement word2vec for a set of text documents from PubMed.
- Choose one of the 2 basic neural network models to preprocess the text set from document collection.
- Continuous Bag of Word (CBOW): use a window of word to predict the middle word.
- Skip-gram (SG): use a word to predict the surrounding ones in window. Window size is not limited. Computer languages are not limited.
Word2Vec是從大量文本中以非監督的方式學習語義的的一種模型,被大量用在NLP之中。Word2Vec是以詞向量的方式來表示語義,如果語義上有相似的單字,則在空間上距離也會很近,而 Embedding是一種將單字從原先的空間映射到新的多維空間上,也就是把原先詞所在空間嵌入到一個新的空間中。
以
CBOW和Skip-gram的model是非常相似的,主要的差異是CBOW是周圍的自在預測現在的字,而Skip-gram則是用現在的字去預測周圍的字。其中Window size是上下文的範圍(ex. Window size = 1指說取前後一個單字。)
SkipGram_Model(
(embeddings): Embedding(14086, 600, max_norm=1)
(linear): Linear(in_features=600, out_features=14086, bias=True)
)
# Input Layer : 1 x 14,086
# Hidden Layer : 14,000 x 600
# Output Layer : 600 x 14,086
讀取4000篇.xml,取Title、Label、AbstractText
sentences = sent_tokenize(text)
sentences = [re.sub(r'[^a-z0-9|^-]', ' ', sent.lower()) for sent in sentences]
clean_words = []
for sent in sentences:
words = [word for word in sent.split() if not word.replace('-', '').isnumeric()]
words = stop_word(words)
clean_words.append(' '.join(words))
tokens = [x.split() for x in text]
首先先將各個單字做詞性標註,最後再將字還原回去。
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
lemmatizer = WordNetLemmatizer()
lemma_word = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for sentence in sentences for w in sentence]
將各個單字建立詞彙表,並單獨標示編號。
{'map': 4314, 'html': 4315, 'interchange': 4316, 'vtm': 4317, 'restrictive': 4318, 'pre-analytic': 4319, 'disadvantageous': 4320, 'unidirectional': 4321, 'wiley': 4322, 'periodical': 4323, 'alternate': 4324, 'low-throughput': 4325}
將詞彙表的編號建立成pair,window_size = 2
[(0, 1), (0, 2), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (2, 3), (2, 4), (3, 1), (3, 2), (3, 4), (3, 5), (4, 2), (4, 3), (4, 5), (4, 6), (5, 3), (5, 4), (5, 6)]
sklearn.decomposition.PCA 將word vectors降維至二維的樣子,而關聯度高的單字會聚集在一起。 Input word: covid-19
點選Skip Gram輸入單字,會列出該單字前15關聯性的單字。 Input word: covid-19
- Efficient Estimation of Word Representations in Vector Space
- Word2vec with PyTorch: Implementing Original Paper
- [自然語言處理] Word to Vector 實作教學
- Word2Vec Implementation
- Word2vec from scratch (Skip-gram & CBOW)
- Skip-Gram負採樣by Pytorch
- PyTorch 實現 Skip-gram
- 讓電腦聽懂人話: 直觀理解 Word2Vec 模型
- 降維與視覺化
- Scikit-learn介紹(10)_ Principal Component Analysis
- dict sort
- Lemmatization和Stemming
- Build your own Skip-gram Embeddings and use them in a Neural Network
- Word2Vec Tutorial - The Skip-Gram Model
- Word2Vec (skip-gram model): PART 1 - Intuition.
- Implementing word2vec in PyTorch (skip-gram model)