Project3 for the Biomedical Information Retrieval Course

tags: `NCKU` `python` `生醫資訊`

Environment

macOS
python3
Flask 2.0.2
matplotlib 3.4.3
nltk 3.6.5
torch 1.10.0
sklearn

Requirement

Implement word2vec for a set of text documents from PubMed.
Choose one of the 2 basic neural network models to preprocess the text set from document collection.
- Continuous Bag of Word (CBOW): use a window of word to predict the middle word.
- Skip-gram (SG): use a word to predict the surrounding ones in window. Window size is not limited. Computer languages are not limited.

Overview

Flask & Bootstrap 5

參考 : https://hackmd.io/@yyyyuwen/BIR_Project2

Word to Vector

Word2Vec是從大量文本中以非監督的方式學習語義的的一種模型，被大量用在NLP之中。Word2Vec是以詞向量的方式來表示語義，如果語義上有相似的單字，則在空間上距離也會很近，而 Embedding是一種將單字從原先的空間映射到新的多維空間上，也就是把原先詞所在空間嵌入到一個新的空間中。以 $f(x) = y$來看，$f()$可以視為一個空間的概念，而$x$則是embedding也就是表示法，$y$是我們預期的結果。我們最常見的方式就是利用one-hot編碼建立一個詞彙表，而我訓練文檔有大約14,000個不重複的單詞，代表每一個詞彙就會是一個用0和1表示的14,000維的向量。

CBOW & Skip-gram

CBOW和Skip-gram的model是非常相似的，主要的差異是CBOW是周圍的自在預測現在的字，而Skip-gram則是用現在的字去預測周圍的字。其中Window size是上下文的範圍(ex. Window size = 1指說取前後一個單字。)

Model Architecture

SkipGram_Model(
  (embeddings): Embedding(14086, 600, max_norm=1)
  (linear): Linear(in_features=600, out_features=14086, bias=True)
)
# Input Layer : 1 x 14,086
# Hidden Layer : 14,000 x 600
# Output Layer : 600 x 14,086

Data pre-Processing

1. 讀檔

讀取4000篇.xml，取Title、Label、AbstractText

2. 將文章分段轉成Sentences，取Stopword

sentences = sent_tokenize(text)
sentences = [re.sub(r'[^a-z0-9|^-]', ' ', sent.lower()) for sent in sentences]
clean_words = []
for sent in sentences:
    words = [word for word in sent.split() if not word.replace('-', '').isnumeric()]
    words = stop_word(words)
    clean_words.append(' '.join(words))

3. 將句子切成單字

tokens = [x.split() for x in text]

4. Lemmatizer

Lemmatization in Python

首先先將各個單字做詞性標註，最後再將字還原回去。

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
lemma_word = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for sentence in sentences for w in sentence]

5. 建立詞彙表

將各個單字建立詞彙表，並單獨標示編號。

{'map': 4314, 'html': 4315, 'interchange': 4316, 'vtm': 4317, 'restrictive': 4318, 'pre-analytic': 4319, 'disadvantageous': 4320, 'unidirectional': 4321, 'wiley': 4322, 'periodical': 4323, 'alternate': 4324, 'low-throughput': 4325}

6. 建立pair

將詞彙表的編號建立成pair，window_size = 2

[(0, 1), (0, 2), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (2, 3), (2, 4), (3, 1), (3, 2), (3, 4), (3, 5), (4, 2), (4, 3), (4, 5), (4, 6), (5, 3), (5, 4), (5, 6)]

visualization

PCA (Principal Components Analysis)

sklearn.decomposition.PCA 將word vectors降維至二維的樣子，而關聯度高的單字會聚集在一起。 Input word: covid-19

Demo

前面參考：https://hackmd.io/@yyyyuwen/BIR_Project2

點選Skip Gram輸入單字，會列出該單字前15關聯性的單字。 Input word: covid-19

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Model		Model
__pycache__		__pycache__
file		file
static		static
templates		templates
作業補充		作業補充
.DS_Store		.DS_Store
BIR_Project2_Q56091079.pdf		BIR_Project2_Q56091079.pdf
BIR_Project3_Q56091079.pdf		BIR_Project3_Q56091079.pdf
README.md		README.md
Train.py		Train.py
Word2Vec.drawio.png		Word2Vec.drawio.png
app.py		app.py
model.py		model.py
save2file.py		save2file.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project3 for the Biomedical Information Retrieval Course

tags: `NCKU` `python` `生醫資訊`

Environment

Requirement

Overview

Flask & Bootstrap 5

Word to Vector

CBOW & Skip-gram

Model Architecture

Data pre-Processing

1. 讀檔

2. 將文章分段轉成Sentences，取Stopword

3. 將句子切成單字

4. Lemmatizer

5. 建立詞彙表

6. 建立pair

visualization

PCA (Principal Components Analysis)

Demo

Reference

About

Releases

Packages

Languages

yyyyuwen/BIR-Course

Folders and files

Latest commit

History

Repository files navigation

Project3 for the Biomedical Information Retrieval Course

tags: NCKU python 生醫資訊

Environment

Requirement

Overview

Flask & Bootstrap 5

Word to Vector

CBOW & Skip-gram

Model Architecture

Data pre-Processing

1. 讀檔

2. 將文章分段轉成Sentences，取Stopword

3. 將句子切成單字

4. Lemmatizer

5. 建立詞彙表

6. 建立pair

visualization

PCA (Principal Components Analysis)

Demo

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

tags: `NCKU` `python` `生醫資訊`

Packages