## 명사 단어 벡터화

추출된 명사 단어들을 벡터화(모델 입력으로 사용하기 위함) 하기 위해 FastText 모델을 사용하여 학습시킵니다.

이때 W2V 모델이 아닌 FastText 모델을 사용하는 이유는 데이터의 수가 많은 편이 아니기 때문에 FastText 모델을 사용하여 단어 내에서도 어절 단위로 끊어 학습을 진행합니다.

In [None]:
import logging
from gensim.models.fasttext import FastText

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

corpus = [sent.strip().split(" ") for sent in open("./output/corpus.txt", 'r', encoding='utf-8').readlines()]

model = FastText(corpus, vector_size=200, epochs=1000, window=5)

model.train(
    corpus_file="./output/corpus.txt",
    epochs=model.epochs, workers = 12,
    total_examples=model.corpus_count, total_words=model.corpus_total_words
)

model.wv.save_word2vec_format('ft_1118.txt')
model.save('ft_1118.model')

2023-11-18 13:38:55,943 : INFO : collecting all words and their counts
2023-11-18 13:38:55,945 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-11-18 13:38:56,272 : INFO : PROGRESS: at sentence #10000, processed 1090614 words, keeping 9895 word types
2023-11-18 13:38:56,549 : INFO : PROGRESS: at sentence #20000, processed 2278230 words, keeping 12912 word types
2023-11-18 13:38:56,558 : INFO : collected 13018 word types from a corpus of 2306285 raw words and 20171 sentences
2023-11-18 13:38:56,559 : INFO : Creating a fresh vocabulary
2023-11-18 13:38:56,608 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 retains 6223 unique words (47.80% of original 13018, drops 6795)', 'datetime': '2023-11-18T13:38:56.608328', 'gensim': '4.3.2', 'python': '3.8.18 | packaged by conda-forge | (default, Oct 10 2023, 15:44:36) \n[GCC 12.3.0]', 'platform': 'Linux-5.15.49-linuxkit-pr-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
2023-11-18 13:38:56,609 : I

In [7]:
result = FastText.load('./model/ft_1118.model')
d = model.wv.most_similar(positive=["초등학교"])
for (x, y) in d:
    print("%s - %f" % (x,y))

2023-11-18 13:43:40,894 : INFO : loading FastText object from ./model/ft_1118.model
2023-11-18 13:43:40,984 : INFO : loading wv recursively from ./model/ft_1118.model.wv.* with mmap=None
2023-11-18 13:43:40,990 : INFO : loading vectors_ngrams from ./model/ft_1118.model.wv.vectors_ngrams.npy with mmap=None
2023-11-18 13:44:22,708 : INFO : setting ignored attribute vectors to None
2023-11-18 13:44:22,733 : INFO : setting ignored attribute buckets_word to None
2023-11-18 13:44:23,203 : INFO : setting ignored attribute cum_table to None
2023-11-18 13:44:23,384 : INFO : FastText lifecycle event {'fname': './model/ft_1118.model', 'datetime': '2023-11-18T13:44:23.383641', 'gensim': '4.3.2', 'python': '3.8.18 | packaged by conda-forge | (default, Oct 10 2023, 15:44:36) \n[GCC 12.3.0]', 'platform': 'Linux-5.15.49-linuxkit-pr-x86_64-with-glibc2.10', 'event': 'loaded'}


고등학교 - 0.870593
학교 - 0.859759
대학교 - 0.847543
중학교 - 0.806007
교회 - 0.785753
독산역 - 0.745466
철산역 - 0.741269
가락시장 - 0.739527
증산역 - 0.736069
당산역 - 0.735030


In [8]:
d = model.wv.most_similar(positive=["어린이", "보호", "구역"])
for (x, y) in d:
    print("%s - %f" % (x,y))

어린이집 - 0.669969
어린이대공원역 - 0.644940
동등 - 0.617614
청구역 - 0.559615
도모 - 0.541052
실종 - 0.538501
대입 - 0.537531
음란물 - 0.535708
숙대입구역 - 0.532322
흐름 - 0.521616
