# Custom Vocab 만들기

* 영어 : **wordpiece** 사용<br>
* 한글 : **mecab 형태소 분석기** 사용<br>
* 기호 : **+-/*÷=×±∓∘∙∩∪≅∀√%∄∃θπσ≠<>≤≥≡∼≈≢∝≪≫∈∋∉⊂⊃⊆⊇⋈∑∫∏∞x().,%#{}** 사용.<br>
    각각의 기호 2가지경우 추가 (ex. +, ##+)
* 숫자 : **0 1 2 3 4 5 6 7 8 9** 사용<br>
    각각의 숫자 2가지경우 추가 (ex. 0, ##0)

## 실행 방법 
**PATH 값 변경** 후 실행

In [24]:
# version
VERSION = "4"

# Vocab path
## input Vocab path
MecapVocab_path = "only_paper_data/paper_mecab_vocab_ko_v{}_sejin.txt".format(VERSION)
EnglishVocab_path = "only_paper_data/paper_wordpiece_only_eng_vocab_1000_v{}.txt".format(VERSION)

## output Vocab path
NoEnglishCustomVocab_path = "only_paper_data/ices_vocab_except_eng_{}.txt".format(VERSION)
CustomVocab_path = "only_paper_data/ices_vocab_v{}_sejin.txt".format(VERSION)

In [25]:
import re
import sys
# custom vocab dictionaly
new_vocab = []

In [26]:
# 전체 글에서 추출한 vocab dictionaly
f = open(MecapVocab_path, 'r')
lines = f.readlines()
print("Total Vocab size : ", len(lines))
f.close()

Total Vocab size :  244945


## Total vocab dictionaly에서 한글 단어만 추출(Mecab 형태소 분석기)

In [27]:
def isHangul(text):
    if text[:2] == "##": text = text[2:]
    #Check the Python Version
    pyVer3 =  sys.version_info >= (3, 0)

    if pyVer3 : # for Ver 3 or later
        encText = text
    else: # for Ver 2.x
        if type(text) is not unicode:
            encText = text.decode('utf-8')
        else:
            encText = text

    hanCount = len(re.findall(u'[\u3130-\u318F\uAC00-\uD7A3]+', encText))
    return hanCount > 0

In [28]:
count = 0
for i in lines:
    if isHangul(i[:-1]):
        new_vocab.append(i)
        count += 1

In [29]:
print("Number of Hangul vocab : {}".format(count))
print("Current new_vocab size : {} (한글단어 추가)".format(len(new_vocab)))

Number of Hangul vocab : 244945
Current new_vocab size : 244945 (한글단어 추가)


## Seperater 추가

In [30]:
new_vocab.insert(0,'[MASK]\n')
new_vocab.insert(0,'[SEP]\n')
new_vocab.insert(0,'[CLS]\n')
new_vocab.insert(0,'[UNK]\n')
new_vocab.insert(0,'[PAD]\n')

In [31]:
print("Number of seperater : 5")
print("Current new_vocab size : {} (Seperater 추가)".format(len(new_vocab)))

Number of seperater : 5
Current new_vocab size : 244950 (Seperater 추가)


## 숫자 추가

In [32]:
count = 0
for i in range(10):
    new_vocab.append(str(i)+'\n')
    new_vocab.append("##{}\n".format(i))
    count += 2

In [33]:
print("Number of type of number : {}".format(count))
print("Current new_vocab size : {} (숫자 추가)".format(len(new_vocab)))

Number of type of number : 20
Current new_vocab size : 244970 (숫자 추가)


## 특수문자 추가

In [34]:
used_Special_Char = "+-/*÷=×±∓∘∙∩∪≅∀√%∄∃θπσ≠<>≤≥≡∼≈≢∝≪≫∈∋∉⊂⊃⊆⊇⋈∑∫∏∞x().,%#{}"
count = 0
for c in used_Special_Char:
    new_vocab.append(c+'\n')
    new_vocab.append("##{}\n".format(c))
    count+=2

In [35]:
print("Number of Special Characters : {}".format(count))
print("Current new_vocab size : {} (숫자 추가)".format(len(new_vocab)))

Number of Special Characters : 110
Current new_vocab size : 245080 (숫자 추가)


In [36]:
f = open(NoEnglishCustomVocab_path, 'w')
f.write("".join(new_vocab))
f.close()

## 영어 단어 추가

In [37]:
f = open(EnglishVocab_path, 'r')
eng_lines = f.readlines()
print("Total English Vocab size : ", len(eng_lines))
f.close()

Total English Vocab size :  40986


In [38]:
count = 0
for i in eng_lines[5:]:
    new_vocab.append(i)
    count += 1

In [39]:
print("Number of english vocab : {}".format(len(eng_lines[5:])))
print("Current new_vocab size : {} (영어 추가)".format(len(new_vocab)))

Number of english vocab : 40981
Current new_vocab size : 286061 (영어 추가)


In [40]:
f = open(CustomVocab_path, 'w')
f.write("".join(new_vocab))
f.close()

In [41]:
# ##붙은것과 안붙은 것 갯수 비교
f = open(CustomVocab_path, 'r')
test = f.readlines()
f.close()
count = 0
count2 = 0

for i in test[5:]:
    if i[:2] == '##': count += 1
    else: count2 += 1
print("## 붙은 것 : ", count)
print("## 안 붙은 것 : ", count2)
count+ count2

## 붙은 것 :  98393
## 안 붙은 것 :  187663


286056

In [13]:
! python createCustomVocab.py rsc/my_conf/FinalMecabVocab.txt rsc/my_conf/ices_eng_vocab_1000.txt  --custom=customvocab.txt

Total Mecab Vocab size :  750650
Number of Hangul vocab : 282642
Current new_vocab size : 282642 (한글단어 추가)
Total English Vocab size :  47874
Number of english vocab : 47869
Current new_vocab size : 330511 (영어 추가)
Number of seperater : 5
Current new_vocab size : 330516 (Seperater 추가)
Number of type of number : 20
Current new_vocab size : 330536 (숫자 추가)
Number of Special Characters : 110
Current new_vocab size : 330646 (숫자 추가)
## 붙은 것 :  137899
## 안 붙은 것 :  192742
