# Vectorization
* One-Hot Encoding
  * 전체 단어 집합의 크기를 벡터의 차원으로 갖습니다.(특성의 개수 = 벡터의 차원 = 단어 집합의 단어 개수)
  * 각 단어에 고유한 정수 인덱스를 부여하고(단어 집합), 해당 인덱스의 원소는 1, 나머지 원소는 0으로 가지는 벡터를 만든다.
* Document Term Matrix(DTM)
  * 각 단어에 고유한 인덱스를 부여한 후에, **문서 마다** 해당 단어가 등장한 횟수를 인덱스의 값으로 가진다. ( n번 문서에 어떠한 단어게 몇 번 등장 했는가? )
* TF-IDF(Term Frequency - Inverse Document Frequency)
  * 단어 빈도 - 역 문서 빈도
  * TF와 IDF라는 값을 곱한 값이다.
  * 문서의 유사도, 검색 시스템에서 검색 결과의 순위 등을 구하는 일에 사용된다.
  * 벡터화 되어 있기 때문에 인공 신경망의 입력으로도 사용이 가능하다.

## TF-IDF (Term Frequency - Inverse Document Frequency)
DTM을 기반으로 중요한 단어에 가중치는 주는 방식이에요! 결과적으로만 말하자면 TF-IDF의 값을 기반으로 중요한 단어는 값이 올라가고, TF-IDF 기준으로 중요하지 않은 단어는 값이 DOWN이 됩니다!

tf-idf의 정의에 대해 이야기 해보겠습니다.

* $tf(d, t)$ : 특정 문서 d에서의 특정 단어 $t$의 등장 횟수. 즉 DTM 상에서의 단어들의 값
* $df(t)$ : 단어 $t$가 등장한 문서의 수
* $idf(t)$ : $df(t)$에 반비례 하는 수.

참고로 idf는 다음과 같아요!

$$
idf(t) = log(\frac{n}{1+df(t)})
$$

위 식에서 $n$은 문서의 개수입니다!

In [None]:
# n번 문서(document)에 단어(term)가 등장한 횟수
def term_frequency(term, document):
  return document.count(term)

# 단어(term)가 몇개의 문서'들'(documents)에서 등장을 했는지 카운트
def document_frequency(term, documents):
  term_count = 0

  # 문서들에서 문서를 하나씩 꺼내줘
  for document in documents:
    # if term in document:
    #   term_count += 1

    # 문서에 단어가 들어있는지 검사 후 더하기
    term_count += term in document
  
  return term_count

def inverse_document_frequency(term, documents):
  from math import log

  N = len(documents) # 전체 문서의 개수
  df = document_frequency(term, documents)

  # idf 구해서 리턴
  return log( N / (1 + df))
  
def tf_idf(term, documents, idx):
  # 원하는 문서 구하기(idx번 문서)
  document = documents[idx]

  return term_frequency(term, document) * inverse_document_frequency(term, documents)


# TF-IDF를 이용한 애국가 분석

In [None]:
docs = [
  '동해 물과 백두산이 마르고 닳도록 하느님이 보우하사 우리나라 만세. 무궁화 삼천리 화려 강산 대한 사람, 대한으로 길이 보전하세. 동해 가고 싶다',
  '남산 위에 저 소나무, 철갑을 두른 듯 바람 서리 불변함은 우리 기상일세. 무궁화 삼천리 화려 강산 대한 사람, 대한으로 길이 보전하세. 소나무 이쁘다',
  '가을 하늘 공활한데 높고 구름 없이 밝은 달은 우리 가슴 일편단심일세. 무궁화 삼천리 화려 강산 대한 사람, 대한으로 길이 보전하세. 가을 하늘 보고 싶다.',
  '이 기상과 이 마음으로 충성을 다하여 괴로우나 즐거우나 나라 사랑하세. 무궁화 삼천리 화려 강산 대한 사람, 대한으로 길이 보전하세. 나라를 사랑하자'
] 

In [None]:
!pip install konlpy

Collecting konlpy
[?25l  Downloading https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl (19.4MB)
[K     |████████████████████████████████| 19.4MB 1.3MB/s 
Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Collecting JPype1>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/88/f817ef1af6f794e8f11313dcd1549de833f4599abcec82746ab5ed086686/JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448kB)
[K     |████████████████████████████████| 450kB 36.2MB/s 
Collecting beautifulsoup4==4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 9.4MB/s 
Installing collected package

In [None]:
from konlpy.tag import Okt
okt = Okt()

In [None]:
vocab = list(set(w for doc in docs for w in okt.nouns(doc)))
vocab

['바람',
 '데',
 '보우',
 '마르고',
 '사람',
 '길이',
 '기상',
 '화려',
 '불변',
 '일편단심',
 '저',
 '나라',
 '동해',
 '만세',
 '서리',
 '하사',
 '하늘',
 '사랑',
 '물',
 '함',
 '삼천리',
 '대한',
 '활',
 '강산',
 '가을',
 '우리나라',
 '무궁화',
 '철갑',
 '보고',
 '듯',
 '마음',
 '충성',
 '남산',
 '하느님',
 '구름',
 '이',
 '위',
 '백두산',
 '가슴',
 '우리',
 '달',
 '소나무',
 '보전']

In [None]:
# 가나다 순으로 정렬
vocab.sort()

In [None]:
vocab[:3]

['가슴', '가을', '강산']

DTM 만들기
1. 1절의 tf를 구해내 보기

In [None]:
# 1절
for vo in vocab:
  print(term_frequency(vo, docs[0]))

0
0
1
0
0
1
1
0
0
2
0
2
0
1
0
1
1
1
0
1
0
1
1
0
1
0
1
0
0
1
1
0
3
0
0
0
0
1
0
1
0
1
0


2. 1절만 tf를 모아내기

In [None]:
tf_result = []

for vo in vocab:
  tf_result.append(term_frequency(vo, docs[0]))

print(tf_result)

[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 0, 2, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 3, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0]


3. 모든 절에 대한 tf를 모아내기

In [None]:
result = []

for doc in docs:
  tf_result = []
  for vo in vocab:
    tf_result.append(term_frequency(vo, doc))

  result.append(tf_result)

print(result)

[[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 0, 2, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 3, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0], [0, 0, 1, 0, 1, 1, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 2, 1, 0, 1, 2, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0], [1, 2, 1, 1, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 0, 2, 0, 0, 1, 1], [0, 0, 1, 0, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]


In [None]:
pd.DataFrame(result, columns=vocab)

Unnamed: 0,가슴,가을,강산,구름,기상,길이,나라,남산,달,대한,데,동해,듯,마르고,마음,만세,무궁화,물,바람,백두산,보고,보우,보전,불변,사람,사랑,삼천리,서리,소나무,우리,우리나라,위,이,일편단심,저,철갑,충성,하느님,하늘,하사,함,화려,활
0,0,0,1,0,0,1,1,0,0,2,0,2,0,1,0,1,1,1,0,1,0,1,1,0,1,0,1,0,0,1,1,0,3,0,0,0,0,1,0,1,0,1,0
1,0,0,1,0,1,1,0,1,0,2,0,0,1,0,0,0,1,0,1,0,0,0,1,1,1,0,1,1,2,1,0,1,2,0,1,1,0,0,0,0,1,1,0
2,1,2,1,1,0,1,0,0,1,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0,2,1,0,0,0,0,2,0,0,1,1
3,0,0,1,0,1,1,2,0,0,2,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,2,1,0,0,0,0,0,3,0,0,0,1,0,0,0,0,1,0


idf 결과를 데이터 프레임으로 확인

In [None]:
result = []

for term in vocab:
  result.append(inverse_document_frequency(term, docs))

pd.DataFrame(result, index=vocab, columns=["IDF"])

Unnamed: 0,IDF
가슴,0.693147
가을,0.693147
강산,-0.223144
구름,0.693147
기상,0.287682
길이,-0.223144
나라,0.287682
남산,0.693147
달,0.693147
대한,-0.223144


tf-idf를 DataFrame으로

In [None]:
result = []

for i in range(len(docs)):
  tf_idf_result = []

  for vo in vocab:
    tf_idf_result.append(tf_idf(vo, docs, i))
  
  result.append(tf_idf_result)

pd.DataFrame(result, columns=vocab)

Unnamed: 0,가슴,가을,강산,구름,기상,길이,나라,남산,달,대한,데,동해,듯,마르고,마음,만세,무궁화,물,바람,백두산,보고,보우,보전,불변,사람,사랑,삼천리,서리,소나무,우리,우리나라,위,이,일편단심,저,철갑,충성,하느님,하늘,하사,함,화려,활
0,0.0,0.0,-0.223144,0.0,0.0,-0.223144,0.287682,0.0,0.0,-0.446287,0.0,1.386294,0.0,0.693147,0.0,0.693147,-0.223144,0.693147,0.0,0.693147,0.0,0.693147,-0.223144,0.0,-0.223144,0.0,-0.223144,0.0,0.0,0.0,0.693147,0.0,-0.669431,0.0,0.0,0.0,0.0,0.693147,0.0,0.693147,0.0,-0.223144,0.0
1,0.0,0.0,-0.223144,0.0,0.287682,-0.223144,0.0,0.693147,0.0,-0.446287,0.0,0.0,0.693147,0.0,0.0,0.0,-0.223144,0.0,0.693147,0.0,0.0,0.0,-0.223144,0.693147,-0.223144,0.0,-0.223144,0.693147,1.386294,0.0,0.0,0.693147,-0.446287,0.0,0.693147,0.693147,0.0,0.0,0.0,0.0,0.693147,-0.223144,0.0
2,0.693147,1.386294,-0.223144,0.693147,0.0,-0.223144,0.0,0.0,0.693147,-0.446287,0.693147,0.0,0.0,0.0,0.0,0.0,-0.223144,0.0,0.0,0.0,0.693147,0.0,-0.223144,0.0,-0.223144,0.0,-0.223144,0.0,0.0,0.0,0.0,0.0,-0.446287,0.693147,0.0,0.0,0.0,0.0,1.386294,0.0,0.0,-0.223144,0.693147
3,0.0,0.0,-0.223144,0.0,0.287682,-0.223144,0.575364,0.0,0.0,-0.446287,0.0,0.0,0.0,0.0,0.693147,0.0,-0.223144,0.0,0.0,0.0,0.0,0.0,-0.223144,0.0,-0.223144,1.386294,-0.223144,0.0,0.0,0.0,0.0,0.0,-0.669431,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,-0.223144,0.0


# Tensorflow로 Bow 구현하기

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

t = Tokenizer()
t.fit_on_texts(docs)
print(t.word_index)

{'무궁화': 1, '삼천리': 2, '화려': 3, '강산': 4, '대한': 5, '사람': 6, '대한으로': 7, '길이': 8, '보전하세': 9, '동해': 10, '싶다': 11, '소나무': 12, '우리': 13, '가을': 14, '하늘': 15, '이': 16, '물과': 17, '백두산이': 18, '마르고': 19, '닳도록': 20, '하느님이': 21, '보우하사': 22, '우리나라': 23, '만세': 24, '가고': 25, '남산': 26, '위에': 27, '저': 28, '철갑을': 29, '두른': 30, '듯': 31, '바람': 32, '서리': 33, '불변함은': 34, '기상일세': 35, '이쁘다': 36, '공활한데': 37, '높고': 38, '구름': 39, '없이': 40, '밝은': 41, '달은': 42, '가슴': 43, '일편단심일세': 44, '보고': 45, '기상과': 46, '마음으로': 47, '충성을': 48, '다하여': 49, '괴로우나': 50, '즐거우나': 51, '나라': 52, '사랑하세': 53, '나라를': 54, '사랑하자': 55}


In [None]:
# DTM 만들기
print(t.texts_to_matrix(docs, mode='count'))

[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 2. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1.]]


In [None]:
# Tf-idf 만들기
print(t.texts_to_matrix(docs, mode='tfidf'))

[[0.         0.58778666 0.58778666 0.58778666 0.58778666 0.58778666
  0.58778666 0.58778666 0.58778666 0.58778666 1.8601123  0.84729786
  0.         0.         0.         0.         0.         1.09861229
  1.09861229 1.09861229 1.09861229 1.09861229 1.09861229 1.09861229
  1.09861229 1.09861229 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.58778666 0.58778666 0.58778666 0.58778666 0.58778666
  0.58778666 0.58778666 0.58778666 0.58778666 0.         0.
  1.8601123  0.84729786 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.09861229 1.09861229 1.09861229 1.09861229
  1.09861229 1.09861229 1.09861229 1.09861229 1.09861229 1.09861229
  1.09861229 0.    

In [None]:
pd.DataFrame([list(map(lambda x : term_frequency(x, doc), vocab)) for doc in docs], columns=vocab)

Unnamed: 0,가슴,가을,강산,구름,기상,길이,나라,남산,달,대한,데,동해,듯,마르고,마음,만세,무궁화,물,바람,백두산,보고,보우,보전,불변,사람,사랑,삼천리,서리,소나무,우리,우리나라,위,이,일편단심,저,철갑,충성,하느님,하늘,하사,함,화려,활
0,0,0,1,0,0,1,1,0,0,2,0,2,0,1,0,1,1,1,0,1,0,1,1,0,1,0,1,0,0,1,1,0,3,0,0,0,0,1,0,1,0,1,0
1,0,0,1,0,1,1,0,1,0,2,0,0,1,0,0,0,1,0,1,0,0,0,1,1,1,0,1,1,2,1,0,1,2,0,1,1,0,0,0,0,1,1,0
2,1,2,1,1,0,1,0,0,1,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0,2,1,0,0,0,0,2,0,0,1,1
3,0,0,1,0,1,1,2,0,0,2,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,2,1,0,0,0,0,0,3,0,0,0,1,0,0,0,0,1,0
