### **TF-IDF = TF(Term Frequency - 단어 빈도) x IDF(Inverse Document Frequency - 역문서 빈도)**  

TF 값은 문서 내에서 단어가 등장하는 빈도가 높을수록 커지고, 반대로 IDF 값은 전체 문서의 집합에서 해당 단어가 적게 나타날수록 커진다.  
IDF 값이 높은 단어는 문서를 구별하는 데 중요한 단어라고 말할 수 있다.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
vectorizer = CountVectorizer()

In [3]:
documents = [
    "과일에는 비타민C가 다량 함유되어 있다.",
    "비타민C를 채우기 위해서는 다양한 방법이 있다. 비타오백을 마시는 방법, 비타민C가 함유된 건강보조식품을 먹는 방법 등.",
    "동남아에 가면 대체로 한국보다 과일을 저렴하게 살 수 있다.",
    "비타민은 여러 종류가 있다.  비타민A, 비타민B, 비타민C, 비타민D...",
    "한라봉은 제주 특산품으로, 많은 사람들이 즐겨 찾는 과일이다."
]

In [4]:
# 단어의 빈도수 기록
print(vectorizer.fit_transform(documents).toarray())

[[0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
  0]
 [0 1 0 0 0 0 1 0 0 1 0 1 2 1 0 0 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0
  1]
 [1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0
  0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0
  0]
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0
  0]]


In [5]:
# 각 단어와 맵핑된 인덱스 출력
print(vectorizer.vocabulary_)

{'과일에는': 2, '비타민c가': 17, '다량': 5, '함유되어': 35, '있다': 25, '비타민c를': 18, '채우기': 31, '위해서는': 24, '다양한': 6, '방법이': 13, '비타오백을': 21, '마시는': 9, '방법': 12, '함유된': 36, '건강보조식품을': 1, '먹는': 11, '동남아에': 8, '가면': 0, '대체로': 7, '한국보다': 33, '과일을': 3, '저렴하게': 26, '비타민은': 20, '여러': 23, '종류가': 28, '비타민a': 14, '비타민b': 15, '비타민c': 16, '비타민d': 19, '한라봉은': 34, '제주': 27, '특산품으로': 32, '많은': 10, '사람들이': 22, '즐겨': 29, '찾는': 30, '과일이다': 4}


In [8]:
# 사이킷런은 TF-IDF를 자동 계산해주는 TfidfVectorizer를 제공한다.
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "과일에는 비타민C가 다량 함유되어 있다.",
    "비타민C를 채우기 위해서는 다양한 방법이 있다. 비타오백을 마시는 방법, 비타민C가 함유된 건강보조식품을 먹는 방법 등.",
    "동남아에 가면 대체로 한국보다 과일을 저렴하게 살 수 있다.",
    "비타민은 여러 종류가 있다.  비타민A, 비타민B, 비타민C, 비타민D...",
    "한라봉은 제주 특산품으로, 많은 사람들이 즐겨 찾는 과일이다."
]

tfidfv = TfidfVectorizer().fit(documents)
print(tfidfv.transform(documents).toarray())
print(tfidfv.vocabulary_)

[[0.         0.         0.50199209 0.         0.         0.50199209
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.40500406
  0.         0.         0.         0.         0.         0.
  0.         0.28281359 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.50199209
  0.        ]
 [0.         0.25847202 0.         0.         0.         0.
  0.25847202 0.         0.         0.25847202 0.         0.25847202
  0.51694403 0.25847202 0.         0.         0.         0.20853359
  0.25847202 0.         0.         0.25847202 0.         0.
  0.25847202 0.14561862 0.         0.         0.         0.
  0.         0.25847202 0.         0.         0.         0.
  0.25847202]
 [0.39786049 0.         0.         0.39786049 0.         0.
  0.         0.39786049 0.39786049 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.      

In [10]:
import pandas as pd

result = tfidfv.transform(documents).toarray()
tfidf_ = pd.DataFrame(result, columns = tfidfv.vocabulary_)
tfidf_

Unnamed: 0,과일에는,비타민c가,다량,함유되어,있다,비타민c를,채우기,위해서는,다양한,방법이,...,비타민c,비타민d,한라봉은,제주,특산품으로,많은,사람들이,즐겨,찾는,과일이다
0,0.0,0.0,0.501992,0.0,0.0,0.501992,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.501992,0.0
1,0.0,0.258472,0.0,0.0,0.0,0.0,0.258472,0.0,0.0,0.258472,...,0.0,0.0,0.0,0.0,0.258472,0.0,0.0,0.0,0.0,0.258472
2,0.39786,0.0,0.0,0.39786,0.0,0.0,0.0,0.39786,0.39786,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.39786,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.369676,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.353553,0.0,0.0,0.0,0.0,0.0,...,0.353553,0.0,0.353553,0.353553,0.0,0.353553,0.0,0.353553,0.0,0.0
