### TfidfVectorizer란?
* TfidfVectorizer는 텍스트 데이터를 숫자로 변환해주는 도구입니다.
* 머신러닝 모델은 텍스트를 직접 이해하지 못하므로, 단어의 중요도를 계산해 수치화합니다.
    * 입력: 여러 문서 (예: 리뷰, 뉴스 기사)
    * 출력: 각 문서를 단어 중요도로 표현된 숫자 벡터

#### TfidfVectorizer를 왜 사용할까?
* 텍스트 유사도 계산 (검색, 추천 시스템)
* 스팸 메일 필터링 (중요 단어 탐지)
* 감정 분석 (긍정/부정 키워드 추출)

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "강아지 고양이 반려동물",  
    "강아지 우유 마시기",  
    "고양이 낮잠 자기"
]

# 1. TfidfVectorizer 적용
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(docs)

# 2. 결과 (단어별 가중치)
print(tfidf.get_feature_names_out())  
# 출력: ['고양이', '낮잠', '마시기', '반려동물', '우유', '자기', '강아지']

tfidf_matrix.toarray()

['강아지' '고양이' '낮잠' '마시기' '반려동물' '우유' '자기']


array([[0.51785612, 0.51785612, 0.        , 0.        , 0.68091856,
        0.        , 0.        ],
       [0.4736296 , 0.        , 0.        , 0.62276601, 0.        ,
        0.62276601, 0.        ],
       [0.        , 0.4736296 , 0.62276601, 0.        , 0.        ,
        0.        , 0.62276601]])

#### 한국어 텍스트 + 불용어 제거
* 불용어(를, 은)가 제거된 순수 명사 중심의 가중치 계산.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 한국어 문서
docs = [
    "강아지 고양이 반려동물",
    "강아지 우유를 마신다",
    "고양이는 낮잠을 잔다"
]

# 불용어 지정 (예: "를", "은" 같은 조사 제거)
stop_words = ["를", "은", "는", "을"]

# TF-IDF 변환 (불용어 제거 적용)
tfidf = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = tfidf.fit_transform(docs)

print("단어 목록:", tfidf.get_feature_names_out())
print("TF-IDF 행렬:\n")
tfidf_matrix.toarray()

단어 목록: ['강아지' '고양이' '고양이는' '낮잠을' '마신다' '반려동물' '우유를' '잔다']
TF-IDF 행렬:



array([[0.4736296 , 0.62276601, 0.        , 0.        , 0.        ,
        0.62276601, 0.        , 0.        ],
       [0.4736296 , 0.        , 0.        , 0.        , 0.62276601,
        0.        , 0.62276601, 0.        ],
       [0.        , 0.        , 0.57735027, 0.57735027, 0.        ,
        0.        , 0.        , 0.57735027]])

#### n-gram 적용 (연속된 단어 묶기)
* "machine learning", "love machine" 같은 단어 쌍이 추가됩니다.
* 문맥을 반영한 분석이 가능해집니다.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "I love machine learning",
    "I hate boring lectures",
    "machine learning is fun"
]

# 2-gram 적용 (단어 2개씩 묶어서 처리)
tfidf = TfidfVectorizer(ngram_range=(1, 2))  # 1-gram과 2-gram 모두 사용
tfidf_matrix = tfidf.fit_transform(docs)

print("단어/2-gram 목록:", tfidf.get_feature_names_out())
print("TF-IDF 행렬:\n")
tfidf_matrix.toarray()

단어/2-gram 목록: ['boring' 'boring lectures' 'fun' 'hate' 'hate boring' 'is' 'is fun'
 'learning' 'learning is' 'lectures' 'love' 'love machine' 'machine'
 'machine learning']
TF-IDF 행렬:



array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.3935112 , 0.        , 0.        ,
        0.51741994, 0.51741994, 0.3935112 , 0.3935112 ],
       [0.4472136 , 0.4472136 , 0.        , 0.4472136 , 0.4472136 ,
        0.        , 0.        , 0.        , 0.        , 0.4472136 ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.41756662, 0.        , 0.        ,
        0.41756662, 0.41756662, 0.31757018, 0.41756662, 0.        ,
        0.        , 0.        , 0.31757018, 0.31757018]])

### TF-IDF란?
##### TF (Term Frequency)
* "문서 내에서 단어가 얼마나 자주 등장하는지"
    * 예: "강아지"가 한 문서에서 5번 나오면 TF = 5
##### IDF (Inverse Document Frequency)
* "모든 문서에서 그 단어가 흔한지 희귀한지"
    * 예: "그리고"는 모든 문서에 자주 나오므로 가중치 ↓
    * "반려동물"은 특정 문서에만 나오면 가중치 ↑
##### TF-IDF = TF × IDF
* 자주 나오지만 특정 문서에 집중된 단어에 높은 점수 부여

In [None]:
# 리뷰 분석, 스팸 필터링 등에 사용 가능.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# CSV 파일 로드 (예: 리뷰 데이터)
data = pd.read_csv("data/reviews.csv")  # "text" 컬럼에 텍스트가 있다고 가정
texts = data["text"].tolist()
texts

['이 제품 정말 좋아요! 강력 추천합니다.',
 '배송이 느렸지만 제품은 괜찮아요.',
 '가격 대비 성능이 아쉽습니다.',
 '완벽한 제품이에요. 다음에도 구매할 거예요!',
 '고객 서비스가 별로였어요.']

In [5]:
# TF-IDF 변환
tfidf = TfidfVectorizer(max_features=100)  # 상위 100개 단어만 선택
tfidf_matrix = tfidf.fit_transform(texts)

# DataFrame으로 변환
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf.get_feature_names_out()
)
tfidf_df.head()

Unnamed: 0,가격,강력,거예요,고객,괜찮아요,구매할,느렸지만,다음에도,대비,배송이,...,서비스가,성능이,아쉽습니다,완벽한,정말,제품,제품은,제품이에요,좋아요,추천합니다
0,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.0,0.447214,0.447214
1,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
2,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,...,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.447214,0.0,0.0,0.447214,0.0,0.447214,0.0,0.0,...,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.447214,0.0,0.0
4,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,...,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 유사도 계산 (코사인 유사도)
* 검색 엔진, 문서 클러스터링에서 유용.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# 두 문서의 유사도 비교
doc1 = "I love python programming"
doc2 = "Python programming is fun"

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform([doc1, doc2])

# 코사인 유사도 계산
similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
print("문서 유사도:", similarity[0][0])  # 0.41 (약 41% 유사)

문서 유사도: 0.4112070550676187
