## 개념 참고

TFIDF  
http://euriion.com/?p=411929

truncated SVD  
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

## 함수 옵션 참고

TfidfVectorizer  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

TruncatedSVD  
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

In [1]:
import os
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [3]:
train = pd.read_csv('./train/train.csv')
test = pd.read_csv('./test/test.csv')

## 결측치 처리

In [4]:
train_desc = train.Description.fillna("none").values
test_desc = test.Description.fillna("none").values

## TFIDF 설정

In [5]:
tfv = TfidfVectorizer(min_df=3,  max_features=None,
        strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
        ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
        stop_words = 'english')

- min_df : 정수(n)로 인수를 받으면 n보다 작으면 무시, 정수의 의미는 DF에서 절대적인 개수  
- max_features(n) : 디폴트는 None, TF를 빈도 순으로 나열해서 상위 n개만큼만 고려한 단어장 구축
- token_pattern : 토큰화 정규 표현식  
정규표현식 test하는 사이트  
https://rubular.com/r/mpI2mrlXe3
- ngram_range : n의 범위
- use_idf : 디폴트 True, Enable inverse-document-frequency reweighting.
- smooth_idf : 디폴트 True, True이면 df에 각각에 1을 더해 idf weights를 smooth하게 해줌.
- sublinear_tf : 디폴트 False, True이면 Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

N-gram 예제

"This is a sentence"

N=1  
this, is, a, sentence

N=2  
this is, is a, a sentence

N=3  
this is a, is a sentence

## TFIDF 적용

In [6]:
# Fit TFIDF
tfv.fit(list(train_desc)) # Learn vocabulary and idf from training set.

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=1,
        stop_words='english', strip_accents='unicode', sublinear_tf=1,
        token_pattern='\\w{1,}', tokenizer=None, use_idf=1,
        vocabulary=None)

In [7]:
X =  tfv.transform(train_desc) # Transform documents to document-term matrix.
X_test = tfv.transform(test_desc)
print("X (tfidf):", X.shape)
print("Term : {:,}개" .format(X.shape[1]))

X (tfidf): (14993, 50192)
Term : 50,192개


## TruncatedSVD 설정

In [10]:
TruncatedSVD_k = 10
svd = TruncatedSVD(n_components=TruncatedSVD_k)

## TruncatedSVD 적용

In [11]:
svd.fit(X) # 	Fit LSI model on training data X.
X = svd.transform(X) # Perform dimensionality reduction on X.
X_test = svd.transform(X_test)
print("X (svd):", X.shape)

X (svd): (14993, 10)


##  분산 확인

In [12]:
print(svd.explained_variance_ratio_.sum())
# int(svd.explained_variance_ratio_)

0.14247590492319592


## 데이터 붙이기

In [13]:
X = pd.DataFrame(X, columns=['svd_{}'.format(i) for i in range(TruncatedSVD_k)])
train = pd.concat((train, X), axis=1)

X_test = pd.DataFrame(X_test, columns=['svd_{}'.format(i) for i in range(TruncatedSVD_k)])
test = pd.concat((test, X_test), axis=1)

print("train:", train.shape)

train: (14993, 34)
