# 1. 특이값 분해(Singular Value Decomposition, SVD)
A가 m x n 행렬일 때 이를 3개의 행렬 곱으로 분해   
U와 VT는 직교행렬, 시그마는 대각행렬
$$A = U\sum V^{T}$$

> **1) 전치행렬(Transposed Matrix)**   
> 기존 행렬에서 행과 열을 바꿈
> $$M = \begin{bmatrix} 1&2\\3&4\\5&6\\ \end{bmatrix},\, M^{T} = \begin{bmatrix} 1&3&5\\2&4&6\\ \end{bmatrix}$$

> **2) 단위행렬(Identity Matrix)**   
> 주대각선 원소가 모두 1, 나머지는 0   
> $$I = \begin{bmatrix} 1&0&0\\0&1&0\\0&0&1 \end{bmatrix}$$

> **3) 역행렬(Inverse Matrix)**   
> 행렬과 역행렬을 곱할 때 단위행렬이 됨
> $$A \times A^{-1} = I$$   
> $$\begin{bmatrix} 1&2&3\\4&5&6\\7&8&9\ \end{bmatrix} \times \begin{bmatrix} ? \end{bmatrix} = \begin{bmatrix} 1&0&0\\0&1&0\\0&0&1\ \end{bmatrix}$$

> **4) 직교행렬(Orthogonal Matrix)**   
> 자신과 자신의 전치 행렬 곱 혹은 그 반대 결과가 단위 행렬이 됨
> $$A \times A^{T} = I,\, A^{T} \times A = I$$   
> $$A^{-1} = A^{T}$$

> **5) 대각행렬(Diagonal Matrix)**   
> 주대각선에 속하지 않은 원소가 모두 0   
> 특이값(singular value)은 내림차순 정렬(a > b > c)
> $$\sum = \begin{bmatrix} a&0&0\\0&b&0\\0&0&c\ \end{bmatrix} $$

# 2. 절단된 SVD
![Truncated SVD](https://miro.medium.com/max/398/0*dLoOJxagJw9Fwrfq.PNG "Truncated SVD")

truncated SVD는 대각행렬 상위 t개만 남으며 값이 손실되어 기존 행렬로 복구 불가      
t가 크면 기존 행렬의 다양한 의미를 가져가고, 작으면 노이즈를 제거함   
영상 처리에서는 노이즈 제거, 자연어 처리에서는 설명력이 낮은 정보를 삭제하는 의미   

# 3. 잠재 의미 분석(Latent Semantic Analysis, LSA)
BoW에 기반한 DTM이나 TF-IDF는 단어 빈도수 수치화이므로 단어 의미 고려 불가   
대안으로 잠재된(latent) 의미를 이끌어내는 잠재 의미 분석(Latent Semantic Analysis, LSA) 사용   
LSA는 DTM이나 TF-IDF에 truncated SVD를 사용해 차원을 축소, 단어들의 잠재적 의미를 이끌어냄

<null>|과일이|길고|노란|먹고|바나나|사과|싶은|저는|좋아요
:--|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
문서1|0|0|0|1|0|1|1|0|0
문서2|0|0|0|1|1|0|1|0|0
문서3|0|1|1|0|2|0|0|0|0
문서4|1|0|0|0|0|0|0|1|1

In [1]:
import numpy as np
A = np.array([ [0,0,0,1,0,1,1,0,0],
               [0,0,0,1,1,0,1,0,0],
               [0,1,1,0,2,0,0,0,0],
               [1,0,0,0,0,0,0,1,1] ])
np.shape(A)

(4, 9)

In [2]:
# full SVD 수행
U, s, VT = np.linalg.svd(A, full_matrices=True)

In [3]:
# 직교행렬 U
print(U.round(2))
np.shape(U)

[[ 0.24  0.75  0.    0.62]
 [ 0.51  0.44 -0.   -0.74]
 [ 0.83 -0.49 -0.    0.27]
 [ 0.   -0.    1.   -0.  ]]


(4, 4)

In [4]:
# 대각행렬 S
print(s.round(2))
np.shape(s)

[2.69 2.05 1.73 0.77]


(4,)

In [5]:
S = np.zeros((4,9))
S[:4, :4] = np.diag(s)

print(S.round(2))
np.shape(S)

[[2.69 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   2.05 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.73 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.77 0.   0.   0.   0.   0.  ]]


(4, 9)

In [6]:
# 직교행렬 VT
print(VT.round(2))
np.shape(VT)

[[ 0.    0.31  0.31  0.28  0.8   0.09  0.28  0.    0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]
 [ 0.58 -0.    0.    0.   -0.    0.   -0.    0.58  0.58]
 [-0.    0.35  0.35 -0.16 -0.25  0.8  -0.16  0.    0.  ]
 [-0.   -0.78 -0.01 -0.2   0.4   0.4  -0.2   0.    0.  ]
 [-0.29  0.31 -0.78 -0.24  0.23  0.23  0.01  0.14  0.14]
 [-0.29 -0.1   0.26 -0.59 -0.08 -0.08  0.66  0.14  0.14]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19  0.75 -0.25]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19 -0.25  0.75]]


(9, 9)

In [7]:
# 기존 행렬로 복구
np.allclose(A, np.dot(np.dot(U, S), VT).round(2))

True

truncated SVD에서 축소된 U의 행은 잠재 의미를 표현하고자 수치화된 각 문서 벡터   
축소된 VT의 열은 잠재 의미를 표현하고자 수치화된 각 단어 벡터

In [8]:
# truncated SVD 수행
S = S[:2, :2]
print(S.round(2))

[[2.69 0.  ]
 [0.   2.05]]


In [9]:
U = U[:, :2]
print(U.round(2))

VT = VT[:2, :]
print(VT.round(2))

[[ 0.24  0.75]
 [ 0.51  0.44]
 [ 0.83 -0.49]
 [ 0.   -0.  ]]
[[ 0.    0.31  0.31  0.28  0.8   0.09  0.28  0.    0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]]


In [10]:
# 기존 행렬로 복구
A_prime = np.dot(np.dot(U, S), VT)
print(A)
print(A_prime.round(2))

[[0 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0]
 [0 1 1 0 2 0 0 0 0]
 [1 0 0 0 0 0 0 1 1]]
[[ 0.   -0.17 -0.17  1.08  0.12  0.62  1.08 -0.   -0.  ]
 [ 0.    0.2   0.2   0.91  0.86  0.45  0.91  0.    0.  ]
 [ 0.    0.93  0.93  0.03  2.05 -0.17  0.03  0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]


# 4. 실습을 통한 이해
> **1) 뉴스그룹 데이터에 대한 이해**

In [11]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

In [12]:
documents[1]

"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [13]:
print(dataset.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


> **2) 텍스트 전처리**

In [14]:
news_df = pd.DataFrame({'document':documents})

news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

  news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")


In [15]:
news_df['clean_doc'][1]

'yeah expect people read actually accept hard atheism need little leap faith jimmy your logic runs steam sorry pity sorry that have these feelings denial about faith need well just pretend that will happily ever after anyway maybe start newsgroup atheist hard bummin much forget your flintstone chewables bake timmons'

In [16]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

In [17]:
tokenized_doc[1]

['yeah',
 'expect',
 'people',
 'read',
 'actually',
 'accept',
 'hard',
 'atheism',
 'need',
 'little',
 'leap',
 'faith',
 'jimmy',
 'logic',
 'runs',
 'steam',
 'sorry',
 'pity',
 'sorry',
 'feelings',
 'denial',
 'faith',
 'need',
 'well',
 'pretend',
 'happily',
 'ever',
 'anyway',
 'maybe',
 'start',
 'newsgroup',
 'atheist',
 'hard',
 'bummin',
 'much',
 'forget',
 'flintstone',
 'chewables',
 'bake',
 'timmons']

> **3) TF-IDF 행렬 만들기**   

In [18]:
# 역토큰화
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
    
news_df['clean_doc'] = detokenized_doc

In [19]:
news_df['clean_doc'][1]

'yeah expect people read actually accept hard atheism need little leap faith jimmy logic runs steam sorry pity sorry feelings denial faith need well pretend happily ever anyway maybe start newsgroup atheist hard bummin much forget flintstone chewables bake timmons'

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',
                             max_features=1000,
                             max_df=0.5, 
                             smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])
X.shape

(11314, 1000)

> **4) 토픽 모델링(Topic Modeling)**

In [21]:
from sklearn.decomposition import TruncatedSVD

# 20개의 토픽을 가졌다고 가정
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)
len(svd_model.components_)

20

In [22]:
np.shape(svd_model.components_) # LSA에서 VT에 해당

(20, 1000)

In [23]:
terms = vectorizer.get_feature_names()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])
get_topics(svd_model.components_, terms)

Topic 1: [('like', 0.21386), ('know', 0.20046), ('people', 0.19293), ('think', 0.17805), ('good', 0.15128)]
Topic 2: [('thanks', 0.32888), ('windows', 0.29088), ('card', 0.18069), ('drive', 0.17455), ('mail', 0.15111)]
Topic 3: [('game', 0.37064), ('team', 0.32443), ('year', 0.28154), ('games', 0.2537), ('season', 0.18419)]
Topic 4: [('drive', 0.53324), ('scsi', 0.20165), ('hard', 0.15628), ('disk', 0.15578), ('card', 0.13994)]
Topic 5: [('windows', 0.40399), ('file', 0.25436), ('window', 0.18044), ('files', 0.16078), ('program', 0.13894)]
Topic 6: [('chip', 0.16114), ('government', 0.16009), ('mail', 0.15625), ('space', 0.1507), ('information', 0.13562)]
Topic 7: [('like', 0.67086), ('bike', 0.14236), ('chip', 0.11169), ('know', 0.11139), ('sounds', 0.10371)]
Topic 8: [('card', 0.46633), ('video', 0.22137), ('sale', 0.21266), ('monitor', 0.15463), ('offer', 0.14643)]
Topic 9: [('know', 0.46047), ('card', 0.33605), ('chip', 0.17558), ('government', 0.1522), ('video', 0.14356)]
Topic 10

# 5. LSA의 장단점
쉽고 빠른 구현으로 단어의 잠재적 의미 도출   
하지만 SVD 특성상 계산된 LSA에 새 데이터를 추가하려면 처음부터 다시 계산해야 함(새 정보 업데이트의 어려움)   
이는 최근 LSA 대신 Word2Vec 등의 인공 신경망 기반 방법론이 각광받는 이유