# Singular Value Decomposition, SVD

특이값 분해는 행렬을 3개의 행렬의 곱의 형태로 분해하는 것이다.  
$ A = U\Sigma V^T$  (직교, 대각, 직교)  
직교행렬: n x n 행렬 A에 대하여 $ A \times A^T = I$이면서 $A^T\times A = I$인 행렬이다. 즉, $A^{-1} = A^T$를 만족한다



In [2]:
import numpy as np

In [3]:
A = np.array(
    [
        [0,0,0,1,0,1,1,0,0],
        [0,0,0,1,1,0,1,0,0],
        [0,1,1,0,2,0,0,0,0],
        [1,0,0,0,0,0,0,1,1]
    ]
)
print('DTM의 크기(shape) :', np.shape(A))

DTM의 크기(shape) : (4, 9)


Full SVD

In [4]:
# 특이값 분해
U, s, VT = np.linalg.svd(A, full_matrices=True)
print('행렬 U :')
print(U.round(2))
print('행렬 U의 크기(shape) :',np.shape(U))

행렬 U :
[[-0.24  0.75  0.   -0.62]
 [-0.51  0.44 -0.    0.74]
 [-0.83 -0.49 -0.   -0.27]
 [-0.   -0.    1.    0.  ]]
행렬 U의 크기(shape) : (4, 4)


In [5]:
print('특이값 벡터 :')
print(s.round(2))
print('특이값 벡터의 크기(shape) :',np.shape(s))

특이값 벡터 :
[2.69 2.05 1.73 0.77]
특이값 벡터의 크기(shape) : (4,)


Numpy의 linalg.svd()는 특이값 분해의 결과로 대각 행렬이 아니라 특이값의 리스트를 반환합니다.  
그러므로 앞서 본 수식의 형식으로 보려면 이를 다시 대각 행렬로 바꾸어 주어야 합니다.  
우선 특이값을 s에 저장하고 대각 행렬 크기의 행렬을 생성한 후에 그 행렬에 특이값을 삽입해도록 하겠습니다.

In [6]:
# 대각 행렬의 크기인 4 x 9의 임의의 행렬 생성
S = np.zeros((4, 9))

# 특이값을 대각행렬에 삽입
S[:4, :4] = np.diag(s) # diagonal 대각행렬을 만드는 함수

print('대각 행렬 S :')
print(S.round(2))

print('대각 행렬의 크기(shape) :')
print(np.shape(S))

대각 행렬 S :
[[2.69 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   2.05 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.73 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.77 0.   0.   0.   0.   0.  ]]
대각 행렬의 크기(shape) :
(4, 9)


In [7]:
np.diag(s) # 대각행렬의 값은 내림차순되어있

array([[2.68731789, 0.        , 0.        , 0.        ],
       [0.        , 2.04508425, 0.        , 0.        ],
       [0.        , 0.        , 1.73205081, 0.        ],
       [0.        , 0.        , 0.        , 0.77197992]])

In [8]:
print('직교행렬 VT :')
print(VT.round(2))

print('직교 행렬 VT의 크기(shape) :')
print(np.shape(VT))

직교행렬 VT :
[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]
 [ 0.58 -0.    0.    0.   -0.    0.   -0.    0.58  0.58]
 [ 0.   -0.35 -0.35  0.16  0.25 -0.8   0.16 -0.   -0.  ]
 [-0.   -0.78 -0.01 -0.2   0.4   0.4  -0.2   0.    0.  ]
 [-0.29  0.31 -0.78 -0.24  0.23  0.23  0.01  0.14  0.14]
 [-0.29 -0.1   0.26 -0.59 -0.08 -0.08  0.66  0.14  0.14]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19  0.75 -0.25]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19 -0.25  0.75]]
직교 행렬 VT의 크기(shape) :
(9, 9)


분해된 행렬을 곱하면 원래 행렬이 나와야한다.  
이 확인은 numpy의 allclose() 함수를 사용해서 알아보자. 두 행렬이 같으면 True를 리턴한다

In [9]:
np.allclose(A, np.dot(np.dot(U,S), VT).round(2))

True

# 절단된 SVD
절단된 SVD는 분해된 행렬의 t 크기만큼만 사용하게 되는데, 이 t값이 남길 정보량을 결정하는 하이퍼파라미터다.  
t에는 중요한 정보가 남고 나머지는 상대적으로 의미가 없는 정보가 버려지게된다. 

In [10]:
S = S[:2, :2]

print("절단된 대각 행렬 S: ")
print(S.round(2))

절단된 대각 행렬 S: 
[[2.69 0.  ]
 [0.   2.05]]


In [11]:
U = U[:, :2]
print("절단된 행렬 U :")
print(U.round(2))

절단된 행렬 U :
[[-0.24  0.75]
 [-0.51  0.44]
 [-0.83 -0.49]
 [-0.   -0.  ]]


In [12]:
VT = VT[:2, :]
print("절단된 직교행렬 VT:")
print(VT.round(2))

절단된 직교행렬 VT:
[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]]


In [13]:
A_prime = np.dot(np.dot(U, S), VT)
print(A)
print(A_prime.round(2))

[[0 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0]
 [0 1 1 0 2 0 0 0 0]
 [1 0 0 0 0 0 0 1 1]]
[[ 0.   -0.17 -0.17  1.08  0.12  0.62  1.08 -0.   -0.  ]
 [ 0.    0.2   0.2   0.91  0.86  0.45  0.91  0.    0.  ]
 [ 0.    0.93  0.93  0.03  2.05 -0.17  0.03  0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]


# 실습!


In [59]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Kyeul\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [16]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes')) # random_state로 매번 동일하게 shuffle한다.
documents = dataset.data
print('샘플 수: ', len(documents))

샘플 수:  11314


In [17]:
# 훈령용 데이터 샘플 살펴보기
documents[0]

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [34]:
documents[1]

"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [20]:
# 데이터 살펴보자
type(dataset)

sklearn.utils._bunch.Bunch

In [23]:
np.shape(dataset)

()

In [24]:
dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [25]:
dataset.filenames

array(['C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76141',
       'C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\alt.atheism\\53281',
       'C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76350',
       ...,
       'C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\105105',
       'C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.mac.hardware\\51575',
       'C:\\Users\\Kyeul\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\104908'],
      dtype='<U95')

In [26]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

# 전처리 하기

In [41]:
news_df = pd.DataFrame({'document':documents})
news_df

Unnamed: 0,document
0,Well i'm not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ..."
...,...
11309,"Danny Rubenstein, an Israeli journalist, will ..."
11310,\n
11311,\nI agree. Home runs off Clemens are always m...
11312,I used HP DeskJet with Orange Micros Grappler ...


In [42]:
news_df.iloc[0]

document    Well i'm not sure about the story nad it did s...
Name: 0, dtype: object

In [45]:
# 특수 문자 제거 # clean_doc이라는 새로운 column에 저장
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ") #오.. DataFrame에 replace로 치환할 수 있구나. 늘 re썼는데 몰랐네. #알파벳을 제외하고 모두 제거
news_df['clean_doc'] = news_df['document'].str.replace("^", "")
news_df['clean_doc']


0        Well i'm not sure about the story nad it did s...
1        \n\n\n\n\n\n\nYeah, do you expect people to re...
2        Although I realize that principle is not one o...
3        Notwithstanding all the legitimate fuss about ...
4        Well, I will have to change the scoring on my ...
                               ...                        
11309    Danny Rubenstein, an Israeli journalist, will ...
11310                                                   \n
11311    \nI agree.  Home runs off Clemens are always m...
11312    I used HP DeskJet with Orange Micros Grappler ...
11313                                          \nNo arg...
Name: clean_doc, Length: 11314, dtype: object

In [51]:
[w for w in 'heelp asdfasdf a'.split() if len(w) > 3]

['heelp', 'asdfasdf']

In [46]:
# 길이가 3이하인 단어는 제거 (길이가 짧은 단어 제거 ) # 이걸 왜 제거하는 거지?
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 3]))  # 낱말이 3개 미만인 것을 제거한다. 
news_df['clean_doc']

0        Well sure about story seem biased. What disagr...
1        Yeah, expect people read FAQ, etc. actually ac...
2        Although realize that principle your strongest...
3        Notwithstanding legitimate fuss about this pro...
4        Well, will have change scoring playoff pool. U...
                               ...                        
11309    Danny Rubenstein, Israeli journalist, will spe...
11310                                                     
11311    agree. Home runs Clemens always memorable. Kin...
11312    used DeskJet with Orange Micros Grappler Syste...
11313    argument with Murphy. scared hell when came la...
Name: clean_doc, Length: 11314, dtype: object

In [47]:
# 소문자로 변환
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())
news_df['clean_doc']

0        well sure about story seem biased. what disagr...
1        yeah, expect people read faq, etc. actually ac...
2        although realize that principle your strongest...
3        notwithstanding legitimate fuss about this pro...
4        well, will have change scoring playoff pool. u...
                               ...                        
11309    danny rubenstein, israeli journalist, will spe...
11310                                                     
11311    agree. home runs clemens always memorable. kin...
11312    used deskjet with orange micros grappler syste...
11313    argument with murphy. scared hell when came la...
Name: clean_doc, Length: 11314, dtype: object

In [60]:
# nltk로부터 불용어를 받아온다.
stop_words = stopwords.words('english')


In [61]:
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # 단어로 쪼개서 토큰화함
tokenized_doc

0        [well, sure, about, story, seem, biased., what...
1        [yeah,, expect, people, read, faq,, etc., actu...
2        [although, realize, that, principle, your, str...
3        [notwithstanding, legitimate, fuss, about, thi...
4        [well,, will, have, change, scoring, playoff, ...
                               ...                        
11309    [danny, rubenstein,, israeli, journalist,, wil...
11310                                                   []
11311    [agree., home, runs, clemens, always, memorabl...
11312    [used, deskjet, with, orange, micros, grappler...
11313    [argument, with, murphy., scared, hell, when, ...
Name: clean_doc, Length: 11314, dtype: object

In [62]:
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
tokenized_doc

0        [well, sure, story, seem, biased., disagree, s...
1        [yeah,, expect, people, read, faq,, etc., actu...
2        [although, realize, principle, strongest, poin...
3        [notwithstanding, legitimate, fuss, proposal,,...
4        [well,, change, scoring, playoff, pool., unfor...
                               ...                        
11309    [danny, rubenstein,, israeli, journalist,, spe...
11310                                                   []
11311    [agree., home, runs, clemens, always, memorabl...
11312    [used, deskjet, orange, micros, grappler, syst...
11313    [argument, murphy., scared, hell, came, last, ...
Name: clean_doc, Length: 11314, dtype: object

In [63]:
# 샘플 살펴보기
tokenized_doc[1]

['yeah,',
 'expect',
 'people',
 'read',
 'faq,',
 'etc.',
 'actually',
 'accept',
 'hard',
 'atheism?',
 'need',
 'little',
 'leap',
 'faith,',
 'jimmy.',
 'logic',
 'runs',
 'steam!',
 'jim,',
 'sorry',
 "can't",
 'pity',
 'you,',
 'jim.',
 'sorry',
 'feelings',
 'denial',
 'faith',
 'need',
 'well,',
 'pretend',
 'happily',
 'ever',
 'anyway.',
 'maybe',
 'start',
 'newsgroup,',
 'alt.atheist.hard,',
 "bummin'",
 'much?',
 'bye-bye,',
 'jim.',
 'forget',
 "flintstone's",
 'chewables!',
 'bake',
 'timmons,']

# TF-IDF 행렬 만들기
tf-idf의 입력은 토큰이 아닌 문서이므로, 토큰화를 문서로 만드는 '역토큰화'를 수행

In [64]:
# 역토큰화 (토큰화 작업을 역으로 되돌림)
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i]) # 토큰들을 join으로 하나의 문자열로 묶음
    detokenized_doc.append(t) 

In [65]:
# 확인
news_df['clean_doc'][1] # 불용어가 제거되고 하나의 문서가 되었다.

"yeah, expect people read faq, etc. actually accept hard atheism? need little leap faith, jimmy. your logic runs steam! jim, sorry can't pity you, jim. sorry that have these feelings denial about faith need well, just pretend that will happily ever after anyway. maybe start newsgroup, alt.atheist.hard, won't bummin' much? bye-bye, jim. don't forget your flintstone's chewables! bake timmons,"

In [67]:
# 1000개의 단어로 제한된 tf-idf 행렬 생성
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000, # 상위 1000개의 단어를 보존
                             max_df = 0.5, smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc']) #벡터화 실행

# 행렬 크기 확인
print('TF-IDF shape: ', X.shape)


TF-IDF shape:  (11314, 1000)


# 토픽 모델링
이제 행렬을 분해해보자! sklean에는 Truncated SVD가 있어서 편하다. 이를 이용하면 차원 축소를 할 수 있다.  
기존 뉴스그룹 데이터가 20개의 카테고리를 갖고 있었으므로, 20개의 토픽이 있다고 가정하고 진행하자.  
토픽의 숫자는 n_components의 파라미터로 지정이 가능하다. 

In [68]:
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122) # 모델 생성
svd_model.fit(X) # 학습 시작
len(svd_model.components_) # componets_는 VT에 해당함

20

In [69]:
svd_model

In [74]:
terms = vectorizer.get_feature_names_out() # 단어 집합. 1000개의 단어가 저장되어 있다.

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n -1 :-1]])

get_topics(svd_model.components_, terms)

Topic 1: [('just', 0.20273), ('don', 0.19953), ('like', 0.19536), ('know', 0.1882), ('people', 0.1785)]
Topic 2: [('thanks', 0.31974), ('windows', 0.27738), ('card', 0.17369), ('drive', 0.15925), ('mail', 0.1488)]
Topic 3: [('game', 0.32005), ('team', 0.27984), ('year', 0.26594), ('games', 0.20692), ('drive', 0.17124)]
Topic 4: [('edu', 0.42975), ('thanks', 0.24951), ('mail', 0.17246), ('game', 0.12793), ('team', 0.12627)]
Topic 5: [('know', 0.41951), ('does', 0.30519), ('thanks', 0.26145), ('don', 0.2107), ('just', 0.1947)]
Topic 6: [('drive', 0.4569), ('edu', 0.21765), ('thanks', 0.18803), ('scsi', 0.15738), ('drives', 0.12184)]
Topic 7: [('just', 0.56786), ('edu', 0.43044), ('don', 0.22504), ('like', 0.20016), ('soon', 0.0953)]
Topic 8: [('chip', 0.2166), ('government', 0.20027), ('encryption', 0.1468), ('like', 0.14586), ('clipper', 0.14285)]
Topic 9: [('don', 0.32342), ('know', 0.31922), ('edu', 0.28739), ('does', 0.26292), ('think', 0.20036)]
Topic 10: [('does', 0.47787), ('card'

LSA의 단점은 명확하다. 문서에 따라 벡더값이 변화기 때문에, 데이터가 추가되었다면 다시 처음부터 계산해야 한다.  
상대적으로 최신인 Word2Vec은 이런 문제를 피해갈 수 있다.  

그리고 토픽을 뽑는다는 의미에서 명사만 사용하여 분석하면 어떨까 싶다.  