1) 잠재의미분석(LSA)

특이값 분해(Singular Value Decomposition, SVD)

    차원축소 방법 중 하나

m x n 행렬을   m x m 직교행렬, n x n 직교행렬, m x n 직사각 대각행렬  3개의 곱으로 분해하는 작업

행렬 연산을 통해 데이터의 차원을 축소하고 중요한 특징들을 추출하는 기법

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups  
# 시간이 오래 걸림
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))  
documents = dataset.data
len(documents)

11314

In [3]:
print(documents[0])

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



In [4]:
print(dataset.target_names) # 뉴스 카테고리

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
news_df = pd.DataFrame({'document':documents})
# 알파벳 이외의 문자 제거
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")
# 길이가 3이하인 단어 제거
news_df['clean_doc'] = news_df['clean_doc'].apply(
    lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# 소문자 변환
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())
news_df['clean_doc'][0]

'well sure about story seem biased. what disagree with your statement that u.s. media ruin israels reputation. that rediculous. u.s. media most pro-israeli media world. having lived europe realize that incidences such described letter have occured. u.s. media whole seem ignore them. u.s. subsidizing israels existance europeans least same degree). think that might reason they report more clearly atrocities. what shame that austria, daily reports inhuman acts commited israeli soldiers blessing received from government makes some holocaust guilt away. after all, look jews treating other races when they power. unfortunate.'

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# 토큰화
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
# 불용어 제거
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

print(tokenized_doc[0])

['well', 'sure', 'story', 'seem', 'biased.', 'disagree', 'statement', 'u.s.', 'media', 'ruin', 'israels', 'reputation.', 'rediculous.', 'u.s.', 'media', 'pro-israeli', 'media', 'world.', 'lived', 'europe', 'realize', 'incidences', 'described', 'letter', 'occured.', 'u.s.', 'media', 'whole', 'seem', 'ignore', 'them.', 'u.s.', 'subsidizing', 'israels', 'existance', 'europeans', 'least', 'degree).', 'think', 'might', 'reason', 'report', 'clearly', 'atrocities.', 'shame', 'austria,', 'daily', 'reports', 'inhuman', 'acts', 'commited', 'israeli', 'soldiers', 'blessing', 'received', 'government', 'makes', 'holocaust', 'guilt', 'away.', 'all,', 'look', 'jews', 'treating', 'races', 'power.', 'unfortunate.']


In [7]:
# tf-idf 행렬을 만들기 위해 다시 역토큰화

detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

news_df['clean_doc'][0]

'well sure story seem biased. disagree statement u.s. media ruin israels reputation. rediculous. u.s. media pro-israeli media world. lived europe realize incidences described letter occured. u.s. media whole seem ignore them. u.s. subsidizing israels existance europeans least degree). think might reason report clearly atrocities. shame austria, daily reports inhuman acts commited israeli soldiers blessing received government makes holocaust guilt away. all, look jews treating races power. unfortunate.'

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 상위 1000개의 단어만 처리
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000)
X = vectorizer.fit_transform(news_df['clean_doc'])
X.shape # TF-IDF 행렬의 크기 확인

(11314, 1000)

In [9]:
from sklearn.decomposition import TruncatedSVD
# 행렬 특이값 분해, 11314개의 행을 20개로 축소, n_components 토픽수
svd_model = TruncatedSVD(n_components=20)
svd_model.fit(X)
len(svd_model.components_)

20

In [10]:
import numpy as np
# 토픽수 x 단어수
np.shape(svd_model.components_)

(20, 1000)

In [11]:
svd_model.components_

array([[ 0.02682194,  0.02779475,  0.00507776, ...,  0.04554116,
         0.01386712,  0.01758137],
       [ 0.03501249, -0.0150116 ,  0.00534304, ..., -0.02162846,
        -0.00912255, -0.01760022],
       [ 0.06487863,  0.0286971 ,  0.00452922, ..., -0.00207655,
         0.02335056,  0.02043433],
       ...,
       [ 0.08024134,  0.03438316,  0.00154817, ...,  0.01804123,
        -0.003448  ,  0.0050334 ],
       [ 0.03872506, -0.02064543, -0.00493834, ..., -0.03772561,
        -0.00945247, -0.0099365 ],
       [-0.07219908, -0.03848623,  0.00044259, ...,  0.01463146,
        -0.007438  , -0.0035805 ]])

In [12]:
# 단어 집합, 1000개의 단어
terms = vectorizer.get_feature_names_out()

# 20개의 뉴스그룹별로 추출한 토픽 리스트 출력
def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1),
              [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])
        
get_topics(svd_model.components_,terms)
# 각 토픽의 핵심 키워드 추출

Topic 1: [('like', 0.2085), ('know', 0.19656), ('people', 0.1912), ('think', 0.17523), ('good', 0.14902)]
Topic 2: [('thanks', 0.31342), ('windows', 0.27941), ('card', 0.17295), ('drive', 0.16146), ('mail', 0.14496)]
Topic 3: [('game', 0.36766), ('team', 0.31145), ('year', 0.28411), ('games', 0.2304), ('season', 0.16978)]
Topic 4: [('edu', 0.50329), ('thanks', 0.25183), ('mail', 0.17635), ('email', 0.11264), ('com', 0.11197)]
Topic 5: [('edu', 0.49887), ('drive', 0.253), ('sale', 0.10969), ('com', 0.10916), ('soon', 0.09206)]
Topic 6: [('drive', 0.39704), ('thanks', 0.3472), ('know', 0.27989), ('scsi', 0.1382), ('mail', 0.11401)]
Topic 7: [('chip', 0.22214), ('government', 0.20243), ('like', 0.16519), ('encryption', 0.15076), ('clipper', 0.14962)]
Topic 8: [('like', 0.65106), ('edu', 0.3098), ('know', 0.13353), ('think', 0.12132), ('bike', 0.1212)]
Topic 9: [('card', 0.3243), ('good', 0.27584), ('sale', 0.16749), ('00', 0.14953), ('video', 0.14545)]
Topic 10: [('card', 0.48524), ('peop