## 텍스트 분류 - 뉴스
---
- scikit-learn의 dataset인 20대뉴스 데이터 분류


## [1] 데이터 준비
---

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
newsData=fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [3]:
newsData20 = newsData["data"]
type(newsData20), type(newsData20[0])

(list, str)

In [4]:
print(newsData20[0])



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




## [2] 데이터 전처리
---
- 데이터 구조 확인
    * 목적에 맞는 데이터 여부 확인
- 단어사전 생성
- 텍스트 데이터 > 정수 수치화
- 이진 수치화 변환

#### [2-1] 단어사전 생성

#### [2-1-1] 불용어 처리

In [5]:
import nltk

In [6]:
# 불용어 코프스 다운로드 받기
nltk.download('stopwords', quiet=True)

True

In [7]:
# 블용어(stopword) 분석에 영향을 미치지 않는 단어들  => 예) i my to and ... 
stopwords=nltk.corpus.stopwords.words('english')

In [8]:
print(f'stopwords : {stopwords}')

stopwords : ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
import numpy as np

In [10]:
# 데이터 정제(Cleaning) - 불용어 제거 
def cleaningData(removeData):
    for idx in range(len(newsData20)):
        news=text_to_word_sequence(newsData20[idx])
        _clear=[]
        for n in news:
            if n not in removeData:
                _clear.append(n)
        newsData20[idx]=' '.join(_clear)

In [11]:
cleaningData(stopwords)

In [12]:
newsData20[0]

"sure bashers pens fans pretty confused lack kind posts recent pens massacre devils actually bit puzzled bit relieved however going put end non pittsburghers' relief bit praise pens man killing devils worse thought jagr showed much better regular season stats also lot fo fun watch playoffs bowman let jagr lot fun next couple games since pens going beat pulp jersey anyway disappointed see islanders lose final regular season game pens rule"

##### [2-2-2] 단어사전 생성 및 사용할 단어사전 수 설정

In [13]:
def makeWordVoca(numWord=0):
    if numWord>0:
        myToken=Tokenizer(num_words=numWord)
    else:
        myToken=Tokenizer()
    
    # 단어사전(voca) 생성
    myToken.fit_on_texts(newsData20)
    
    return myToken

In [14]:
# 텍스트 데이터 토큰화 진행
myToken=makeWordVoca()
print(f'word_index : {len(myToken.word_index)}개')

word_index : 139318개


In [15]:
# 단어사전의 사용빈도 높은것을 기준으로 단어사전 크기 제한
def getNumWord(limit_num):
    # 빈도수가 limit_num개 인것 체크 후 제거
    low_freq_cnt=0
    for k, v in myToken.word_counts.items():
        if v == limit_num: low_freq_cnt += 1

    return len(myToken.word_index) - low_freq_cnt

In [16]:
num_word = getNumWord(2)
print(f'num_word : {num_word}개')

num_word : 121055개


In [24]:
myToken=makeWordVoca(num_word)
print(f'word_index : {len(myToken.word_index)}개')

word_index : 139318개


##### [2-2-2-3] 텍스트 => 정수 수치화 (  생성한 단어사전 기반 )

In [25]:
# 텍스트 => 수치화 (단어 사전 사용해서)
seq_news=myToken.texts_to_sequences(newsData20)

In [26]:
print(f'seq_news : {len(seq_news)}')

seq_news : 18846
