In [100]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')
print(newsgroups.DESCR)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

위 데이터는 20개의 뉴스 그룹 문서 데이터이며, 뉴스 내용, 주소와 무슨 뉴스인지 나와있다. 

Ex) 주소 : comp.graphics, 뉴스 종류 : 7

아래는 뉴스 종류를 출력하는 것이다.

In [101]:
from pprint import pprint
pprint(list(newsgroups.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


아래는 index번째 뉴스 내용과 그 index 번째의 뉴스가 무슨 뉴스인지 출력한다.

In [102]:
print(newsgroups.data[1])
print("=" * 80)
print(newsgroups.target_names[newsgroups.target[1]])

From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphics Ultra Pro

  - Any other high-performance VLB card


Please post or email.  Thank you!

  - Matt

-- 
    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   
  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- 
    |   of heaven, because everything he does is right and all his ways  |   
    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   

comp.sys.ibm.pc.hardware


In [104]:
newsgroups.target[300]

15

In [118]:
print(newsgroups.data[10])
print(set(newsgroups.data[10].split()))

From: sandvik@newton.apple.com (Kent Sandvik)
Subject: Re: 14 Apr 93   God's Promise in 1 John 1: 7
Organization: Cookamunga Tourist Bureau
Lines: 17

In article <1qknu0INNbhv@shelley.u.washington.edu>, > Christian:  washed in
the blood of the lamb.
> Mithraist:  washed in the blood of the bull.
> 
> If anyone in .netland is in the process of devising a new religion,
> do not use the lamb or the bull, because they have already been
> reserved.  Please choose another animal, preferably one not
> on the Endangered Species List.  

This will be a hard task, because most cultures used most animals
for blood sacrifices. It has to be something related to our current
post-modernism state. Hmm, what about used computers?

Cheers,
Kent
---
sandvik@newton.apple.com. ALink: KSAND -- Private activities on the net.

{'bull.', 'Promise', 'Bureau', 'cultures', 'Cheers,', 'the', 'If', 'has', '1:', '14', 'not', 'hard', 'new', 'Apr', 'From:', 'animal,', 'sacrifices.', 'ALink:', '17', 'lamb', 'preferably

In [119]:
print(newsgroups.data[9])
print(set(newsgroups.data[9].split()))

From: arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee)
Subject: Re: Christians above the Law? was Clarification of pe
Organization: Johns Hopkins University CS Dept.
Lines: 13

In article <C61Kow.E4z@mailer.cc.fsu.edu> dlecoint@garnet.acns.fsu.edu (Darius_Lecointe) writes:
>>Jesus was a JEW, not a Christian.

If a Christian means someone who believes in the divinity of Jesus, it is safe
to say that Jesus was a Christian.
--
"On the first day after Christmas my truelove served to me...  Leftover Turkey!
On the second day after Christmas my truelove served to me...  Turkey Casserole
    that she made from Leftover Turkey.
[days 3-4 deleted] ...  Flaming Turkey Wings! ...
   -- Pizza Hut commercial (and M*tlu/A*gic bait)

Ken Arromdee (arromdee@jyusenkyou.cs.jhu.edu)

{'Christian.', 'Casserole', 'Leftover', 'Wings!', 'above', 'first', 'it', 'say', 'the', 'from', 'Jesus', 'If', 'Arromdee)', 'me...', 'not', '...', 'From:', '<C61Kow.E4z@mailer.cc.fsu.edu>', 'served', '(and', '>>Jesus', '"On', '

In [120]:
print(newsgroups.data[8])
print(set(newsgroups.data[8].split()))

From: dchhabra@stpl.ists.ca (Deepak Chhabra)
Subject: Re: Goalie masks
Nntp-Posting-Host: stpl.ists.ca
Organization: Solar Terresterial Physics Laboratory, ISTS
Lines: 15

In article <C5sqz3.EG8@acsu.buffalo.edu> hammerl@acsu.buffalo.edu (Valerie S. Hammerl) writes:

>>[...] and I'll give Fuhr's new one an honourable mention, although I haven't
>>seen it closely yet (it looked good from a distance!).  

>This is the new Buffalo one, the second since he's been with the
>Sabres?  I recall a price tag of over $700 just for the paint job on
>that mask, and a total price of almost $1500.  Ouch.  

Yeah, it's the second one.  And I believe that price too.  I've been trying
to get a good look at it on the Bruin-Sabre telecasts, and wow! does it ever
look good.  Whoever did that paint job knew what they were doing.  And given
Fuhr's play since he got it, I bet the Bruins are wishing he didn't have it:)

--

{'S.', "haven't", 'get', 'just', 'although', 'give', 'price', 'closely', 'it', 'the', '

In [125]:
# print(newsgroups.data[8][0])
a=list(newsgroups.data[8].split())
print(a)
num = a.index("Lines:")
print()
for a in 

['From:', 'dchhabra@stpl.ists.ca', '(Deepak', 'Chhabra)', 'Subject:', 'Re:', 'Goalie', 'masks', 'Nntp-Posting-Host:', 'stpl.ists.ca', 'Organization:', 'Solar', 'Terresterial', 'Physics', 'Laboratory,', 'ISTS', 'Lines:', '15', 'In', 'article', '<C5sqz3.EG8@acsu.buffalo.edu>', 'hammerl@acsu.buffalo.edu', '(Valerie', 'S.', 'Hammerl)', 'writes:', '>>[...]', 'and', "I'll", 'give', "Fuhr's", 'new', 'one', 'an', 'honourable', 'mention,', 'although', 'I', "haven't", '>>seen', 'it', 'closely', 'yet', '(it', 'looked', 'good', 'from', 'a', 'distance!).', '>This', 'is', 'the', 'new', 'Buffalo', 'one,', 'the', 'second', 'since', "he's", 'been', 'with', 'the', '>Sabres?', 'I', 'recall', 'a', 'price', 'tag', 'of', 'over', '$700', 'just', 'for', 'the', 'paint', 'job', 'on', '>that', 'mask,', 'and', 'a', 'total', 'price', 'of', 'almost', '$1500.', 'Ouch.', 'Yeah,', "it's", 'the', 'second', 'one.', 'And', 'I', 'believe', 'that', 'price', 'too.', "I've", 'been', 'trying', 'to', 'get', 'a', 'good', 'look'

유방암이나 수종 데이터 예제에서는 모든 특징이 수치 혹은 종류였기 때문에 높고 낮음, 혹은 종류가 맞고 틀리고를 분기조건으로 확실하게 YES 나 NO로 나눌 수 있었는데, 

text 데이터는 YES나 NO로 분기하기 위한 조건을 찾기 어렵다.  
-> 단어의 있고 없음을 따지기에는 동음이의어나 문맥에 따라 바뀌는 단어 의미 등을 고려하지 못한다.  
-> text가 나타내는 것을 분석한 뒤 그에 맞는 주제를 정해야 하는데 text의 어디를 특징으로 잡아 Yes와 No에 넣을 것인지를 정하기 어렵다.  