In [8]:
import pandas as pd

## The 20 newsgroups dataset

The [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic:

- 'atheism',
- 'comp.graphics',
- 'comp.os.ms-windows.misc',
- 'comp.sys.ibm.pc.hardware',
- 'comp.sys.mac.hardware',
- 'comp.windows.x',
- 'misc.forsale',
- 'rec.autos',
- 'rec.motorcycles',
- 'rec.sport.baseball',
- 'rec.sport.hockey',
- 'sci.crypt',
- 'sci.electronics',
- 'sci.med',
- 'sci.space',
- 'soc.religion.christian',
- 'talk.politics.guns',
- 'talk.politics.mideast',
- 'talk.politics.misc',
- 'talk.religion.misc']

 we will work on a partial dataset with only 7 categories out of the 20 available in the dataset

In [5]:
categories = ['alt.atheism', 'soc.religion.christian', 'rec.autos', 'sci.electronics',
               'comp.graphics', 'sci.med', 'sci.space']
categories

['alt.atheism',
 'soc.religion.christian',
 'rec.autos',
 'sci.electronics',
 'comp.graphics',
 'sci.med',
 'sci.space']

In [31]:
from sklearn.datasets import fetch_20newsgroups

In [32]:
data = fetch_20newsgroups(categories=categories)

In [33]:
X = pd.DataFrame(data.data, columns=['text'])
y = pd.Series(data.target)
X['label'] = y

In [34]:
len(X)

4035

In [35]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'rec.autos',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian']

In [36]:
X.label = X.label.replace({0:'atheism',
                           1:'computer graphics',
                           2:'autos',
                           3:'electronics',
                           4:'medicine',
                           5:'space',
                           6:'christianity'})

In [37]:
X

Unnamed: 0,text,label
0,From: keith@cco.caltech.edu (Keith Allan Schne...,atheism
1,"From: rcg1597@zeus.tamu.edu (GUYNN, RICHARD CA...",autos
2,From: henry@zoo.toronto.edu (Henry Spencer)\nS...,space
3,From: wjhovi01@ulkyvx.louisville.edu\nSubject:...,christianity
4,From: tholen@galileo.ifa.hawaii.edu (Dave Thol...,space
...,...,...
4030,From: scott@psy.uwa.oz.au (Scott Fisher)\nSubj...,autos
4031,From: markm@bigfoot.sps.mot.com (Mark Monninge...,autos
4032,From: steinly@topaz.ucsc.edu (Steinn Sigurdsso...,space
4033,From: dje@bmw535.NoSubdomain.NoDomain (Don Eil...,autos


In [38]:
X.to_csv('messages.csv', index=None)

In [39]:
# import the 20newsgroup
data = pd.read_csv('messages.csv')

In [40]:
data.head()

Unnamed: 0,text,label
0,From: keith@cco.caltech.edu (Keith Allan Schne...,atheism
1,"From: rcg1597@zeus.tamu.edu (GUYNN, RICHARD CA...",autos
2,From: henry@zoo.toronto.edu (Henry Spencer)\nS...,space
3,From: wjhovi01@ulkyvx.louisville.edu\nSubject:...,christianity
4,From: tholen@galileo.ifa.hawaii.edu (Dave Thol...,space


In [41]:
# topics
data.label.value_counts()

christianity         599
autos                594
medicine             594
space                593
electronics          591
computer graphics    584
atheism              480
Name: label, dtype: int64

In [18]:
# display one example of each topic
print(X[X.label=='christianity'].iloc[0].text) 

From: wjhovi01@ulkyvx.louisville.edu
Subject: Re: Hebrew grammar texts--choose English or German?
Organization: University of Louisville
Lines: 37

Phil Sells writes:

> Probably a tired old horse, but...  maybe with a slightly different
> twist.  I wanted to know if there are any good English-language texts
> for learning ancient Hebrew, and how these compare with German
> educational texts qualitywise, if anybody has an idea.  I can't figure
> out if I should buy one here for later study or wait until I get back to
> the U.S.

My impression is that *for advanced work* you will be much better off with
German reference works (lexicons, concordances especially).  For a first-time
encounter, my *personal* preference would be to deal with a textbook written in
my native language.  But if you know German and are in Germany, pick up all the
reference books you think you can handle.  (I only know these works by
reputation, since my German is most rusty, but I'd look at the following books:
K

In [19]:
print(X[X.label=='medicine'].iloc[0].text) 

From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: Need Info on RSD
Reply-To: geb@cs.pitt.edu (Gordon Banks)
Organization: Univ. of Pittsburgh Computer Science
Lines: 13

In article <1993Mar27.004627.21258@rmtc.Central.Sun.COM> lrd@rmtc.Central.Sun.COM writes:
>I just started working for a rehabilitation hospital and have seen RSD
>come up as a diagnosis several times.  What exactly is RSD and what is
>the nature of it?  If there is a FAQ on this subject, I'd really
>appreciate it if someone would mail it to me.  While any and all

Reflex sympathetic dystrophy.  I'm sure there's an FAQ, as I have
made at least 10 answers to questions on it in the last year or so.
-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 
----------------------------------------------------------------------------



In [20]:
print(X[X.label=='autos'].iloc[0].text) 

From: rcg1597@zeus.tamu.edu (GUYNN, RICHARD CARL)
Subject: Re: MGBs and the real world
Article-I.D.: zeus.5APR199321160020
Distribution: world
Organization: Texas A&M University, Academic Computing Services
Lines: 34
NNTP-Posting-Host: zeus.tamu.edu
News-Software: VAX/VMS VNEWS 1.41

In article <1993Apr5.181056.29411@mks.com>, mike@mks.com (Mike Brookbank) writes...
>My sister has an MGB.  She has one from the last year they were produced
>(1978? 1979?).  Its in very good shape.  I've been bugging her for years

	Last year produced: 1980.

>about selling it.  I've said over and over that she should sell it
>before the car is worthless while she maintains that the car may
>actually be increasing in value as a result of its limited availability.
> 
>Which one of us is right?  Are there MGB affectionados out there who are
>still willing to pay $6K to 8K for an old MG?  Are there a lot out in the 
>market?
>-- 

	Yes, there are still alot of MGBs out there.  The earlier cars (pre
 74-1/2) 

In [86]:
print(X[X.label=='space'].iloc[0].text) 

From: steinly@topaz.ucsc.edu (Steinn Sigurdsson)
Subject: Re: Commercial mining activities on the moon
Organization: Lick Observatory/UCO
Lines: 26
	<1993Apr20.204838.13217@cs.rochester.edu>	<STEINLY.93Apr20145301@topaz.ucsc.edu>	<1993Apr20.223807.16712@cs.rochester.edu>,<STEINLY.93Apr20160116@topaz.ucsc.edu>
	<1r46j3INN14j@mojo.eng.umd.edu>
NNTP-Posting-Host: topaz.ucsc.edu
In-reply-to: sysmgr@king.eng.umd.edu's message of 21 Apr 1993 19:16:51 GMT

In article <1r46j3INN14j@mojo.eng.umd.edu> sysmgr@king.eng.umd.edu (Doug Mohney) writes:

   In article <STEINLY.93Apr20160116@topaz.ucsc.edu>, steinly@topaz.ucsc.edu (Steinn Sigurdsson) writes:

   >Very cost effective if you use the right accounting method :-)

   Sherzer Methodology!!!!!!

Hell, yes. I'm not going to let a bunch of seven suits tell
me what the right way to estimate cost effectiveness is, at
least not until they can make their mind up long enough
to leave their scheme stable for a fiscal year or two.


Seriously though. I

In [42]:
print(X[X.label=='computer graphics'].iloc[0].text) 

From: schaefer@imag.imag.fr (Arno Schaefer)
Subject: Re: CView answers
Nntp-Posting-Host: silene
Organization: Institut Imag, Grenoble, France
Lines: 32

In article <C5LErr.1J3@rahul.net>, bryanw@rahul.net (Bryan Woodworth) writes:
|> In <1993Apr16.114158.2246@whiting.mcs.com> sean@whiting.mcs.com (Sean Gum) writes:
|> 
|> >A stupid question, but what will CView run on and where can I get it? I
|> >am still in need of a GIF viewer for Linux. (Without X-Windows.)
|> >Thanks!
|> > 
|> 
|> Ho boy. There is no way in HELL you are going to be able to view GIFs or do
|> any other graphics in Linux without X windows!  I love Linux because it is
|> so easy to learn..  You want text?  Okay.   Use Linux. You want text AND
|> graphics?  Use Linux with X windows.  Simple.  Painless.  REQUIRED to have
|> X Windows if you want graphics!  This includes fancy word processors like
|> doc, image viewers like xv, etc.
|> 

Sorry, Bryan, this is not quite correct. Remember the VGALIB package that comes
wi

In [43]:
print(X[X.label=='atheism'].iloc[0].text) 

From: keith@cco.caltech.edu (Keith Allan Schneider)
Subject: Re: <Political Atheists?
Organization: California Institute of Technology, Pasadena
Lines: 25
NNTP-Posting-Host: lloyd.caltech.edu

livesey@solntze.wpd.sgi.com (Jon Livesey) writes:

>> The probability that the "automobile system" will kill someone 
>> innocent in an accident goes asymptotically close to 1, just 
>> like the court system.
>However, anyone who doesn't like the "automobile system" can
>opt out, as I have.

This isn't true.  Many people are forced to use the "automobile system."
I certainly don't use it by choice.  If there were other ways of getting
around, I'd do it.

>Secondly, we do try to make the "automobile system" as safe
>as possible, because we *do* recognize the danger to the 
>innocent, whereas the US - the current example - is not trying
>to make the "Court System" safer, which it could fairly easily
>do by replacing fatal punishments with non-fatal punishments.

But I think that the Court system ha

In [44]:
print(X[X.label=='electronics'].iloc[0].text) 

From: Mike Diack <mike-d@staff.tc.umn.edu>
Subject: Anyone know about DATA I/O device proggers ?
X-Xxmessage-Id: <A7F5DAE6E6026550@dialup-slip-1-80.gw.umn.edu>
X-Xxdate: Sat, 17 Apr 93 16:03:50 GMT
Nntp-Posting-Host: dialup-slip-1-80.gw.umn.edu
Organization: persian cat & carpet co.
X-Useragent: Nuntius v1.1.1d7
Lines: 9

I keep finding these programmers in local junk shops. This may
mean that they are indeed junk - but i'd like to hear from anyone 
else that may have met up with them. The basic device is a
"Data I/O 29A universal programmer", and the usual pod is a 
"LogicPak 303A-Vo4" with a "303A-001" programming tester/
adapter. I'd really like to hear from anyone who knows whether
these monsters are worth bothering with. All i want to do is blast
PALCE22V10s. - Ideas, folks
Mike.



**Goal**:  classify text messages from the dataset by their topic

In [45]:
X = data.text
y = data.label

In [46]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [48]:
# initialize the vectorizer (with default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,1))

In [49]:
# learn training vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [50]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

In [51]:
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(max_iter=1000) 
log_clf.fit(X_train_dtm,y_train)
y_test_pred = log_clf.predict(X_test_dtm)

In [52]:
# evaluate the model
from sklearn.metrics import accuracy_score, confusion_matrix

In [53]:
confusion_matrix(y_test,y_test_pred)

array([[109,   0,   3,   0,   0,   1,   0],
       [  1, 135,   1,   2,   7,   2,   0],
       [  0,   1, 132,   3,   3,   4,   0],
       [  1,   2,   0, 134,   3,   1,   1],
       [  0,   8,   4,  10, 135,   1,   1],
       [  1,   3,   1,   6,   4, 138,   1],
       [  0,   1,   1,  10,   5,   1, 132]], dtype=int64)

In [55]:
log_clf.classes_

array(['atheism', 'autos', 'christianity', 'computer graphics',
       'electronics', 'medicine', 'space'], dtype=object)