## Working with text data example  
Based on [sklearn example](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train')
all_topics = list(newsgroups_train.target_names)
print("All the topics:")
print("-" * 20)
for topic in all_topics:
    print(topic)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


All the topics:
--------------------
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


### Pick topics of interest

In [3]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics',
                  'sci.med']

In [4]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories,
                                      shuffle=True, random_state=42)
# data is in the data attribute, filenames in the filename attribute
num_docs = len(twenty_train.data)
targets = sorted(list(set(twenty_train.target)))
target_names = twenty_train.target_names
print("There are {0} documents in the dataset.".format(num_docs))
print("The targets are: {0}.".format(targets))
print("The target names are {0}.".format(target_names))

There are 2257 documents in the dataset.
The targets are: [0, 1, 2, 3].
The target names are ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'].


### Pick a document of interest

In [5]:
doi = 99
print("\nInvestigating document {0}.".format(doi))
print("Text in the document:\n")
print("\n".join(twenty_train.data[doi].split("\n")))
print("\nIt's topic:")
ind_target = twenty_train.target[doi]
print(twenty_train.target_names[ind_target])


Investigating document 99.
Text in the document:

From: bobbe@vice.ICO.TEK.COM (Robert Beauchaine)
Subject: Re: <<Pompous ass
Organization: Tektronix Inc., Beaverton, Or.
Lines: 20

In article <1ql6jiINN5df@gap.caltech.edu> keith@cco.caltech.edu (Keith Allan Schneider) writes:
>
>The "`little' things" above were in reference to Germany, clearly.  People
>said that there were similar things in Germany, but no one could name any.
>They said that these were things that everyone should know, and that they
>weren't going to waste their time repeating them.  Sounds to me like no one
>knew, either.  I looked in some books, but to no avail.

  If the Anne Frank exhibit makes it to your small little world,
  take an afternoon to go see it.  


/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ 

Bob Beauchaine bobbe@vice.ICO.TEK.COM 

They said that Queens could stay, they blew the Bronx away,
and sank Manhattan out at sea.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

## Analysis
Start by turning the corpus (the collection of documents) into a bag-of-words.

**Bag of words**: A sentence/document is represented by the counts of words in it, disregarding word order.

To do it, assign a fixed integer to each word occuring in any document of the training set.  Build a dictionary whose keys are words, values are the word's column index.

Next, for each document **i**, count the number of occurences of each word **w** and store it in the X matrix at X[i, j] where **i** is the document of interest and **j** is the column index for word **w**.  

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(lowercase=True, tokenizer=None, stop_words='english',
                             analyzer='word', max_df=1.0, min_df=1,
                             max_features=None)
count_vect.fit(twenty_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
# here is the dictionary linking words to column indices
count_vect.vocabulary_

{'sd345': 28821,
 'city': 8638,
 'ac': 4015,
 'uk': 32993,
 'michael': 21514,
 'collier': 8972,
 'subject': 30852,
 'converting': 9745,
 'images': 17239,
 'hp': 16792,
 'laserjet': 19643,
 'iii': 17175,
 'nntp': 22956,
 'posting': 25466,
 'host': 16759,
 'hampton': 15973,
 'organization': 23731,
 'university': 33332,
 'lines': 20112,
 '14': 587,
 'does': 11985,
 'know': 19321,
 'good': 15468,
 'way': 34483,
 'standard': 30399,
 'pc': 24457,
 'application': 5257,
 'pd': 24483,
 'utility': 33646,
 'convert': 9741,
 'tif': 32137,
 'img': 17262,
 'tga': 31887,
 'files': 14191,
 'format': 14580,
 'like': 20057,
 'hpgl': 16803,
 'plotter': 25164,
 'email': 12756,
 'response': 27636,
 'correct': 9872,
 'group': 15729,
 'thanks': 31905,
 'advance': 4375,
 'programmer': 25978,
 'computer': 9279,
 'unit': 33307,
 'tel': 31687,
 '071': 177,
 '477': 2326,
 '8000': 3062,
 'x3769': 35116,
 'london': 20318,
 'fax': 13998,
 '8565': 3166,
 'ec1v': 12470,
 '0hb': 230,
 'ani': 5025,
 'ms': 22211,
 'uky':

In [8]:
for key in sorted(count_vect.vocabulary_.keys()):
    print("{0:<20s} {1}".format(key, count_vect.vocabulary_[key]))

00                   0
000                  1
0000                 2
0000001200           3
000005102000         4
0001                 5
000100255pixel       6
00014                7
000406               8
0007                 9
000usd               10
0010                 11
001004               12
0010580b             13
001125               14
001200201pixel       15
0014                 16
001642               17
00196                18
002                  19
0028                 20
003258u19250         21
0033                 22
0038                 23
0039                 24
004021809            25
004158               26
004627               27
0049                 28
00500                29
005148               30
00630                31
008561               32
0094                 33
00am                 34
00index              35
00pm                 36
01                   37
0100                 38
010116               39
010702               40
011255               41
01

508                  2409
509                  2410
50b                  2411
50mg                 2412
51                   2413
510                  2414
5103                 2415
5108                 2416
512                  2417
512k                 2418
512x512              2419
513                  2420
51351                2421
514                  2422
515                  2423
516                  2424
517                  2425
518                  2426
519                  2427
52                   2428
5200                 2429
52000                2430
5215                 2431
522                  2432
52223                2433
523                  2434
523296               2435
524                  2436
5245                 2437
525                  2438
5252                 2439
5254                 2440
525714               2441
5258                 2442
527                  2443
529                  2444
5298                 2445
5299                 2446
52nd        

achieved             4132
achievement          4133
achieves             4134
achieving            4135
achim                4136
aching               4137
achive               4138
achived              4139
achses               4140
acid                 4141
acidic               4142
acidophilis          4143
acidophilius         4144
acidophilous         4145
acidophilus          4146
acids                4147
acis                 4148
ack                  4149
acknosledge          4150
acknowledge          4151
acknowledged         4152
acknowledgement      4153
acknowledgements     4154
acknowledges         4155
acknowledging        4156
acknowledgment       4157
acknowleding         4158
aclimatized          4159
acm                  4160
acme                 4161
acn                  4162
acne                 4163
acns                 4164
acooper              4165
acording             4166
acorn                4167
acosta               4168
acoustical           4169
acoustique  

berry                6479
berry_               6480
berryh               6481
berryhill            6482
bert                 6483
berthe               6484
bertil               6485
bertrand             6486
beseeched            6487
beset                6488
besetting            6489
besieged             6490
besler               6491
besmith              6492
best                 6493
best24               6494
bestows              6495
bet                  6496
beta                 6497
betcha               6498
beth                 6499
bethesda             6500
bethke               6501
bethlehem            6502
bethulah             6503
betray               6504
betrayal             6505
betrayed             6506
betrayer             6507
betrothed            6508
bette                6509
better               6510
betts                6511
betty                6512
betweed              6513
bev                  6514
bevans               6515
bevelizes            6516
bevelled    

census               8099
cent                 8100
center               8101
centered             8102
centers              8103
centigram            8104
centimeter           8105
central              8106
centralia            8107
centralization       8108
centrally            8109
centre               8110
centres              8111
centric              8112
centrifuge           8113
centris              8114
centro               8115
centroid             8116
cents                8117
centure              8118
centuries            8119
centurion            8120
century              8121
ceo                  8122
cephas               8123
cept                 8124
cereal               8125
cereals              8126
cerebellum           8127
cerebrospinal        8128
ceredase             8129
ceremonial           8130
ceremonies           8131
ceremony             8132
cerermony            8133
cerl                 8134
cern                 8135
cernapo              8136
cerrina     

curvature            10408
curve                10409
curved               10410
curves               10411
cush                 10412
cusp                 10413
cusps                10414
cust                 10415
custer               10416
custody              10417
custom               10418
customary            10419
customer             10420
customers            10421
customizable         10422
customization        10423
customized           10424
customs              10425
cut                  10426
cute                 10427
cutoff               10428
cuts                 10429
cutting              10430
cuyler               10431
cuz                  10432
cv                   10433
cv4                  10434
cv7                  10435
cva                  10436
cvadrmaz             10437
cview                10438
cview097             10439
cvo                  10440
cvs                  10441
cvt                  10442
cvtstu               10443
cw                   10444
c

eching               12485
echo                 12486
echocardiography     12487
echoed               12488
eckart               12489
ecl                  12490
eclipse              12491
ecn                  12492
ecole                12493
ecological           12494
ecomplaint           12495
economic             12496
economical           12497
economically         12498
economics            12499
economos             12500
economy              12501
ecosystems           12502
ecpdsharmony         12503
ecr                  12504
ecs                  12505
ecst                 12506
ecstasy              12507
ecstatic             12508
ecsvax               12509
ecuador              12510
ecublens             12511
ecumenical           12512
ecumenism            12513
eczcaw               12514
ed                   12515
edb                  12516
edb9140              12517
eddie                12518
eden                 12519
eder                 12520
ederveen             12521
e

fred                 14713
freddie              14714
frederic             14715
frederick            14716
freds                14717
free                 14718
freebie              14719
freed                14720
freedom              14721
freedoms             14722
freeform             14723
freehand             14724
freely               14725
freeman              14726
freemant             14727
freemasonry          14728
freemasons           14729
freemont             14730
freenet              14731
freepost             14732
freethinker          14733
freethinkers         14734
freethought          14735
freeware             14736
freewill             14737
freeze               14738
freezing             14739
freind               14740
french               14741
frenchman            14742
frenzy               14743
freq                 14744
frequencies          14745
frequency            14746
frequent             14747
frequently           14748
fresa                14749
f

hesitations          16406
hess                 16407
hessian              16408
het                  16409
heterogeneity        16410
heterogeneous        16411
heteroorthodox       16412
heteropathic         16413
heterosexual         16414
hetersexual          16415
heuristic            16416
hew                  16417
hewlett              16418
hewn                 16419
hex                  16420
hexagon              16421
hexagonal            16422
hexagons             16423
hey                  16424
heydt                16425
heylighen            16426
hfs                  16427
hfsi                 16428
hhs                  16429
hhuang               16430
hi                   16431
hian                 16432
hibbard              16433
hickman              16434
hickory              16435
hicn                 16436
hicn610              16437
hicnet               16438
hicolor              16439
hicomb               16440
hid                  16441
hidden               16442
h

interperated         18081
interpolate          18082
interpolated         18083
interpolates         18084
interpolation        18085
interpoleerlineair   18086
interpret            18087
interpretation       18088
interpretationa      18089
interpretations      18090
interprete           18091
interpreted          18092
interpreter          18093
interpreters         18094
interpreting         18095
interpretor          18096
interprets           18097
interracial          18098
interrelate          18099
interrelated         18100
interresting         18101
interrogation        18102
interrogationum      18103
interrupt            18104
interruption         18105
interrupts           18106
intersect            18107
intersecting         18108
intersection         18109
intersections        18110
intersects           18111
intersting           18112
intertestamental     18113
intertwined          18114
interurban           18115
interval             18116
intervals            18117
i

lo                   20256
loa                  20257
load                 20258
loaded               20259
loading              20260
loads                20261
loan                 20262
loaned               20263
loans                20264
loasil               20265
loathe               20266
lobby                20267
lobbying             20268
lobe                 20269
lobo                 20270
lobotomy             20271
local                20272
locale               20273
localhost            20274
localized            20275
locally              20276
locals               20277
locate               20278
located              20279
locating             20280
location             20281
locations            20282
lock                 20283
locke                20284
locked               20285
lockheed             20286
locking              20287
locks                20288
locle                20289
locus                20290
locust               20291
locutions            20292
l

ninety               22904
nineveh              22905
nintendo             22906
ninth                22907
nirvana              22908
nis                  22909
nish                 22910
nishantha            22911
nishi                22912
nist                 22913
nistuk               22914
nites                22915
nitpick              22916
nitpicks             22917
nitrosamines         22918
nitrosiamines        22919
nitta                22920
nitty                22921
nity                 22922
niv                  22923
nive                 22924
niven                22925
nixdorf              22926
nixon                22927
nizoral              22928
nj                   22929
njbc                 22930
njit                 22931
njitgw               22932
njq                  22933
nk                   22934
nkjv                 22935
nl                   22936
nl__                 22937
nlm                  22938
nlp                  22939
nlpers               22940
n

persevere            24689
persia               24690
persians             24691
persist              24692
persistance          24693
persistant           24694
persistence          24695
persistent           24696
persisting           24697
persoanl             24698
person               24699
persona              24700
personable           24701
personage            24702
personages           24703
personal             24704
personalities        24705
personality          24706
personally           24707
personaly            24708
personification      24709
personified          24710
personify            24711
personnel            24712
persons              24713
perspective          24714
perspectives         24715
perspicacious        24716
persuade             24717
persuaded            24718
persuasions          24719
persuasive           24720
pertain              24721
pertaining           24722
pertains             24723
perterist            24724
perth                24725
p

reason               26903
reasonability        26904
reasonable           26905
reasonable_          26906
reasonableness       26907
reasonably           26908
reasonalby           26909
reasoned             26910
reasoning            26911
reasons              26912
reassurance          26913
reassure             26914
reastful             26915
reatil               26916
reattaching          26917
rebel                26918
rebelled             26919
rebelling            26920
rebellion            26921
rebirth              26922
reboots              26923
reborn               26924
rebound              26925
rebroadcast          26926
rebuild              26927
rebuilding           26928
rebuilt              26929
rebuke               26930
rebuked              26931
rebuking             26932
rebut                26933
rec                  26934
recall               26935
recalled             26936
recanted             26937
recapitulate         26938
reccommended         26939
r

shamed               29199
shameful             29200
shamim               29201
shamokin             29202
shamos               29203
shampoos             29204
shan                 29205
shank                29206
shankley             29207
shannon              29208
shanti               29209
shao                 29210
shapard              29211
shape                29212
shaped               29213
shapes               29214
shapiro              29215
shaprio              29216
shar                 29217
sharan               29218
shards               29219
share                29220
shared               29221
sharen               29222
shares               29223
shareware            29224
sharing              29225
sharnoff             29226
sharon               29227
sharp                29228
sharpen              29229
sharpening           29230
sharpimage           29231
sharply              29232
sharrar              29233
sharynk              29234
shatim               29235
s

substantiate         30902
substantiated        30903
substantiates        30904
substantiating       30905
substantiation       30906
substantive          30907
substantively        30908
substatiation        30909
substitute           30910
substituted          30911
substituting         30912
substitution         30913
substructures        30914
subsystem            30915
subsystems           30916
subtic               30917
subtitled            30918
subtle               30919
subtly               30920
subtopic             30921
subtract             30922
subtracted           30923
subtraction          30924
subtractions         30925
subtractive          30926
subtype              30927
subunit              30928
suburban             30929
suburbs              30930
subvert              30931
subviews             30932
subvolumes           30933
subway               30934
succeed              30935
succeeded            30936
succeeds             30937
succes               30938
s

twist                32856
twisted              32857
twisto               32858
twitching            32859
twixt                32860
twn                  32861
twncu865             32862
twong                32863
twosey               32864
twosies              32865
twpierce             32866
twu                  32867
tx                   32868
txt                  32869
ty                   32870
tycchow              32871
tychay               32872
tylenol              32873
tyndale              32874
typ                  32875
type                 32876
typed                32877
typeface             32878
types                32879
typewatch            32880
typhoid              32881
typical              32882
typically            32883
typing               32884
typingtutor          32885
typists              32886
typology             32887
typos                32888
tyrannic             32889
tyrannical           32890
tyranny              32891
tyrant               32892
t

wahlgren             34332
waikato              34333
wail                 34334
wailing              34335
wainwright           34336
wais                 34337
waistband            34338
wait                 34339
waited               34340
waiter               34341
waiting              34342
waitng               34343
waits                34344
waive                34345
wak                  34346
wake                 34347
wakefield            34348
wakfer               34349
waking               34350
waldensoftware       34351
wales                34352
walk                 34353
walked               34354
walker               34355
walking              34356
walks                34357
walkup               34358
wall                 34359
walla                34360
wallace              34361
wallach              34362
walled               34363
wallet               34364
wallets              34365
wallis               34366
wallpaper            34367
wallpapers           34368
w

In [9]:
X_train_counts = count_vect.transform(twenty_train.data)
print("The type of X_train_counts is {0}.".format(type(X_train_counts)))
print("The X matrix has {0} rows (documents) and {1} columns (words).".format(
        X_train_counts.shape[0], X_train_counts.shape[1]))

The type of X_train_counts is <class 'scipy.sparse.csr.csr_matrix'>.
The X matrix has 2257 rows (documents) and 35482 columns (words).


In [10]:
# back to the document of interest
print("Document of interest: {0}".format(doi))
nnz_doi = X_train_counts[doi].getnnz(axis=1)
nwords_doi = X_train_counts[doi].sum()
print("There are {0} non-zero word counts in document {1}.".format(nnz_doi[0], doi))
print("There are {0} words in document {1}.".format(nwords_doi, doi))

Document of interest: 99
There are 61 non-zero word counts in document 99.
There are 76 words in document 99.


In [11]:
print("\nSparse matrix representing the words in document {0}".
      format(doi))
words_doi = X_train_counts[doi]
print(words_doi)


Sparse matrix representing the words in document 99
  (0, 1282)	1
  (0, 1341)	1
  (0, 4472)	1
  (0, 4717)	1
  (0, 5049)	1
  (0, 5499)	1
  (0, 5568)	1
  (0, 5872)	1
  (0, 5916)	1
  (0, 6289)	2
  (0, 6295)	1
  (0, 6780)	1
  (0, 6873)	1
  (0, 6875)	2
  (0, 6958)	1
  (0, 7219)	1
  (0, 7677)	2
  (0, 8032)	1
  (0, 8725)	1
  (0, 9013)	2
  (0, 12555)	2
  (0, 13575)	1
  (0, 14688)	1
  (0, 14985)	1
  (0, 15204)	2
  :	:
  (0, 23731)	1
  (0, 24590)	1
  (0, 25336)	1
  (0, 26499)	1
  (0, 27110)	1
  (0, 27428)	1
  (0, 27997)	1
  (0, 28373)	3
  (0, 28454)	1
  (0, 28664)	1
  (0, 28843)	1
  (0, 29503)	1
  (0, 29759)	1
  (0, 30026)	1
  (0, 30477)	1
  (0, 30852)	1
  (0, 31676)	2
  (0, 31686)	1
  (0, 32006)	3
  (0, 32163)	1
  (0, 34001)	2
  (0, 34442)	1
  (0, 34605)	1
  (0, 34976)	1
  (0, 35050)	1


In [12]:
print("\nThese are the indices, words, and counts in doc. {0}:".
       format(doi))
for i in range(words_doi.count_nonzero()):
    word_index = words_doi.indices[i]
    word = count_vect.get_feature_names()[word_index]
    count = words_doi.data[i]
    print("{0:<6d}  {1:<12s}  {2}".format(word_index, word, count))


These are the indices, words, and counts in doc. 99:
1282    1ql6jiinn5df  1
1341    20            1
4472    afternoon     1
4717    allan         1
5049    anne          1
5499    article       1
5568    ass           1
5872    avail         1
5916    away          1
6289    beauchaine    2
6295    beaverton     1
6780    blew          1
6873    bob           1
6875    bobbe         2
6958    books         1
7219    bronx         1
7677    caltech       2
8032    cco           1
8725    clearly       1
9013    com           2
12555   edu           2
13575   exhibit       1
14688   frank         1
14985   gap           1
15204   germany       2
15437   going         1
17076   ico           2
19081   keith         2
19307   knew          1
19321   know          1
20057   like          1
20112   lines         1
20201   little        2
20331   looked        1
20723   makes         1
20786   manhattan     1
23731   organization  1
24590   people        1
25336   pompous       1
26499   qu

Count vectorizer is just the number of times each word appears in each document.  Per document, we would like to normalize by the number of words (so the values are the counts divided by the number of words for L1, or divided by RSS of counts for L2). This is the term (or word) frequency.

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
# just do term frequency at first, use_idf = False
tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
words_doi_tf = X_train_tf[doi]
print("\nThese are the indices, words, and term frequencies in doc. {0}:".
       format(doi))
tf_lst = []
for i in range(words_doi_tf.count_nonzero()):
    word_index = words_doi_tf.indices[i]
    word = count_vect.get_feature_names()[word_index]
    count = words_doi_tf.data[i]
    tf_lst.append(count)
    print("{0:<6d}  {1:<12s}  {2:0.3f}".format(word_index, word, count))

mag = np.sqrt(np.sum([tf**2 for tf in tf_lst]))
print("\nThe magnitude of the tf vector for this document is {0:0.3f}".format(mag))
print("It used the L2 norm.")


These are the indices, words, and term frequencies in doc. 99:
1282    1ql6jiinn5df  0.095
1341    20            0.095
4472    afternoon     0.095
4717    allan         0.095
5049    anne          0.095
5499    article       0.095
5568    ass           0.095
5872    avail         0.095
5916    away          0.095
6289    beauchaine    0.191
6295    beaverton     0.095
6780    blew          0.095
6873    bob           0.095
6875    bobbe         0.191
6958    books         0.095
7219    bronx         0.095
7677    caltech       0.191
8032    cco           0.095
8725    clearly       0.095
9013    com           0.191
12555   edu           0.191
13575   exhibit       0.095
14688   frank         0.095
14985   gap           0.095
15204   germany       0.191
15437   going         0.095
17076   ico           0.191
19081   keith         0.191
19307   knew          0.095
19321   know          0.095
20057   like          0.095
20112   lines         0.095
20201   little        0.191
20331   look

We want to tell how similar (or different) the documents are from each other.
Words that appear in all documents don't differentiate them, while words that
only appear in a few do.  Downscale the importance of common words by the inverse
document frequency.  (If document frequency is large it appears in most documents,
if inverse document frequency is large it only appears in a few.  So large tf-idf
values indicate terms that appear frequently in a document, and only in a few documents.


In [14]:
tfidf_transformer = TfidfTransformer(use_idf=True)
tfidf_transformer.fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
words_doi_tfidf = X_train_tfidf[doi] 
print("\nThese are the indices, words, and tf-idf values in doc. {0}:".
      format(doi)) 
for i in range(words_doi_tfidf.count_nonzero()): 
    word_index = words_doi_tfidf.indices[i] 
    word = count_vect.get_feature_names()[word_index]
    tfidf = words_doi_tfidf.data[i]
    print("{0:<6d}  {1:<12s}  {2:0.3f}".format(word_index, word, tfidf))


These are the indices, words, and tf-idf values in doc. 99:
1282    1ql6jiinn5df  0.173
1341    20            0.077
4472    afternoon     0.151
4717    allan         0.096
5049    anne          0.144
5499    article       0.040
5568    ass           0.123
5872    avail         0.151
5916    away          0.082
6289    beauchaine    0.234
6295    beaverton     0.123
6780    blew          0.126
6873    bob           0.108
6875    bobbe         0.234
6958    books         0.092
7219    bronx         0.127
7677    caltech       0.181
8032    cco           0.093
8725    clearly       0.099
9013    com           0.087
12555   edu           0.061
13575   exhibit       0.151
14688   frank         0.117
14985   gap           0.114
15204   germany       0.208
15437   going         0.074
17076   ico           0.232
19081   keith         0.171
19307   knew          0.102
19321   know          0.047
20057   like          0.049
20112   lines         0.023
20201   little        0.149
20331   looked 

## Training a model

In [15]:
from sklearn.naive_bayes import MultinomialNB

print('\nTraining a Naive Bayes model.')
nb_model = MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
nb_model.fit(X_train_tfidf, twenty_train.target);


Training a Naive Bayes model.


### Predictions

In [16]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 
            'Two hands working can do more than a thousand clasped in prayer.']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predictions = nb_model.predict(X_new_tfidf)
print('Predictions')
for doc, category in zip(docs_new, predictions):
    print("{0} => {1}".format(doc, twenty_train.target_names[category]))

Predictions
God is love => soc.religion.christian
OpenGL on the GPU is fast => comp.graphics
Two hands working can do more than a thousand clasped in prayer. => soc.religion.christian


## Building a pipeline

In [17]:
from sklearn.pipeline import Pipeline
nb_pipeline = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('model', MultinomialNB()),
                        ])
nb_pipeline.fit(twenty_train.data, twenty_train.target); 

### Evaluating performance on the test set

In [18]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories,
                                     shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = nb_pipeline.predict(docs_test)
accuracy = np.mean(predicted == twenty_test.target)
print("\nThe accuracy on the test set is {0:0.3f}.".format(accuracy))


The accuracy on the test set is 0.835.


In [19]:
len(docs_test)


1502