# Bag of Words NLP Approach

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

## Import 20 News Groups dataset from scikitlearn
Focusing on 4 of the news categories:  
- Gun Politics
- Christian
- Graphics
- Science Medical

In [2]:
categories = ['talk.politics.guns', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [22]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [24]:
twenty_train['filenames']

array(['/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54238',
       '/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54702',
       '/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20673',
       ...,
       '/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/sci.med/58113',
       '/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54304',
       '/Users/hor/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20855'],
      dtype='<U91')

In [27]:
twenty_train['data']

 'From: porges@beretta.camb.inmet.com (Don Porges)\nSubject: Re: JFFO has gone a bit too far\nNntp-Posting-Host: beretta\nOrganization: Intermetrics Inc.\nDistribution: usa\nLines: 80\n\n\nHaving read the posted long article by JPFO, I have some observations:\n\n1.  This article does NOT claim that the GCA of 1968 is a "verbatim \ntranslation" of a Nazi law.  What it says is that in another place --\nthe book they\'re talking about -- they compare the two things section\nby section.  The implication is that the similarities are devastating.\nIn the next sentence, they talk about how in that book they reproduce \nthe German text of the Nazi law, together with its translation.  Not \nsurprisingly, a reader could easily conflate these two things into a \nsingle idea:  that the American GCA is a literal translation of the Nazi\nlaw; and sure enough, that\'s what the whole thing has mutated into, \nurban-folklore style.\n\n2.  The article goes to great pains to establish that Senator Dodd h

In [3]:
print(twenty_train.target_names) # news categories used
print(len(twenty_train.data))
print(len(twenty_train.filenames))

['comp.graphics', 'sci.med', 'soc.religion.christian', 'talk.politics.guns']
2323
2323


#### Sample Input from Imported Data

In [5]:
#print("\n".join(twenty_train.data[0].split("\n")[:3]))
print("\n".join(twenty_train.data[5].split("\n")))

From: annick@cortex.physiol.su.oz.au (Annick Ansselin)
Subject: Re: Is MSG sensitivity superstition?
Nntp-Posting-Host: cortex.physiol.su.oz.au
Organization: Department of Physiology, University of Sydney, NSW, Australia
Lines: 29

In <C5nFDG.8En@sdf.lonestar.org> marco@sdf.lonestar.org (Steve Giammarco) writes:

>>
>>And to add further fuel to the flame war, I read about 20 years ago that
>>the "natural" MSG - extracted from the sources you mention above - does not
>>cause the reported aftereffects; it's only that nasty "artificial" MSG -
>>extracted from coal tar or whatever - that causes Chinese Restaurant
>>Syndrome.  I find this pretty hard to believe; has anyone else heard it?

MSG is mono sodium glutamate, a fairly straight forward compound. If it is
pure, the source should not be a problem. Your comment suggests that 
impurities may be the cause.
My experience of MSG effects (as part of a double blind study) was that the
pure stuff caused me some rather severe effects.

>I was 

#### Sample Label from Imported Data

In [6]:
print(twenty_train.target_names[twenty_train.target[5]])

sci.med


## Build tf_idf Document Term Matrix
Count Vectorizer: stores counts of vocabulary within a document; strips non-ascii characters, lowercases all characters, removes common stop words in english dictionary, includes n-grams feature lengths of up to 3  
TfidfTransformer: transforms count vector into a normalized tfidf representation  

In [7]:
count_vect = CountVectorizer(strip_accents='ascii', lowercase=True, stop_words='english', ngram_range=(1, 3))
tfidf_transformer = TfidfTransformer()
x_train_counts = count_vect.fit_transform(twenty_train.data)
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

In [28]:
print(x_train_counts.toarray().shape)
print(x_train_tfidf.toarray().shape)

(2323, 580346)
(2323, 580346)


In [32]:
np.nonzero(x_train_counts.toarray()[1])

(array([  9489,   9490,   9491,   9877,   9888,   9889,   9890,   9891,
          9898,   9900,  27374,  27406,  27407,  29482,  29511,  29512,
         35986,  36044,  36046,  36963,  36966,  36967,  37703,  37712,
         37713,  38878,  38895,  38896,  40931,  40966,  40967,  42136,
         42139,  42140,  49353,  49444,  49445,  49446,  49482,  49484,
         52301,  52462,  52473,  52819,  52936,  52937,  58059,  58108,
         58109,  59757,  60782,  60783,  60823,  60824,  60881,  60882,
         60933,  60934,  60996,  60998,  61283,  61284,  61785,  61832,
         61833,  61861,  62160,  62161,  62199,  62378,  62379,  62380,
         62382,  62630,  62645,  62647,  66830,  66831,  66832,  66868,
         66869,  66938,  66939,  66950,  66951,  69221,  69252,  69253,
         75246,  75247,  75248,  77654,  77657,  77658,  77731,  77734,
         77735,  77742,  77743,  80699,  80700,  80701,  81296,  81423,
         81424,  84826,  85124,  85125,  85156,  85157,  85185, 

In [35]:
print(x_train_counts.toarray()[1][9480:9500])
print(x_train_tfidf.toarray()[1][9480:9500])

[0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0]
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.02496655 0.02496655 0.02496655
 0.         0.         0.         0.         0.         0.
 0.         0.        ]


In [58]:
lookup = {v:k for k, v in count_vect.vocabulary_.items()}
[(i, lookup[i]) for i in range(9480, 9500)]

[(9480, '19260'),
 (9481, '19260 sun1x'),
 (9482, '19260 sun1x res'),
 (9483, '192644'),
 (9484, '192644 29219'),
 (9485, '192644 29219 clpd'),
 (9486, '1927a1'),
 (9487, '1927a1 avtomat'),
 (9488, '1927a1 avtomat kalashnikov'),
 (9489, '1928'),
 (9490, '1928 nazi'),
 (9491, '1928 nazi inspired'),
 (9492, '192947'),
 (9493, '192947 11230'),
 (9494, '192947 11230 sophia'),
 (9495, '1929ee'),
 (9496, '1929ee w78xh'),
 (9497, '1929ee w78xh s6'),
 (9498, '193'),
 (9499, '193 dave')]

## Fit Logistic Regression Model to predict news category

In [61]:
text_clf = LogisticRegression()
text_clf.fit(x_train_tfidf, twenty_train.target)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

#### Test our Logistic classifier model on sample texts
Try putting some other string texts into "docs_new" and see which category our model predicts.

In [62]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'Gun violence is a real problem', 'Doctors cure critical brain tumor in patient', 'Space Radiation Doesnt Seem to Be Causing Astronauts to Die from Cancer, Study Finds']
X_new_counts = count_vect.transform(docs_new) # only use transform because transformers are already fitted to dataset
X_new_tfidf = tfidf_transformer.transform(X_new_counts) # only use transform because transformers are already fitted to dataset

predicted = text_clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'Gun violence is a real problem' => talk.politics.guns
'Doctors cure critical brain tumor in patient' => sci.med
'Space Radiation Doesnt Seem to Be Causing Astronauts to Die from Cancer, Study Finds' => sci.med


#### Evaluate our logistic regression model on test data

In [63]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
test_counts = count_vect.transform(docs_test)
test_tfidf = tfidf_transformer.transform(test_counts)
predicted = text_clf.predict(test_tfidf)
np.mean(predicted == twenty_test.target) 

0.936005171299289

## Trying out other news groups:
We only used 4 of the news groups! You can change which news groups we want to use to train our models at the beginning of this notebook in the import step.  
Here are all 20 of the available news groups from the dataset.  
<img src="./screenshots/20news_groups.png" alt="drawing" width="500"/>

## Trying out more models:
Here's a couple of other models available in scikit-learn you can try out in your spare time:  
- Naive Bayes
- Decision Tree
- Random Forest
- XGBoost