From: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
Also, great explanation at: https://krakensystems.co/blog/2018/text-classification

In [68]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train_full = fetch_20newsgroups(subset='train')
#newsgroups_test = fetch_20newsgroups(subset='test')

In [69]:
from pprint import pprint
pprint(list(newsgroups_train_full.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer 
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',  categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(2034, 34118)

In [71]:
vectors_test = vectorizer.transform(newsgroups_test.data)
vectors_test.shape

(1353, 34118)

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [72]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB()
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.7924374490634268

In [73]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.8821359240272957

In [74]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB(alpha=0)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')



0.8660426652791797

In [75]:
import numpy as np
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: edu it and in you that is of to the
comp.graphics: edu in graphics it is for and of to the
sci.space: edu it that is in and space to of the
talk.religion.misc: not it you in is that and to of the


In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer 
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),categories=categories)

newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(2034, 26879)

In [77]:
vectors_test = vectorizer.transform(newsgroups_test.data)
vectors_test.shape

(1353, 26879)

In [78]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.7699517518452172

In [79]:
show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: not in and it you is that of to the
comp.graphics: graphics you in it is for of and to the
sci.space: for that it space is in and of to the
talk.religion.misc: not it in you is and that to of the


Homework: assignment-- try removing stopwords like in: https://krakensystems.co/blog/2018/text-classification and try stemming too.  What works best?   Also try different values for alpha. Also, check out the script at: https://github.com/screddy1313/Text-Classification-with-20news-dataset/blob/master/code/MultiClassNB.ipynb for removing symbols: symbols in the def preprocessing(words): function they define.  How did that work for you???
Hint: try things like: CountVectorizer(lowercase=True,stop_words='english')

Now let's try to understand what you got!  What were the 20 most important words overall?

In [80]:
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, f_classif
from sklearn.feature_selection import chi2
# compute chi2 for each feature
chi2score = chi2(vectors, newsgroups_train.target)[0]
wscores = zip(vectorizer.get_feature_names(),chi2score)

In case you were wondering (like I was) why we need the subscript [0] up above, it is because the Chi functions returns both the rank and the p-values in the test. See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2

In [81]:
print("Overall")
wchi2 = sorted(wscores,key=lambda x:x[1],reverse = True) 
for i in range(0,20):
    print(wchi2[i])

Overall
('space', 50.84053191640039)
('graphics', 41.14620302341697)
('god', 30.14032128191578)
('jesus', 27.47411064248329)
('nasa', 22.268266439410848)
('file', 22.12202676510602)
('image', 21.881779476774796)
('files', 20.38056829326831)
('atheism', 19.690604452023212)
('windows', 19.627133314274825)
('christian', 19.263217764891344)
('thanks', 18.1012929444855)
('christians', 17.79840855627674)
('format', 17.78020756413214)
('orbit', 17.731665494258433)
('launch', 17.361423856933794)
('3d', 17.286349979436984)
('hi', 16.517898442191914)
('christ', 16.4871078748634)
('islam', 15.83777484687983)


Looking forward, there are some nice resources at: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html
like: https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py
and 
https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
