## Task 1: Investigate the impact of class imbalance on performance measures

Build a classifier that predicts the categories 'talk.religion.misc' and 'soc.religion.christian' from the 20-newsgroups dataset.  
Test it first on the training set.
Test it also on the test set.  

Add a markdown cell and include your observations on the following:
* the class distribution of the training set  (evident from the classification report of testing on the training set -  support = the number of instances of each class)
* the performance on the training set versus the performance on the test set and what does any difference here mean
* which performance measure should be used.

Note that a Jupyter notebooks Markdown Cheatsheet is available at https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Feature selection classes
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# metrics
from sklearn import metrics

In [15]:
categories = ['talk.religion.misc', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train',
                                     categories=categories,
                                     remove=('headers', 'footers', 'quotes'),
                                     shuffle=True, random_state=42)

X, Y = newsgroups_train.data, newsgroups_train.target

vectorizer = TfidfVectorizer()

X_vec = vectorizer.fit_transform(X)   #transform training data

fs = SelectKBest(chi2, k=100)    #get top k features 
X_fs_vec= fs.fit_transform(X_vec, Y)  # fit and transform tdm to reduced feature space

newsgroups_test = fetch_20newsgroups(subset='test',     # get test data
                                     categories=categories,
                                     remove=('headers', 'footers', 'quotes'),
                                     shuffle=True,
                                     random_state=42)

vectors_test = vectorizer.transform(newsgroups_test.data)   #transform test data
fs_test = fs.transform(vectors_test)     # transform test data to reduced feature space

classifier = MultinomialNB(alpha=.01)
classifier.fit(X_fs_vec, Y)
predicted = classifier.predict(fs_test)

print(metrics.classification_report(newsgroups_test.target, predicted,
    target_names=newsgroups_train.target_names))

print("Accuracy = %6.4f" % metrics.accuracy_score(newsgroups_test.target, predicted))

print("Avg recall, micro = %6.4f" % metrics.recall_score(newsgroups_test.target, 
                                                         predicted, 
                                                         average='micro'))  # same as accuracy
      
print("Avg recall, macro = %6.4f" % metrics.recall_score(newsgroups_test.target, 
                                                         predicted, average='macro'))  # average of class recall

print("Avg precision, macro = %6.4f" % metrics.precision_score(newsgroups_test.target, 
                                                        predicted, average='macro'))  #average across class precision 

classifier.show_most_informative_features(5)

                        precision    recall  f1-score   support

soc.religion.christian       0.68      0.98      0.80       398
    talk.religion.misc       0.90      0.28      0.43       251

           avg / total       0.77      0.71      0.66       649

Accuracy = 0.7088
Avg recall, micro = 0.7088
Avg recall, macro = 0.6294
Avg precision, macro = 0.7902


AttributeError: 'MultinomialNB' object has no attribute 'show_most_informative_features'