## Task 1: Investigate the impact of class imbalance on performance measures

Build a classifier that predicts the categories 'talk.religion.misc' and 'soc.religion.christian' from the 20-newsgroups dataset.  
Test it first on the training set.
Test it also on the test set.  

Add a markdown cell and include your observations on the following:
* the class distribution of the training set  (evident from the classification report of testing on the training set -  support = the number of instances of each class)
* the performance on the training set versus the performance on the test set and what does any difference here mean
* which performance measure should be used.

Note that a Jupyter notebooks Markdown Cheatsheet is available at https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed

In [27]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Feature selection classes
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# metrics
from sklearn import metrics

def showMetrics(targets, predictions, names):
    print(metrics.classification_report(targets, predictions, target_names=names))
    print("Accuracy = %6.4f " % metrics.accuracy_score(targets, predictions))
    print("Avg recall, micro = %6.4f" % metrics.recall_score(targets, predictions, average = 'micro'))
    print("Avg recall, macro = %6.4f" % metrics.recall_score(targets, predictions, average = 'macro'))
    print("Avg precision, macro=%6.4f" % metrics.precision_score(targets, predictions, average = 'macro'))
    
def predict(categories, stop_words, dataset = 'test'):

    newsgroups_train = fetch_20newsgroups(subset='train',
                                         categories=categories,
                                         remove=('headers', 'footers', 'quotes'),
                                         shuffle=True, random_state=42)

    X, Y = newsgroups_train.data, newsgroups_train.target

    if stop_words:
        vectorizer = TfidfVectorizer()
    else:
        vectorizer = TfidfVectorizer(max_df=0.95, min_df=3, stop_words="english")

    X_vec = vectorizer.fit_transform(X)   #transform training data

    fs = SelectKBest(chi2, k=100)    #get top k features 
    X_fs_vec= fs.fit_transform(X_vec, Y)  # fit and transform tdm to reduced feature space

    newsgroups_test = fetch_20newsgroups(subset='test',     # get test data
                                         categories=categories,
                                         remove=('headers', 'footers', 'quotes'),
                                         shuffle=True,
                                         random_state=42)

    classifier = MultinomialNB(alpha=.01)
    classifier.fit(X_fs_vec, Y)

    if dataset == 'test':
        vectors_test = vectorizer.transform(newsgroups_test.data)   #transform test data
        fs_set = fs.transform(vectors_test)     # transform test data to reduced feature space
    else:
        vectors_train = vectorizer.transform(newsgroups_train.data)   #transform test data
        fs_set = fs.transform(vectors_train)     # transform test data to reduced feature space
    
    return classifier.predict(fs_set)


In [28]:
categories = ['talk.religion.misc', 'soc.religion.christian']


print("-------------------------------")
print("Classifying over TEST dataset.")
print("------------------------------")


predicted = predict(['talk.religion.misc', 'soc.religion.christian'], False, 'test')

showMetrics(newsgroups_test.target, predicted, newsgroups_train.target_names)

print("-------------------------------")
print("Classifying over TRAIN dataset.")
print("------------------------------")

predicted = predicted = predict(['talk.religion.misc', 'soc.religion.christian'], False, 'train')

showMetrics(newsgroups_train.target, predicted, newsgroups_train.target_names)
      

-------------------------------
Classifying over TEST dataset.
------------------------------
                        precision    recall  f1-score   support

soc.religion.christian       0.69      0.96      0.80       398
    talk.religion.misc       0.84      0.31      0.45       251

           avg / total       0.75      0.71      0.67       649

Accuracy = 0.7088 
Avg recall, micro = 0.7088
Avg recall, macro = 0.6345
Avg precision, macro=0.7623
-------------------------------
Classifying over TRAIN dataset.
------------------------------
                        precision    recall  f1-score   support

soc.religion.christian       0.77      1.00      0.87       599
    talk.religion.misc       1.00      0.53      0.69       377

           avg / total       0.86      0.82      0.80       976

Accuracy = 0.8176 
Avg recall, micro = 0.8176
Avg recall, macro = 0.7639
Avg precision, macro=0.8855


### Answers

The sample is unbalanced. There's more documents under "talk.religion.misc" category than "soc.religion.christian".

The precision (and overall performance) is better when we classify over the TRAIN dataset. That makes sense, because we have trained the classifier on this dataset.

A mix of Precission, Accuracy and Recall should be considered to asses algorithm's performance.

## Task 2: Investigate the impact of stopword removal and DF reduction on performance

Build a classifier on at least 3 categories of the 20-newsgroup dataset.  Measure the performance including stopword removal and various levels of document frequency reduction.

Add a markdown cell and outline your results showing the number of features used by the different settings and the impact on performance, if any.      

Justify your choice of performance measure.

In [None]:
print('-------------------------------')
print("PREDICTIONS WITHOUT STOP WORDS")
print('-------------------------------')

predicted = predict(['talk.religion.misc', 'soc.religion.christian'], False, 'test')

showMetrics(newsgroups_test.target, predicted, newsgroups_train.target_names)

print('----------------------------')
print("PREDICTIONS WITH STOP WORDS")
print('---------------------------')

predicted = predict(['talk.religion.misc', 'soc.religion.christian'], True, 'test')

showMetrics(newsgroups_test.target, predicted, newsgroups_train.target_names)


-------------------------------
PREDICTIONS WITHOUT STOP WORDS
-------------------------------
                        precision    recall  f1-score   support

soc.religion.christian       0.69      0.96      0.80       398
    talk.religion.misc       0.84      0.31      0.45       251

           avg / total       0.75      0.71      0.67       649

Accuracy = 0.7088 
Avg recall, micro = 0.7088
Avg recall, macro = 0.6345
Avg precision, macro=0.7623
----------------------------
PREDICTIONS WITH STOP WORDS
---------------------------
