###### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Week 9 - Practical Workshop

Today, we are talking about feature selection and text processing using in `scikit learn`. 

Remeber you can always use the Scikit Learn API (https://scikit-learn.org/stable/modules/classes.html#) for further exampes and explanation.


### Exercise 1. 
scikit-learn has an in-built text dataset, the “20 newsgroups corpus” https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html), which contains a number of documents classified with a topic, based on the newsgroup in which it was posted

In [2]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


### Exercise 1. (a)
Choose a couple of newsgroups that you think it would be interesting to discriminate between, like `rec.autos` and `rec.motorcycles`.

We will use `alt.atheism` and `talk.religion.misc` as an example.

In [None]:
categories = ['alt.atheism','talk.religion.misc']

data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=30027)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=30027)

X_train_txt = data_train.data
y_train = data_train.target
X_test_txt = data_test.data
y_test = data_test.target

print(dir(data_train))

for file,target in zip(data_train.filenames,data_train.target):
    print('file:',file, 'label:',target)

#### Exercise 1. (b)
Examine a couple of documents, by referencing the list ( X_train[0] ). Can you accurately predict the class ( y_train[0] ) based on the text alone?

In [None]:
print("\nInstance Name:",data_train.filenames[0])
print("\nInstance body:\n\n", X_train_txt[0])
print('\n Instance Label:', y_train[0], data_train.target_names[y_train[0]])
print('\n----------------------------------------------------------------------------------')
print("\nInstance Name:",data_train.filenames[3])
print("\nInstance body:\n\n", X_train_txt[3])
print('\n Instance Label:', y_train[3], data_train.target_names[y_train[3]])

### Exercise 2
The document is currently a string, which scikit-learn can’t use directly. In order to feed predictive models with the text data, one first need to turn the text into vectors of numerical values suitable for statistical analysis. This can be achieved with the utilities of the `sklearn.feature_extraction.text`

#### Exercise 2.(a)
The class `DictVectorizer` can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

`DictVectorizer` implements what is called one-of-K or “one-hot” coding for categorical. Recall that, in this method we replace each categorical attribute having m values with m binary attributes

Refresh your knowledge about one-hot encoding and other possible discritisation methods in week 3 and extract the words (or tokens) in the text and count them by using `CountVectorizer` to build a dictionary which associates each word (token) in a text document with its frequency in that document. The list of vocabulary is stored as a dict object in vectoriser.vocabulary_.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectoriser = CountVectorizer()

X_train = vectoriser.fit_transform(X_train_txt)
X_test = vectoriser.transform(X_test_txt)

print(len(vectoriser.vocabulary_))

#### Exercise 2.(b)
After “vectorizing” the data, what is the shape of X_train and X_test ?

In [None]:
print(..., ...)

#### Exercise 2.(c)
Are there any documents in X_test whose values are all 0? Why might this happen?

In [None]:
print(...)

### Exercise 3
Now that we have "vectorised" our text, we want to choose the **best** attributes.

Feature Selection in `scikit-learn` can be done using `SelectKBest`.

#### Exercise 3.(a)
Find out what the best 10 features were for your dataset, according to $\chi^2$ .

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

x2 = SelectKBest(...,...)

X_train_x2 = x2.fit_transform(X_train,y_train)
X_test_x2 = x2.transform(X_test)

for feat_num in x2.get_support(indices=True):
    print(vectoriser.get_feature_names()[feat_num])

Do they correspond to your intuitions? Is there any evidence of the biases inherent in $\chi^2$? What if you look at deeper than 10?

#### Exercise 3.(b)
Do the same thing for Mutual Information, instead of $\chi^2$ (note that you want the classification version, not the regression version).

In [None]:
from sklearn.feature_selection import mutual_info_classif

mi = SelectKBest(...,...)

X_train_mi = mi.fit_transform(X_train,y_train)
X_test_mi = mi.transform(X_test)

print(X_test_mi.shape, X_train_mi.shape)

for feat_num in mi.get_support(indices=True):
    print(vectoriser.get_feature_names()[feat_num])

### Exercise 4. 
Build a classifier on the training dataset, and evaluate its Accuracy on the test set. Consider k-NN, and perhaps Naive Bayes or Decision Trees.
#### Exercise 4.(a) 
It’s likely that the dataset is still small enough that you can build a model on the entire feature set (after the CountVectorizer , but before the SelectKBest ) without crashing your computer. How well do these models predict the test data, using all of the features?

#### Exercise 4.(b)  
How does this compare with 1000 features, or just the top 10 features?

#### Exercise 4.(c) 
Try some different values for the cut-off for SelectKBest — is it possible to improve upon the Accuracy observed for the models which use the entire feature set? Is this more true for some learners than others? Does your choice between χ 2 and Mutual Information make a difference?

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB

models = [GaussianNB(),
          MultinomialNB(),
          DecisionTreeClassifier(max_depth=1),
          KNeighborsClassifier(n_neighbors=1),
          KNeighborsClassifier(n_neighbors=5),
          DecisionTreeClassifier(max_depth=None)]
#          svm.LinearSVC(C=C),
#          svm.SVC(kernel='rbf', gamma=0.7, C=C),
#          svm.SVC(kernel='poly', degree=3, C=C)]
titles = ['GNB',
          'MNB',
          'one-r',
          '1-nearest neighbour',
          '5-nearest neighbour',
          'Decision Tree']
#          'LinearSVC',
#          'SVM with a cubic kernel',
#          'SVM with an RBF kernel']

k = ...

x2 = SelectKBest(...)
x2.fit(...,...)
X_train_x2 = x2.transform(...)
X_test_x2 = x2.transform(...)


mi = SelectKBest(...)
mi.fit(...,...)
X_train_mi = mi.transform(...)
X_test_mi = mi.transform(...)


Xs = [(X_train, X_test), (X_train_x2, X_test_x2), (X_train_mi, X_test_mi)]

X_names = ['complete', 'x2', 'mi']

for title, model in zip(titles, models):
    print('\n',title, '(with k=',k,'features):')
    for X_name, X in zip(X_names, Xs):
        X_train_t, X_test_t = X
        model.fit(X_train_t.todense(), y_train)
        acc = model.score(X_test_t.todense(), y_test)
        print(X_name, '\t acc',  acc)
