# Lab 4: Further Document Classification (Part 1)

### Overview
This topic builds on the activities of the previous topic on sentiment analysis. You will be focussing on the Amazon review corpus with a view to investigating the following issues.

- Evaluation metrics for classifier performance
- What is the impact of varying training data size? To what extent does increasing the quantity of training data improve classifier performance?
- What is the impact of changing domain (i.e. book, dvd, electronics, kitchen). 

By this stage, you should be very comfortable with Python's [list comprehensions](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) and [slice](http://bergbom.blogspot.co.uk/2011/04/python-slice-notation.html) notation.



### Preliminaries 

Ler's set up Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

>To access functionality defined in previous notebooks, copy the functions defined in Week3Labs into a `utils.py` file and then import it into the notebook.  There is a `utils.py` file included with these resources which you can update.

In [None]:
#import code to setup training and testing data, wordlist classifiers and NB classifiers

import sys
sys.path.append('/content/drive/My Drive/NLENotebooks/resources/')
#from utils import *

## Evaluation Metrics for Classifier Performance

### Accuracy
Here is code for an evaluation function <code>evaluate_wordlist_classifier</code> which can be used to determine how well a word_list classifier performs. This function returns the <b>accuracy</b> of a classifier. The accuracy metric is defined as the proportion of documents that were correctly classified.  Look at the code and make sure you understand what it is doing

In [None]:
def evaluate_wordlist_classifier(cls, pos_test_data, neg_test_data):
  '''
  cls: an instance of a classifier object which has a classify method which returns "P" or "N"
  pos_test_data: a list or generator of positive Amazon review objects
  neg_test_data: a list or generator of negative Amazon review objects
  
  returns: float point number which is the accuracy of the classifier on the test data provided 
  '''
  acc = 0
  for review in pos_test_data:
    acc += 1 if cls.classify(review.words()) == "P" else 0
    
  for review in neg_test_data:
    acc += 1 if cls.classify(review.words()) == "N" else 0
    
  return acc / (len(pos_test_data) + len(neg_test_data) + 0.0)

  #do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

Now we need some data to train and test our classifier on.  We are going to make use of the AmazonCorpusReader() to get a sample of dvd reviews and then split it into training and testing data (as before!)

In [None]:
#get a AmazonReviewCorpusReader for dvd reviews
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
import random
dvd_reader = AmazonReviewCorpusReader().category("dvd")

#The following two lines use the documents function on the Amazon corpus reader. 
#This returns a generator over reviews in the corpus. 
#Each review is an instance of a Python class called AmazonReview. 
#An AmazonReview object contains all the data about a review.
pos_train, pos_test = split_data(dvd_reader.positive().documents())
neg_train, neg_test = split_data(dvd_reader.negative().documents())

#You can also combine the training data
train = pos_train + neg_train

We are now going to use this function to start evaluating our classifiers from last week.  Lets first try the SimpleClassifier

In [None]:
#here I am going to create an instance of a very simple classifier 
top_pos=[]
top_neg=[]
dvd_classifier = SimpleClassifier(top_pos, top_neg)


#Evaluate classifier
#The function requires three arguments:
# 1. Word list based classifer
# 2. A list (or generator) of positive AmazonReview objects
# 3. A list (or generator) of negative AmazonReview objects
score = evaluate_wordlist_classifier(dvd_classifier, pos_test, neg_test)  
print(score)

0.5


If you have run the cell above without updating the SimpleClassifier code you should see that the accuracy is 0.5 i.e., 50%. The original SimpleClassifier just assigns everything to the positive class.  Since it is a binary classification decision and the classes are balanced, it will get 50% of the decisions correct (those that are positive) and 50% of the decisions incorrect (those that are actually negative).  This is the **baseline** result for this kind of classification task.  We obviously want to build classifiers that do better than this.

Now you try one of the classifiers that you wrote that selects positive and negative words from the training data.  Hopefully, this will perform better than the baseline of 50%. 

In [None]:
#Create a new classifier
#Make sure you have updated the code in utils.py to contain your WordList Classifier
my_dvd_classifier = SimpleClassifier_mf(100)
#train it
my_dvd_classifier.train(pos_train,neg_train)
#evaluate it on the test data
score=evaluate_wordlist_classifier(my_dvd_classifier,pos_test,neg_test)
print(score)

0.5883333333333334


### Evaluating a Naïve Bayes classifier on test data
We are now ready to run our Naïve Bayes classifier on a set of test data. When we do this we want to return the accuracy of the classifier on that data, where accuracy is calculated as follows:

$$\frac{\mbox{number of test documents that the classifier classifiers correctly}}
{\mbox{total number of test documents}}$$

In order to compute this accuracy score, we need to give the classifier **labelled** test data.
- This will be in the same format as the training data.

>In the cell below, we set up 5 test documents in the class `weather` and 5 documents in the class `football`.

>Run this cell.

In [None]:
weather_sents_train = [
    "today it is raining",
    "looking cloudy today",
    "it is nice weather",
]

football_sents_train = [
    "city looking good",
    "advantage united",
]

weather_data_train = [({word: True for word in sent.split()}, "weather") for sent in weather_sents_train] 
football_data_train = [({word: True for word in sent.split()}, "football") for sent in football_sents_train]
train_data = weather_data_train + football_data_train

weather_sents_test = [
    "the weather today is nice",
    "it is raining cats and dogs",
    "the weather here is wet",
    "it was hot today",
    "rain due tomorrow",
]

football_sents_test = [
    "what a great goal that was",
    "poor defending by the city center back",
    "wow he missed a sitter",
    "united are a shambles",
    "shots raining down on the keeper",
]

weather_data_test = [({word: True for word in sent.split()}, "weather") for sent in weather_sents_test] 
football_data_test = [({word: True for word in sent.split()}, "football") for sent in football_sents_test]
test_data = weather_data_test + football_data_test



In [None]:
train_data

[({'is': True, 'it': True, 'raining': True, 'today': True}, 'weather'),
 ({'cloudy': True, 'looking': True, 'today': True}, 'weather'),
 ({'is': True, 'it': True, 'nice': True, 'weather': True}, 'weather'),
 ({'city': True, 'good': True, 'looking': True}, 'football'),
 ({'advantage': True, 'united': True}, 'football')]

In [None]:
test_data

[({'is': True, 'nice': True, 'the': True, 'today': True, 'weather': True},
  'weather'),
 ({'and': True,
   'cats': True,
   'dogs': True,
   'is': True,
   'it': True,
   'raining': True},
  'weather'),
 ({'here': True, 'is': True, 'the': True, 'weather': True, 'wet': True},
  'weather'),
 ({'hot': True, 'it': True, 'today': True, 'was': True}, 'weather'),
 ({'due': True, 'rain': True, 'tomorrow': True}, 'weather'),
 ({'a': True,
   'goal': True,
   'great': True,
   'that': True,
   'was': True,
   'what': True},
  'football'),
 ({'back': True,
   'by': True,
   'center': True,
   'city': True,
   'defending': True,
   'poor': True,
   'the': True},
  'football'),
 ({'a': True, 'he': True, 'missed': True, 'sitter': True, 'wow': True},
  'football'),
 ({'a': True, 'are': True, 'shambles': True, 'united': True}, 'football'),
 ({'down': True,
   'keeper': True,
   'on': True,
   'raining': True,
   'shots': True,
   'the': True},
  'football')]

### Exercise 1.1
In the cell below implement a `classifier_evaluate` function that returns the accuracy of a classifier on a set of labelled test data.
`classifier_evaluate` should take the following arguments:
- a (trained) classifier (e.g., an instance of NBClassifier)
- the labelled test data

If you have not implemented your own NBClassifier as a stand-alone class, you could implement a version of the `classifier_evaluate` function which makes use of the `classify` function, and take the following arguments:
- the test data
- the class priors
- the conditional probabilities
- the known vocabulary (though this is redundant since it could be computed from the conditional probabilities)

In any case, `classifier_evaluate` should return the accuracy of the classifier on the test data.

Try out your `classifier_evaluate` function on the test data in the cell above.

In [None]:
def classifier_evaluate(classifier,test_data):
    
    docs,goldstandard=zip(*test_data) #note this neat pythonic way of turning a list of pairs into a pair of lists
    predictions=classifier.classify_many(docs)
    #print(predictions)
    correct=0
    for (prediction,gold) in zip(predictions,goldstandard):
        if prediction ==gold:
            correct+=1
    return correct/len(test_data)

    

In [None]:
myclassifier=NBClassifier()
myclassifier.train(train_data)

In [None]:
classifier_evaluate(myclassifier,test_data)

0.7

### Exercise 1.2
If you have written your classifier_evaluate() code in a fairly generic way, you should find that it is **not** specific to NB classification.  You should be able to pass it any classifier and test_data (formatted in the same way) and evaluate the accuracy.  
* Format the test_data for the Amazon reviews in the same way as the weather_football sentences (i.e., convert the list of documents into a list of (document,label pairs)
* Make any updates necessary to your classifier_evaluate() code
* Use your function to evaluate the accuracy of the SimpleClassifier (and check it gives the same result as we saw earlier in the notebook!)

In [None]:
Amazon_test=[(doc,"P") for doc in pos_test]+[(doc,"N") for doc in neg_test]


In [None]:
classifier_evaluate(my_dvd_classifier,Amazon_test)

In [None]:
print(Amazon_test[3][0].words())

### Exercise 1.3
Now, we want to run your NB classifier on a real problem - the classification of Amazon reviews as positive or negative.
* use your feature extraction code from Lab_3_2 to convert the Amazon Review corpus training data into the same format that your NB_classifier expects.
* train a nb_classifier on this training data
* test it on the test data

In [None]:
def feature_extract(review):
    #print(review.words())
    return {word:True for word in review.words()}
    
Amazon_train=[(feature_extract(review),'P')for review in pos_train]+[(feature_extract(review),'N') for review in neg_train]
Amazon_test=[(feature_extract(review),'P')for review in pos_test]+[(feature_extract(review),'N') for review in neg_test]

In [None]:
nb_dvd_classifier=NBClassifier()
nb_dvd_classifier.train(Amazon_train)

In [None]:
classifier_evaluate(nb_dvd_classifier,Amazon_test)

In [None]:
classifier_evaluate(my_dvd_classifier,Amazon_test)

## Precision, Recall and F1 score etc

When classes are unbalanced, evaluating classifiers in terms of accuracy can be misleading.  For example, if 10% of documents are relevant and 90% of documents are irrelevant, then a classifier which labels all documents as irrelevant will obtain an accuracy of 90%.  This sounds good but is actually useless. More useful metrics for evaluation of performance are precision, recall and F1 score.  These metrics allow us to distinguish the different types of errors our classifiers make.

For each class, $c$, we need to keep a record of 
* True Positives: $TP=|\{i|\mbox{prediction}(i)=\mbox{label}(i)=c\}|$
* False Negatives: $FN=|\{i|\mbox{prediction}(i)\neq \mbox{label}(i)=c\}|$
* False Positives: $FP=|\{i|\mbox{label}(i) \neq \mbox{prediction}(i)=c\}|$
* True Negatives: $TN=|\{i|\mbox{prediction}(i)=\mbox{label}(i)\neq c\}|$

Note the symmetry in the binary classification task (the TN for one class are the TP for the other class and so on).  Therefore, in binary classification, we just record these values and compute the following evaluation metrics for a single class (e.g. "Relevant" or "Positive")

* Precision: 
\begin{eqnarray*}
P=\frac{TP}{TP+FP}
\end{eqnarray*}
* Recall: 
\begin{eqnarray*}
R=\frac{TP}{TP+FN}
\end{eqnarray*}
* F1-score: 
\begin{eqnarray*}
F1 = \frac{2\times P\times R}{P+R}
\end{eqnarray*}


 ### Exercise 2.1
 
 The code below defines a ConfusionMatrix class for the binary classification task.  Currently, it will compute the number of TPs, FPs, FNs and TNs.  Test it out with predictions and test data for 
 * sentiment analysis task (Amazon book review data)
 * topic classification task (weather_football sentence data)

In [None]:
class ConfusionMatrix:
    def __init__(self,predictions,goldstandard,classes=("P","N")):
        (self.c1,self.c2)=classes
        self.TP=0
        self.FP=0
        self.FN=0
        self.TN=0
        for p,g in zip(predictions,goldstandard):
            if g==self.c1:
                if p==self.c1:
                    self.TP+=1
                else:
                    self.FN+=1
            
            elif p==self.c1:
                self.FP+=1
            else:
                self.TN+=1
        
    
    def precision(self):
        p=0
        #put your code to compute precision here
        
        return p
    
    def recall(self):
        r=0
        #put your code to compute recall here
        
        return r
    
    def f1(self):
        f1=0
        #put your code to compute f1 here
         
        return f1 

In [None]:
#docs will contain the documents to classify, labels contains the corresponding gold standard labels
docs,labels=zip(*Amazon_test)

In [None]:


senti_cm=ConfusionMatrix(my_dvd_classifier.classify_many(docs),labels)
print(senti_cm.TP)
print(senti_cm.FP)
print(senti_cm.TN)
print(senti_cm.FN)

In [None]:
senti_nb_cm=ConfusionMatrix(nb_dvd_classifier.classify_many(docs),labels)
print(senti_nb_cm.TP)
print(senti_nb_cm.FP)
print(senti_nb_cm.TN)
print(senti_nb_cm.FN)

In [None]:
docs,labels=zip(*test_data)
nb_cm=ConfusionMatrix(myclassifier.classify_many(docs),labels,("football","weather"))

In [None]:
print(nb_cm.TP)
print(nb_cm.FP)
print(nb_cm.TN)
print(nb_cm.FN)

### Exercise 2.2
* Add functionality to the ConfusionMatrix class code to compute precision, recall and F1 score
* Use your code to evaluate the performance of the different classifiers you have constructed.
* Interpret your results

## Investigating the impact of the quantity of training data
We will begin by exploring the impact on classification accuracy of using different quantities of training data.

The code in the cell below combines functionality built up earlier and will enable you to get training and testing data (in the correct format) for your classifiers.  It also defines a WordListClassifier class which expects training data in the same format as the NB Classifier - this is very important if we want to be able to easily switch between using different classifiers.


In [None]:
def feature_extract(review):
    #print(review.words())
    return {word:True for word in review.words()}

def get_training_test_data(category):
    reader=AmazonReviewCorpusReader().category(category)
    pos_train, pos_test = split_data(reader.positive().documents())
    neg_train, neg_test = split_data(reader.negative().documents())
    train_data=[(feature_extract(review),'P')for review in pos_train]+[(feature_extract(review),'N') for review in neg_train]
    test_data=[(feature_extract(review),'P')for review in pos_test]+[(feature_extract(review),'N') for review in neg_test]
    return train_data,test_data



class WordListClassifier(SimpleClassifier):
    #this WordListClassifier uses the same feature representation as the NB classifier
    #i.e., a multivariate Bernouilli event model where multiple occurrences of the same word in the same document are not counted.
        
    def __init__(self,k):
        self._labels=["P","N"]
        self.k=k
        
    def get_all_words(self,docs):
        return reduce(lambda words,doc: words + list(doc.keys()), docs, [])
    
    def train(self,training_data):
        pos_train=[doc for (doc,label) in training_data if label == self.labels()[0]]
        neg_train=[doc for (doc,label) in training_data if label == self.labels()[1]]
        
        pos_freqdist=FreqDist(self.get_all_words(pos_train))
        neg_freqdist=FreqDist(self.get_all_words(neg_train))
        
        self._pos=most_frequent_words(pos_freqdist,self.k)
        self._neg=most_frequent_words(neg_freqdist,self.k)
     

Now run the code in the cell below several times.  Each time it should generate a new sample of review data, train the classifiers and evaluate them.

In [None]:

training,testing=get_training_test_data("dvd")


#stopwords = stopwords.words('english')
word_list_size = 100
classifiers={"Word List":WordListClassifier(word_list_size),
             "Naive Bayes":NBClassifier()}
use=["Word List","Naive Bayes"]

results=[]
for name,classifier in classifiers.items():
    if name in use:
        classifier.train(training)
        accuracy=classifier_evaluate(classifier,testing)
        print("The accuracy of {} classifier is {}".format(name,accuracy))
        results.append((name,accuracy))
             
df = pd.DataFrame(results)
display(df)
ax = df.plot.bar(title="Experimental Results",legend=False,x=0)
ax.set_ylabel("Classifier Accuracy")
ax.set_xlabel("Classifier")
ax.set_ylim(0,1.0)

As you can see, the classifiers have different accuracies on different runs. 

### Exercise 3.1
Copy the cell above and move the copy to be positioned below this cell. Then adapt the code so that the accuracy reported for each classifier is the average across multiple runs.

### Exercise 3.2
Adapt the code so that it calculates average precision, recall and F1-score rather than average accuracy.

The next step involves measuring the performance of both the word list and Naïve Bayes classifiers on a range of subsets of the dvd reviews in the extended dvd review corpus.

- The full data set has 1000 positive and 1000 negative reviews. 
- You should continue to use 30% of the data for testing, so this means that we have up to 700 positive and 700 negative reviews to sample from.
- Consider (at least) the following sample sizes: 1, 10, 50, 100, 200, 400, 600 and 700.
- Note that the sample size is not the total number of reviews, but the number of positive reviews (which is also equal to the number of negative reviews).

### Exercise 3.3
Copy the code cell that you created for the last exercise, and place the copy below this cell. Then adapt the code to determine accuracy, precision, recall and F1-score for each classifier on each subset.

Use the `sample` function from the random module, which means you should include the line:  
`from random import sample`
- Make sure that you are selecting samples that have an equal number of positive and negative reviews.

Use a Pandas dataframe to display the results in a table.
- The table should have nine columns:
 - C1 for the sample sizes, 
 - C2-C5 for the Word List classifier performance metrics, and 
 - C6-C9 for the Naïve Bayes classifier performance metrics.

- You can use `pd.set_option('precision',2)` to limit the reals to have 2 digits after the decimal point.
- Create a dataframe like this:
```
pd.DataFrame(list(zip(<column 1 list>, <column 2 list>, ...)),
                  columns=<a list of the column headings)
```

### Exercise 3.3

Make a copy of the cell you created for the previous exercise and move it to be positioned below this cell. Using the new cell, repeat the above for each of the product categories.
- The available categories are `'dvd'`, `'book'`, `'kitchen'` and `'electronics'`. 



### Exercise 3.4
Interpret your results.  Specifically,
1. What is the impact of the amount of training data on classifier performance?  
2. Does this vary according to the classifier used?
3. Does this vary according to the category of the data?
4. Which classifier would you recommend to somebody else to use in their product? Are there any caveats or scenarios that you would warn them about (when it might not work as well as expected or a different classifier might be better?)