# Week 3: Basic Document Classification (Part 1)

## Preliminaries 

In [None]:
#necessary library imports and setup introduced previously

from google.colab import drive
#mount google drive
drive.mount('/content/drive/')

import sys
#sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
#sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
#sys.path.append(r'/Users/juliewe/resources')
sys.path.append('/content/drive/My Drive/NLENotebooks/resources/')

import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize

from sussex_nltk.corpus_readers import ReutersCorpusReader

Mounted at /content/drive/
Sussex NLTK root directory is /content/drive/My Drive/NLENotebooks/resources


In [None]:
#download nltk resources
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Overview 
In labs this week (and next), the focus will be on the application of sentiment analysis. You will be using a corpus of **book reviews** within an **Amazon review corpus**.

You will be exploring various techniques that can be used to classify the sentiment of Amazon book reviews as either positive or negative. 

You will be developing your own **Word List** and **Naïve Bayes** classifiers and then comparing them to the **NLTK Naïve Bayes** classifier.

## Creating training and testing sets
You will be training and testing various document classifiers. It is essential that the data used in the testing phase is not used during the training phase, since this can lead to overestimating performance. 

We now introduce the `split_data` function (defined in the cell below) which can be used to get separate **training** and **testing** sets.

> Look through the code in the following cell, reading the comments and making sure that you understand each line.

In [None]:
from random import sample # have a look at https://docs.python.org/3/library/random.html to see what random.sample does
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

 
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data) # data is a generator, so this puts all the generated items in a list
 
    n = len(data)  #Found out number of samples present
    train_indices = sample(range(n), int(n * ratio))          #Randomly select training indices
    test_indices = list(set(range(n)) - set(train_indices))   #Other items are testing indices
 
    train = [data[i] for i in train_indices]           #Use training indices to select data
    test = [data[i] for i in test_indices]             #Use testing indices to select data
 
    return (train, test)                       #Return split data
 

Now we can use this function together with a <code>reader</code> object  to create training and testing data.  Note that the <code>AmazonReviewCorpusReader().category("dvd")</code> returns a reader over just the *dvd* reviews.  The methods <code>positive()</code>, <code>negative()</code> can be called to create readers over reviews classified accordingly to their sentiment.  

In [None]:
#Create an Amazon corpus reader pointing at only dvd reviews
dvd_reader = AmazonReviewCorpusReader().category("dvd")

#The following two lines use the documents function on the Amazon corpus reader. 
#This returns a generator over reviews in the corpus. 
#Each review is an instance of a Python class called AmazonReview. 
#An AmazonReview object contains all the data about a review.
dvd_pos_train, dvd_pos_test = split_data(dvd_reader.positive().documents())
dvd_neg_train, dvd_neg_test = split_data(dvd_reader.negative().documents())

#You can also combine the training data
dvd_train = dvd_pos_train + dvd_neg_train
dvd_test=dvd_pos_test + dvd_neg_test

In [None]:
dvd_pos_test[0].words()

### Exercise 1.1
* Generate 80:20 *training*:*testing* splits of all 4 categories of reviews (*dvd*, *book*, *kitchen* and *electronics*), containing **positive** and **negative** reviews.  
* Record the number of reviews according to category, sentiment and dataset (training or testing) in a Pandas dataframe
* Answer the following questions:
    1. Regarding the *training* data for *books*, how many are a) **positive**, b) **negative**?
    2. Regarding the **negative** *testing* data for, how many are there for each category: a) *dvd*, b) *book*, c) *kitchen* and d) *electronics*? 

In [None]:
dvd_reader = AmazonReviewCorpusReader().category("dvd")
dvd_pos_train, dvd_pos_test = split_data(dvd_reader.positive().documents(),ratio=0.8)
dvd_neg_train, dvd_neg_test = split_data(dvd_reader.negative().documents(),ratio=0.8)
print(len(dvd_pos_train))

800


In [None]:
print(len(dvd_neg_test))

200


How can we make this better?

In [None]:
categories=["dvd","kitchen","electronics","book"]

numbers=[]

for c in categories:
  c_reader= AmazonReviewCorpusReader().category(c)
  pos_train, pos_test = split_data(c_reader.positive().documents(),ratio=0.8)
  numbers.append((c,"positive","train",len(pos_train)))
  numbers.append((c,"positive","test",len(pos_test)))
  neg_train, neg_test = split_data(c_reader.negative().documents(),ratio=0.8)
  numbers.append((c,"negative","train",len(neg_train)))
  numbers.append((c,"negative","test",len(neg_test)))

print(numbers)


[('dvd', 'positive', 'train', 800), ('dvd', 'positive', 'test', 200), ('dvd', 'negative', 'train', 800), ('dvd', 'negative', 'test', 200), ('kitchen', 'positive', 'train', 800), ('kitchen', 'positive', 'test', 200), ('kitchen', 'negative', 'train', 800), ('kitchen', 'negative', 'test', 200), ('electronics', 'positive', 'train', 800), ('electronics', 'positive', 'test', 200), ('electronics', 'negative', 'train', 800), ('electronics', 'negative', 'test', 200), ('book', 'positive', 'train', 800), ('book', 'positive', 'test', 200), ('book', 'negative', 'train', 800), ('book', 'negative', 'test', 200)]


In [None]:
df=pd.DataFrame(numbers,columns=['category','sentiment','dataset','number'])
df

Unnamed: 0,category,sentiment,dataset,number
0,dvd,positive,train,800
1,dvd,positive,test,200
2,dvd,negative,train,800
3,dvd,negative,test,200
4,kitchen,positive,train,800
5,kitchen,positive,test,200
6,kitchen,negative,train,800
7,kitchen,negative,test,200
8,electronics,positive,train,800
9,electronics,positive,test,200


Training data for books:


1.   positive = 800
2.   negative = 800

Negative testing data:
  200 of each





## Creating word lists
The next section will explain how to use a sentiment classifier that bases its decisions on word lists. The classifier requires a list of words indicating positive sentiment, and a second list of words indicating negative sentiment. Given positive and negative word lists, a document's overall sentiment is determined based on counts of occurrences of words that occur in the two lists. In this section we are concerned with the creation of the word lists. We will be considering both hand-crafted lists and automatically generated lists.

### Exercise 2.1

- Create a reasonably long hand-crafted list of words that you think indicate positive sentiment.
- Create a reasonably long hand-crafted list of words that indicate negative sentiment.

Use the following cells to store these lists in the variables `my_positive_word_list` and `my_negative_word_list`.

In [None]:
my_positive_word_list = ["good","great","lovely", "excellent"] # extend this one or put your own list here
my_negative_word_list = ["bad", "terrible", "awful", "dreadful"] # extend this one or put your own list here

Next, you should try to derive word lists from the data. One way to do this, is to use the most frequent words in positive reviews as your positive list, and the most frequent words in negative reviews as your negative list. This can be done with the [NLTK <code style="background-color: #F5F5F5;">FreqDist</code>](http://www.nltk.org/api/nltk.html#module-nltk.probability) object. 

> You should make sure you understand the code in the cell below.

In [None]:
from nltk.probability import FreqDist # see http://www.nltk.org/api/nltk.html#module-nltk.probability
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from functools import reduce # see https://docs.python.org/3/library/functools.html

#Helper function. Given a list of reviews, return a list of all the words in those reviews
#To understand this look at the description of functools.reduce in https://docs.python.org/3/library/functools.html
def get_all_words(amazon_reviews):

    return reduce(lambda words,review: words + review.words(), amazon_reviews, [])

#A frequency distribution over all words in positive book reviews
pos_freqdist = FreqDist(get_all_words(dvd_pos_train))
neg_freqdist = FreqDist(get_all_words(dvd_neg_train))

Some more examples of reduce

In [None]:
#adding up a list
mylist=[1,4,5,2,98]
reduce(lambda x,y:x+y,mylist,0)

110

In [None]:
mylist=['t','h','e',' ','d','o','g']
reduce(lambda x,y:x+y,mylist,"")

'the dog'

So what does get_allwords() do?

In [None]:
for review in dvd_pos_train:
  print(review.words())
  break

['This', 'wonderful', 'TV', 'special', 'from', 'the', '70', "'s", 'is', 'timeless', 'and', 'has', 'great', 'music', 'by', 'Harry', 'Nihlssen', '(', 'not', 'sure', 'of', 'that', 'spelling', ')', '.', 'The', 'lesson', 'is', 'all', 'about', 'tolerance', 'with', 'lots', 'of', 'side', 'issues', 'of', 'great', 'value', '.', 'I', 'am', 'thrilled', 'to', 'own', 'this', 'DVD', '.', 'It', 'brought', 'back', 'all', 'kinds', 'of', 'memories', 'for', 'me', '-', 'both', 'the', 'story', 'AND', 'especially', 'the', 'music', '.', 'It', 'is', 'a', 'perfect', 'addition', 'to', 'my', 'collection', 'of', 'movies', 'for', 'my', 'grandchildren']


Its just going to make a long (flattened) list of all of the words in the reviews.

The constructor for FreqDist will count up how many there are of each type and store this in an object which is very similar to a dictionary.  Its just specialised so that the keys are strings and the values are counts.   We can look things up in a FreqDist in exactly the same way as in a dictionary.  But it supports some other operations too (such as adding and subtraction)

In [None]:
pos_freqdist

In [None]:
pos_freqdist['This']

398

In [None]:
pos_freqdist['this']

1319

### Exercise 2.2
Explain (in words) how the <code>get_all_words()</code> function works.  Your description should include details about
1. the input
2. the output
3. the algorithm used to generate the output from the input

YOU NEED TO TYPE SOME WORDS HERE!

### Exercise 2.3
In the blank code cell below write code that uses the frequency lists, `pos_freqdist` and `neg_freqdist`, created in the above cell and `my_positive_word_list` and `my_negative_word_list` that you manually created earlier to determine whether or not the review data conforms to your expectations. In particular, whether:
- the words you expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews
- the words you expected to indicate negative sentiment actually occur more frequently in negative reviews than positive reviews.

Display your findings in a table using pandas.

In [None]:
def check_expectations(a_word_list,expectation,pos_freqdist=pos_freqdist,neg_freqdist=neg_freqdist):
#expectation is a positive number if words are expected to be positive
#expectation is a negative number if words are expected to be negative

    for word in a_word_list:
        pos_freq=pos_freqdist.get(word,0)
        neg_freq=neg_freqdist.get(word,0)
        diff=pos_freq-neg_freq
        if diff*expectation>0:
            print("As expected: for {} difference is {}".format(word,diff))
        else:
            print("Contrary to expectations: for {} difference is {}".format(word,diff))
        

In [None]:
check_expectations(my_positive_word_list,1)

Contrary to expectations: for good difference is -41
As expected: for great difference is 165
As expected: for lovely difference is 5
As expected: for excellent difference is 51


In [None]:
check_expectations(my_negative_word_list,-1)

As expected: for bad difference is -140
As expected: for terrible difference is -34
As expected: for awful difference is -37
As expected: for dreadful difference is -2


### Exercise 2.4
Now, you are going to create positive and negative word lists automatically from the training data. In order to do this:

1. write two new functions to help with automating the process of generating wordlists.

    - `most_frequent_words` - this function should take THREE arguments: 2 frequency distributions and a natural number, k. It should order words by how much more they occur in one frequency distribution than the other.   It should then return the top k highest scoring words. You might want to use the `most_common` method from the `FreqDist` class - this returns a list of word, frequency pairs ordered by frequency.  You might also or alternatively want to use pythons built-in `sorted` function
    - `words_above_threshold` - this function also takes three arguments: 2 frequency distributions and a natural number, k. Again, it should order words by how much more they occur in one distribution than the other.  It should return all of the words that have a score greater than k.

2. Remove punctuation and stopwords from consideration. You can re-use code from near the end of Lab_2_2.
3. Using the training data, create two sets of positive and negative word lists using these functions (1 set with each function). 
4.  Display these 4 lists (possibly in a `Pandas` dataframe?)



In [None]:
posdiff=pos_freqdist-neg_freqdist
posdiff

In [None]:
posdiff.get('excellent',0)

51

In [None]:
posdiff.get('good',0)

0

In [None]:
pos_freqdist.most_common()

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

def most_frequent_words(posfreq,negfreq,topk):
    difference=[(w,f-negfreq.get(w,0)) for (w,f) in posfreq.most_common()]
    sorteddiff=sorted(difference,key=lambda pair:pair[1],reverse=True)
    normalised=[w.lower() for (w,f) in sorteddiff]
    filtered=[w for w in normalised if w.isalpha() and w not in stop]
    return filtered[:topk]

In [None]:
top_pos=most_frequent_words(pos_freqdist,neg_freqdist,50)
print(top_pos)

['great', 'well', 'also', 'best', 'film', 'love', 'still', 'first', 'family', 'many', 'wonderful', 'see', 'always', 'excellent', 'classic', 'gives', 'show', 'music', 'one', 'enjoy', 'set', 'story', 'comedy', 'must', 'lot', 'john', 'episode', 'season', 'loved', 'perfect', 'years', 'performance', 'men', 'man', 'us', 'fun', 'collection', 'young', 'hilarious', 'watch', 'shows', 'films', 'series', 'amazing', 'day', 'played', 'performances', 'enjoyed', 'old', 'one']


In [None]:
top_neg=most_frequent_words(neg_freqdist,pos_freqdist,50)
print(top_neg)

['movie', 'like', 'bad', 'would', 'could', 'worst', 'better', 'even', 'nothing', 'acting', 'boring', 'waste', 'book', 'money', 'much', 'character', 'movies', 'make', 'plot', 'minutes', 'instead', 'horrible', 'want', 'people', 'stupid', 'something', 'going', 'good', 'scenes', 'quality', 'script', 'awful', 'erin', 'supposed', 'worse', 'problem', 'either', 'terrible', 'get', 'think', 'version', 'read', 'actors', 'dialogue', 'way', 'say', 'least', 'away', 'felt', 'ridiculous']


In [None]:
def above_threshold(posfreq,negfreq,threshold):
  difference=[(w.lower(),f-negfreq.get(w,0)) for (w,f) in posfreq.most_common()]
  sorteddiff=sorted(difference,key=lambda pair:pair[1],reverse=True)
  filtered=[w for (w,f) in sorteddiff if w.isalpha() and w not in stop and f>threshold]
  return filtered

In [None]:
above100pos = above_threshold(pos_freqdist,neg_freqdist,20)
print(above100pos)

['great', 'well', 'also', 'best', 'love', 'one', 'family', 'still', 'first', 'many', 'episode', 'show', 'wonderful', 'always', 'season', 'excellent', 'gives', 'music', 'perfect', 'fun', 'man', 'enjoy', 'classic', 'john', 'loved', 'watch', 'collection', 'enjoyed', 'hilarious', 'film', 'times', 'lot', 'one', 'workout', 'years', 'old', 'episodes', 'men', 'frank', 'especially', 'life', 'young', 'true', 'anyone', 'favorite', 'amazing', 'musical', 'best', 'never', 'makes', 'must', 'vs', 'little', 'comedy', 'job', 'day', 'role', 'war', 'body', 'includes', 'adrian', 'series', 'cast', 'shows', 'wife', 'plays', 'son', 'woman', 'takes', 'documentary', 'great', 'named', 'bergman', 'russell', 'lily', 'different', 'along', 'fine', 'view', 'course', 'friend', 'features', 'jack', 'city', 'see', 'work', 'set', 'though', 'performances', 'able', 'including', 'finds', 'terrific', 'gehry']


In [None]:
above100neg = above_threshold(neg_freqdist,pos_freqdist,100)
print(above100neg)

['movie', 'like', 'bad', 'would', 'could']


## Creating a word list based classifier
Now you have a number of word lists for use with a classifier. 
> Make sure you understand the following code, which will be used as the basis for creating a word list based classifier.

In [None]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 

    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        
        # add code here that assigns an appropriate value to score
        return "N" if score < 0 else "P"

    def batch_classify(self, docs): 
        return [self.classify(doc.words() if hasattr(doc, 'words') else doc) for doc in docs] 

    def labels(self): 
        return ("P", "N")

#Example usage:

classifier = SimpleClassifier(top_pos, top_neg)
classifier.classify("I read the book".split())

### Exercise 3.1

- Copy the above code cell and move it to below this one. Then complete the `classify` method in the above code as specified below.
- Test your classifier on several very simple hand-crafted examples to verify that you have implemented `classify` correctly.

The classifier is initialised with a list of positive words, and a list of negative words. The words of a document are passed to the `classify` method (which is partially completed in the above code fragment). The `classify` method should be defined so that each occurrence of a negative word decrements `score`, and each occurrence of a positive word increments `score`. 
- For `score` less than 0, an "`N`" for negative should be returned.
- For `score` greater than 0,  "`P`" for positive should returned.
- For `score` of 0, the classification decision should be made randomly (see https://docs.python.org/3/library/random.html).


In [None]:

from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 

    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        
        # add code here that assigns an appropriate value to score
        for word in words:
            if word in self._pos:
                score+=1
            if word in self._neg:
                score-=1
        
        return "N" if score < 0 else "P" 

    def batch_classify(self, docs): 
        return [self.classify(doc.words() if hasattr(doc, 'words') else doc) for doc in docs] 

    def labels(self): 
        return ("P", "N")

#Example usage:

classifier = SimpleClassifier(top_pos, top_neg)
classifier.classify("I hated this awful movie".split())

'N'

### Exercise 3.2
* Extend your SimpleClassifier class so that it has a `train` function which will derive the wordlists from training data.  You could build a separate class for each way of automatically deriving wordlists (which both inherit from SimpleClassifier) OR a single class which takes an extra parameter at training time.

In [None]:
class SimpleClassifier_mf(SimpleClassifier):
    
    def __init__(self,k):
        self._k=k
    
    def train(self,pos_train,neg_train):
        pos_freqdist = FreqDist(get_all_words(pos_train))
        neg_freqdist = FreqDist(get_all_words(neg_train))
        self._pos=most_frequent_words(pos_freqdist,neg_freqdist,self._k)
        self._neg=most_frequent_words(neg_freqdist,pos_freqdist,self._k)
    
    

In [None]:
dvdclassifier=SimpleClassifier_mf(100)

In [None]:
dvdclassifier.train(dvd_pos_train,dvd_neg_train)

Try out your classifier on the test data.  We will look at how to evaluate classifiers next week, but in an ideal world, most of the positive test items will have been classified as 'P' and most of the negative test items will have been classified as 'N' 

In [None]:
dvdclassifier.classify("I hated this movie".split())

'N'

In [None]:
dvdclassifier.batch_classify(dvd_pos_test)

In [None]:
dvdclassifier.batch_classify(dvd_neg_test)

['N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'P',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'P',
 'N',
 'P',
 'P',
 'N',
 'N',
 'P',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'P',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'P',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'P',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'P',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'P',
 'N',
 'N',
 'N',
 'P',
 'N',
 'P',
 'P',
 'P',
 'P',
 'N',
 'N',
 'P',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'P',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N',
 'N'