<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Text classification </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>April 24 - 26, 2018</b></p>

<hr style="height:5px;border:none" />

Text data can be analyzed by classification and clustering algorithms. This can be done by extracting features from a text data corpus, and performing a classification or clustering according to the extracted features and the target category data. Here, we cover a few simple examples of text classification and clustering. 

# 1. String classification
<hr style="height:1px;border:none" />

The goal here is to classify which strings belong to which category. To do so, we will use the corpus **`name`**, a collection of female and male names on NLTK, and construct a classifier to determine whether a name is of a female or a male.

`<NameClassifier.py>`

In [1]:
import nltk
import random

# reading names from the names corpus
from nltk.corpus import names
femaleNames = names.words('female.txt')
maleNames = names.words('male.txt')

Here, we are reading in names of females and males. 

In [2]:
# creating name-label pairs, then shuffling
nameData = []
for iName in femaleNames:
    nameData.append((iName, 'female'))
for iName in maleNames:
    nameData.append((iName, 'male'))
random.shuffle(nameData)

Once both data are read, we combine the name with its category ('female' or 'male') as tuples. Then we shuffle the data so that male and female names are now mixed.

To extract a feature from the shuffled data file, we will use a custom function called **`gender_feature`**. This function takes a string, then returns a feature (in this example, the last letter).

In [3]:
# a function to return a feature to classify whether a name is
# male or female.
# The feature and the label are returned together
def gender_feature(name):
    featureDict = {'last-letter': name[-1]}
    return featureDict

Then we convert the list of names (**`nameData`**) into a list of features (**`last-letter`**). Again, we shall keep both the feature dictionary and the label in **`featureData`**.

In [4]:
# converting the name data into feature (i.e., just the last letter)
# as well as the label (female / male)
featureData = [(gender_feature(n), gender) for (n, gender) in nameData]

At this point, we are generating training and testing data sets, with the testing data set comprising 1000 observations.

In [5]:
# spliting into training and testing data sets
trainData, testData = featureData[1000:], featureData[:1000]

Then we train a classifier. Here, we use a naive Bayes algorithm. A naive Bayes classifier classifies observations as the most likely outcomes based on the Bayes theory (i.e., the distribution of the label given the observed feature(s)). Naive Bayes classifiers are widely used in text classification, such as spam detection and sentiment analysis. The naive Bayes classifier is available in NLTK as **`NaiveBayesClassifier`**. This classifier object is somewhat different from that of **Scikit-learn** (a.k.a., `sklearn`). We supply both feature(s) and category label to the naive Bayes algorithm. 

In [6]:
# training a classifier (Naive Bayes)
clf = nltk.NaiveBayesClassifier.train(trainData)

Now the classifier has been trained. Let's see how well it works.

In [8]:
# classification example
print(clf.classify(gender_feature('Nemo')))

male


In [9]:
print(clf.classify(gender_feature('Dory')))

female


Now, we shall see how the classifier performs on our testing data.

In [10]:
# classifier performance on the testing data
print(nltk.classify.accuracy(clf, testData))

0.765


This is actually a good performance in terms of accuracy. Unlike classifiers in `sklearn`, the naive Bayes classifier in NLTK lets you examine the most informative features.

In [12]:
# most informative features
clf.show_most_informative_features(10)

Most Informative Features
             last-letter = 'k'              male : female =     40.5 : 1.0
             last-letter = 'a'            female : male   =     35.8 : 1.0
             last-letter = 'f'              male : female =     14.6 : 1.0
             last-letter = 'p'              male : female =     11.2 : 1.0
             last-letter = 'd'              male : female =     10.1 : 1.0
             last-letter = 'v'              male : female =      9.2 : 1.0
             last-letter = 'o'              male : female =      8.5 : 1.0
             last-letter = 'm'              male : female =      8.1 : 1.0
             last-letter = 'u'              male : female =      7.8 : 1.0
             last-letter = 'g'              male : female =      6.5 : 1.0


So, it seems like a name ending with "a" is likely classified as a female name.

How can we improve the performance of the classifier? Let's examine misclassified cases and see if there is any pattern.

In [None]:
# examining classification errors
errorData = []
testDataFull = nameData[:1000]
for iData in testDataFull:
    trueCat = iData[1]
    predCat = clf.classify(gender_feature(iData[0]))
    if predCat != trueCat:
        errorData.append((trueCat, predCat, iData[0]))


# printing out the errors
for (y_true, y_pred, name) in sorted(errorData):
    print('Truth: %-6s\t' % y_true, end='')
    print('Pred: %-6s\t' % y_pred, end='')
    print('Name: %-12s' % name)

(I will not print out all the misclassified cases)

Notice that there are some patterns. For example,
```
...
Truth: female	Pred: male  	Name: Catlin      
...
Truth: female	Pred: male  	Name: Christin    
...
Truth: female	Pred: male  	Name: Dyann       
Truth: female	Pred: male  	Name: Emlynn      
...
Truth: female	Pred: male  	Name: Jacquelin   
...
Truth: female	Pred: male  	Name: Joann       
Truth: female	Pred: male  	Name: Joycelin    
...
Truth: female	Pred: male  	Name: Kerstin     
...
```
Notice that some misclassified female names end with "in" and "nn." Moreover,
```
...
Truth: male  	Pred: female	Name: Artie       
...
Truth: male  	Pred: female	Name: Benjie      
...
Truth: male  	Pred: female	Name: Bobbie      
...
Truth: male  	Pred: female	Name: Eddie       
...
Truth: male  	Pred: female	Name: Ricky       
Truth: male  	Pred: female	Name: Rocky       
...
Truth: male  	Pred: female	Name: Sparky      
...
Truth: male  	Pred: female	Name: Tucky       
...
```
Some misclassified male names end with "ie" and "ky". So, in addition to the last letter, we can use the last two letters as another feature to improve the classification performance.

`<NameClassifierRevised.py>`

In [14]:
# a function to return a feature to classify whether a name is
# male or female.
# The feature and the label are returned together
def gender_feature(name):
    featureDict = {'last-letter': name[-1], 'last2': name[-2:]}
    return featureDict

Now there are two features in the feature dictionary: **`last-letter`** and **`last2`**. Let's use these features and re-classify the name data.

In [15]:
# converting the name data into features 
# as well as the label (female / male)
featureData = [(gender_feature(n), gender) for (n, gender) in nameData]

# spliting into training and testing data sets
trainData, testData = featureData[1000:], featureData[:1000]

# training a classifier (Naive Bayes)
clf = nltk.NaiveBayesClassifier.train(trainData)

# classifier performance on the testing data
print(nltk.classify.accuracy(clf, testData))

0.792


As you can see, there is some improvement in the performance. Here are most informative features.

In [16]:
# most informative features
clf.show_most_informative_features(15)

Most Informative Features
                   last2 = 'na'           female : male   =     96.1 : 1.0
                   last2 = 'la'           female : male   =     67.0 : 1.0
             last-letter = 'k'              male : female =     40.5 : 1.0
                   last2 = 'ia'           female : male   =     36.9 : 1.0
             last-letter = 'a'            female : male   =     35.8 : 1.0
                   last2 = 'ra'           female : male   =     33.2 : 1.0
                   last2 = 'ld'             male : female =     32.8 : 1.0
                   last2 = 'sa'           female : male   =     32.6 : 1.0
                   last2 = 'ta'           female : male   =     30.2 : 1.0
                   last2 = 'rt'             male : female =     28.3 : 1.0
                   last2 = 'us'             male : female =     26.0 : 1.0
                   last2 = 'rd'             male : female =     25.0 : 1.0
                   last2 = 'io'             male : female =     23.9 : 1.0

As you can see, the majority of informative features are the last two letters.

### Exercise
1. **Last 3 letters**. Revise the code above by adding an additional feature, the last 3 letters. Then re-run the classifier. Print out 15 most informative features.

# 2. Document classification
<hr style="height:1px;border:none" />

# Document classification with NLTK
In addition to classification of string data, you can also classify documents using NLTK. Such document classification is often done by using a **bag-of-words** approach. A bag-of-words is a list of words used in a document. Here, we ignore word order or any grammatical structure. We solely focus on which words are used in a document, and with a sufficient number of documents in a corpus, we can build a classifier to categorize a document. 

In this example, we will use the **`movie-reviews`** corpus from NLTK, one of the example corpora we have been using. This type of classification can be used in a **sentiment analysis**, to infer the sentiment of the writer (positive or negative, in this particular case).

`<DocClassify.py>`

In [17]:
import nltk
import random
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import LinearSVC

# creating a list of document-label pairs
from nltk.corpus import movie_reviews as mr
reviewList = []
for iCat in mr.categories():  # first, going over categories (pos or neg)
    for iReview in mr.fileids([iCat]):   # reviews in that category
        reviewPair = (mr.words(iReview), iCat)
        reviewList.append(reviewPair)

First, we read documents in the corpus, and create a list of document-label pair. 

In [18]:
# shuffling, and separating into testing and training data sets
random.shuffle(reviewList)
trainList, testList = reviewList[500:], reviewList[:500]

Since the reviews are grouped by categories (pos or neg), so we shuffle them first, then split them into the testing (500 reviews) and training (all the rest) data sets. Next, we create a list of 2000 most frequently appearing words in the training data.

In [19]:
# creating a list of all words in the training data set
allWords = []
for iReviewPair in trainList:
    reviewWords = [w.lower() for w in iReviewPair[0]]
    # Just in case someone writes a review IN ALL CAPS
    allWords += reviewWords


# word frequency, and just consider 2000 most frequent words
allWordFreq = nltk.FreqDist(allWords)
featureWords = [w for (w,c) in allWordFreq.most_common(2000)]

Here, the most frequently used words are punctuations and stop words. But since there are so many words in the feature word list, we do not worry about those. At this point, we define a function to extract a bag-of-words from each document.

In [21]:
# Document features (whether contains certain words)
def document_features(document): 
    document_words = set(document) 
    features = {}
    for w in featureWords:
        features['contains({})'.format(w)] = (w in document_words)
    return features

This function returns a dictionary of features, whether each of the 2000 words is contained in the document. We use this function to extract features from each document in the training and testing data sets.

In [22]:
# extracting features for training and testing data
trainSet = [(document_features(d), c) for (d,c) in trainList]
testSet = [(document_features(d), c) for (d,c) in testList]

Each entry in the training and testing data set is a dictionary of features, i.e., whether a certain word is contained in the document, as well as the target class ('pos' or 'neg').
```
({'contains(surprise)': False,
  'contains(version)': False,
  'contains(lady)': False,
  'contains(constantly)': False,
  'contains(minute)': True,
  'contains(sheer)': False,
  'contains(memorable)': False,
  'contains(hospital)': False,
  'contains(himself)': False,
  ...},
 'pos')
```

Now, we are ready to run the naive Bayes classifier. *Please note that it may take a few minutes to run the classifier*.

In [25]:
# classifier
clf = nltk.NaiveBayesClassifier.train(trainSet)
print(nltk.classify.accuracy(clf, testSet)) 

0.828


As you can see, the classifier did a fairly good job in classifying the review sentiment. The most informative features are:

In [26]:
# most informative features
clf.show_most_informative_features(15)

Most Informative Features
        contains(seagal) = True              neg : pos    =     11.0 : 1.0
         contains(mulan) = True              pos : neg    =      8.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      8.3 : 1.0
         contains(awful) = True              neg : pos    =      6.0 : 1.0
          contains(lame) = True              neg : pos    =      5.3 : 1.0
        contains(wasted) = True              neg : pos    =      5.1 : 1.0
          contains(jedi) = True              pos : neg    =      5.0 : 1.0
         contains(waste) = True              neg : pos    =      4.9 : 1.0
       contains(freedom) = True              pos : neg    =      4.7 : 1.0
         contains(worst) = True              neg : pos    =      4.5 : 1.0
    contains(ridiculous) = True              neg : pos    =      4.5 : 1.0
        contains(poorly) = True              neg : pos    =      4.2 : 1.0
        contains(superb) = True              pos : neg    =      4.2 : 1.0

## Using a classifier from `sklearn` in NLTK

Besides the naive Bayes classifier, you can also use a classifier available in `sklearn` in NLTK. This is done by a *wrapper* available in NLTK. A wrapper is a function that calls one function from another function.  So, let's say we want to use a support vector machine (SVM) from `sklearn`. In that case, we need to import **`SklearnClassifier`** from **`nltk.classify.scikitlearn`**, a wrapper utility enabling a use of `sklearn` classifier in NLTK. In addition, we also need to import the SVM classifier in `sklearn`. We will use **`LinearSVC`**, a SVM with a linear kernel. 

In [27]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import LinearSVC

Now we are ready to use a linear SVM for our movie review data.

In [28]:
# Now with SVM classifier (linear kernel)
clf_svm = SklearnClassifier(LinearSVC())
clf_svm.train(trainSet)
print(nltk.classify.accuracy(clf_svm, testSet)) 

0.782


Note that, since this is a classifier from `sklearn`, not from NLTK, so it does not let you print out most informative features.

### Exercise
1. **Reviewer sentiment, same or different?**. The program **`SentimentClassify.py`** (available on GitHub) classifies user reviews on Amazon.com as positive (1) or negative (0), using the naive Bayes classifier in NLTK. You want to see if the classifier for the reviewer sentiment for Amazon works on reviews from other platforms. In the directory **`SentimentReviews`** (available on GitHub) are labeled reviews from IMDB (**`imdb_labelled.txt`**) and Yelp (**`yelp_labelled.txt`**). Chose either one of the review corpora, and classify the reviews using the classifier from `SentimentClassify.py` based on Amazon reviews. Calculate the classification accuracy.
2. **Sanity check**. Modify the program **`SentimentClassify.py`** so that it can construct a classifier based on the review corpus you chose for the earlier exercise. Calculate the classification accuracy, and print 15 most informative features.

## Document classification with `sklearn`

NLTK is great for processing text data. However, its classification functionality is not as extensive as other libraries specialized in machine learning. Luckily, we can perform document classification using **`sklearn`**. The classification tools available in `sklearn` has been optimized and run much faster than that of NLTK. We will revisit the sentiment analysis of movie reviews earlier using `sklearn`.

`<DocClassifyFreq.py>`

In [29]:
import nltk
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# creating a list of document-label pairs
from nltk.corpus import movie_reviews as mr
reviewList = []
for iCat in mr.categories():  # first, going over categories (pos or neg)
    for iReview in mr.fileids([iCat]):   # reviews in that category
        reviewPair = (mr.raw(iReview), iCat)
        reviewList.append(reviewPair)

# splitting into training and testing data
X = [d for (d, c) in reviewList]
Y = [c for (d, c) in reviewList]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=500,
                                                    random_state=0)

Here, we load the data, and split into the training and testing data using `sklearn`'s **`train_test_split`** function. 

Next, we generate a bag-of-words for each review. To do so, we use the **`CountVectorizer`** transformation object available in **`sklearn.feature_extraction.text`**. `CountVectorizer` tokenizes a document into words, changes to lower case, and removes punctuations and stop words, all in one step. 

In [30]:
# word occurrence counts
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

The transformation object **`count_vect`** in this case returns word frequency counts in all documents in the training data `X_train`. You can examine the list of all words in the training data corpus by

In [31]:
# List of words
#count_vect.get_feature_names()

**`X_train_counts`** is a matrix of word occurrence counts, rows corresponding to documents in the training data, and the columns corresponding to the number of unique words appearing in the training data corpus.

In [32]:
X_train_counts.shape

(1500, 35321)

Note that the majority of the elements in `X_train_counts` are zeros, thus it is stored as a *sparse matrix* as opposed to a 2D array. You can still examine the elements of `X_train_counts` by

In [33]:
# indices for non-zero elements in the sparse matrix
X_train_counts.nonzero()

(array([   0,    0,    0, ..., 1499, 1499, 1499], dtype=int32),
 array([ 8406, 14937,  1835, ..., 12551,  9283, 34287], dtype=int32))

Next, we want to convert the word occurrence counts to word frequencies. This is because the word occurrence counts depend highly on the length of a document. For example, the word "absolutely" may appear only 3 times in a 2-page essay, but a 500-page novel may contain 232 uses of "absolutely." The **`TfidfTransformer`** transformer object in **`sklearn.feature_extraction.text`** can transform word count data into word frequency (known as **term frequency**, or **tf**) data. In addition to converting counts to frequencies, it also down-weight words that are abundant in the corpus. This is because words commonly used in a corpus add very little information for classification. This process of down-weighting is known as **inverse document frequency** or **idf**. `TfidfTransformer` can covert count data to **tf-idf** data.

In [35]:
# converting to term frequency
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

We can use **`X_train_tf`** as the features in a classifier. Here, we use a naive Bayes classifier again. Here, we use the **`MultinomialNB`** classifier object, a multinomial naive Bayes classifier, available in **`sklearn.naive_bayes`**. 

In [36]:
# classifier (naive Bayes)
clf_nb = MultinomialNB().fit(X_train_tf, Y_train)

Now the classifier has been trained, so we now classify the testing data. It needs to be converted to the count data, then converted to the frequency data. We have to use **`count_vect`** and **`tf_transformer`**, respectively, already fitted for our training data. 

In [37]:
# converting the testing set to term frequency
X_test_counts = count_vect.transform(X_test)  # NB you don't have to fit
X_test_tf = tf_transformer.transform(X_test_counts)  # NB you don't have to fit

Finally classifying **`X_test_tf`**. 

In [38]:
# classifying the testing data
Y_pred_nb = clf_nb.predict(X_test_tf)

We can examine the classifier performance.

In [39]:
# accuracy
print('Accuracy - Naive Bayes: %6.4f' % accuracy_score(Y_test,Y_pred_nb))
print(confusion_matrix(Y_test,Y_pred_nb))
print(classification_report(Y_test,Y_pred_nb))

Accuracy - Naive Bayes: 0.8260
[[219  35]
 [ 52 194]]
             precision    recall  f1-score   support

        neg       0.81      0.86      0.83       254
        pos       0.85      0.79      0.82       246

avg / total       0.83      0.83      0.83       500



Just for fun, we can try a linear SVM as a classifier on the same data.

In [40]:
# classifier (Linear SVM)
clf_svm = LinearSVC().fit(X_train_tf, Y_train)

# classifying the testing data
Y_pred_svm = clf_svm.predict(X_test_tf)

# accuracy
print('Accuracy - Linear SVM: %6.4f' % accuracy_score(Y_test,Y_pred_svm))
print(confusion_matrix(Y_test,Y_pred_svm))
print(classification_report(Y_test,Y_pred_svm))

Accuracy - Linear SVM: 0.8760
[[218  36]
 [ 26 220]]
             precision    recall  f1-score   support

        neg       0.89      0.86      0.88       254
        pos       0.86      0.89      0.88       246

avg / total       0.88      0.88      0.88       500



* Name classification
   * Exercise: last 3 letters, first letter vowel
* Text classification
   * NLTK classifier (Naive Bayes, sklearn wrapper)
       * Exercise: SentimentReview
   * sklearn tools
   * (cross validation)
       * Exercise: News groups
* Text clustering