# Clustering and unsupervised learning

The simplest version of your task is to produce two representations of the 40-volume course corpus:

1. A plausible topic model using latent Dirichlet allocation
1. A _k_-means clustering on the basis of (normalized) word frequencies

The _k_-means clustering should take the form of cluster assignments per volume, as illustrated in the textbook. The topic model should be represented as a list of keywords associated with each topic, again as shown in the textbook.

There is at least one thing you'll need to do in addition to the textbook examples:

* Use the topic model output (or any other method) to uncover and remove additional proper names from your data.

There are a bunch of proper names in the corpus. As we discussed in class, most of these are undesirable in the sense that they do not represent meaningful semantic (that is, subject-matter) connections between any two volumes that happen to share them. So we want to get rid of them. 

One way to do this is to build topic models repreatedly, adding names from each output run to your stopword list until you stop seeing personal names among the top keywords. Another is to remove all proper nouns or named entities of type "person" from the input data. It's up to you how you approach the problem.

To help you get started, I've supplied three large stopword lists (from Jockers, Underwood, and Goldstone) in the `'data/wordlists'` directory on GitHub. Each file name begins with 'stopwords-'. I've also given you a function to import those lists and add their content to the basic NLTK English stopwords list. You can manage your own stopword list either in your code or (better yet) as an additional stopword file that you load alongside the others.

You'll notice, too, that I've supplied a short list (just one item for now) of offensive terms that should not be removed from the corpus (because they are meaningful), but also should not be displayed in raw form in the output. The supplied `mask_offensive()` function will return a version of these words with all but the first and last two letters replaced by `'*'`. I've supplied a lightly modified version of the `normalize()` function that uses `mask_offensive()` to apply this transformation. 

## Your minimal outputs should be:

1. Topic keywords for each topic in the final version of your model, showing no (or very few) proper names in any topic.
1. Cluster assignments from a _k_-means clustering on the data with names and other stopwords removed.
1. A (very) brief discussion of your results, emphasizing your sense of how well they reflect any knowledge you may have about the books in question.

Most of the two primary tasks can be accomplished with minor modification to the code included in chapter 6 of the textbook.

## Optional ways to extend this work

If you have the time and inclination to push yourself:

* Visualize the _k_-means output in two dimensions, coloring each volume by the cluster to which it is assigned. You might want to consult the vectorization problem set answers for a model approach to this type of visualization.
* Visualize the output of your topic model using `pyLDAvis`.
* Repeat the clustering using topic fractions per document in place of word frequencies. To do this, you'll need to produce a doc-topic matrix to substitute for the doc-term matrix. You can do this by running the `.transform()` method of your trained topic model over the normalized corpus.
* Explore a range of settings for normalization, vectorization, dimension-reduction, and modeling parameters to get a feel for how they affect the output.

In [22]:
import os
import sys
import nltk
import unicodedata
import numpy as np
import glob

from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD, NMF, PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from itertools import groupby
from operator import itemgetter
from nltk.corpus import wordnet as wn
from nltk.cluster import KMeansClusterer
from collections import Counter

# Where are the corpus texts on your system
pickle_dir = os.path.join('..', 'data', 'pickled')
wordlist_dir = os.path.join('..', 'data', 'wordlists')

# Import our libraries
sys.path.append(os.path.join('..', 'libraries'))
from TMN import PickledCorpusReader

def get_wordlists_from_files(wordlist_dir):
    """Read stopwords and offensive terms from files in wordlist_dir, return a set for each"""
    stopwords = []
    offensive = []
    stopword_files = glob.glob(wordlist_dir+'/stop*.txt')
    offensive_files = glob.glob(wordlist_dir+'/offensive*.txt')
    for f in stopword_files:
        with open(f, 'r') as fh:
            for line in fh.readlines():
                line = line.strip()
                if line:
                    stopwords.append(line)
    for f in offensive_files:
        with open(f, 'r') as fh:
            for line in fh.readlines():
                line = line.strip()
                if line:
                    offensive.append(line)
    return set(stopwords), set(offensive)

STOPWORDS, OFFENSIVE = get_wordlists_from_files(wordlist_dir)
STOPWORDS = set(nltk.corpus.stopwords.words('english')).union(STOPWORDS)

## LDA topic modeling

In [12]:
def identity(words):
    return words


def mask_offensive(token):
    """Make lemmatized offensive words less objectionable for display"""
    if token in OFFENSIVE:
        return token[0]+(len(token)-3)*'*'+token[-2:]
    else:
        return token

class TextNormalizer(BaseEstimator, TransformerMixin):

    def normalize(self, document):
        return [
            mask_offensive(self.lemmatize(token, tag).lower())
            for paragraph in document
            for sentence in paragraph
            for (token, tag) in sentence
            if not self.is_punct(token) and not self.is_stopword(token)
        ]
    
    """
    TEXTBOOK CODE HERE. 
    I have supplied only the modified normalize() method that masks offensive terms.
    You need to get the rest of the code for this class and for the SklearnTopicModels
    class from the textbook, or else write it yourself.
    
    NB. To remove proper nouns, you can modify the normalize() function pretty trivially.
    """

## _k_-Means

In [14]:
lemmatizer = nltk.WordNetLemmatizer()

def is_punct(token):
    # Is every character punctuation?
    return all(
        unicodedata.category(char).startswith('P')
        for char in token
    )


def wnpos(tag):
    # Return the WordNet POS tag from the Penn Treebank tag
    return {
        'N': wn.NOUN,
        'V': wn.VERB,
        'R': wn.ADV,
        'J': wn.ADJ
    }.get(tag[0], wn.NOUN)


def normalize(document, stopwords=STOPWORDS):
    """
    Takes a document = list of (token, pos_tag) tuples
    Removes stopwords and punctuation, lowercases, lemmatizes
    """

    for token, tag in document:
        token = token.lower().strip()

        if is_punct(token) or (token in stopwords):
            continue
            
        yield mask_offensive(lemmatizer.lemmatize(token, wnpos(tag)))

"""
TEXTBOOK/YOUR CODE HERE.
Pretty much just the KMeansTopics class; the rest is included above, again modifying the
normalize() function to include output masking.
NB. You'll need to remove names in the same way you did for the topic model.
"""

## Discussion

A (very) brief discussion of your results, emphasizing your sense of how well they reflect any knowledge you may have about the books in question.