# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Natural Language Processing, with NLTK

###  February 22, 2018

### Discovering network structure is just one aspect of social media mining. Let's look at the actual *content* users generate on social media, starting with data provided by the NLTK package.

### Running the next cell should bring up a window prompting you to select data for download. You should select "book" in order to get everything used in the [NLTK Book](http://www.nltk.org/book/).

In [1]:
# Importing NLTK; included in Conda distro
import nltk
nltk.download()

## Reuters Corpus

### Let's start with `reuters`, a *corpus* taken from Reuters' reporting. This corpus is already grouped into categories and into *training* and *test* sets.

In [13]:
from nltk.corpus import reuters

### A *corpus* is a collection of documents. The documents in this case are articles, but in general could be other things, such as individual tweets. Think "body" of work - the root means body.

### Here we have a list of all the documents' ids.

In [14]:
reuters_ids = reuters.fileids()

### The first and last documents...

In [None]:
print(reuters_ids[0],reuters_ids[-1])

### Notice that there are *test* and *training* documents. We will use that when we do classification, where the training documents will be used to 'learn' a model, and the test documents to evaluate the quality of that model.

### The `reuters` corpus is grouped into many overlapping categories...

In [None]:
reuters_cats = reuters.categories()
print(reuters_cats, len(reuters_cats))

### That was all of them, but the function `categories` can be applied to a particular document to get its categories.

In [None]:
reuters.categories('training/9865')

### ...or a list of documents

In [None]:
reuters.categories(['training/9865','training/9880'])

### And you can pass a category or list of categories to the `fileid` function.

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley','corn'])

### We can get the words appearing in a list of documents or in documents belonging to a specified category.

In [None]:
reuters.words('training/9865')[:14]

In [None]:
reuters.words(['training/9865','training/9880'])

In [None]:
reuters.words(categories='barley')

In [None]:
reuters.words(categories=['barley','corn'])

### What about the actual content?? The `raw` function gives that in a string.

In [None]:
print(reuters.raw('test/14826'))

### For more on available methods, see `help`.

In [None]:
help(nltk.corpus.reader)

## Movie Reviews Corpus (from Lee Pang)

In [16]:
from nltk.corpus import movie_reviews

In [17]:
movie_ids = movie_reviews.fileids()

In [None]:
print(movie_ids[0],movie_ids[-1])

### These are split into *negative* and *positive* movie reviews - this is the sort of classification we would like to do for sentiment analysis.

In [None]:
movie_reviews.categories()

In [None]:
movie_reviews.categories('neg/cv000_29416.txt')

In [None]:
print(len(movie_reviews.fileids('neg')), len(movie_reviews.fileids('pos')))

### Again, we can look at the raw content of the document.

In [None]:
movie_reviews.raw('neg/cv000_29416.txt')

## Twython

### The NLTK Twitter module depends on the Twython package.

### This is another python package for interacting with the Twitter API. Maybe you'll prefer it. The first three examples below use the public stream (no credentials required).

### See the [NLTK Twitter HOWTO](http://www.nltk.org/howto/twitter.html) for more details.

In [None]:
from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='love, hate', limit=10) #sample from the public stream

In [None]:
tw = Twitter()
tw.tweets(follow=['759251', '612473'], limit=10) # see what CNN and BBC are talking about

In [None]:
tw = Twitter()
tw.tweets(to_screen=False, limit=25)

### Let's use credentials. They must be stored in a file with the name "credentials.txt" kept in your *twitter-files* directory. The file must have the following format:

```
app_key=YOUR_CONSUMER_KEY  
app_secret=YOUR_CONSUMER_SECRET  
oauth_token=YOUR_ACCESS_TOKEN  
oauth_token_secret=YOUR_ACCESS_TOKEN_SECRET
```

In [7]:
from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile

In [None]:
oauth = credsfromfile()
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.sample()

In [None]:
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.filter(track='refugee, germany')

In [None]:
client = Query(**oauth)
tweets = client.search_tweets(keywords='nltk', limit=10)
tweet = next(tweets)
from pprint import pprint
pprint(tweet, depth=1)

### (What is `next`?)

In [None]:
help(next)

### Printing those tweets

In [None]:
for tweet in tweets:
    print(tweet['text'])

### Now for some initial processing steps. Ultimately, you'll want a mathematical representation of tweets, reviews, posts - whatever you are trying to classify.

In [24]:
from nltk import *

## Tokenization

In [None]:
s = 'We bought apples, oranges, etc., etc.'
tokens = tokenize.word_tokenize(s)
print(tokens)

### It may seem trivial because all we are doing is breaking the sentence down into words. But which things count? Notice that the commas appear in our list of tokens as well. With special characters thrown into the mix, as in a tweet, things become even more complicated.

In [32]:
t = '''#qcpoli enjoyed a hearty laugh today with #plq
    debate audience for @jflisee #notrehome tune was that the intended reaction?'''

In [50]:
tt = TweetTokenizer()

In [51]:
tokens2 = tt.tokenize(t)

In [52]:
print(tokens2)

['#qcpoli', 'enjoyed', 'a', 'hearty', 'laugh', 'today', 'with', '#plq', 'debate', 'audience', 'for', '@jflisee', '#notrehome', 'tune', 'was', 'that', 'the', 'intended', 'reaction', '?']


### These results are different from what you'd get from the old-fashioned tokenizer:

In [None]:
tokens3 = tokenize.word_tokenize(t)
print(tokens3)

### Tokenization is just the fundamental first step toward a model. Whether you use N-grams, Word2Vec, Bag-of-Words, Naïve Bayes, or whatever, you will almost certainly start with tokenization. Because we need to chop things up into pieces before we can understand them.

### (We will look at each of those, don't worry if they sound alien.)

## Stemming/Lemmatization

### Many words are subtle variants of each other or of another more basic word. Examples:

- likes $\to$ like
- carries $\to$ carry
- books $\to$ book

### A natural next step after tokenization, particularly if you are taking frequency of words into account, is to identify root words whose variations occur as different tokens. For instance, if you are searching documents containing "democracy", you probably want results including documents containing "democratic" as well.

### Technically, *stemming* is the stripping away of prefixes/suffixes, and *lemmatization* is the stripping away of prefixes/suffixes so that the result is a legitimate word.

### This is a non-trivial task (lemmatization), and is based on *rules* and *dictionaries*. In other words, sometimes you can just do stemming (*stemmed* $\to$ *stem*), but other cases require a lookup (*sought* $\to$ *seek*).

### Furthermore, lemmatization cannot be done one token at a time, since parts of speech (POS) must be considered. Example:

- bored/bore/bear

### Stemming in NLTK

In [None]:
tokens = word_tokenize(s)
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)

### (named for [Martin Porter](https://tartarus.org/martin/index.html).)

### Stemming with the Lancaster stemmer (named for Lancaster University).

In [None]:
lancaster = LancasterStemmer()
stems = [lancaster.stem(t) for t in tokens]
print(stems)

### Lemmatization in NLTK

In [58]:
wnl = WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

['We', 'bought', 'apple', ',', 'orange', ',', 'etc.', ',', 'etc', '.']


## Stopwords

In [59]:
from nltk.corpus import stopwords
import string

In [None]:
stop = stopwords.words('english')
print(stop)

In [None]:
tokens_filtered = [w for w in tokens if w.lower() not in stop and w not in string.punctuation]
print(tokens_filtered)

### Rmk: which languages are supported in `stopwords`? And what is included in `string.punctuation`?

In [None]:
print(stopwords.fileids())

In [None]:
print(string.punctuation)

## Frequency

### The most frequent words are often stopwords and can be deleted (depending on the application). Very rare words are often typos to be dismissed (or counted as an occurrence of another word). Surprisingly short dictionaries (200 words) suffice for many applications.

In [None]:
tokens = tokenize.word_tokenize(reuters.raw('test/14826'))
fdist = FreqDist(tokens)
print(fdist.most_common(100))

### Some basic summary stats...

In [None]:
print("Total number of tokens = {}".format(fdist.N()))
print("Total number of unique tokens = {}".format(len(fdist.keys())))

In [None]:
for token in fdist:
    print("Term " + token + " occurs " + str(fdist[token]) + " times.")

### We can visualize the distribution of frequency, too.

In [None]:
fdist.plot()

In [None]:
fdist.plot(cumulative=True)

## Text Normalization

### We get text content from many different sources and we want a unified format. Issues of grammaticality, spelling, punctuation, acronyms, weird tokens (e.g. emoticons) and others make this hard. There is not a nifty python package to handle it all for us, but here is an example of one tool to be used in normalizing text - *edit distance*:

In [None]:
from nltk.metrics import *
edit_distance('rain,','shine')

### See also the [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance). There is also [phonemic distance](https://en.wikipedia.org/wiki/Phonetic_algorithm), based on the pronunciation of words.

### You might as well download [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.6218&rep=rep1&type=pdf) because I will assign it as reading eventually.

## Representation

### So far, we have looked at basic tools for breaking down text and cleaning it up. In order to plug it into a machine learning algorithm, text will need to be broken down, cleaned up and then encoded in vectors.

- Word2Vec - Skip-Gram/CBOW
- Bag-of-Words
- N-grams

### Using `nltk` to calculate N-grams is a natural next step after tokenization and normalization, for sentiment analysis on tweets, say.

In [73]:
tokens = tt.tokenize(t)

In [None]:
for b in bigrams(tokens):
    print(b)

In [None]:
for r in trigrams(tokens):
    print(r)

In [None]:
for n in ngrams(tokens,4):
    print(n)

### These can then be transformed into numerical vectors using, say, a *one-hot* encoding.

## Classification

### After we have tokenized and cleaned and encoded, what then? Then we want to do classification. We want to learn from the data. We will look at three different methods for this, at least:

- decision trees
- support vector machines
- naïve Bayes

### You can also use neural nets and any other thing that can be made to operate on the representation.

In [None]:
# Example taken from Sklearn docs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets


def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out


# import some data to play with
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
models = (svm.SVC(kernel='linear', C=C),
          svm.LinearSVC(C=C),
          svm.SVC(kernel='rbf', gamma=0.7, C=C),
          svm.SVC(kernel='poly', degree=3, C=C))
models = (clf.fit(X, y) for clf in models)

# title for the plots
titles = ('SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel')

# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2)
plt.subplots_adjust(wspace=0.4, hspace=0.4)

X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

for clf, title, ax in zip(models, titles, sub.flatten()):
    plot_contours(ax, clf, xx, yy,
                  cmap=plt.cm.coolwarm, alpha=0.8)
    ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xlabel('Sepal length')
    ax.set_ylabel('Sepal width')
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(title)

plt.show()

### What is this "kernel" business? Let's look at example to illustrate the basic idea.