# Natural Language Processing 
### (nlp for short)

## Morning Roadmap:

* Overview of nlp.
* Terminology.
* Featurizing text.
 * Bag of words and vectorizing documents.
 * Cosine similarity.
 * Stop words.
 * Stemming and lemmatization.
 * The TF-IDF transform. 
 * Summary of preprocessing.
 * Demonstration.
 * Discussion.



## Overview of nlp

Natural language processing is concerned with understanding text using computation. People working within the field are often concerned with:

 * Information retrieval.
  * How do you find a document or a particular fact within a document? 
 * Document classification.
  * What is the document about amongst mutually exclusive categories? 
 * Machine translation.
  * How do you write an English phrase in Chinese? Think of Google translate. 
 * Sentiment analysis. 
  * Was a product review positive or negative?

Natural language processing is a huge field and we will just touch on some of the concepts.

### Terminology

* Corpus: a dataset of text, e.g. newspaper, tweets, etc...
* Document: a single entry from our corpus.
* Vocabulary: a distinct list of all the words.
* Type: A single word in a vocabulary.
* Token: An instance of a type in a document. 
 * A single document can have multiple tokens of the same type, e.g. "the dog ran after the other dog" has two tokens of type "dog".

## Featurizing text

### Bag of words

So far in this course, we've trained models using numerical data or categorical data. In order to process text, we'll have to translate it into these data types. One not-so-great way to featurize a piece of text is to consider its binary representation. Every bit is a categorical piece of data. However, it does not have very much meaning behind it. Because we are human and understand text, we can choose better features. Letters don't carry very much meaning but words do. A natural way to featurize a document is to count the number of times that a word occurs in a document. This is called a bag of words represention of that document. It is called that because it is as if we are removing the words from our document and throwing them into a bag. All that matters in a bag of words is how many times a word appears in the bag. This number is called the term-frequency tf. The term frequency may also correspond to how many times a word happens in a corpus. Context usually distinguishes the two cases.

$tf = number \ of \ times \ a \ word \ appears \ in \ a \ document \ or \ corpus$

Although, this representation destroys all the structural information about the document it does convert the document into a list of counts, a vector essentially. Now that we have transformed our document into numerical data, we can use that data to do machine learning. (Aside: in this lecture we will focus on a measure of closeness called cosine similarity which can then be used in an unsupervised learning algorithm.) 

#### Example

We will take the following four documents and make a document term matrix. Each row of the matrix corresponds to a document and is called a term frequency vector. Each column corresponds to a word. A particular entry corresponds to how many times a word happened within a document. 



In [12]:

s1 = "the buffalo"
s2 = "buffalo buffalo buffalo"
s3 = "the cat"
s4 = "buffalo buffalo buffalo buffalo buffalo"

corpus = [s1,s2,s3,s4]

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
document_term_matrix = cv.fit_transform(corpus)
print cv.vocabulary_
print document_term_matrix.todense()


{u'the': 2, u'buffalo': 0, u'cat': 1}
[[1 0 1]
 [3 0 0]
 [0 1 1]
 [5 0 0]]


The first column corresponds to the word "buffalo", the second "cat" and the third "the", or ["buffalo","cat","the"]. 

The document "the buffalo" has the vector representation [1,0,1] since "buffalo" happens once, "cat" happens zero times and "the" happens once. 

### Closeness measures

Once we have a vector representation of a document, we can ask question about it, like how similar is it to another document? However in order to answer that, we have to have some measure for similarity. We'll talk about cosine similarity and the euclidean distance as measures of similarity. 

#### Cosine similarity

Recall that for two vector $\vec{x}$ and $\vec{y}$ that $\vec{x} \cdot \vec{y} = ||\vec{x}|| ||\vec{y}|| \cos{\theta}$. And so,

$\frac{\vec{x} \cdot \vec{y} }{||\vec{x}|| ||\vec{y}||} = \cos{\theta}$

This is called the cosine similarity of two vectors because it is the cosine of the angle between two vectors. Intuitively, the more similar two documents, the smaller the angle between them and the more dissimilar the larger the angle. An extreme example is when two documents share no words in common; the dot product is zero and therefor the cosine is zero. On the opposite extreme, when two documents are identical they share all of the same words with the same frequencies so their cosine is 1.

[Note: insert figures.]

#### Euclidean distance

We could equally try the Euclidean distance $||\vec{x}-\vec{y}||$ however there is a big problem with this. The euclidean distance goes up with the length of a document. Intuitively, duplicating each word in our bag of words generates a vector that points in exactly the same direction, however, the euclidean distance goes up. One solution is to normalize vectors before calculating the euclidean distance. Now increasing the length of a document does not change the Euclidean distance unless the direction of the term frequency vector changes. 

The cosine similarity and Euclidean distance differ in that large/small euclidean distance means two documents are dis-similar/similar far apart where as the relation ship is reversed with cosine similarity. Large/small cosine similarity corresponds to documents that are similar/dis-similar. For instance, the normalized Euclidean distance, two dissimilar documents will have a Euclidean distance of $\sqrt{2}$ and two copies of the same document will have the Euclidean distance of $0$ (compare that to 0 and 1 for cosine similarity). 

However, the euclidean distance of two normalized vectors contains the same information as the cosine similarity of two vectors. Given the euclidean distance of two normalized vectors, we can uniquely determine the angle(using $2\cos{\theta/2}=distance$) between the vectors. Once we have this angle, we can calculate the cosine and thus the cosine similarity. 

[Note: insert figures.]



### Back to the example

### Stop words

Does the cosine similarity measure of similarity between the four documents in the example make sense? Intuitively "the buffalo" and "buffalo buffalo buffalo" are similar and "the cat" and "the buffalo" are dissimilar. The cosine similarity of "the buffalo" and "buffalo buffalo buffalo" is 0.7. However the cosine similarity of "the buffalo" and "the cat" is 0.5. The cosine similarity is far too high for "the buffalo" and "the cat".

The word "the" does not carry much meaning and yet we are treating it as just as important as the word "cat". One way to deal with this is to just remove the word "the" from our vocabulary. In that case the word "the" would be called a stop word. A typical way that one would build a vocabulary is to pass a vectorizer a list of stop words and then insert every word in the vocabulary that is in the corpus at least once but which is not in the list of stop words.

Let's consider the previous example but with the stop words "a" and "the" removed. 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = ['a','the'])
document_term_matrix = cv.fit_transform(corpus)
print cv.vocabulary_
print document_term_matrix.todense()

{u'buffalo': 0, u'cat': 1}
[[1 0]
 [3 0]
 [0 1]
 [5 0]]


Now the cosine similarity between all three buffalo sentences is 1 and zero between any buffalo sentence and the sentence "the cat". 

### TF-IDF

Stop words are uninformative because the occur in almost ever document. We can remove the stop words when we know what they are. However, it still may be the case that some words occur in more documents than another. Uncommon words are often more informative than common words and we may want to scale our term frequency vectors in order to represent this observation. TF-IDF is a systematic way of addressing this. 

TF-IDF is an acronym for the product of two parts: the term frequency tf and what is called the inverse document frequency idf. The term frequency is just the counts in a term frequency vector. 

$tf(term,document) = \# \ of \ times \ a \ term \ appears \ in \ a \ document$

The idf part is defined in terms of the document frequency. The document frequency is 

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$

The inverse document frequency is defined in terms of the document frequency as

$idf(term,corpus) = \log{\frac{1}{df(term,corpus)}}$.

It is called the inverse document frequency but really it is the log of the inverse document frequency. Finally tf-idf is just

tf-idf $ = tf(term,document) * idf(term,corpus)$

At this point, there is some inconsistency in our notation. The document frequency is a proportion where-as the term frequency is a count. Different authors use different conventions for what these terms mean. Often the difference is in the choice of how to normalize numbers. For instance we could normalize our term frequency vectors so that the Euclidean norm is 1. That is, $L2$ norm is 1. We could also divide the term frequency of a word by the sum of the term frequencies over all words in our vocabulary. This is called $L1$ normalization and is the way we normalize probability distributions so that the probabilities sum to 1. The choice of how to normalize the term frequencies will not change the cosine similarity. However, changing the normalization of the document frequency could change the cosine similarity between two vectors.  

We can repeat the vectorization process for our example but now computing tf-idf for each. 

In [26]:
s1 = "the brown buffalo buffalo"
s2 = "the brown dog"
s3 = "the cat"
corpus = [s1,s2,s3]

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
document_tfidf_matrix = tfidf.fit_transform(corpus)
print tfidf.vocabulary_
print document_tfidf_matrix.todense()

{u'brown': 0, u'the': 4, u'buffalo': 1, u'dog': 3, u'cat': 2}
[[ 0.34261996  0.90100815  0.          0.          0.26607496]
 [ 0.54783215  0.          0.          0.72033345  0.42544054]
 [ 0.          0.          0.861037    0.          0.50854232]]


Note: <a href='http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html'>scikit-learn</a> actually does tf+tf*idf and then normalizes so that terms like "the" that happen everywhere don't entirely go to zero. 

### Stemming and lemmatization

Often times two words have essentially the same meaning. We may want to combine them into the same vector component. For instance, if one document has the word "dolphin" and another has the word "dolphins", then at present these two words are treated as distinct words in our vocabulary. Stemming an lemmatization are two ways of mapping two words which mean essentially the same thing to the same root. 

#### Stemming

Stemming, see <a href='http://snowball.tartarus.org/algorithms/porter/stemmer.html'>Porter stemming</a>, is a way of algorithmically determining whether two words share the same root or not. It does this by comparing syllables between words. The words "fisher" and "fishing" share the same root "fish" and stemming would return "fish" as the root of these words. However, stemming does not always return words. For instance, "argue" and "argument" may both get mapped to "argu". 

#### Lemmatization

Lemmatization seeks to map similar words to the same root by mapping meanings of words. Lemmatization algorithms use giant databases of words and their relatinships like <a href='https://wordnet.princeton.edu'>WordNet</a>. Because it deals with actual meanings, hand picked by real people and not algorithmically generated as with stemming, lemmatization tends to work better. 

With lemmatization, the words "argument" and "argue" would both map to "argue". The words "better" and "good" may also both map to "good" even though these two words share no common letters. 

### Generalization of bag of words: n-grams

Instead of counting how many times a particular word appears in a document, we can count how often a pair of words appears in a document. If we only count words that appear in consecutive order than that pair is called a bigram. If we count the number of times that a three word combination happens in consecutive order than that is called a trigram. In general we can count how often $n$ words happen in consecutive order, which is called an n-gram. 

Creating bigrams, trigrams and generally n-grams, blows up our feature space. If we use a bigram instead of individual words, then if we had a vocabulary of size $|V|$ before, we will have $|V|^2$ possible bigrams. In general, for a vocabulary of size $|V|$ there are $|V|^n$ n-grams. The larger $n$ is, the more combination will have happened only once in our corpus. If we make $n$ too large, we risk overfitting. 

### Steps to featurize text

1. Split on whitespace.
2. Remove punctuation.
3. Lowercase.
4. Stem/lemmatize.
5. Remove stop words.
6. Perform tf-idf (optional depending on situation).

## Evening Roadmap:

### Naive Bayes
* Generative vs. discriminative Models.
* Document classification.
* Mathematical definition.
* What is naive about Naive Bayes?
* Example with document classification.
* Laplace smoothing.
* Numerical underflow.
* Generalizations.
* Pros and cons.

### Generative vs. discriminative models.

* Generative models model the full joint probability distribution. E.g. $P(X,Y)$.
* Discriminative models only model conditional probability distributions. E.g. $P(Y|X)$.

Naive Bayes' is a generative model. We seek to estimate the joint probability distribution. However, we can still use it to do descriminate between classes since we can calculate the conditional probability distribution from the join probability distribution as follows:

$P(Y|X)=P(Y,X)/P(X)$

Models where we are seeking to model the underlying probability distribution are called generative because we can use our model to generate new data. We do this by sampling with the probabilities given to us by our model. 

### Document classification.

In this lecture we will be applying Naive Bayes to document classification. 

Document classification is when you are trying to figure out what category a document belongs to based on the text of the document.

For instance, you could be looking at a news article and you want to decide what category of articles it belongs to, e.g. sports, art, weather, etc... You happen to have a corpus of news articles which have articles which are pre-categorized. You use these articles to make a decision about which class your new article belongs to. There are many machine learning algorithms one could use to do this. Today we'll focus on Naive Bayes.

### Naive Bayes

Naive Bayes is a way of estimating the **Maximum a Posteriori(MAP)**:

MAP = $argmax_C P(C|x_1...x_n)$

where $C$ is a category and $x_1...x_n$ is the evidence. In words, this is just figuring out which category is most probable given the evidence. 

Suppose that $x_1...x_n$ is a vector representation of bag of words according to some vocabulary $V$. That is, we do not take words into account that are not part of our vocabulary. In order to estimate $P(C|x_1...x_n)$ directly, we would need to count the number of times that the category $C$ and the words $x_1...x_n$ happen in our corpus and then divide by the number of times that the words $x_1...x_n$ happen. More formally:

$P(C|x_1...x_n)=\frac{N(C,x_1...x_n)}{N(x_1...x_n)}$,

where $N(\cdot)$ denotes the number of documents. However, we may never have seen that particular document, especially is the document is more than a few words long, e.g. a news article. In order to estimate the probabilities, we'll have to make some simplifying assumptions. One simplifying assumption is Naive Bayes. 

Naive Bayes derives it's name from Bayes' law:

$P(C|x_1...x_n) = \frac{P(x_1...x_n|C)P(c)}{p(x_1...x_n)}$. 

However, this is not yet naive. The naive assumption of Naive Bayes, is that each word in distribution $P(x_1...x_n|C)$ is independent from each other word. That is $P(x_1...x_n|C)=P(x_1|C)...P(x_n|C)$. This however, is a completely ridiculous assumption especially when applied to document classification. It assumes that the next word that you are reading does not depend on any words that you have already read or any words that you will read. One basic counter example is of terms like "San Francisco" and "San Diego". The term "San" does not typically occur independently of "Francisco" and "Diego". "San" mostly always occurs in combination with such terms. Naive Bayes treats them as independent. The Naive Bayes model also assumes that whether the word "game" appears in the document is completely independent of whether "football" occurred. Of course this is not true.

If the independence of Naive Bayes is so ridiculous, why use it? Because it works. It is not perfect, but it still does better than random. 

"Better than random" may not be a very compelling reason to use Naive Bayes, and indeed, Naive Bayes is often worse than other models, see <a href="https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf">Caruana et. al.</a>. However, Naive Bayes has some attractive features. First, training is really simple. Essentially you need only keep track of the number of times a word appears in a document of a particular category. Why is that? 
* In order to estimate $\frac{P(x_1...x_n|C)P(c)}{p(x_1...x_n)}$ we approximate $P(x_1...x_n|C)P(C)$ with a product $P(x_1|C)...P(x_n|C)P(C)$. This is now no longer a probability, it does not add up to one. Instead it is a score that we use in order to make a decision. 
* The denominator is the same for each class and so is irrelevant for deciding which category a document belongs to. Instead we compute the score $P(x_1|C)...P(x_n|C)P(C)$ to decide which category is most probable.
* In order to decide which category a document belongs to, we need only calculate $P(x_i|C)$ for each $i$ and $P(C)$. 

$P(x_i|C)=\frac{Number\ of\ times\ the \ word\ type \ x_i\ occurs\ in\ documents\ of\ category\ C}{Number\ of\ word\ tokens \ in\ documents\ of\ category\ C}$

$P(C)=\frac{Number\ of \ word \ tokens \ in\ documents\ of\ category\ C}{Number \ of \ word \ tokens \ in \ the \ entire \ corpus}$

* In order to do this, we need only store a count of the number of times that a word type $x_i$ occurs in a documents with category $C$, a count of the number of words that occur in a category $C$ and the total number of words in the corpus. 


Thus training amounts to just storing a list of numbers in memory. If we get a new document in our corpus, we need only update our counts in order to update our model. This makes it a natural choice for "online" applications, where data is continuously changing.  

### Example with document classification. 

Suppose we wanted to classify the document "the cat in the hat" according to whether it is about sports or art. We want to estimate:
    
$P(sport|``the\ cat\ in\ the\ hat'')$

and 

$P(art|``the\ cat\ in\ the\ hat'')$. 

We do this as follows. 

$P(sport|``the\ cat\ in\ the\ hat'') \propto P(sports)P(cat|sports)P(hat|sports)$

and 

$P(art|``the\ cat\ in\ the\ hat'') \propto P(art)P(cat|art)P(hat|art)$.

Suppose that we knew that 75% percent of our articles are about sports, 1% of words in sports articles are "cat" and 2% are "hat". Also 25% of the articles are about art. Of those 5% of the words are "cat" and 2% are "hat". We can then calculate these scores:

$P(sports)P(cat|sports)P(hat|sports)=0.75 * 0.01 * 0.02 = 0.0015$


$P(art)P(cat|art)P(hat|art)=0.25 * 0.05 * 0.02 = 0.0025$

From these scores, we decide that the article is more likely to be about art than sports. 



### Laplace smoothing

Suppose we are looking at a document which we know is about dolphins. However, the author made an offhand joke about football. And we happen to have a corpus for which the word football never appears in an article about dolphins. Thus when we calculate the score, we will find that $P(football|dolphins)=0$. And thus our score will be zero, despite the fact that the article uses the word "dolphin" over and over again. This is a problem. One way to solve this problem is with Laplace smoothing. 

Let $n_{x,C}$ be the number of words of type $x$ which appear in documents of category $C$. Also let $N_C$ be the number of words in all documents that are about $C$. Normally we would compute $P(x|C)$ as $n_{x,C}/N_C$. Laplace smoothing involves changing all of the counts in all of the classes by $\alpha$. So instead of using $n_{x,C}$ we use $n_{x,C}+\alpha$. Now the total number of words in articles of category $C$ is $N_c+\alpha |V|$ where $|V|$ is the size of our vocabulary. Thus the new probability will be

$Laplace \ P(x|C) = \frac{n_{x,C}+\alpha}{N_c+\alpha |V|}$.

A typical value for $\alpha$ is $\alpha = 1$. However, $\alpha$ can also be treated as a hyperparameter, to be chosen using cross-validation. 

### Numerical underflow

We might run into a problem in calculating the score for Naive Bayes. Imagine that we have 1000 words in a document each of which occurs with a typical probability of 1/10000. If we try to calculate $\frac{1}{10000}^{1000}$, we will run into problems. Namely, python will consider that number to be zero. 

In [4]:
0==(1.0/10000)**1000

True

What's happening? Python can only store a finite number of decimal places of a floating point number. If the number becomes smaller than the smallest nonzero floating point number, it will then be stored as zero in python. This is called numerical underflow. If we are going to use python to rank documents by multiplying probabilities, then we will need to do something about numerical underflow. One thing that we can do is compare the log of the numbers. Log has a great property that it is monotonic. If $y>x$ then $\log(y)>\log(x)$. Furthermore logs have the property that $\log(xy)=\log(x) + \log(y)$. Thus we can use the log-sum score 

$\log(P(c)P(x_1|C)...P(x_n|C))=log(P(C)) +\sum_i \log(P(x_i|C))$. 

The log-sum score is what we will use in practice. 

### Generalization

So far we've applied Naive Bayes to document classification. But fundamentally, Naive Bayes can be used with any classification problem whenever there is not enough data to calculate the joint probability distribution. 

For instance, you could apply Naive Bayes to a continuous probability distribution. As an example, we could be trying to decide whether a person is male or female based on their height and weight. Of course we expect that height and weight are correlated and that there is some joint distribution 

$P(height,weight|gender)$ 

which represents that. A Naive Bayes' model would naively assume that the height and weight are independent, i.e. 

$P(height,weight|gender)=P(height|gender)P(weight|gender)$. 

We might further simplify by assuming that these probability distributions are normally distributed. We could then use the score 

$P(height|gender)P(weight|gender)P(gender)$ 

in order to decide which gender we think a person is. 

### Pros and cons

You might want to use Naive Bayes if 

* there is not enough data to calculate the joint probability distribution. This happens when the number of features is large compared to the number of data points. 
* in an "online" setting where updating your model needs to be fast. Remember that in order to update a Naive Bayes model, we need only update a list of counts. 

You might not want to use Naive Bayes because

* it is often outperformed by other models.