#Natural Language Processing

* NLP for short.
* Large field devoted to using computation to understand text data.
 * Information Retrieval
 * Document classification
 * Machine translation
 * Semantic Orientation
 
Huge, complex field.
 

#Terminology

* _Corpus_: a dataset of text. e.g. Newspaper Articles, tweets, etc.
* _Document_: A single entry from our corpus. e.g. Sentence, Tweet, Article, Complete Works of Shakespeare etc. 
* _Vocabulary_: All the words that appear in our corpus.
* _Bag of Words_: A vector representation of a document based on word counts.
* _Stop Words_: Words we ignore in our analysis that are too common to be useful.
* _Token_: A single word.

In [14]:
import numpy as np

In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
#A really tiny example example.
corpus = ['Isaac is giving a lecture', 'Isaac is going whitewater rafting']
c = CountVectorizer()
c = c.fit(corpus) #What happens here?
c.vocabulary_

{u'giving': 0,
 u'going': 1,
 u'is': 2,
 u'isaac': 3,
 u'lecture': 4,
 u'rafting': 5,
 u'whitewater': 6}

In [16]:
c.transform(corpus).todense()

matrix([[1, 0, 1, 1, 1, 0, 0],
        [0, 1, 1, 1, 0, 1, 1]])

#TF-IDF
TF-IDF stands for Term Frequency Inverse Document Frequency. The name is very descriptive:
$$ tfidf(t, d) = tf(t, d) * idf(t, D) $$

For some term $t$ some document $d$ and some corpus of documents $D$. This is pretty memorable, but what do it's parts mean?

$$ tf(t, d) =freq(t, d) $$ 

Where $|d|$ is the number of terms in the document.

$$ idf(t, d) = log\frac{N}{|\{d \in D : t \in d\}|} $$

Which is the log total number of documents in the corpus, divided by the number of documents in which the term appears--_the inverse document frequency_.

#Back to example

In [17]:
d = c.transform(corpus).todense()
d

matrix([[1, 0, 1, 1, 1, 0, 0],
        [0, 1, 1, 1, 0, 1, 1]])

How can we get the idf from this?

In [18]:
document_frequency = np.sum(d > 0, axis=0)
document_frequency

matrix([[1, 1, 2, 2, 1, 1, 1]])

In [19]:
N = d.shape[0]
N

2

#IDF

IDF is calculated _almost_ like you expect, but with a few tweaks to deal with the possible presence of zeros in our data.

If it's possible we'll want to calculate idf for new terms that don't appear in our corpus, we add one two prevent division by zero in the denominator.

In [20]:
idf = np.log(float(N+1) /(1.+document_frequency)) + 1
idf

matrix([[ 1.40546511,  1.40546511,  1.        ,  1.        ,  1.40546511,
          1.40546511,  1.40546511]])

In [21]:
from sklearn.preprocessing import normalize

#TFIDF (at last)

There are variety of competing normalization strategies for tfidf. This choice could change your results, and so is something you should be aware of. Most will probably have the same effect.

In [22]:
tfidf  = np.multiply(d, idf)
#tfidf = normalize(tfidf, norm='l2')
tfidf

matrix([[ 1.40546511,  0.        ,  1.        ,  1.        ,  1.40546511,
          0.        ,  0.        ],
        [ 0.        ,  1.40546511,  1.        ,  1.        ,  0.        ,
          1.40546511,  1.40546511]])

Or... more easily...

In [23]:
z = TfidfTransformer(norm=None)
z.fit_transform(d).todense()

matrix([[ 1.40546511,  0.        ,  1.        ,  1.        ,  1.40546511,
          0.        ,  0.        ],
        [ 0.        ,  1.40546511,  1.        ,  1.        ,  0.        ,
          1.40546511,  1.40546511]])

#Cosine similarity
##Or, what to do with these vectors

Cosine similarity is a way of measuring the similarity between vectors.

$$ a \cdot b = ||a||\,||b||\,cos(\theta) $$

We can rearrange the definition of the definition of the dot product above, to get 

$$ cos(\theta) = \frac{a \cdot b}{||a||\,||b||} $$

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

In [25]:
new_document = corpus + ['Isaac is giving a lecture on whitewater rafting']
new_document

['Isaac is giving a lecture',
 'Isaac is going whitewater rafting',
 'Isaac is giving a lecture on whitewater rafting']

In [26]:
sim = z.transform(c.transform(new_document))
cosine_similarity(sim)

array([[ 1.        ,  0.29121942,  0.77523967],
       [ 0.29121942,  1.        ,  0.67172541],
       [ 0.77523967,  0.67172541,  1.        ]])

#How do we evaluate the quality of our method?

It's not supervised!

* Compare agaist something else.
* Design an experiment.

#What else is in NLP?

1. POS-Tagging
2. word2vec
3. Stemming e.g. hosting/host, lying/lie etc.
4. chunking
5. grammars
6. semantic orientation

#Naive Bayes

Naive Bayes is an extremely simple but, still amazingly effective machine learning technique.

##Where/When?
1. n << p
2. n small
3. n large
4. streams of input data (online learning)
5. multi-class
6. low memory applications

#Bayes

$$ P(C|X) = \frac{P(X|C)P(C)}{P(X)} $$

For $C$ a class, and X some data.

#The algorithm.

$$ P(c|X) \propto P(c) \times P(x_1|c) \times P(x_2|c) \times ... \times P(x_p|c) $$ 

Cool, looks good, lets just go ahead and implement that...

![](https://mathspig.files.wordpress.com/2011/01/6-angry-teacher.jpg)

#Naivete

$$ P(X|c) \propto \prod_i P(x_i | c) \iff  P(x_i|c, x_j) = P(x_i|c)\,\forall\, i, j $$

So our model assumes this is true, but think about how ridiculous this is in practice.

$$ P(\text{'ball' appears in article}\,|\, \text{article is about sports}) = P(\text{'ball' appears in article}\,|\,\text{article is about sports and 'soccer' appears in article})$$

##But, if it works...

Then fine, but don't go claiming the number that results is _actually_ a probability or anything.

<img src='https://therealryanbell.files.wordpress.com/2013/12/pouting-child.jpg' width=400px />

In practice, relative rank of probabilities will be preserved.

#Laplace smoothing

What if $P(x_i|c) = 0$? This means that we see no instances of a feature for a given class in our training data. What consequence does this have for prediction?

So what can we do?

$$ P(x_i | c) = \frac{count_c(x_i) + \alpha}{count_c(X) + \alpha \cdot count_C(X))} $$

Add some $\alpha$, often $\alpha=1$, to prevent our probabilities from going to all the way to zero.

#MLE Estimation

$$ P(X|c) \propto P(c)\prod_i P(x_i | c) $$

So our bayes classifier would choose:

$$ argmax_c{P(c)\prod_i P(x_i | c)} $$

But in a high dimensional problems like text, $P(x_i|c)$ is likely to be small, and the product _very_ small which can result in underflow.

So how about maximizing log?

$$ argmax_c{P(c)\prod_i P(x_i | c)} = argmax_c{\log P(c)\prod_i P(x_i | c)} = argmax_c \log P(c) + \sum_i \log P(x_i | c) $$


#Types of features for Naive Bayes

Easy to see how we would compute this for categorical features $X$. What about numerical features?

Assume some distribution on those features, and then use the pdf of that distribution for $P(x_i|c)$. Normal is a pretty popular choice (especially if you normalize your features).

$$ P(x_i | c) \sim N(\mu_c, \sigma_c) $$