# Natural Language Processing / Naive Bayes

## Objectives

* Turn raw text data into useful features with an NLP pipeline.
* Build a Naive Bayes classifier.


# NLP

* Text is inherently unstructured.
* What tricks can we use to convert text data into useful (structured) formats?

## Terms

* A _corpus_ is a body of text (the web, a news site, an anthology of stories...).
* A _document_ is a stand-alone part of your corpus (a website, an article, a story...).
* A _vocabulary_ is the set of words in your corpus. If you don't want to limit yourself to just these words (for example if your corpus changes over time) you just use a list of words from an English dictionary as your vocabulary.
* A _Bag of Words_ is a vector representation of words in a documents. Think of each word as a column, each document as a row, and the corresponding matrix entry as the number of times each word appears in a document
* A _token_ is a single word.
* _Stop Words_ are (common) words we ignore because they are not useful in distinguishing text. 


## Building an NLP Pipeline

* This morning, we will go through some steps you can use to build an NLP Pipeline to turn unstructured text data into something a computer can train a classifier on.



In [None]:
import nltk
nltk.download('all')

In [3]:
#Using the news groups data set, let's fetch all of the data
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

#Subset the data, this dataset is huge
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)



## Tokenizing 

The first step is usually turning your raw text documents into lists of words.

In [20]:
from nltk.tokenize import word_tokenize
text = "This is a sentence in English. This is another one."

# Split the raw text into individual words.
document = word_tokenize(text)
print document

['This', 'is', 'a', 'sentence', 'in', 'English.', 'This', 'is', 'another', 'one', '.']


## Sentence Tokenizing

Sometimes you want to treat each sentence as a separate document.

In [19]:
from nltk.tokenize import sent_tokenize

# Split document in individual sentences
sentences = sent_tokenize(text)
print sentences

# Split individual sentences into words
corpus = []
for sentence in sentences:
    corpus.append(word_tokenize(sentence))
print corpus

['This is a sentence in English.', 'This is another one.']
[['This', 'is', 'a', 'sentence', 'in', 'English', '.'], ['This', 'is', 'another', 'one', '.']]


## Stop Words

nltk has functionality for removing common words that show up in almost every document. 

### Question:

Can you think of applications where you might not want to remove these kinds of words from your text?

In [10]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print stop_words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [21]:
# Remove stop words from 

document = [w for w in document if not w in stopwords.words('english')]
print document

['This', 'sentence', 'English.', 'This', 'another', 'one', '.']


## Stemming and Lemmatization

Removing morphoglical affixes from words, and also replacing words with their "lemma" (or base words: "good" is the lemma of "better").

* running -> run
* generously -> generous
* better -> good
* dogs -> dog
* mice -> mouse

Here are a few examples using the nltk libraries. Don't try to write your own functions to do this.

In [95]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

print SnowballStemmer('english').stem('running')
print WordNetLemmatizer().lemmatize('mice')

run
mouse


## N-grams

* N-grams or Ngrams are strings consecutive words in your corpus.  
* These are extra "features" in your data that contain more information that individual words.

In [24]:
from nltk.util import ngrams

# An example of 3-grams in our "document" above.
ngrams(document, 3)

[('This', 'sentence', 'English.'),
 ('sentence', 'English.', 'This'),
 ('English.', 'This', 'another'),
 ('This', 'another', 'one'),
 ('another', 'one', '.')]

## Bag-of-Words

How do you turn a document into a bag-of-words?


In [88]:
from sklearn.feature_extraction.text import CountVectorizer

# Divide our original text into sentences
# text = "This is a sentence in English. This is another one."
corpus = sent_tokenize(text)
print corpus

# Build a bag of words model
c = CountVectorizer()
print c.fit(corpus).vocabulary_
print c.fit_transform(corpus).todense()

['This is a sentence in English.', 'This is another one.']
{u'english': 1, u'another': 0, u'sentence': 5, u'this': 6, u'is': 3, u'in': 2, u'one': 4}
[[0 1 1 1 0 1 1]
 [1 0 0 1 1 0 1]]


In [66]:
# Building a bag of words model from the newgroups_train text corpus we downloaded before
c = CountVectorizer(stop_words=stopwords.words('english'))
bag_of_words = c.fit_transform(newsgroups_train['data'])
print c.vocabulary_



## TF-IDF

If a word appears in almost all documents in a corpus, it's not useful for classification. For example, the words "bayesian" might appear in most blog posts about machine learning, but "kernel" might only show up in specific ones.

To capture this idea, use somthing called _Term Frequency - Inverse Document Frequency_ (TFIDF). Suppose 

* $t$ is a token
* $d$ is a document
* $D$ is a corpus
* $N$ is the total number of documents in $D$

then

$$tfidf(t,d) = tf(t,d) * idf(t,d)$$

where

$$tf(t, d) = freq(t, d)$$

is the number of times that the token $t$ appears in $d$ (often normalized by dividing by the total length of $d$), and

$$idf(t, d) = \log{\frac{N}{|\{d \in D : t \in d\}|}} $$

### Computing  TF-IDF



In [68]:
bag_of_words_matrix = bag_of_words.todense()
document_freq = np.sum(bag_of_words_matrix > 0, axis=0)
print document_freq

[[471 412   7 ...,   1   2   3]]


In [72]:
# N is the number of documents
N = bag_of_words_matrix.shape[0]

# Want a IDF value for words that don't appear in a document
idf = np.log(float(N+1) / (1.0 + document_freq)) + 1
idf

matrix([[ 4.17690557,  4.31043697,  8.25444302, ...,  9.64073738,
          9.23527227,  8.9475902 ]])

In [80]:
from sklearn.preprocessing import normalize
tfidf = np.multiply(bag_of_words_matrix, idf)
tfidf = normalize(tfidf, norm='l2')
tfidf

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.00849825,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [78]:
# It's probably easier just to do this
from sklearn.feature_extraction.text import TfidfTransformer

z = TfidfTransformer(norm='l2')
z.fit_transform(bag_of_words_matrix).todense()

matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.00849825,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

In [77]:
# or
from sklearn.feature_extraction.text import TfidfVectorizer

t = TfidfVectorizer(stop_words=stopwords.words('english'))
t.fit_transform(newsgroups_train['data']).todense()

matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.00849825,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

## Cosine Similarity

We need a way to compare our documents. Use the cosine similary metric!

In [89]:
from sklearn.metrics.pairwise import cosine_similarity

# Use cosine similarity to measure the similarity of the 3 sentences ("documents") in a corpus
corpus.append("English is my favorite language.")
cosine_similarity(t.transform(corpus))

array([[ 1.        ,  0.        ,  0.28671097],
       [ 0.        ,  1.        ,  0.        ],
       [ 0.28671097,  0.        ,  1.        ]])