# Introduction to Natural Language Processing

In this notebook, we'll go over some fundamental feature extraction concepts (very shallow NLP) using NLTK and scikit-learn.  This will lay the foundation for using machine learning with text data.

### The term-document matrix representation of a corpus.

This representation is extremely common for shallow NLP tasks, such as document-level classification.

Also known as "bag of words".

In [55]:
# example unicode text
docs = [u'The patient was seen for bird flu.', 
        u'The patient was seen for chickenpox.', 
        u'The patient was seen for dengue.']

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 1),
                                stop_words=None)

tf = tf_vectorizer.fit_transform(docs)
tf_vectorizer.get_feature_names()

[u'bird',
 u'chickenpox',
 u'dengue',
 u'flu',
 u'for',
 u'patient',
 u'seen',
 u'the',
 u'was']

*tf*   is a sparse matrix.

To see its contents, we'll use pandas to convert it to a table...

In [57]:
import pandas as pd
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,chickenpox,dengue,flu,for,patient,seen,the,was
0,1,0,0,1,1,1,1,1,1
1,0,1,0,0,1,1,1,1,1
2,0,0,1,0,1,1,1,1,1


**Stop words.**  Words such as 'the', 'a', 'an', and may prepositions are often not so informative, and thus thrown out of the analysis as follows...

In [58]:
# apply stop words
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 1),
                                stop_words='english')

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,chickenpox,dengue,flu,patient,seen
0,1,0,0,1,1,1
1,0,1,0,0,1,1
2,0,0,1,0,1,1


**N-grams.** Single words or tokens are called unigrams, or 1-grams.  Sometimes unigrams simply do not capture enough information. In this case, we can take sequences of tokens.  

For example, we want the matrix to record that "bird flu" was present in the first document, not just that the first document mentioned "bird" and that it also mentioned "flu".

In [59]:
# Let's get the unigrams and bi-grams from the documents.
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 2),
                                stop_words='english')

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,bird flu,chickenpox,dengue,flu,patient,patient seen,seen,seen bird,seen chickenpox,seen dengue
0,1,1,0,0,1,1,1,1,1,0,0
1,0,0,1,0,0,1,1,1,0,1,0
2,0,0,0,1,0,1,1,1,0,0,1


We may want to add more complex features, such as tokens with their POS tags.

In [60]:
import nltk
def tokenize_and_tag(doc):
     return nltk.pos_tag(nltk.word_tokenize(doc))
    
tf_vectorizer = CountVectorizer(max_df=1., min_df=1,
                                max_features=100,
                                ngram_range=(1, 1),
                                tokenizer=tokenize_and_tag)

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,"(., .)","(bird, NN)","(chickenpox, NN)","(dengue, NN)","(flu, NN)","(for, IN)","(patient, NN)","(seen, VBN)","(the, DT)","(was, VBD)"
0,1,1,0,0,1,1,1,1,1,1
1,1,0,1,0,0,1,1,1,1,1
2,1,0,0,1,0,1,1,1,1,1


Now look at the 20 news groups data.  Check scalability.

In [61]:
# Get the 20 newsgroups Corpus (just the train set for now)
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train',
                          categories=('rec.autos',
                             'rec.motorcycles',
                             'rec.sport.baseball',
                             'rec.sport.hockey'),
                          remove=('headers', 'footers', 'quotes'))

In [62]:
for n in range(1,5):
    tf_vectorizer = CountVectorizer(max_df=1., min_df=1,
                                    max_features=999999,
                                    ngram_range=(1, n),
                                    stop_words='english')
    tf = tf_vectorizer.fit_transform(news.data[:500])
    print "Using 1-grams up to %d-grams on the first 500 news groups documents yields %d features" % (n,tf.shape[1])

Using 1-grams up to 1-grams on the first 500 news groups documents yields 9328 features
Using 1-grams up to 2-grams on the first 500 news groups documents yields 39452 features
Using 1-grams up to 3-grams on the first 500 news groups documents yields 71233 features
Using 1-grams up to 4-grams on the first 500 news groups documents yields 102880 features
