In [2]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

# Natural Language Processing
<!-- requirement: small_data/apple_fruit.txt -->
<!-- requirement: small_data/apple_inc.txt -->
<!-- requirement: small_data/ford_car.txt -->
<!-- requirement: small_data/ford_crossing.txt -->
<!-- requirement: small_data/window_glass.txt -->
<!-- requirement: small_data/windows_ms.txt -->

Natural language processing (NLP) is concerned with developing algorithms that enable computers to process or understand human (or "natural") language. In this notebook we will focus on using natural language processing to classify texts into different genres. Natural language processing is also concerned with language generation and translation, but these topics are rather advanced and are beyond the scope of this notebook. 

**EXAMPLE:**
In a given source block of (prose) text, you may want to be able to tell apart Apple (the company) vs. apple (the fruit).  Ideally you would also be able to tell apart Ford vs ford, Windows vs windows, etc. via similar examples.

**The type of learner**: We're going to choose to look at this as a _supervised classification_ problem.  There are also unsupervised approaches, but you have to make choices sometimes.  This means we need some "marked up" data:

**The training dataset**: Having limited resources at our disposal, we're going to try to use Wikipedia's articles on the given topics as our chosen "corpus" of text.

**The test dataset**: A good idea would be to mark up a small corpus by hand, or to use sentences culled from Wikipedia with the disambiguation coming from looking at the target of outgoing links.  We're not going to be that scientific for lack fo time.

## Text as a "bag of words"

You can imagine that documents belonging to the same class will share more words in common with each other than with documents belonging to different classes. This idea is key to how we will construct our classifier. As such, as a first step, we will keep track of the word counts or frequencies for each document. 

  - Split the text into words
  - Count how many times each word (in some fixed vocabulary) occurs
  - _(Optionally)_ normalize the counts against some baseline
  - _(Variant)_ Just do a binary "yes / no" for whether each word (in some vocabulary) is contained in the material
  
Let us count the words we see in documents (in this case, sentences) pertaining to apples (the fruit) and Apple (the company). Our first step will be to pull in the training (and test) data. We will want to clean both data on the way in: our goal is to have each text as a list of strings, one string for each sentence. We'll be using spaCy for this. It should already be installed, and its data set downloaded, on your box.

In [3]:
from spacy.en import English
nlp = English()

When spaCy loads a document, it automatically tokenizes it into sentences and words. For example:

In [4]:
doc = nlp(u'Here is a text document. spaCy requires it to be a unicode string.')
# doc.sents is a generator producing sentences
for sentence in doc.sents:
    print sentence
# doc can be indexed to find the individual words
print "Word 3:", doc[3]

Here is a text document.
spaCy requires it to be a unicode string.
Word 3: text


Now let's grab our documents from the apple and Apple wikipedia pages.

In [5]:
import urllib2
from bs4 import BeautifulSoup
import re

# Spit out (slightly cleaned up) sentences from a Wikipedia article.
def wikipedia_to_sents(url):
    soup = BeautifulSoup(urllib2.urlopen(url), 'lxml').find(attrs={'id':'mw-content-text'})
    
    # The text is littered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join(re.split('\[\d+\]', s))
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    return [s.text for paragraph in paragraphs for s in nlp(paragraph).sents if len(s) > 2]

fruit_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple")
company_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple_Inc.")

In [6]:
company_sents[-105:-100]

[u"The statement was released after the results from the company's probe into its suppliers' labor practices were published in early 2010.",
 u"Foxconn was not specifically named in the report, but Apple identified a series of serious labor violations of labor laws, including Apple's own rules, and some child labor existed in a number of factories.",
 u'Apple committed to the implementation of changes following the suicides.',
 u'Also in 2010, workers in China planned to sue iPhone contractors over poisoning by a cleaner used to clean LCD screens.',
 u'One worker claimed that he and his coworkers had not been informed of possible occupational illnesses.']

In [7]:
fruit_sents[-105:-100]

[u'In the wild, apples grow readily from seeds.',
 u'However, like most perennial fruits, apples are ordinarily propagated asexually by grafting.',
 u'This is because seedling apples are an example of "extreme heterozygotes", in that rather than inheriting DNA from their parents to create a new apple with those characteristics, they are instead significantly different from their parents.',
 u'Triploid cultivars have an additional reproductive barrier in that 3 sets of chromosomes cannot be divided evenly during meiosis, yielding unequal segregation of the chromosomes (aneuploids).',
 u'Even in the case when a triploid plant can produce a seed (apples are an example), it occurs infrequently, and seedlings rarely survive.']

### Count Vectorizer

Learning algorithms, however, prefer vectors of numbers, not text. When we do this, the output will be typically be a very large, but usually sparse, vector: The number of coordinates is the number of words in our dictionary, and the $i$-th coordinate entry is the number of occurrences of the $i$-th word.

There's a reasonable implementation of this in the `CountVectorizer` class in sklearn.feature_extraction.text. See http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref for more detail on the options. When we use sklearn's `CountVectorizer`, we generate a matrix where each column corresponds to a word, each row corresponds to a document, and the values are the word counts.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words_vectorizer = CountVectorizer()

counts = bag_of_words_vectorizer.fit_transform( fruit_sents + company_sents  )
print counts.shape

(802, 3997)


In [9]:
# Note that counts is a **sparse** matrix.
print counts.toarray()       #This is what it actually looks like.. there are non-zero entries, really!
print "\n", counts           # .. this is just describing the non-zero entries

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]

  (0, 1563)	1
  (0, 2738)	1
  (0, 3548)	1
  (0, 1998)	1
  (0, 1522)	1
  (0, 2063)	1
  (0, 516)	1
  (0, 1432)	1
  (0, 3131)	1
  (0, 1843)	1
  (0, 1040)	1
  (0, 1983)	1
  (0, 1175)	1
  (0, 626)	1
  (0, 1327)	1
  (0, 339)	1
  (0, 818)	1
  (0, 2892)	1
  (0, 2228)	2
  (0, 3718)	2
  (0, 367)	2
  (0, 3623)	3
  (1, 1620)	1
  (1, 3395)	1
  (1, 1694)	1
  :	:
  (800, 634)	1
  (800, 2817)	1
  (800, 380)	1
  (800, 3532)	1
  (800, 873)	1
  (800, 3948)	1
  (800, 3684)	1
  (800, 2319)	1
  (800, 3621)	1
  (800, 2351)	1
  (800, 3670)	1
  (800, 1522)	1
  (800, 1843)	1
  (800, 339)	1
  (800, 367)	1
  (801, 3968)	1
  (801, 1551)	1
  (801, 3168)	1
  (801, 3967)	1
  (801, 86)	1
  (801, 3081)	1
  (801, 418)	1
  (801, 346)	1
  (801, 1843)	1
  (801, 3623)	1


### Hashing Vectorizer

When doing "bag of words" type techniques on a *large* corpus and without an existing vocabulary, there is a simple trick that is often useful.  The issue (and solution) is as follows: 

 - The output is a feature vector, so that whenever we encounter a word we must look up which coordinate slot it is in.  A naive way would be to keep a list of all the words encountered so far, and look up each word when it is encountered.  Whenever we encounter a new word, we see if we've already seen it before and if not -- assign it a new number.  This requires storing all the words that we have seen in memory, cannot be done in parallel (because we'd have to share the hash-table of seen words), etc.
 - A **hash function** takes as input something complicated (like a string) and spits out a number, with the desired property being that different inputs *usually* produce different outputs.  (This is how hash tables are implemented, as the name suggests.)
 - So -- rather than exactly looking up the coordinate of a given word, we can just use its hash value (modulo a big size that we choose).  This is fast and parallelizes easily.  (There are some downsides: You cannot tell, after the fact, what word each of your feature actually corresponds to!)
 
Scikit-learn includes `sklearn.feature_extraction.text.HashingVectorizer` to do this.  It behaves as almost a drop-in replacement for `CountVectorizer`.

## Word importance: Term frequency–inverse document frequency (TF-IDF)

Just using counts (or frequencies), as above, is clearly not great.  Both apples the fruit and Apple the company are enjoyed around the world!  We would like to find words that are common in one document, but not common in all of them.  This is the goal of the __tf-idf weighting__.  A precise definition is:


  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of any single word's occurrence count ($L^1$) or by the Euclidean length of the vector of word occurrence counts ($L^2$).  Scikit-learn by default does this second one:
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  with a common variant being
  $$ \mathrm{idf}(t, D) = \log \frac{\# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{\text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
  (This second one is the default in scikit-learn. Without this tweak we would omit the $1+$ in the denominator and have to worry about dividing by zero if $t$ is not found in any documents.)
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$
  
### TF-IDF Vectorizer
The `CountVectorizer` and `HashingVectorizer` can be used with tf-idf by combining them with the `TfidfTransformer` (the `TfidfVectorizer` is the `CountVectorizer` together with the `TfidfTransformer`). For our application (where the training and test data is small), we may as well just use `TfidfVectorizer` -- but it is good to know that `HashingVectorizer` is there.

In [11]:
#TFIDF close to 0 for a word very common in every document

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.en import STOPWORDS
STOPWORDS

ng_tfidf=TfidfVectorizer(max_features=300, 
                         ngram_range=(1,2), 
                         stop_words=STOPWORDS.union({u'apple', u'apples'}))
ng_tfidf=ng_tfidf.fit( fruit_sents + company_sents )
print ng_tfidf.get_feature_names()[100:105]
print ng_tfidf.transform( fruit_sents + company_sents )

[u'fruits', u'gb', u'generally', u'generation', u'glass']
  (0, 273)	0.549576096914
  (0, 160)	0.616903034937
  (0, 140)	0.28918122199
  (0, 99)	0.243654334759
  (0, 43)	0.281533668607
  (0, 38)	0.308451517468
  (1, 296)	0.445304622192
  (1, 273)	0.429002680864
  (1, 160)	0.481558527212
  (1, 109)	0.490979140851
  (1, 99)	0.380396321466
  (2, 291)	0.632920240853
  (2, 273)	0.514998978583
  (2, 160)	0.578089976368
  (3, 299)	0.390177490275
  (3, 178)	0.4416207104
  (3, 109)	0.466036513083
  (3, 87)	0.457094483644
  (3, 27)	0.476032851518
  (4, 121)	1.0
  (5, 274)	0.547663532004
  (5, 142)	0.570796550966
  (5, 109)	0.611764622316
  (6, 273)	0.378163619965
  (6, 248)	0.424491323711
  :	:
  (797, 149)	0.473088877072
  (797, 112)	0.473088877072
  (797, 34)	0.404690231275
  (797, 29)	0.364463637141
  (797, 19)	0.353565156264
  (798, 259)	0.344533253602
  (798, 223)	0.39101887204
  (798, 133)	0.408086676997
  (798, 33)	0.467041477833
  (798, 31)	0.429461923816
  (798, 30)	0.399125364897
  (79

### Exercises:
  1. Imagine that $D$ consists of just two documents $D = \{ d_1, d_2 \}$ and that the word "cultivar" occurs in $d_1$ but not in $d_2$.  What is 
  $$ \mathrm{tfidf}(\mathrm{"cultivar"}, d_i, D)$$
for each $i=1,2$?  For simplicity, use $\mathrm{tf}^{raw}$ and the version of `idf` without the $1+$.  

  2. Same question as 1, but now use $\mathrm{tf}^{L^2}$ and the version of of `idf` with the $1+$.  

  3. What happens to the tf-idf weighting of a word if it occurs in all (or all but one) documents?  Consider both forms of `idf`.
  
  4. In the example below, we consider each sentence as a separate document for the purpose of tf-idf.  What happens if you instead treat the input as just two documents, one for each starting article.
  
### Hints/Answers:
  1. For $i=2$ it is zero.  For $i=1$, it is the number of occurrences of "cultivar" in $d_1$ multiplied by $\log 2$.
  2. For $i=2$, it is zero.  For $i=1$, it is .. also zero.
  3. Answer: If the word occurs in all documents, and there are $N$ of them, then the $1+$ form weights `idf` by $\log N/(N+1) < 0$ while the other form weights `idf` by $\log N/N = 0$.  If the word occurs in all-but-one document, then the $1+$ form weights `idf` by $0$ while the other form weights it by $\log N/(N-1) \approx 1+1/N$.
  4. It works less well, because of what's discussed in 3.  tf-idf doesn't work so well with few documents and where relevant words  occur (even if with wildly different frequencies!) in both.

## Document similarity metrics

A common problem is looking up a document similar to a given snippet, or relatedly comparing two documents for similarity.  The above provides a simple method for this called __cosine similarity__:
  - To each of the two documents $d_1, d_2$ in a corpus of documents $D$, assign its tf or tf-idf vector $$ (v_i)_{j} = \mathrm{tfidf}( t_{j}, d_i, D ) $$
  where $i$ ranges over indices for documents, and $j$ ranges over indices for terms in the vocabulary.
  - To compare two documents, simply find the cosine of the angle between the vectors:
  $$ \frac{v_i \cdot v_{i'}}{|v_i| |v_{i'}|} $$
  
(There's also a variant using binary vectors and Jaccard distance.)

## Engineering your features

There are some considerations to make when deciding which "words" or features to include in your classifier. In our example application, we might try to use:
   - Capitalized of the word apple? (_a_pple vs _A_pple)    
   - Pluralization of the word apple? (apples)
   - Possessive form of the word apple? (Apple's)
   - Presence (or frequency) of certain well-chosen words : Does (e.g.,) the word "computer" or "fruit" occur in the sentence?  (This feature regards the sentence as a simple __bag of words__ without regard to trying to parse its structure.)
   - In addition to single words, we can also look for __n-grams__: Strings of n consecutive words.
   - There are common techniques for determining which words / n-grams to look for.  One of them is called __tf-idf__.

### Stop words
It's common to want to __omit__ certain common words when doing these counts -- "a", "an", and "the" are common enough so that their counts do not tend to give us any hints as to the meaning of documents.  Such words that we want to omit are called __stop words__ (they don't stop anything, though). NOTE: If you are using tfidf and not the word frequency to compute your document similarity, however, you don't need to exclude stop words.

spaCy contains a standard list of such stop words for English in `spacy.en.STOPWORDS`.  In our application, we'd also want to include "apple" -- it is certainly not going to help us distinguish our two meanings!

In [12]:
from spacy.en import STOPWORDS
STOPWORDS

{u'a',
 u'about',
 u'above',
 u'across',
 u'after',
 u'afterwards',
 u'again',
 u'against',
 u'all',
 u'almost',
 u'alone',
 u'along',
 u'already',
 u'also',
 u'although',
 u'always',
 u'am',
 u'among',
 u'amongst',
 u'amoungst',
 u'amount',
 u'an',
 u'and',
 u'another',
 u'any',
 u'anyhow',
 u'anyone',
 u'anything',
 u'anyway',
 u'anywhere',
 u'are',
 u'around',
 u'as',
 u'at',
 u'back',
 u'be',
 u'became',
 u'because',
 u'become',
 u'becomes',
 u'becoming',
 u'been',
 u'before',
 u'beforehand',
 u'behind',
 u'being',
 u'below',
 u'beside',
 u'besides',
 u'between',
 u'beyond',
 u'bill',
 u'both',
 u'bottom',
 u'but',
 u'by',
 u'call',
 u'can',
 u'cannot',
 u'cant',
 u'co',
 u'computer',
 u'con',
 u'could',
 u'couldnt',
 u'cry',
 u'de',
 u'describe',
 u'detail',
 u'did',
 u'didn',
 u'do',
 u'does',
 u'doesn',
 u'doing',
 u'don',
 u'done',
 u'down',
 u'due',
 u'during',
 u'each',
 u'eg',
 u'eight',
 u'either',
 u'eleven',
 u'else',
 u'elsewhere',
 u'empty',
 u'enough',
 u'etc',
 u'even

In [13]:
# o_O
print 'fify' in STOPWORDS
print 'six' in STOPWORDS, 'seven' in STOPWORDS, 'eight' in STOPWORDS

True
True False True


In [14]:
# The vocabulary *can* be built for you.  
#
# For instance, here we'll compute and then use the top 300 words by frequency -- *ignoring*
# the so-called "stopwords": these are words like "a", "and", "the" that are very common
# "apple" is not useful for distinguishing the two, but is common, so add it as a stopword.
#
# Nevertheless, this method is probably NOT GOOD.  See tf-idf instead.
counter=CountVectorizer(max_features=300,
                        stop_words=STOPWORDS.union({u'apple'}))
counter=counter.fit( fruit_sents + company_sents )
print counter.get_feature_names()

# Now we can use it with that vectorizer, like so...
counter.transform(company_sents)
counter.transform(fruit_sents)

[u'000', u'10', u'100', u'12', u'16', u'1984', u'1997', u'20', u'2001', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015', u'2016', u'2017', u'30', u'500', u'800', u'according', u'added', u'america', u'american', u'announced', u'annual', u'app', u'apples', u'application', u'applications', u'apps', u'april', u'august', u'away', u'began', u'best', u'billion', u'board', u'brand', u'business', u'california', u'called', u'camera', u'campus', u'center', u'century', u'ceo', u'characteristics', u'china', u'climate', u'commercial', u'companies', u'company', u'computers', u'conditions', u'consumer', u'consumers', u'content', u'continues', u'cook', u'corporate', u'corporation', u'country', u'cultivar', u'cultivars', u'cultivated', u'current', u'cut', u'data', u'davidson', u'day', u'days', u'december', u'design', u'designed', u'desktop', u'developed', u'development', u'device', u'devices', u'different', u'digital', u'directly', u'disease', u'display', u'early

<206x300 sparse matrix of type '<type 'numpy.int64'>'
	with 592 stored elements in Compressed Sparse Row format>

### Stemming

In our original hand-built vocabulary, we had to include both "apple" and "apples".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  After spaCy processes some text, the `lemma_` property of each word contains the lemmatized version.

In [16]:
print [w.lemma_ for w in nlp(u'carry carries carrying carried')]
print [w.lemma_ for w in nlp(u'eat eating eaten ate')]
print ' '.join(w.lemma_ for w in nlp(u"The quick brown fox jumped over the lazy dog.  "
                                     u"I can't believe it's not butter.  "
                                     u"I tried to ford the river and my unfortunate oxen died."))

[u'carry', u'carry', u'carry', u'carry']
[u'eat', u'eat', u'eat', u'eat']
the quick brown fox jump over the lazy dog .   i can not believe it ' not butter .   i try to ford the river and my unfortunate ox die .


We can tell our bag-of-words counters (or tf-idf) to run on lemmatized text.  This way it won't have to include both e.g., 'apple' and 'apples':

In [17]:
default_tokenizer = TfidfVectorizer().build_tokenizer()
    
def tokenize_lemma(text):
    return [w.lemma_ for w in nlp(text)]

stop_words_lemma = set(w.lemma_ for w in nlp(' '.join(STOPWORDS)))

ng_stem_tfidf = TfidfVectorizer(max_features=300, 
                                ngram_range=(1,2), 
                                stop_words=stop_words_lemma.union({"apple"}),
                                tokenizer=tokenize_lemma)
ng_stem_tfidf = ng_stem_tfidf.fit(fruit_sents + company_sents)

ng_stem_vocab = ng_stem_tfidf.get_feature_names()
print ng_stem_vocab

[u'  ', u'   billion', u'   gb', u'   million', u'"', u'" ,', u'" .', u'$', u'%', u"'", u"' ,", u"'s", u'(', u')', u') ,', u') .', u',', u', "', u", '", u", 's", u', -', u', .', u', 2011', u', 2012', u', 2016', u', announce', u', feature', u', include', u', introduce', u', job', u', release', u'-', u'- -', u'.', u'. "', u'10', u'100', u'100 %', u'1997', u'2', u'2001', u'2006', u'2007', u'2007 ,', u'2008', u'2008 ,', u'2009', u'2010', u'2010 ,', u'2011', u'2011 ,', u'2012', u'2012 ,', u'2013', u'2013 ,', u'2014', u'2014 ,', u'2015', u'2015 ,', u'2016', u'2016 ,', u'2017', u'3', u'30', u'3g', u'4', u'5', u'6', u'7', u'8', u'9', u':', u';', u']', u'accord', u'add', u'allow', u'american', u'announce', u'app', u'app store', u'application', u'april', u'august', u'begin', u'best', u'billion', u'board', u'brand', u'bring', u'build', u'camera', u'campus', u'cause', u'center', u'century', u'ceo', u'change', u'china', u'climate', u'commercial', u'company', u"company 's", u'company .', u'consumer'

### Tokenization

Tokenization refers to splitting the text into pieces, in this case into sentences and into words. Instead of looking at just single words, it is also useful to look at **n-grams**: These are n-word long sequences of words (i.e., each of "farmer's market", "market share", and "farm share" is a 2-gram).

The exact same sort of counting techniques apply.  The `CountVectorizer` function has built in support for this, too:

If you pass it the `ngram_range=(m, M)` then it will count $n$-grams with  $m \leq n \leq M$.

In [18]:
ng_counter=CountVectorizer(max_features=300, 
                           ngram_range=(2,2), 
                           stop_words=STOPWORDS.union({u'apple', u'Apple'}))
ng_counter=ng_counter.fit( fruit_sents + company_sents  )
print ng_counter.get_feature_names()

# Now we can use it with that vectorizer, like so...
ng_counter.transform(company_sents)
ng_counter.transform(fruit_sents)

[u'000 square', u'000 time', u'100 renewable', u'19th century', u'2007 jobs', u'2011 jobs', u'2016 update', u'21 2016', u'27 2010', u'30 2012', u'33182 122', u'37 33182', u'3g service', u'40 gb', u'48 total', u'4s iphone', u'500 known', u'700 billion', u'800 000', u'84 million', u'a4 processor', u'according report', u'added support', u'adverse reactions', u'aim alliance', u'alexander great', u'app store', u'april 24', u'august 24', u'backlit lcd', u'bear fruit', u'begin producing', u'best global', u'billion annual', u'billion cash', u'billion dollar', u'billion downloads', u'birch pollen', u'birch syndrome', u'board directors', u'brand loyalty', u'brands report', u'bud sports', u'carbon dioxide', u'carbon footprint', u'cash reserves', u'central asia', u'ceo began', u'ceo michael', u'ceo tim', u'citation needed', u'climate counts', u'companies time', u'company history', u'company product', u'company revenue', u'computers saw', u'computers use', u'consumer electronics', u'consumer market

<206x300 sparse matrix of type '<type 'numpy.int64'>'
	with 135 stored elements in Compressed Sparse Row format>

In [None]:
problems with this, two grams are n squared which is biggest problem.

In [19]:
# we have 300 "features"
print len(ng_counter.get_feature_names())

300


### Part of speech tagging
Consider the "Ford" vs "ford" example.  As a human being, the easiest way to tell these apart is that Ford is a __noun__ while ford is a __verb__.

Fortunately, spaCy also has a part-of-speech tagger: You give it a sentence, and it tries to tag the parts of speech (e.g., noun, verb, adjective, etc.).  The broad category is given in the `.pos_` property, while a more detailed description, using the [UPenn Treebank Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), is in the `.tag_` property.

(N.B. Nothing's perfect -- the tagger will make mistakes.)

In [None]:
s1 = u"I tried to ford the river, and my unfortunate oxen died."
s2 = u"Henry Ford built factories to facilitate the construction of the Ford automobile."

In [None]:
[(w.text, w.pos_, w.tag_) for w in nlp(s1)]

In [None]:
[(w.text, w.pos_, w.tag_) for w in nlp(s2)]

### Capitalization, punctuation, etc.
There are the obvious features that we had in mind....

## Building the Classifier
Disclaimer: This version is actually pretty bad -- it uses many of the right ideas, but puts them together pretty poorly (and with fairly little available data). Let's first start by creating a function to retrieve text from a url. 

In [20]:
def wikipedia_to_sents(url):
    """
    Retrieves a URL from wikipedia, and returns a list of sentences (of at least 3 words) in the body text.
    """
    files_by_url = {
      "http://en.wikipedia.org/wiki/Ford_(crossing)": "ford_crossing.txt",
      "http://en.wikipedia.org/wiki/Ford": "ford_car.txt",
      "http://en.wikipedia.org/wiki/Apple": "apple_fruit.txt",
      "http://en.wikipedia.org/wiki/Apple_Inc.": "apple_inc.txt",
      "http://en.wikipedia.org/wiki/Window": "window_glass.txt",
      "http://en.wikipedia.org/wiki/Microsoft_Windows": "windows_ms.txt"
    }
    
    try:
        with open("small_data/{}".format(files_by_url[url])) as wiki_file:
            soup = BeautifulSoup(wiki_file.read(), 'lxml').find(attrs={'id':'mw-content-text'})
    except KeyError:
        soup = BeautifulSoup(urllib2.urlopen(url), 'lxml').find(attrs={'id':'mw-content-text'})
    
    # The text is littered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    return [s.text for paragraph in paragraphs for s in nlp(paragraph).sents if len(s) > 2]

We'll then perform feature engineering in a transformer class. 

In [21]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class AdHocFeatures(BaseEstimator, TransformerMixin):
    """
    Given a keyword (e.g., "apple"), will transform documents into an
    encoding of several ad hoc features of each occurrences of the keyword:
        - If the keyword is capitalized
        - If it is plural
        - If it is possessive (in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """

    def __init__(self, keyword):
        self.keyword = nlp(keyword)[0].lemma_
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.asarray([self.transform_doc(x) for x in X])
    
    def feature_posessive(self, doc):
        ## N.B. spaCy will tokenize "Apple's" as ["Apple", "'s"]
        hits = [i for i, word in enumerate(doc) if word.lemma_ == self.keyword]
        return sum((i + 1) < len(doc) and doc[i+1].text == "'s" for i in hits)
    
    def transform_doc(self, row):
        doc = nlp(row)
        words = [word for word in doc if word.lemma_ == self.keyword]
        return [sum(word.is_title for word in words),
                sum(word.tag_ in (u'NNS', u'NNPS') for word in words),
                self.feature_posessive(doc),
                sum(word.pos_ == u'VERB' for word in words)]

Next, we'll make our classifier. To do this, we will use a multinomial **Naive Bayes** model. Detailed information on Naive Bayes can be found in the Naive Bayes notebook, but we'll briefly describe it here. The goal is, given a set of observed features $X_1, \ldots, X_p$, to find the label $Y$ with the maximum conditional probability. In other words, we know what our distributions of words ($X$'s) should look like for a given genre ($Y$) from our training data, and we would like to use this information to find the genre of a body of text ($Y$) given its words ($X$'s) for new data. Bayes theorem gives us a way to compute the latter conditional probability from the former. 

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion

def make_classifier(base_word, meaning1, meaning2):
    """
    Given
        - a base word (e.g., "apple", "ford") that can have ambiguous meaning
        - a pair meaning1 = (name1, url1) of a label for the first meaning, and a Wikipedia URL for it
        - a pair meaning2 = ... for the other meaning
    Returns a classifier that predicts the meaning
    """
    name1, url1 = meaning1
    name2, url2 = meaning2
    sents1 = wikipedia_to_sents(url1)
    sents2 = wikipedia_to_sents(url2)
    
    stop_words_lemma = set(w.lemma_ for w in nlp(' '.join(STOPWORDS)))
    def tokenize_lemma(text):
        return [w.lemma_ for w in nlp(text)]

    features = FeatureUnion([('stem_vectorizer',
                              TfidfVectorizer(ngram_range=(1,2),
                                              stop_words=stop_words_lemma.union({base_word}),
                                              tokenizer=tokenize_lemma)),
                             ('ad_hoc', AdHocFeatures(base_word))])
    pipe = Pipeline([('features', features),
                     ('classifier', MultinomialNB())])

    # Build the training data
    train_res  = [name1] * len(sents1) + [name2] * len(sents2)
    
    return pipe.fit(sents1 + sents2, train_res)

In [23]:
base_word = u"apple"
options = [ ("fruit", "http://en.wikipedia.org/wiki/Apple"),
            ("company", "http://en.wikipedia.org/wiki/Apple_Inc.") ]
print make_classifier(base_word, *options).predict([
    u"I'm baking a pie with my granny smith apples.",
    u"I looked up the recipe on my Apple iPhone.",
    u"The apple pie recipe is on my desk.",
    u"How is Apple's stock doing?",
    u"I'm drinking apple juice.",
    u"I have three apples.",
    u"Steve Jobs is the CEO of apple.",
    u"Steve Jobs likes to eat apples."
])

['fruit' 'company' 'company' 'company' 'company' 'fruit' 'company' 'fruit']


We can also do this for other classes of text documents. 

In [24]:
base_word = u"windows"
options = [ ("building", "http://en.wikipedia.org/wiki/Window"),
            ("software", "http://en.wikipedia.org/wiki/Microsoft_Windows") ]
print make_classifier(base_word, *options).predict([
    u"Bill Gates was involved with Windows.",
    u"Could you open the window?",
    u"The 'broken window' theory related broken windows to increases in crime rate.",
    u"The windows are all made of shatter-proof glass.",
    u"Could you install windows on your computer?",
    u"Could you install windows on your house?"
])

['software' 'building' 'building' 'building' 'building' 'building']


In [25]:
base_word = u"ford"
options = [ ("crossing", "http://en.wikipedia.org/wiki/Ford_(crossing)"),
            ("company", "http://en.wikipedia.org/wiki/Ford") ]
print make_classifier(base_word, *options).predict([
    u"I tried to ford the river and my unfortunate oxen died.",
    u"Ford makes cars, though their quality is sometimes in dispute.",
    u"The Ford Mustang is an iconic automobile.",
    u"The river crossing was shallow, but we could not ford it."
])

['company' 'company' 'company' 'crossing']


### Exercises / Brainstorming for Improvement:
Change the code to use just the ad hoc features. How does this change the results? Why do you think this is?
Same question as 1, but for the tf-idf features.
Change the formation of tf-idf features as follows -- when doing the tf-idf weighting (in the call to fit appearing in make_ng_stem_vectorizer) we pass in the sentences as separate documents. How do the results change if we pass them in as just two documents?
What ideas do you think could improve the performance of this model?

### Exit Tickets
1. What are some other options for modeling with text data besides bag of words?
1. How would you account for the fact that word meanings change over time?
1. How do stopwords, stemming, and limiting the # features affect variance-bias?

In [None]:
#answers: 2. lemmatizeing and map older meanings to the older versions....
#3. reduce overfitting and noise. stopwords reduce bias, so better job fitting signal.

### Spoilers

Some ideas / hints for the exercises:
  - A key problem with the model is the small amount of training data.  At the least, we could follow links from the given Wikipedia articles.  Better would be to find other sources that directly use the words Apple/apple.
  - In this specific case (apple/Apple) we would do better by using a few human created absolute rules _first_: e.g., typos aside -- apple's will always refer to the company and apples to the fruit, so we do not need to run a more complicated learner. 

## Additional NLP topics and resources

Natural language processing is a big field.  We only (really) talked about a few tools and techniques.  Here are some other terms that are relevant:

 - Context free grammars (and probabilistic context free grammars): This is a simple and basic technique for parsing.  
 - [Word2vec](https://code.google.com/p/word2vec/) is a popular tool for creating a vectorized representation of a text corpus. The learned vectors can then be used to identify/predict words similar to a target, or even (weakly) to reason by analogy. For example, vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen').
 
 To use Word2vec in Python (and get the computation speed improvements) look at [gensim](https://radimrehurek.com/gensim/models/word2vec.html) and [cython](https://github.com/cython/cython/wiki/Installing). This [Git repo](https://github.com/danielfrg/word2vec) is an alternative way to access the algorithm. You might also use this [Kaggle competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors) as a reference.  spaCy gives words a `.vector` property based on [a particular](https://spacy.io/docs#token-distributional) word2vec model.

*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*

*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*