# Natural Language Processing
<!-- requirement: small_data/apple_fruit.txt -->
<!-- requirement: small_data/apple_inc.txt -->
<!-- requirement: small_data/ford_car.txt -->
<!-- requirement: small_data/ford_crossing.txt -->
<!-- requirement: small_data/window_glass.txt -->
<!-- requirement: small_data/windows_ms.txt -->


Natural language processing (NLP) is concerned with developing algorithms that enable computers to process or understand human (or "natural") language. In this notebook we will focus on using natural language processing to classify texts into different genres. 

**EXAMPLE:**
In a given source block of text, we may want to differentiate Apple (the company) vs. apple (the fruit) using supervised classification.

## Text as a "bag of words"


We can imagine that documents belonging to the same class will share more words in common with each other than with documents belonging to different classes. This idea is key to how we will construct our classifier. As such, as a first step, we will keep track of the word counts (or frequencies) for each document. 

  - Split the text into words
  - Count how many times each word (in some fixed vocabulary) occurs
  - _(Optionally)_ normalize the counts against some baseline
  - _(Variant)_ Just do a binary "yes / no" for whether each word (in some vocabulary) is contained in the material
  
Let us count the words we see in documents (in this case, sentences) pertaining to apples (the fruit) and Apple (the company). Our first step will be to pull in the training (and test) data. We will want to clean both data on the way in: our goal is to have each text as a list of strings, one string for each sentence. We'll be using `spaCy` for this. It should already be installed, and its data set downloaded, on your box.

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

When `spaCy` loads a document, it automatically tokenizes it into sentences and words. For example:

In [3]:
doc = nlp('Here is a text document. spaCy requires it to be a unicode string.')
# doc.sents is a generator producing sentences
for sentence in doc.sents:
    print(sentence)
# doc can be indexed to find the individual words
print("Word 3:", doc[3])

Here is a text document.
spaCy requires it to be a unicode string.
Word 3: text


Now let's grab our documents from the "apple" and "Apple" Wikipedia pages.

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# Spit out (slightly cleaned up) sentences from a Wikipedia article.
def wikipedia_to_sents(url):
    soup = BeautifulSoup(urlopen(url), 'lxml').find(attrs={'id':'mw-content-text'})
    
    # The text is littered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join(re.split('\[\d+\]', s))
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    return [s.text for paragraph in paragraphs for s in nlp(paragraph).sents if len(s) > 2]

# Articles are pinned to a specific version for consistent results
fruit_sents = wikipedia_to_sents("https://en.wikipedia.org/?title=Apple&oldid=1069655337")
company_sents = wikipedia_to_sents("https://en.wikipedia.org/?title=Apple_Inc.&oldid=1070267868")

In [5]:
company_sents[-105:-100]

['Apple was ranked No. 4 on the 2018 Fortune 500 rankings of the largest United States corporations by total revenue.',
 'Apple has created subsidiaries in low-tax places such as Ireland, the Netherlands, Luxembourg, and the British Virgin Islands to cut the taxes it pays around the world.',
 'According to The New York Times, in the 1980s Apple was among the first tech companies to designate overseas salespeople in high-tax countries in a manner that allowed the company to sell on behalf of low-tax subsidiaries on other continents, sidestepping income taxes.',
 'In the late 1980s, Apple was a pioneer of an accounting technique known as the "Double Irish with a Dutch sandwich," which reduces taxes by routing profits through Irish subsidiaries and the Netherlands and then to the Caribbean.',
 'British Conservative Party Member of Parliament Charlie Elphicke published research on October 30, 2012, which showed that some multinational companies, including Apple Inc., were making billions o

In [6]:
fruit_sents[-105:-100]

['However, more than with most perennial fruits, apples must be propagated asexually to obtain the sweetness and other desirable characteristics of the parent.',
 'This is because seedling apples are an example of "extreme heterozygotes", in that rather than inheriting genes from their parents to create a new apple with parental characteristics, they are instead significantly different from their parents, perhaps to compete with the many pests.',
 'Triploid cultivars have an additional reproductive barrier in that three sets of chromosomes cannot be divided evenly during meiosis, yielding unequal segregation of the chromosomes (aneuploids).',
 'Even in the case when a triploid plant can produce a seed (apples are an example), it occurs infrequently, and seedlings rarely survive.',
 'Because apples are not true breeders when planted as seeds, although cuttings can take root and breed true, and may live for a century, grafting is usually used.']

### Count Vectorizer


Machine learning algorithms, however, prefer vectors of numbers, not text. When we do this translation, the output will be typically be a very large, but usually sparse, vector: The number of coordinates is the number of words in our dictionary, and the $i$-th coordinate entry is the number of occurrences of the $i$-th word.

There's a reasonable implementation of this in the `CountVectorizer` class in `sklearn.feature_extraction.text`. See http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref for more detail on the options. When we use sklearn's `CountVectorizer`, we generate a matrix where each column corresponds to a word, each row corresponds to a document, and the values are the word counts.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words_vectorizer = CountVectorizer()

counts = bag_of_words_vectorizer.fit_transform( fruit_sents + company_sents  )
print(counts.shape)

(867, 4377)


In [8]:
# Note that counts is a **sparse** matrix.
print(counts.toarray())       #This is what it actually looks like.. there are non-zero entries, really!
print()
print(counts)                 # .. this is just describing the non-zero entries

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

  (0, 379)	3
  (0, 429)	2
  (0, 2220)	1
  (0, 1402)	1
  (0, 1764)	1
  (0, 3101)	1
  (0, 715)	1
  (0, 4074)	1
  (0, 2475)	1
  (0, 1320)	1
  (1, 429)	1
  (1, 2475)	1
  (1, 4075)	1
  (1, 453)	2
  (1, 1128)	1
  (1, 4332)	1
  (1, 391)	1
  (1, 3968)	2
  (1, 2606)	1
  (1, 4293)	1
  (1, 1905)	1
  (1, 3722)	1
  (1, 2066)	1
  (1, 1821)	1
  (2, 2220)	1
  :	:
  (865, 2066)	1
  (865, 4020)	1
  (865, 1919)	1
  (865, 2099)	1
  (865, 1442)	1
  (865, 1070)	1
  (865, 1862)	1
  (865, 4329)	1
  (865, 4140)	1
  (865, 953)	1
  (865, 3460)	1
  (865, 3997)	1
  (865, 469)	1
  (866, 429)	1
  (866, 2066)	1
  (866, 2755)	2
  (866, 2838)	1
  (866, 3944)	1
  (866, 2247)	1
  (866, 102)	1
  (866, 3849)	1
  (866, 2898)	1
  (866, 1490)	1
  (866, 3446)	1
  (866, 162)	1


In [9]:
#change by sunil
bag_of_words_vectorizer.get_feature_names_out()[200:244]

array(['above', 'abroad', 'absence', 'absent', 'absolutely', 'abuse',
       'abuses', 'abusive', 'acceptable', 'acceptance', 'access',
       'accessible', 'accessions', 'accessories', 'accessory', 'acclaim',
       'acclaimed', 'acclimatized', 'accomplish', 'according', 'account',
       'accountability', 'accounted', 'accounting', 'accounts',
       'achieves', 'acid', 'acidity', 'acknowledge', 'acknowledges',
       'acquire', 'acquired', 'acquiring', 'acquisition', 'acquisitions',
       'acres', 'across', 'act', 'acted', 'acting', 'action', 'actions',
       'active', 'actively'], dtype=object)

In [10]:
#sunil
bag_of_words_vectorizer.vocabulary_

{'an': 379,
 'apple': 429,
 'is': 2220,
 'edible': 1402,
 'fruit': 1764,
 'produced': 3101,
 'by': 715,
 'tree': 4074,
 'malus': 2475,
 'domestica': 1320,
 'trees': 4075,
 'are': 453,
 'cultivated': 1128,
 'worldwide': 4332,
 'and': 391,
 'the': 3968,
 'most': 2606,
 'widely': 4293,
 'grown': 1905,
 'species': 3722,
 'in': 2066,
 'genus': 1821,
 'originated': 2820,
 'central': 791,
 'asia': 479,
 'where': 4280,
 'its': 2236,
 'wild': 4295,
 'ancestor': 388,
 'sieversii': 3615,
 'still': 3797,
 'found': 1738,
 'today': 4021,
 'apples': 432,
 'have': 1946,
 'been': 582,
 'for': 1716,
 'thousands': 3996,
 'of': 2755,
 'years': 4356,
 'europe': 1511,
 'were': 4274,
 'brought': 686,
 'to': 4020,
 'north': 2712,
 'america': 370,
 'european': 1512,
 'colonists': 899,
 'religious': 3311,
 'mythological': 2646,
 'significance': 3618,
 'many': 2488,
 'cultures': 1131,
 'including': 2076,
 'norse': 2711,
 'greek': 1883,
 'christian': 832,
 'tradition': 4053,
 'from': 1760,
 'seed': 3526,
 'tend':

### Hashing Vectorizer


When doing "bag of words" type techniques on a *large* corpus and without an existing vocabulary, there is a simple trick that is often useful.  The issue (and solution) is as follows: 

 - The output is a feature vector, so that whenever we encounter a word we must look up which coordinate slot it is in.  A naive way would be to keep a list of all the words encountered so far, and look up each word when it is encountered.  Whenever we encounter a new word, we see if we've already seen it before and if not -- assign it a new number.  This requires storing all the words that we have seen in memory, cannot be done in parallel (because we'd have to share the table of seen words), etc.
 - A **hash function** takes as input something complicated (like a string) and spits out a number, with the desired property being that different inputs *usually* produce different outputs.  (This is how hash tables are implemented, as the name suggests.)
 - So -- rather than exactly looking up the coordinate of a given word, we can just use its hash value (modulo a big size that we choose).  This is fast and parallelizes easily.  (There are some downsides: You cannot tell, after the fact, what word each of your feature actually corresponds to!)
 
Scikit-learn includes `sklearn.feature_extraction.text.HashingVectorizer` to do this.  It behaves as almost a drop-in replacement for `CountVectorizer`.

## Word importance: Term frequency–inverse document frequency (TF-IDF)


Just using counts (or frequencies), as above, is clearly not great.  Both apples, the fruit, and Apple, the company, are enjoyed around the world!  We would like to find words that are common in one document, but not common in all of them.  This is the goal of the __tf-idf weighting__.  A precise definition is:


  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of any single word's occurrence count ($L^1$) or by the Euclidean length of the vector of word occurrence counts ($L^2$).  Scikit-learn by default does this second one:
  
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  
  (Scikit-learn actually does this normalization _after_ applying the $\mathrm{idf}$ step below)
  
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  
  with a common variant being
  
  $$ \mathrm{idf}(t, D) = \log \frac{1 + \# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{1 + \text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
   
  (Without the $1+$ in the denominator, we have to worry about dividing by zero if $t$ is not found in any documents). This will produce zero weight if a term appears in every document, so Scikit-learn actually uses
  
  $$ \mathrm{idf}(t, D) = 1 + \log \frac{1+\# D}{1 + \# \{d \in D : t \in d\}}$$
  
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$
  

### TF-IDF Vectorizer

The `CountVectorizer` and `HashingVectorizer` can be used to compute tf-idf values by combining them with the `TfidfTransformer` (the `TfidfVectorizer` is the `CountVectorizer` together with the `TfidfTransformer`). For our application (where the training and test data is small), we may as well just use `TfidfVectorizer` -- but it is good to know that `HashingVectorizer` is there.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

ng_tfidf = TfidfVectorizer(max_features=300)
ng_tfidf.fit(fruit_sents + company_sents)
print(ng_tfidf.get_feature_names_out()[100:105])
print(ng_tfidf.transform(fruit_sents + company_sents))

['fruit' 'generally' 'genome' 'golden' 'government']
  (0, 265)	0.3322776109544498
  (0, 205)	0.3751168070999027
  (0, 158)	0.36159994684483426
  (0, 133)	0.2083640248196821
  (0, 100)	0.287599162892657
  (0, 56)	0.1893017126779034
  (0, 31)	0.21340585183865515
  (0, 26)	0.6407951368214393
  (1, 295)	0.3429914934653169
  (1, 266)	0.33490142716129156
  (1, 252)	0.1833729444178797
  (1, 171)	0.2998147685178659
  (1, 158)	0.3687277536387121
  (1, 118)	0.11773514895097484
  (1, 105)	0.3752742454129184
  (1, 72)	0.3752742454129184
  (1, 36)	0.4356175996972143
  (1, 31)	0.10880623883994807
  (1, 27)	0.11661306278015948
  (2, 287)	0.37442789103849117
  (2, 283)	0.3609358550033812
  (2, 265)	0.3316673706807502
  (2, 252)	0.08974896766627086
  (2, 242)	0.3609358550033812
  (2, 231)	0.3673440073121736
  :	:
  (864, 252)	0.07884157365358642
  (864, 244)	0.25781194962619014
  (864, 225)	0.298751665090909
  (864, 135)	0.18381935337339275
  (864, 118)	0.10124093766505689
  (864, 99)	0.18670107242901

### Limiting features

Note that we used `max_features=300` here to limit how many terms the vectorizer returns, restricting it to the $300$ most common terms.  This works for the `CountVectorizer` as well, but not the `HashingVectorizer`.  

We could also limit the number of terms using `min_df` and `max_df` (also for `TfidfVectorizer` and `CountVectorizer` but not `HashingVectorizer`). Here `min_df` indicates the minimum number of documents a term must appear in, and `max_df` the maximum.  `min_df=10`, for example, would reject any term that doesn't appear in at least $10$ documents, and `min_df=0.01` would reject any term that doesn't appear in at least $1\%$ of documents - integers are interpreted as a fixed number of documents, floats as a fraction.

In [17]:
example_doc = ["singleton appears once",
              "once appears more than once than",
              "once more"]

cv = CountVectorizer(min_df=2, max_df=0.8)
cv.fit(example_doc)

print(cv.get_feature_names_out())

['appears' 'more']


Here, `singleton` and `than` are rejected because they only appear in one document (even though `than` appears twice in its document), and `once` is rejected since it appears in every document, leaving only `appears` and `more`.

## Document similarity metrics


A common problem is looking up a document similar to a given snippet, or relatedly comparing two documents for similarity.  The above provides a simple method for this called __cosine similarity__:
  - To each of the two documents $d_1, d_2$ in a corpus of documents $D$, assign its tf or tf-idf vector $$ (v_i)_{j} = \mathrm{tfidf}( t_{j}, d_i, D ) $$
  where $i$ ranges over indices for documents, and $j$ ranges over indices for terms in the vocabulary.
  - To compare two documents, simply find the cosine of the angle between the vectors:
  $$ \frac{v_i \cdot v_{i'}}{|v_i| |v_{i'}|} $$
  
(There's also a variant using binary vectors and Jaccard distance.)

## Engineering your features


There are many considerations to make when deciding which "words" or features to include in your classifier. In our example application, we might try to consider:
   - Capitalization of the word apple? (`_a_pple` vs `_A_pple`)    
   - Pluralization of the word apple? (apples)
   - Possessive form of the word apple? (Apple's)
   - Presence (or frequency) of certain well-chosen words : Does (e.g.,) the word "computer" or "fruit" occur in the sentence?  (This feature regards the sentence as a simple __bag of words__ without regard to trying to parse its structure.)
   - In addition to single words, we can also look for __n-grams__: Strings of n consecutive words.
   - Using the __tf-idf__ values of words or n-grams.

### Stop words

It's common to want to __omit__ certain common words when doing these counts -- "a", "an", and "the" are common enough so that their counts do not tend to give us any hints as to the meaning of documents.  Such words that we want to omit are called __stop words__. (Note that, if you are using tf-idf and not the word frequency to compute your document similarity, you don't need to exclude stop words.)

`spaCy` contains a standard list of such stop words for English in `spacy.lang.en.stop_words.STOP_WORDS`.  In our application, we'd also want to include "apple" -- it is certainly not going to help us distinguish our two meanings!

In [18]:
from spacy.lang.en.stop_words import STOP_WORDS

# Removing a few words that don't lemmatize well
STOP_WORDS = STOP_WORDS.difference({'he','his','her','hers'})

STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

The question of which words to include in your list of stop words is not always a straight forward one. A detail to keep in mind is that your list needs to use the same preprocessing and tokenization as your vectorizer. For instance, the stop words supplied by `spaCy` include several contractions (`"'d"`, `"'ll"`, etc.) with the apostrophe `'` included. Because `scikit-learn` handles these contractions [a little differently](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words), we need to add a few contractions *without* the apostrophe to our list of stop words. Otherwise, `scikit-learn` will give us a `UserWarning` to let us know that we need to tweak our list.

In [19]:
STOP_WORDS = STOP_WORDS.union({'ll', 've'})

Another thing to look out for is whether your list of stop words might lead you to exclude words that are actually valuable in the context of your project. It is worth checking if the stop word list includes words that might be important for you.  Some of the choices may be idiosyncratic.

In [20]:
print('six' in STOP_WORDS, 'seven' in STOP_WORDS, 'eight' in STOP_WORDS)

True False True


Once stop words have been removed, we might hope that the most-frequently-used remaining words are the important features to keep.  For instance, here we'll compute and then use the top 300 words by frequency, *ignoring* the stop words from above.  Since "apple" is not useful in distinguishing the meanings, but will be common, we add it as a stop word.

Nevertheless, this method is probably not as good as taking the top words as determined by tf-idf.

In [21]:
counter = CountVectorizer(max_features=300,
                          stop_words=STOP_WORDS.union({'apple'}))
counter.fit(fruit_sents + company_sents)
print(counter.get_feature_names_out())

# Now we can use it with that vectorizer, like so...
counter.transform(fruit_sents + company_sents)

['000' '10' '100' '13' '14' '19' '1984' '1985' '1997' '20' '2001' '2006'
 '2007' '2008' '2009' '2010' '2011' '2012' '2013' '2014' '2015' '2016'
 '2017' '2018' '2019' '2020' '2021' '30' '500' 'according' 'allowed'
 'america' 'american' 'announced' 'app' 'apples' 'applications' 'apps'
 'april' 'asia' 'audio' 'august' 'away' 'based' 'began' 'best' 'billion'
 'board' 'brand' 'business' 'called' 'came' 'campus' 'carbon' 'central'
 'century' 'ceo' 'china' 'cider' 'city' 'climate' 'companies' 'company'
 'computer' 'computers' 'conditions' 'consumer' 'consumers' 'content'
 'continues' 'cook' 'corporation' 'cost' 'created' 'criticized' 'cultivar'
 'cultivars' 'cultivated' 'data' 'davidson' 'day' 'december' 'design'
 'designed' 'desktop' 'despite' 'development' 'device' 'devices'
 'different' 'digital' 'early' 'efforts' 'electronics' 'employees' 'end'
 'energy' 'environmental' 'europe' 'european' 'executives' 'facebook'
 'following' 'fortune' 'found' 'foxconn' 'free' 'fruit' 'generally'
 'genera

<867x300 sparse matrix of type '<class 'numpy.int64'>'
	with 3940 stored elements in Compressed Sparse Row format>

### Tokenization


Tokenization refers to splitting the text into pieces, in this case into sentences and into words. Instead of looking at just single words, it is also useful to look at **n-grams**: These are n-word long sequences of words (i.e., each of "farmer's market", "market share", and "farm share" is a 2-gram).

The exact same sort of counting techniques apply.  The `CountVectorizer` function has built in support for this, too:

If you pass it the `ngram_range=(m, M)` then it will count $n$-grams with  $m \leq n \leq M$.

In [22]:
ng_counter = CountVectorizer(max_features=300, 
                             ngram_range=(2,2), 
                             stop_words=STOP_WORDS.union({'apple', 'Apple'}))
ng_counter.fit( fruit_sents + company_sents  )
print(ng_counter.get_feature_names_out())
print()
print(len(ng_counter.get_feature_names_out()))

['000 time' '000 units' '100 million' '100 renewable' '122 0090'
 '13 billion' '17th century' '19th century' '2011 jobs' '2016 update'
 '2017 announced' '2019 update' '2020 announced' '2021 update'
 '2022 update' '3349 122' '37 3349' '500 known' '500 list' '65 billion'
 'according report' 'active use' 'advanced manufacturing'
 'adverse reactions' 'advertising campaigns' 'aim alliance'
 'allowed company' 'amazon echo' 'ancestor malus' 'announced away'
 'announced billion' 'announced internal' 'announcement came'
 'anti competitive' 'app devices' 'app store' 'app tracking' 'apps app'
 'august 2018' 'backlit lcd' 'board directors' 'brand loyalty'
 'carbon dioxide' 'central asia' 'ceo tim' 'chief operating'
 'chinese government' 'climate counts' 'co founder' 'coca cola'
 'collection database' 'commercial orchards' 'commonly known'
 'companies time' 'company felt' 'company focus' 'company history'
 'company market' 'company product' 'company products'
 'company proprietary' 'company provide

### Stemming


In our original hand-built vocabulary, we had to include both "apple" and "apples".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  After `spaCy` processes some text, the `lemma_` property of each word contains the lemmatized version.

In [23]:
print([w.lemma_ for w in nlp('carry carries carrying carried')])
print([w.lemma_ for w in nlp('eat eating eaten ate')])
print(' '.join(w.lemma_ for w in nlp("The quick brown fox jumped over the lazy dog.  "
                                     "I can't believe it's not butter.  "
                                     "I tried to ford the river and my unfortunate oxen died.")))

['carry', 'carry', 'carry', 'carry']
['eat', 'eat', 'eat', 'eat']
the quick brown fox jump over the lazy dog .   I can not believe it be not butter .   I try to ford the river and my unfortunate oxen die .


We can tell our bag-of-words counters (or tf-idf) to run on lemmatized text.  This way it won't have to include both e.g., 'apple' and 'apples'.  The way to do this with `CountVectorizer` or `TfidfVectorizer` is to supply a function with the `tokenizer` option.  This function can process the text before performing the counts (or computing tf-idf values).  

In [24]:
def tokenize_lemma(text):
    return [w.lemma_.lower() for w in nlp(text)]

stop_words_lemma = set(tokenize_lemma(' '.join(sorted(STOP_WORDS))))

ng_stem_tfidf = TfidfVectorizer(max_features=300, 
                                stop_words=stop_words_lemma.union({'apple'}),
                                tokenizer=tokenize_lemma,
                                token_pattern=None        # Is ignored, since tokenizer is specified
                               )
ng_stem_tfidf = ng_stem_tfidf.fit(fruit_sents + company_sents)

ng_stem_vocab = ng_stem_tfidf.get_feature_names_out()
print(ng_stem_vocab)

['\n' '"' '$' '%' "'s" '(' ')' ',' '-' '.' '1' '10' '100' '19' '1984'
 '1985' '1997' '2' '2006' '2007' '2008' '2010' '2011' '2012' '2014' '2015'
 '2016' '2017' '2018' '2019' '2020' '2021' '3' '30' '6' ':' ';' ']'
 'advertisement' 'allow' 'america' 'american' 'announce' 'app'
 'application' 'april' 'attempt' 'audio' 'august' 'away' 'base' 'beat'
 'begin' 'big' 'billion' 'board' 'brand' 'build' 'business' 'campaign'
 'campus' 'carbon' 'cause' 'center' 'central' 'century' 'ceo' 'change'
 'china' 'claim' 'climate' 'clone' 'color' 'come' 'company' 'computer'
 'consumer' 'contain' 'continue' 'control' 'cook' 'corporation' 'cost'
 'country' 'create' 'criticize' 'cultivar' 'cultivate' 'customer' 'datum'
 'day' 'december' 'design' 'desktop' 'develop' 'development' 'device'
 'different' 'digital' 'disease' 'early' 'eat' 'effort' 'electronic'
 'employee' 'end' 'energy' 'europe' 'european' 'executive' 'facebook'
 'facility' 'factory' 'feature' 'find' 'focus' 'follow' 'food' 'form'
 'foxconn' 'frui

### Part of speech tagging

Consider the "Ford" vs "ford" example.  As a human being, the easiest way to tell these apart is that "Ford" is a __noun__ while "ford" is a __verb__.

Fortunately, `spaCy` also has a part-of-speech tagger: You give it a sentence, and it tries to tag the parts of speech (e.g., noun, verb, adjective, etc.).  The broad category is given in the `.pos_` property, while a more detailed description, using the [UPenn Treebank Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), is in the `.tag_` property.

(N.B. Nothing's perfect -- the tagger will make mistakes.)

In [25]:
s1 = "I tried to ford the river, and my unfortunate oxen died."
s2 = "Henry Ford built factories to facilitate the construction of the Ford automobile."

In [26]:
[(w.text, w.pos_, w.tag_) for w in nlp(s1)]

[('I', 'PRON', 'PRP'),
 ('tried', 'VERB', 'VBD'),
 ('to', 'PART', 'TO'),
 ('ford', 'VERB', 'VB'),
 ('the', 'DET', 'DT'),
 ('river', 'NOUN', 'NN'),
 (',', 'PUNCT', ','),
 ('and', 'CCONJ', 'CC'),
 ('my', 'PRON', 'PRP$'),
 ('unfortunate', 'ADJ', 'JJ'),
 ('oxen', 'NOUN', 'NN'),
 ('died', 'VERB', 'VBD'),
 ('.', 'PUNCT', '.')]

In [27]:
[(w.text, w.pos_, w.tag_) for w in nlp(s2)]

[('Henry', 'PROPN', 'NNP'),
 ('Ford', 'PROPN', 'NNP'),
 ('built', 'VERB', 'VBD'),
 ('factories', 'NOUN', 'NNS'),
 ('to', 'PART', 'TO'),
 ('facilitate', 'VERB', 'VB'),
 ('the', 'DET', 'DT'),
 ('construction', 'NOUN', 'NN'),
 ('of', 'ADP', 'IN'),
 ('the', 'DET', 'DT'),
 ('Ford', 'PROPN', 'NNP'),
 ('automobile', 'NOUN', 'NN'),
 ('.', 'PUNCT', '.')]

### Capitalization, punctuation, etc.

There are the obvious features that we had in mind....

## Building the Classifier

*Disclaimer: This version is actually pretty bad&mdash;it uses many of the right ideas, but puts them together poorly and is being trained on fairly limited data.*

Let's take all of these ideas and combine them into a function to build a classifier to do word disambiguation.  First, we create a function to retrieve text from a url. 

In [28]:
def wikipedia_to_paragraphs(url):
    """
    Retrieves a URL from wikipedia, and returns a list of paragraphs 
    (based on the 'p' html paragraph tag) 
    """
    files_by_url = {
      "http://en.wikipedia.org/wiki/Ford_(crossing)": "ford_crossing.txt",
      "http://en.wikipedia.org/wiki/Ford": "ford_car.txt",
      "http://en.wikipedia.org/wiki/Apple": "apple_fruit.txt",
      "http://en.wikipedia.org/wiki/Apple_Inc.": "apple_inc.txt"
    }
    
    try:
        with open("small_data/{}".format(files_by_url[url]), encoding='utf-8') as wiki_file:
            soup = BeautifulSoup(wiki_file.read(), 'lxml')\
            .find(attrs={'id':'mw-content-text'})
    except KeyError:
        soup = BeautifulSoup(urlopen(url), 'lxml').find(attrs={'id':'mw-content-text'})
    
    # The text is littered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    return [drop_refs(p.text) for p in soup.find_all('p') if p.text != '']

We'll then perform feature engineering in a transformer class. 

In [29]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class AdHocFeatures(BaseEstimator, TransformerMixin):
    """
    Given a keyword (e.g., "apple"), will transform documents into an
    encoding of several ad hoc features of each occurrences of the keyword:
        - If the keyword is capitalized
        - If it is plural
        - If it is possessive (in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """

    def __init__(self, keyword):
        self.keyword = nlp(keyword)[0].lemma_
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.asarray([self.transform_doc(x) for x in X])
    
    def feature_posessive(self, doc):
        ## N.B. spaCy will tokenize "Apple's" as ["Apple", "'s"]
        hits = [i for i, word in enumerate(doc) if word.lemma_ == self.keyword]
        return sum((i + 1) < len(doc) and doc[i+1].text == "'s" for i in hits)
    
    def transform_doc(self, row):
        doc = nlp(row)
        words = [word for word in doc if word.lemma_ == self.keyword]
        return [sum(word.is_title for word in words),
                sum(word.tag_ in ('NNS', 'NNPS') for word in words),
                self.feature_posessive(doc),
                sum(word.pos_ == 'VERB' for word in words)]

Next, we'll make our classifier. To do this, we will use a multinomial **Naive Bayes** model. Detailed information on Naive Bayes can be found in the Naive Bayes notebook, but we'll briefly describe it here: 

> The goal is, given a set of observed features $X_1, \ldots, X_p$, to find the label $Y$ with the maximum conditional probability. In other words, we know what our distributions of words ($X$'s) should look like for a given genre ($Y$) from our training data, and we would like to use this information to find the genre of a body of text ($Y$) given its words ($X$'s) for new data. Bayes' theorem gives us a way to compute the latter conditional probability from the former. 

In [30]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion

def make_classifier(base_word, meaning1, meaning2):
    """
    Given
        - a base word (e.g., "apple", "ford") that can have ambiguous meaning
        - a pair meaning1 = (name1, url1) of a label for the first meaning, and a Wikipedia URL for it
        - a pair meaning2 = ... for the other meaning
    Returns a classifier that predicts the meaning
    """
    name1, url1 = meaning1
    name2, url2 = meaning2
    para1 = wikipedia_to_paragraphs(url1)
    para2 = wikipedia_to_paragraphs(url2)
    minlen = min(len(para1),len(para2))
    if len(para1) == minlen:
        para2 = para2[:minlen]
    else:
        para1 = para1[:minlen]
    
    def tokenize_lemma(text):
        return [w.lemma_.lower() for w in nlp(text)]

    stop_words_lemma = set(tokenize_lemma(' '.join(STOP_WORDS)))
    features = FeatureUnion([('stem_vectorizer',
                              TfidfVectorizer(ngram_range=(1,2),
                                              stop_words=stop_words_lemma.union({base_word}),
                                              tokenizer=tokenize_lemma)),
                             ('ad_hoc', AdHocFeatures(base_word))])
    pipe = Pipeline([('features', features),
                     ('classifier', MultinomialNB())])

    # Build the training data
    train_res  = [name1] * len(para1) + [name2] * len(para2)
    
    return pipe.fit(para1 + para2, train_res)

In [31]:
base_word = "apple"
apple_fruit = ("fruit", "http://en.wikipedia.org/wiki/Apple")
apple_company = ("company", "http://en.wikipedia.org/wiki/Apple_Inc.")
apple_classifier = make_classifier(base_word, apple_fruit, apple_company)

print(apple_classifier.predict([
    "I'm baking a pie with my granny smith apples.",
    "I looked up the recipe on my Apple iPhone.",
    "The apple pie recipe is on my desk.",
    "How is Apple's stock doing?",
    "I'm drinking apple juice.",
    "I have three apples.",
    "Steve Jobs is the CEO of apple.",
    "Steve Jobs likes to eat apples."
]))



['fruit' 'company' 'fruit' 'company' 'fruit' 'fruit' 'company' 'fruit']


We can also do this for other classes of text documents. 

In [32]:
base_word = "ford"
ford_crossing = ("crossing", "http://en.wikipedia.org/wiki/Ford_(crossing)")
ford_company = ("company", "http://en.wikipedia.org/wiki/Ford")
ford_classifier = make_classifier(base_word, ford_crossing, ford_company)

print(ford_classifier.predict([
    "I tried to ford the river and my unfortunate oxen died.",
    "Ford makes cars, though their quality is sometimes in dispute.",
    "The Ford Mustang is an iconic automobile.",
    "The river crossing was shallow, but we could not ford it."
]))



['crossing' 'company' 'company' 'crossing']
