# Introduction to Text Processing
## Reintroduction to Strings
Before beginning with text processing in full, we will review and introduce several important topics in working with strings.
### String Attributes
Let's quickly review some useful features of strings.

In [None]:
## Strings, demarcated by single or double quotation markers, can be
## joined together by the addition operator. They may also be combined
## across multiple lines with the paranthesis. Be sure not to add 
## commas, or it will be interpreted as a tuple!

example_string = ('With educated people, I suppose, punctuation is a matter of rule; '
                  'with me it is a matter of feeling. But I must say I have a great '
                  'respect for the semi-colon; its a useful little chap.')
print(example_string)

In [None]:
## Strings have a number of useful attributes for identifying features of 
## their content.
print('Is "comma" in string?: %s' %('orangutan' in example_string) )
print('Number of "a"s in string?: %s' %example_string.count('a'))
print('Index of "respect" in string?: %s' %example_string.index('respect'))

In [None]:
## Strings have a number of useful attributes for manipulating their content.
print(example_string.lower())

In [None]:
## Strings have a number of useful attributes for manipulating their content.
print(example_string.replace('educated','obnoxious'))

In [None]:
## Strings even have functions to break down and recombine their constituents.
split_string = example_string.split(' ')
sort_string = ' '.join(sorted(split_string))

print(split_string)
print(sort_string)

### Regular Expressions
Though the basic strings attributes are useful, they can be very clunky for certain tasks. For example, if we wanted to remove every punctuation in the string above, we would have to chain together multiple calls to the replace command. Similarly, if wanted to find every word ending with "ing", no command could give us an answer in one step (especially not with punctuation). 

We briefly introduce regular expressions, a mini-language used to interrogate strings, and highlight some use cases. The [documentation](https://docs.python.org/2/library/re.html) provides many more examples and details for usage.

In [None]:
import re    # Import the regular expressions package.

## Define pattern: string begins with "r". 
## To break it down, the pattern works in four parts:
## 1. "\W" specifies the pattern begins with any non-alphanumeric character.
## 2. "r" specifies that "r" should follow the space.
## 3. "[a-z]" specifies we're looking for any alphabetical character.
## 4. "*" specifies we're looking for any number of repeats of the previous.

pattern = '\Wr[a-z]*'
print( 'Find strings beginning with "r": %s' %re.findall(pattern,example_string) )

## Define pattern: string ends with "s" or "t".
## To break it down, the pattern works in five parts.
## 1. "\W" specifies the pattern begins with any non-alphanumeric character.
## 2. "[a-z]" specifies we're looking for any alphabetical character.
## 3. "*" specifies we're looking for any number of repeats of the previous.
## 4. "[st]" specifies we're looking for "s" or "t".
## 5. "\W" specifies the pattern ends with any non-alphanumeric character.

pattern = '\W[a-z]*[st]\W'
print( 'Find strings ending with "s" or "t": %s' %re.findall(pattern,example_string) )

In [None]:
import string

## Find all puncutation.
pattern = '[%s]' %string.punctuation

print(string.punctuation)
print( 'Find all punctuation: %s' %re.findall(pattern,example_string) )

In [None]:
## Find all numeric characters.
pattern = '\d'
print('Find all numbers: %s' %re.findall(pattern, 'my favorite numbers are 4, 24, 48'))

We can also use regular expressions to find and replace multiple characters as once. This will form the basis of tokenization (discussed in greater detail below).

In [None]:
## Remove all punctation (i.e. replace with '')
pattern = '[%s]' %string.punctuation 
print(re.sub(pattern, '', example_string) )

### Reading and Writing Text Files
Though we have previously shown how to read and save datatables with Pandas, let's briefly discuss how to read and write text files with Python. To do this, we will use the **open** command.

In [None]:
## Open file for writing.
write_file = open('example_lincoln.txt', 'w')      # The 'w' stands for write.

## Write lines to file.
write_file.write( example_string + '\n' )  # We add a newline character.

## Close file for writing.
write_file.close()

We can also append to a previously written file. Notice the change in syntax below. Using the "with" command, we do not need to close the file. All code is executed under the "with" block and, once completed, the file is closed.

In [None]:
## Open file for appending.
with open('example_lincoln.txt', 'a') as append_file:
    
    append_file.write( re.sub(pattern, '', example_string) + '\n' ) # Append de-punctuated line.

We will now open the file for reading and print the lines in the file.

In [None]:
with open('example_lincoln.txt', 'r') as read_file:
    
    for line in read_file.readlines():
        
        print(line.strip())    # We use the "strip" attribute to remove whitespace/line breaks.

## Dataset: NSF Awards (1970-2016)
The National Science Foundation (NSF) has made public the records of every award it has ever granted from 1970 onwards, including the awarding NSF division, the awarded institution, the amount awarded, and, most importantly, the abstract of the awardeed project. In this demonstration we will explore the temporal evolution of scientific topics, as well as the scientific topics that draw the most research funding from the NSF.

**NOTE:** The raw data (originally gathered from [here](https://www.nsf.gov/awardsearch/download.jsp)) are stored in XML format. Some code for parsing, reformatting, and saving XML data can be found in the 'nsf' folder in the *nsf_xml_parsing.ipynb* notebook.

Before turning to the text data, let's quickly characterize some trends in the funding data.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import read_csv, concat
sns.set_style('white')
sns.set_context('notebook', font_scale=2.5)
%matplotlib inline

## Define years.
years = range(1970,2017)

## Iteratively load and concatenate grant dataframes.
grants = concat([read_csv(os.path.join('nsf',str(year),'grants.csv')) for year in years])

## Fix ID formatting (i.e. prepend 0s). Set as index.
grants.ID = grants.ID.apply(lambda i: '%0.7d' %i)
grants = grants.set_index('ID')

print(grants.shape)
grants.tail(2)

### Standardizing Variables: String Matching
Unfortuantely there are inconsistencies in the naming conventions in this dataset. For example, look at the following unique categories in Directorate. It is readily apparent that abbreviations were inconsently used and now certain directorates have multiple names despite the same identity (e.g. 'Direct For Computer & Info Scie & Enginr' and 'Directorate for Computer & Information Science & Engineering'). 

In [None]:
for title in np.unique( grants.dropna().Directorate ):
    print(title)

We could manually recode these items with a series of np.where commands, but there may be a more elegant solution with **string matching** techniques. String matching techniques attempt to characterize the similarity of two strings. There are a number of metrics for computing the difference between two strings ([Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance), [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance), [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance), [Jaro–Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)), but here we will use a pre-installed module built into vanilla play: the **SequenceMatcher** command from difflib. 

From the SequenceMatcher docstring:
>SequenceMatcher is a flexible class for comparing pairs of sequences of any type. The basic algorithm dates back to Ratcliff's and Obershelp's "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence. The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence.  This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

We can use this algorithm to compute a similarity score between any two strings (e.g. directorate titles).

In [None]:
from difflib import SequenceMatcher

def similar(a,b):
    return SequenceMatcher(None, a, b).ratio()

print('Silly comparison 1: %0.3f' %similar('the cat in the hat', 'the bat in the vat)'))
print('Silly comparison 2: %0.3f' %similar('the cat in the hat', 'the dog on the log)'))

We will use this algorithm to find the most likely replacements for the abbreviated titles.

In [None]:
## Extract titles.
titles_abbr = [d for d in grants.dropna().Directorate.unique() if d.startswith('Direct ')]
titles_full = [d for d in grants.dropna().Directorate.unique() if not d.startswith('Direct ')]

## Iteratively identify most similar string.
abbr_dict = dict()
for abbr in titles_abbr:
    abbr_dict[abbr] = titles_full[np.argmax([similar(abbr, full) for full in titles_full])]
    
## Print.
for abbr, full in abbr_dict.items():
    print('%s --> %s' %(abbr,full))

Perfect matches! We will use this information to update the DataFrame.

In [None]:
## Update directorates.
for abbr, full in abbr_dict.items():
    grants.loc[grants.Directorate==abbr,'Directorate'] = full
    
## Cheat a little.
grants.loc[grants.Directorate=='OFFICE OF THE DIRECTOR','Directorate'] = 'Office Of The Director'
grants.loc[grants.Directorate=='Directorate for Geosciences','Directorate'] = 'Directorate For Geosciences'

As can be seen below, we were successfully able to remove all of the duplicates by collapsing across identical directorates.

In [None]:
for title in np.unique( grants.dropna().Directorate ):
    print(title)

### Trends: Total Grants by Year
The total number of grants has not changed much since the 70s.

In [None]:
fig = plt.figure(figsize=(15,3))
ax = sns.countplot('Year', data=grants, color='#377eb8');
ax.set_xticklabels([str(y)[2:] for y in years], fontsize=14);
ax.set_yscale('log')
ax.set_title('NSF Total Grants by Year')
sns.despine()
plt.tight_layout()

### Trends: Total Funding by Year
Funding increased steadily until the [Great Recession (2008-2009)](https://www.nsf.gov/news/news_summ.jsp?cntn_id=129264), after which funding has steadily declined.

In [None]:
g = sns.factorplot('Year','Funds',data=grants,estimator=np.sum,ci=None, 
               kind='bar', color='#377eb8', size=5, aspect=3)
g.ax.set_xticklabels([str(y)[2:] for y in years], fontsize=14)
g.ax.set_yscale('log')
g.ax.set_title('NSF Total Funding by Year');

### Trends: Total Grants by Directorate

In [None]:
fig = plt.figure(figsize=(15,3))
order = grants.Directorate.value_counts().index
ax = sns.countplot('Directorate', data=grants, color='#377eb8', order=order);
ax.set_xticklabels(order,rotation=-60, ha='left', fontsize=14);
ax.set_xlabel('')
ax.set_yscale('log')
ax.set_title('NSF Total Grants by Directorate')
sns.despine()

### Trends: Average Awarded Amount by Directorate

In [None]:
gb = grants.groupby('Directorate').Funds.mean()
factor_order = gb.index[np.argsort(gb)[::-1]]
g = sns.factorplot('Directorate','Funds',data=grants,estimator=np.mean,ci=68, 
                kind='bar', color='#377eb8', order=factor_order, size=5, aspect=3);
g.set_xticklabels(rotation=-60, ha='left', fontsize=14);
g.ax.set_xlabel('')
g.ax.set_yscale('log')
g.ax.set_title('NSF Average Award Amount by Directorate');

### Trends: Total Grants Awarded by Institution

In [None]:
fig = plt.figure(figsize=(15,3))
order = grants.Institution.value_counts().index
ax = sns.countplot('Institution', data=grants, color='#377eb8', order=order[:10]);
ax.set_xticklabels(order,rotation=-60, ha='left', fontsize=16);
ax.set_xlabel('')
ax.set_yscale('log')
ax.set_title('NSF Total Grants by Institution')
sns.despine()

### Trends: Total Funding Awarded by Institution

In [None]:
gb = grants.groupby('Institution').Funds.sum()
order = gb.index[np.argsort(gb)[::-1]]
g = sns.factorplot('Institution','Funds',data=grants,estimator=np.sum,ci=None, 
                kind='bar', color='#377eb8', order=order[:10], size=5, aspect=3);
g.set_xticklabels(rotation=-60, ha='left', fontsize=14);
g.ax.set_xlabel('')
g.ax.set_title('NSF Total Funding Awarded by Institution');

### Summary
A few general takeaways:
1. Though the total number of grants awarded by the NSF has not increased substantially over the past 4 decades, the total amount of funds allocated has steadily increased (up until the 2008-2009 recession at least). 
2. Not every Granting Directorate is equal: The Directorates of Mathematics, Physics, Engineering, and Biology award the most grants, whereas the Directorates of Scientific Coordination, Geosciences, and  Education award the most on average. The Directorate of Polar Research and Information Management consistently award the least and least often.
3. Large public research institutions (e.g. Berkeley, Michigan, Washington) have been awarded the most grants, but are not necessarily the most highly funded institutions. 

Let us see if the text information encoded in the abstracts can be used to predict research funding, as well as if certain words predict different Directorates/Institutions.

## Natural Language Processing
In the next section we will cover a summary of the steps involved in setting up a preprocessing pipeline for natural language processing tasks. Specifically, we will cover issues concerning text encoding, tokenizing, spell checking, stopwords, and stemming/lemanization. We will a simple pipeline and apply it to the abstracts of NSF awarded grants. 

**NOTE:** The abstracts for each year are stored in the *abstracts.txt* file in its corresponding  directory organized by year. Within a file, abstracts are separated by line with its corresponding ID at the beginning of the line.

### Text Encoding
The first major hurdle in text processing is identifying the text encoding of the file. The encoding is the machine language detailing how the text is represented under-the-hood. Without specifying the correct encoding, the text cannot be recovered. 

Fortunately, our abstract data has been encoded in the universal UTF-8. For files not encoded as such, however, the **codecs** module can be used to read and decode files of alternate encodings. To show how this goes wrong, we will try opening a file encoded in UTF-16.

In [None]:
import codecs

## Open file with UTF-8 encoding.
with codecs.open('example_utf16.txt', 'r', encoding='utf-8') as f:
    line = f.readline()
    
print(abstract)

This is easily fixed when we specify the correct encoding.

In [None]:
## Open file with UTF-16 encoding.
with codecs.open('example_utf16.txt', 'r', encoding='utf-16') as f:
    line = f.readline()
    
print(line)

For the purposes of our demonstration, we will use the first abstract from 2016. Abstracts are stored with their IDs line-by-line. IDs are tab-separated from their corresponding abstract. 

In [None]:
## Open 2016 abstracts document and read first line (i.e. first abstract).
with codecs.open(os.path.join('nsf','2016','abstracts.txt'), 'r', encoding='utf-8') as f:
    abstract = f.readline()
    
## Separate ID and abstract.
id, abstract = abstract.split('\t')

## Use regular expressions to remove any remaining XML paragraph break tags.
pattern = '<br/>'
abstract = re.sub(pattern, '', abstract)

print(abstract)

### Tokenizing
The next step is breaking down the abstract into its constituent elements, i.e. words. We do this using the tokenizing function from the Natural Language Toolkit (NLTK). The NLTK library is:
>... a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

In [None]:
import nltk
nltk.download_gui()

In [None]:
## Define path for NLTK to search for tokenizer information.
nltk.data.path.append(os.path.abspath('nltk_data'))

The tokenizer accepts regular expression patterns in instructing it what text to preserve and what text to toss. Below we specify to only keep the alphabetic characters.

In [None]:
from nltk.tokenize import sent_tokenize

## Tokenize into sentences.
sentences = sent_tokenize( abstract )

for n, sentence in enumerate(sentences):
    print( '%s %s...' %(n,sentence[:70]) )

In [None]:
from nltk.tokenize import RegexpTokenizer

## Define regex inclusion pattern. This pattern accepts
## all alphanumeric characters and the hyphen character.
regex = '[\w-]+'

## Initialize tokenizer with regular expression.
tokenizer = RegexpTokenizer(regex)

## Apply to sentences.
sentences = [tokenizer.tokenize(sentence) for sentence in sentences]

print(sentences[0])

### Position Tagging
**Position tagging:** Position tagging is the linguistic process of identifying **parts of speech** (e.g. nouns, verbs, adjectives). NLTK has extensive [documentation](http://www.nltk.org/book/ch05.html) and has many useful tools for decomposing sentences into their respective lexical categories. With this approach, we could identify proper nouns in sentences (NP tag) and drop them. Unfortunately, position tagging is a non-trivial linguistic problem, with many corner cases and errors, and can also be computationally demanding. 

In [None]:
from nltk import pos_tag

## Define the positions of speech of the first sentence.
pos_tag(sentences[0])

What do these tags mean? A full list [here](https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/).

|Class|Type|Tag|Example| 
|---|---|---|---|
|Noun|Singular<br/>Plural<br/>Proper (s)<br/>Proper (p)|NN<br/>NNS<br/>NNP<br/>NNPS|citizen<br/>citizens<br/>American<br/>Americans|
|Verb|Base<br/>Past<br/>Gerund<br/>Past participle<br/>Present|VB<br/>VBD<br/>VBG<br/>VBN<br/>VBP|take<br/>took<br/>taking<br/>taken<br/>take|
|Adjective|Base<br/>Comparative<br/>Superlative|JJ<br/>JJR<br/>JJS|big<br/>bigger<br/>biggest|
|Other|Conjunction<br/>Determiner<br/>Preposition|CC<br/>DD<br/>IN|and<br/>this, that<br/>in, of|

In [None]:
## Tokenize all sentences and store with associated POS.
tokens_pos = np.vstack([pos_tag(sentence) for sentence in sentences])

print(tokens_pos.shape)
print(tokens_pos[::15])

### Stop Words
One easy way to increase the informational content of your corpus is to remove highly common words (e.g. a, an, and, of, this, that, etc.). These are highly likely to appear in every documents and thus of little explanatory power. Many packages (NLTK, SpaCy, Scikit-Learn) have built-in **stop words**, or a list of the most frequently appearing words for a given language. Below we will use Scikit-Learn's list.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS)

print(stop_words[:20])

We will apply the stoplist to our tokens using the in1d command from NumPy. After applying stopwords, the abstract is reduced by approximately 75 words.

In [None]:
## Define function to lowercase all strings.
lower = np.vectorize(lambda s: s.lower())

## Match all tokens in stoplist.
ix = np.in1d(np.apply_along_axis(lower, 0, tokens_pos[:,0]), stop_words)

## Remove all stopwords (i.e. those in tokens matching stopwords)
tokens_pos = tokens_pos[~ix]

print('Total tokens after stop words: %s' %len(tokens_pos))

As a quick note, stoplists largely remove "glue" words (i.e. conjunctives, determiners, prepositions). We can confirm this by looking at the types of tags left over after applying the stop words. As can be seen below, what largely remains are nouns, verbs, adjectives, and adverbs.

In [None]:
print(np.unique(tokens_pos[:,1]))

### Stemming and Lemmatization
A second way to increase the informational content of your corpus (though far less simple than utilizing stop words) is through stemming and lemmatizing your tokens. The [Stanford NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) provides the following definitions:
* **Stemming:** refers to a process that removes the ends of words, thereby reducing them to their stems. This most frequently includes the removal of derivational affixes (e.g. joyful --> joy).
* **Lemmatization:** refers to a process that converts words so as to remove inflectional endings only and to return the base or dictionary form of a word, i.e. its lemma (e.g. walks, walked, walking --> walk). 

There are several methods of stemming/lemmatizing implemented by different modules and functions. For example, NLTK and SpaCy both include stemming/lemmatizing functions which produce slightly different outputs. We provide examples of some differences below.

In [None]:
from pandas import DataFrame
from nltk.stem import LancasterStemmer, PorterStemmer, WordNetLemmatizer

## Seprate tokens from POS-tags.
tokens, tags = tokens_pos.T

## Initialize and vectorize functions.
ls = np.vectorize(LancasterStemmer().stem)
ps = np.vectorize(PorterStemmer().stem)
wnl = np.vectorize(WordNetLemmatizer().lemmatize)

## Apply and store outputs.
outputs = DataFrame(np.vstack([tokens, tags, np.apply_along_axis(ls, 0, tokens), 
                               np.apply_along_axis(ps, 0, tokens),
                               np.apply_along_axis(wnl, 0, tokens)]).T,
                     columns=('Token','Tag','Lancaster','Porter','WordNet'))

## Compare and print agreement.
for col in outputs.columns[2:]:
    print('%s agreement: %0.3f' %(col, (outputs['Token']==outputs[col]).mean()))
    
outputs.iloc[::12]

Each approach has its own strengths and weaknesses. The WordNetLemmatizer is the most conservative, largely only de-pluralizing certain noun tokens. LancasterStemmer is the most liberal in its pruning, ocassionally removing stems that change the word entirely (e.g. population --> pop, elementary --> el). The PorterStemmer seems to be somewhere in the middle, stemming many words without rendering them unintelligible (e.g. dimension --> dimensional, population --> popul). It is again worth emphasizing that there is no one correct solution and what will work best is very contingent on the corpus and goal of analysis.

In [None]:
## Extract all verbs from tokens.
verbs = tokens[[True if tag.startswith('V') else False for tag in tags]]

## Apply PorterStemmers to verbs.
verbs_ps = np.apply_along_axis(ps, 0, verbs)

print(verbs[::5])
print(verbs_ps[::5])

In [None]:
## Extract all plural nouns from tokens.
nouns = tokens[[True if tag.startswith('N') and tag.endswith('S') else False for tag in tags]]

## Apply PorterStemmers to verbs.
nouns_wnl = np.apply_along_axis(wnl, 0, nouns)

print(nouns)
print(nouns_wnl)

### Spell Checking
Though this step is not strictly necessary, and will not be used below, there are options for spell-checking in python. Peter Norvig has written an incredible [tutorial for spellchecking in python](http://norvig.com/spell-correct.html) to provide an intuition for how it is possible to write by hand with minimal code. We will instead cheat and use the **pyenchant** package, a spellchecking library in Python [(documentation)](https://pythonhosted.org/pyenchant/).

In [None]:
import enchant

## Define the language of use (i.e. American English).
## Possible languages include: English, German, French, and more.
spellcheck = enchant.Dict('en_US')

The *check* attribute will lookup a specified string to see if it is correctly spelled.

In [None]:
print( spellcheck.check('Hello') )
print( spellcheck.check('Helo') )

The *suggest* attribute will provide a list of possible correct spellings.

In [None]:
print( spellcheck.suggest('Helo') )

In our case, let's check to see which words it does not recognize. As can be seen, the spellchecker breaks down for proper nouns.

In [None]:
[s for s in tokens if not spellcheck.check(str(s))]

If we were to want to replace these with a suggested word, the recommended words for Archimedean are accurate; the first suggested word is its root, Archimedes. This is not terribly surprising given its status as a better-known proper noun.

In [None]:
spellcheck.suggest('Archimedean')

Unfortunately Calabi (of the less frequently encountered [Calabi–Yau manifold](https://en.wikipedia.org/wiki/Calabi%E2%80%93Yau_manifold)) does not possess the same fame. It's top recommendation is calabash, "an evergreen tropical American tree that bears fruit in the form of large woody gourds."

In [None]:
spellcheck.suggest('Calabi')

### Putting it all together

In [None]:
import os, codecs, re
import nltk, enchant
import numpy as np
from pandas import read_csv, concat
from nltk import pos_tag
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import LancasterStemmer, PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Define parameters.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Define years.
years = range(2006,2017)

## Define path for NLTK to search for tokenizer information.
nltk.data.path.append(os.path.abspath('nltk_data'))

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Load grant information.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Iteratively load and concatenate grant dataframes.
grants = concat([read_csv(os.path.join('nsf',str(year),'grants.csv')) for year in years])

## Fix ID formatting (i.e. prepend 0s). Set as index.
grants.ID = grants.ID.apply(lambda i: '%0.7d' %i)
grants = grants.set_index('ID')

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Define functions.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Define functions for stop words.
lower = np.vectorize(lambda s: s.lower())
stop_words = list(ENGLISH_STOP_WORDS)

## Define functions for pipeline.
def parse_abstracts(filepath):
    
    ## Open file.
    with codecs.open(filepath, 'r', encoding='utf-8') as file:
        
        ## Iterate over abstracts.
        for line in file.readlines():
            
            ## Split into ID / abstract.
            try:
                ID, abstract = line.strip().split('\t')
            except ValueError:
                continue   
            
            abstract = re.sub('<br/>', '', abstract)           
            
            yield(ID,abstract)
            
def tokenize_abstract(abstract, pattern):
    
    ## Sentence tokenize.
    sentences = sent_tokenize(abstract)
    
    ## Initialize word tokenizer.
    tokenizer = RegexpTokenizer(pattern)
    
    ## Word tokenize. 
    tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
    tokens = [sentence for sentence in tokens if sentence]
    return(tokens)

def pos_tag_abstract(tokens):
    tokens, tags = np.vstack([pos_tag(sentence) for sentence in tokens]).T
    return(tokens, tags)

def stem_verbs(tokens, tags, stemmer=PorterStemmer):
    
    ## Initialize stemmer.
    stemmer = np.vectorize(PorterStemmer().stem)
    
    ## Find indices of verbs.
    ix = [True if tag.startswith('V') else False for tag in tags]
    
    ## Apply stemmer.
    tokens[ix] = np.apply_along_axis(stemmer, 0, tokens[ix])
    return(tokens)

def lemmatize_nouns(tokens, tags, lemmatizer=WordNetLemmatizer):
    
    ## Initialize stemmer.
    stemmer = np.vectorize(WordNetLemmatizer().lemmatize)
    
    ## Find indices of verbs.
    ix = [True if tag.startswith('N') else False for tag in tags]
    
    ## Apply stemmer.
    tokens[ix] = np.apply_along_axis(stemmer, 0, tokens[ix])
    return(tokens)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

IDs = []
for year in years:
                   
    ## Initialize generator.
    filepath = os.path.join('nsf', str(year), 'abstracts.txt')
    G = parse_abstracts(filepath)
    
    for ID, abstract in G: 
        
        ## Tokenize abstract.
        tokens = tokenize_abstract(abstract, '[\w-]+')
        
        ## POS-tag abstract.
        tokens, tags = pos_tag_abstract(tokens)

        ## Remove all tokens with fewer than 3 characters.
        stop_ix = np.array([True if len(token) < 3 else False for token in tokens])
        tokens, tags = tokens[~stop_ix], tags[~stop_ix]
        if not len(tokens): continue
        
        ## Apply stopwords.
        stop_ix = np.in1d(np.apply_along_axis(lower, 0, tokens), stop_words)
        tokens, tags = tokens[~stop_ix], tags[~stop_ix]
        if not len(tokens): continue
        
        ## Remove proper nouns.
        stop_ix = np.in1d(tags, ['NNP','NNPS'])
        tokens, tags = tokens[~stop_ix], tags[~stop_ix]
        if not len(tokens): continue

        ## Stem/lemmatize tokens.
        ## Try/catch to handle when there are no such tags.
        try: tokens = lemmatize_nouns(tokens, tags)
        except ValueError: pass
        
        try: tokens = stem_verbs(tokens, tags)
        except ValueError: pass
        
        ## Write to file.
        out_file = os.path.join('nsf','tokenized', '%s.txt' %ID)
        with codecs.open(out_file, 'w', encoding='utf-8') as out_file:
            line = ' '.join(tokens).lower()
            out_file.write(line)
            
        ## Append ID.
        IDs.append(ID)
        
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Limit grants to included abstracts.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Restrict.
grants = grants.loc[IDs]

## Save.
grants.to_csv(os.path.join('nsf', 'abstracts_metadata.csv'))

print('Done.')

## Machine Learning
It is easy to train the same sorts of machine learning models we previously encountered on text data. What is necessary, of course, is converting the text data into some numeric format for modeling purposes. Scikit-Learn has a number of functions for accomplishing this. 

### Rescaling Text Data, Part 1
The most straightforward way of rescaling text data to numeric format is to simply count the occurrence of words across documents. In doing so, we end up with an NxM matrix, where N is the number of documents and M is the number of unique tokens. The **CountVectorizer** class from Scikit-Learn performs just this function. Before applying CountVectorizer to the NSF data, let's first demonstrate its utility with a toy dataset.

Our dataset will consist of 20 features drawn from 2 categories: {A01, A02, A03, ..., A10} and {B01, B02, B03, ..., B10}. To generate this dataset, we will construct a transition matrix which dictates the probability of encountering, in a document, Token2 given Token1. The tokens from category A will have slightly higher probabilities towards other category A members, and likewise for category B. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook', font_scale=2)
%matplotlib inline

## Define homophily index.
hindex = 3.0

## Define tokens.
tokens = ['A%0.2d' %n for n in np.arange(10)+1] + ['B%0.2d' %n for n in np.arange(10)+1]

## Define transition probability matrix.
trans_mat = np.hstack( [np.ones((10,10))*hindex, np.ones((10,10))] )
trans_mat = np.vstack( [trans_mat, trans_mat[:, ::-1]] )
trans_mat[np.diag_indices_from(trans_mat)] = 0
trans_mat = np.apply_along_axis(lambda arr: arr / arr.sum(), 1, trans_mat)

## Plot.
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(trans_mat, square=True, xticklabels=tokens, yticklabels=tokens, 
                 cbar_kws={'label': 'Transition Probability'});

With this transition proability matrix generated, we will generate 100 documents by starting a document with each respective token and samping J new tokens based on a random integer generator.

In [None]:
np.random.seed(47404)

## Generate 100 random integers.
doc_lengths = np.random.randint(2,15,100)

## Generate documents.
documents = []
for token, doc_length in zip(tokens*5, doc_lengths):
    
    ## Initialize document.
    document = [token]
    
    ## Iteratively add tokens to document.
    while len(document) < doc_length:
        
        ix = tokens.index(document[-1])
        document.append( np.random.choice(tokens, 1, p=trans_mat[ix])[0] )
        
    ## Append document to documents.
    document = ' '.join(document)
    documents.append(document)

documents[:3]

We will use **CountVectorizer** to convert these documents into their numeric representation.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

## Initialize CountVectorizer class.
cvec = CountVectorizer(lowercase=False)

## Fit to documents.
cfit = cvec.fit(documents)
print(cfit.get_feature_names())

In [None]:
## Use fit_transform to generate counts.
cmat = cvec.fit_transform(documents)
cmat

In [None]:
## Plot.
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(cmat.toarray(), xticklabels=cfit.get_feature_names(), 
                 yticklabels='', cbar_kws={'label': 'Token Counts'});
ax.set_ylabel('Documents')

In [None]:
from pandas import DataFrame
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_val_score

## Define dependent variable.
y = np.array([0 if doc.startswith('A') else 1 for doc in documents])

## Initialize logistic regression w/ cross-validation class.
lrcv = LogisticRegressionCV(Cs=[1.], cv=5, random_state=0)

## Fit.
lrf = lrcv.fit(cmat, y)

## Print scores.
print('Average classification score: %0.2f' %lrf.scores_[1].mean()) 

## Plot coefficients.
fig = plt.figure(figsize=(12,4))
ax = sns.barplot(cfit.get_feature_names(), lrf.coef_.flatten(), [token[0] for token in tokens])
ax.hlines(0,-0.1,20.1)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=18)
ax.set_title('CV-Logistic Regression Coefficients')
sns.despine()

### Rescaling NSF Data
Using all abstracts between 2006 and 2016, let us attempt to predict Directorate category from the tokenized text alone. Just as before, the first step will be to perform frequency counts of the tokens within documents. 

To produce the frequency counts, we will still use CountVectorizer but we will initialize it to read data separated into files.

Because producing the CountVectorizer process can be computationally intensive, and thus slow, we have precumpted the sparse matrix. We will read it in and construct it using NumPy and SciPy.

In [2]:
import os
import numpy as np
from scipy.sparse import csr_matrix

## Load compressed numpy file.
npz = np.load(os.path.join('nsf','NSF_cvec.npz'))
features = npz['features']

## Build sparse matrix.
cmat = csr_matrix((npz['data'], npz['indices'], npz['indptr']), npz['shape'])
cmat

<134244x146663 sparse matrix of type '<class 'numpy.int64'>'
	with 16568090 stored elements in Compressed Sparse Row format>

Let's plot the distribution of tokens. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook', font_scale=2)
%matplotlib inline

## Compute feature sums (convert into NumPy array).
feature_counts = np.array(cmat.sum(axis=0)).squeeze()

## Plot
fig, ax = plt.subplots(1,1,figsize=(6,4))
ax.plot( np.sort(feature_counts)[::-1], linewidth=3 )
ax.set_xticks([0,15e4])
ax.set_xlabel('Token Rank')
ax.set_yscale('log')
ax.set_ylabel('Token Count')
ax.set_title('NSF Abstracts (2006-2016)')
sns.despine()
plt.tight_layout()

It's pretty clear we are dealing with a power law distribution here. We can visualize the most popular words. We can create a wordcloud visualizing the most common words of in our corpus. (See [here](https://amueller.github.io/word_cloud/index.html) for more examples.)

In [None]:
from wordcloud import WordCloud

## Construct token-count dictionary.
token_counts = dict()
for feature, feature_count in zip(features, feature_counts):
    token_counts[feature] = feature_count
    
## Initialize and generate wordcloud from token-counts.
wordcloud = WordCloud(background_color='white', 
                      max_font_size=40).generate_from_frequencies(token_counts)

## Print top 10 tokens.
for k,v in [(k, token_counts[k]) for k in sorted(token_counts, key=token_counts.get, reverse=True)][:10]:
    print('%s: %s' %(k,v))
    
## Generate 
plt.figure(figsize=(12,8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Next we will load in the metadata associated with the abstracts.

In [None]:
from pandas import read_csv

## Load metadata.
metadata = read_csv(os.path.join('nsf','abstracts_metadata.csv'))
metadata.Directorate.value_counts()

Before fitting the model, we will remove the categories with low counts.

In [None]:
## Define major (frequent) directorate categories.
directorates = [d for d in metadata.Directorate.astype(str).unique() 
                if d.startswith('Directorate') or d.endswith('Director')]

## Reduce sparse matrix & DataFrame to major categories.
ix = np.in1d(metadata.Directorate, directorates)
cmat = cmat[ix]
metadata = metadata.loc[ix].copy()

print(cmat.shape)
metadata.Directorate.value_counts()

We should also remove low frequency tokens. Let's visualize how many singletons and near-singletons there are.

In [None]:
## Digitize frequency counts into bins: 1, 2, 3, ..., 9, >=10
bins = np.arange(10)
cats = np.digitize(feature_counts, bins, right=True)

## Count instances.
bins, counts = np.unique(cats, return_counts=True)

## Visualize.
fig = plt.figure(figsize=(8,4))
ax = sns.barplot(bins, counts, color='#377eb8')
ax.set(xlabel='Token Frequency', ylabel='Count', yticks=[0,2e4,4e4,6e4])
ax.set_xticklabels([1,2,3,4,5,6,7,8,9,r'$\geq$10'])
sns.despine()
plt.tight_layout()

Let's remove singletons only, i.e. all tokens occurring only once.

In [None]:
## Find all features occurring only once.
singleton_ix = np.where(feature_counts==1, True, False) 

## Remove from feature names and sparse matrix.
cmat = cmat[:,~singleton_ix]
features = features[~singleton_ix]
cmat.shape

### Predicting NSF Data: Naive Bayes Classifiers
To predict directorate category from our token data, we will rely on Naive Bayes classificiation. Naive Bayes classification is similar to linear model, but are faster in training and especially efficient for sparse data (such as with text data). They achieve this efficiency by representing the relationship of feature to outcome as the average frequency of observation between a feature and outcome. In the case of binary/multinomial classification, this amounts to counting the number of times a feature appears with a given outcome. Given that computing counts (i.e. summing) is incredibly cheap, Naive Bayes classifers can achieve high efficiency even with high dimensional data.

Naive Bayes classifiers rely on one parameter: alpha. Alpha, as before, is a sparsity parameter that affects model complexity. Increasing levels of alpha instructs the classification algorithm to add additional observations to all of the features, thereby smoothing counts and reducing model complexity. Performance is relatively insensitive to the setting of alpha; however, tuning it usually improves accuracy somewhat.

As before, we will test out several levels of alpha to find an optimal set of model parameters. **NOTE:** This may take several minutes to run. 

In [None]:
from pandas import DataFrame
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score, StratifiedKFold

## Define variables.
X = cmat
y = metadata.Directorate

## Define alphas.
alphas = np.power(10., np.arange(5))
n_alphas = alphas.shape[0]

## Initialize cross validation method.
n_splits = 10
kfold = StratifiedKFold(n_splits=n_splits)

## Main loop.
scores = np.empty((n_alphas, n_splits))
for n in range(n_alphas):
    
    ## Initialize MultinomialNB
    mnb = MultinomialNB(alpha=alphas[n])
    
    ## Compute scores.
    scores[n] = cross_val_score(mnb, X, y, cv=kfold)

## Assemble into DataFrame. Melt.
df = DataFrame(scores.T, columns=alphas)
df = df.melt(var_name='Alpha', value_name='Score')

## Plot.
fig = plt.figure(figsize=(8,4))
sns.swarmplot('Alpha','Score',data=df)
sns.despine()
plt.tight_layout()

It appears that the out-of-the-box Naive Bayes classifier (i.e. alpha = 1) performs close to the best, if not the best. In any case, higher sparisity values dramatically worsen model performance. Using an 80/20 split, let's fit a classifier to the data and see which tokens best predict each category. 

In [None]:
from sklearn.model_selection import train_test_split

## Split data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

## Initialize MultinomialNB
mnb = MultinomialNB(alpha=1)

## Fit model.
mnb_fit = mnb.fit(X_train, y_train)

## Print scores.
print('MultinomialNB train: score = %0.3f' %mnb_fit.score(X_train,y_train))
print('MultinomialNB test: score = %0.3f' %mnb_fit.score(X_test,y_test))

print('Feature count matrix: (%s, %s)' %mnb_fit.feature_count_.shape)
mnb_fit.feature_count_

In [None]:
## Extract most common features per category.
for n, feature_class in enumerate(mnb_fit.classes_):
    
    ## Get sorted indices of counts.
    ix = np.argsort(mnb_fit.feature_count_[n])[::-1]
    
    ## Print feature class + top 10 features.
    print(feature_class)
    print(' '.join(features[ix][:10]) + '\n' )

As can be seen, the top words per feature class are mixed in terms of their representativeness. Certain tokens seem fairly representative of the top-10 lists to which they belong (e.g. ["plant", "specie", "gene"] for Biological Sciences; ["Education", "Teacher'] for Education and Human Resources). Other tokens appear across the top-10 list for every outcome class and are hardly uniquely representative (e.g. research, project, student). Below, we address some methods of improving this.

### Rescaling Text Data, Part 2
Beyond simple token-frequency scaling, CountVectorizer also allows for **n-gram tokenization.** In n-gram tokenzation, N-pairs of tokens are counted. This can be useful for capturing recurring phrases (e.g. "chaos theory", "dynamical systems", "decision making". This can be set with the ngram_range flag.

A second way of scaling the text employs the **term frequency–inverse document frequency (tf-idf)** method. From Introduction to Machine Learning with Python:
>The intuition of this method is to give high weight to a term that appears
often in a particular document, but not in many documents in the corpus. If a word
appears often in a particular document, but not in very many documents, it is likely
to be very descriptive of the content of that document.

Scikit-learn implements the tf-idf method in two classes, the **TfidfTransformer**,
which takes in the sparse matrix output produced by CountVectorizer and transforms it, or **TfidfVectorizer**, which takes in the text data and does both the bag-of-words
feature extraction and the tf-idf transformation.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer

## Initialize TfidfTransformer.
transformer = TfidfTransformer( )

## Apply to sparse CountVec matrix.
tfidf_mat = transformer.fit_transform(cmat)
print(tfidf_mat)

  (0, 70441)	0.113637674138
  (0, 120754)	0.202854530521
  (0, 125075)	0.0838692649594
  (0, 89630)	0.291782184402
  (0, 55271)	0.0455057467047
  (0, 111429)	0.135244837884
  (0, 117220)	0.0759391518063
  (0, 78014)	0.105373551151
  (0, 124512)	0.140751945834
  (0, 39420)	0.0890407222174
  (0, 86048)	0.0841323981938
  (0, 107687)	0.212514989274
  (0, 44772)	0.346016980525
  (0, 60256)	0.215655927431
  (0, 120530)	0.0778200152808
  (0, 121130)	0.0466794657949
  (0, 22340)	0.164603068423
  (0, 144249)	0.0446698649464
  (0, 117225)	0.0824308011862
  (0, 52494)	0.0522343506536
  (0, 37803)	0.0514224680389
  (0, 111620)	0.0487696173513
  (0, 133972)	0.0669435303252
  (0, 124880)	0.0558466794757
  (0, 43763)	0.0296783622264
  :	:
  (134243, 78906)	0.0909763412385
  (134243, 145156)	0.0837730056086
  (134243, 61019)	0.131948252761
  (134243, 76582)	0.102985775905
  (134243, 31828)	0.0904026823137
  (134243, 136175)	0.109341027497
  (134243, 29831)	0.149211122458
  (134243, 55627)	0.1338682636

Let's directly compare the performance of the two means of normalizing the data.

**NOTE:** This may take a minute to run.

In [None]:
## Initialize cross validation method.
n_splits = 20
kfold = StratifiedKFold(n_splits=n_splits, random_state=0)
    
## Initialize MultinomialNB
mnb = MultinomialNB(alpha=1)
    
## Compute scores.
scores = np.empty((n_splits,2))
scores[:,0]= cross_val_score(mnb, cmat, y, cv=kfold)         # CountVec
scores[:,1]= cross_val_score(mnb, tfidf_mat, y, cv=kfold)    # TF-IDF

In [None]:
## Assemble into DataFrame. Melt.
df = DataFrame(scores, columns=['CountVec','TF-IDF'])
df = df.melt(var_name='Method', value_name='Score')

## Plot.
fig = plt.figure(figsize=(8,4))
sns.boxplot('Method','Score',data=df)
sns.despine()
plt.tight_layout()

Let's show the most representative tokens now.

In [None]:
## Define variables.
X = tfidf_mat
y = metadata.Directorate

## Split data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

## Initialize MultinomialNB
mnb = MultinomialNB(alpha=1)

## Fit model.
mnb_fit = mnb.fit(X_train, y_train)

## Print scores.
print('MultinomialNB train: score = %0.3f' %mnb_fit.score(X_train,y_train))
print('MultinomialNB test: score = %0.3f' %mnb_fit.score(X_test,y_test))

In [None]:
## Extract most common features per category.
for n, feature_class in enumerate(mnb_fit.classes_):
    
    ## Get sorted indices of counts.
    ix = np.argsort(mnb_fit.feature_count_[n])[::-1]
    
    ## Print feature class + top 10 features.
    print(feature_class)
    print(' '.join(features[ix][:10]) + '\n' )

As can be seen, many of the tokens of each directorates' top-10 list are more representative of its respective discipline. 

### Topic Modeling

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=100, learning_method="online", random_state=0)
lda_fit = lda.fit(tfidf_mat)