# A Very Brief Intro to Natural Language Processing

## Regular expressions

The fastest way to find patterns in strings is to use a library that can match so-called regular expressions. In Python, it is the `re` library that does the trick.

In [1]:
import re

Here is only a really simple example.

In [18]:
re.findall('\w+','apple pear plum cherries')

['apple', 'pear', 'plum', 'cherries']

This is a quite good introduction:
http://www.zytrax.com/tech/web/regex.htm

This is a really good tutorial that shows how regular expressions match or do not match:
https://regexone.com/

And this link covers Pythons regex library as a starter.
https://developers.google.com/edu/python/regular-expressions

## How to feed a text to the computer?

First of all, we want to decide what elements we want as building blocks. A common approach is to break the text down to words. But even this task is not that simple as it first seems. The next string contains an example message of Donald Trump from the Twitter social network.

Possible problems:
* lower and uppercase
* punctuation
* emojis
* character encoding
* named entity recognition...

In [19]:
trump_tweet = 'Having great meetings and discussions with my friend, President @EmmanuelMacron of France. We are in the midst of meetings on Iran, Syria and Trade. We will be holding a joint press conference shortly, here at the @WhiteHouse. 🇺🇸🇫🇷'

Does this work?

In [22]:
print(' '.join(re.findall('\w+',trump_tweet.lower())))

having great meetings and discussions with my friend president emmanuelmacron of france we are in the midst of meetings on iran syria and trade we will be holding a joint press conference shortly here at the whitehouse


## nltk

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."

In [23]:
import nltk

In [28]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/bokanyie/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's have a look at our former task.

In [31]:
print(' '.join(nltk.word_tokenize(trump_tweet.lower())))

having great meetings and discussions with my friend , president @ emmanuelmacron of france . we are in the midst of meetings on iran , syria and trade . we will be holding a joint press conference shortly , here at the @ whitehouse . 🇺🇸🇫🇷


Look for a good tweet tokenizer that has Twitter-specific regular expressions built-in either within `nltk` or outside!
http://www.nltk.org/api/nltk.tokenize.html

## Stemming

If we are interested in the core meaning, often we can throw away third person forms or plurals etc.

In [36]:
from nltk.stem import SnowballStemmer

snow = SnowballStemmer('english',ignore_stopwords=True)
for word in nltk.word_tokenize(trump_tweet.lower())[0:10]:
    print(word,'->',snow.stem(word))

having -> having
great -> great
meetings -> meet
and -> and
discussions -> discuss
with -> with
my -> my
friend -> friend
, -> ,
president -> presid


## Term-document matrix

A useful algebraic concept for representing a collection of texts.

https://en.wikipedia.org/wiki/Document-term_matrix

Let us have a look at such a matrix on some Trump tweets again http://www.trumptwitterarchive.com/archive:

In [52]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [42]:
obamacare = pd.read_csv('obamacare.csv')

In [46]:
c = CountVectorizer()

In [53]:
c.fit_transform(obamacare['text'].head(3).tolist()).todense()

matrix([[0, 1, 2, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,
         1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 3, 0, 1,
         0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1,
         0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 2, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
        [1, 0, 1, 1, 1, 0, 2, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
         1, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
         0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 4, 1, 0,
         1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

In [54]:
' '.join(c.get_feature_names())

'00 amp and approved approximately as at based be been being biggest bill come conference cut democrats develop essentially eventually fact final for goes great has hated healthcare history house if in individual is just mandate morning most new news obamacare of on our over part passed plan reform remember repealed repeals republicans senate signed states tax terminated terrible that the there time to together tomorrow under unfair united unpopular very vote which white will'

It excluded so-called stopwords.

## Other questions

What to do when the texts do not have the same length? How to exclude not relevant words? How to make a weighting scheme for our purposes?

How to deal with the 'timeseries' nature of texts? What inputs do neural networks expect?