# Introduction to Natural Language Processing
## 1. Data Representation
### ASI Data Science Fellowship IX - 6th October 2017

***Before you do anything, apply the 'requirements' environment to your server to make sure that you have all the modules we are going to need for the examples below.***

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# dictionary of colours for making nice plots later
PARTY_COLOURS = {'trump': '#E91D0E', 'obama': '#00A6EF'}

%matplotlib inline

This notebook will take you through many of the concepts we have introduced in this session. We will use the same dataset for all examples, namely a collection of 6000 or so tweets from @realDonaldTrump and @BarackObama. 

Wherever possible we will use `sklearn`, python's machine learning library that you are most likely already familiar with. For a few tasks we will turn to `nltk` (natural language toolkit) a python library for NLP.

In [3]:
df = pd.read_pickle('tweets.pkl')
df.head()

Unnamed: 0,date,id,text,label
0,2017-02-08 15:23:29,829349943613734912,b'Thank you to our great Police Chiefs &amp; S...,0
1,2014-09-19 16:26:07,513001099302023168,"b'""This is not your fight alone. It\'s on all ...",1
2,2017-05-23 15:43:16,867043258932875264,b'Thank you for such a wonderful and unforgett...,0
3,2017-06-15 10:55:37,875305788708974592,b'They made up a phony collusion with the Russ...,0
4,2015-06-23 20:07:16,613438190829965312,"b""It doesn't get much better than the chance f...",1


## Data Cleaning 

There are many things to consider when cleaning text data. Some problems are common to other data types, such as how to deal with missing values. Others are unique to text data, and include things like removing HTML tags or urls. We don't want to focus too much on data cleaning for the purposes of this course, we've done a little bit of cleaning below to give you a taste. Generally speaking regular expressions (available in python in the `re` module) will get you pretty far. For specific tasks there are often existing libraries you can use. For example `feedparser` is good for getting data from an RSS feed, `beautifulsoup` is good for parsing HTML/XML.

In [4]:
import re

def clean_tweet(text):
    # encode tweets as utf-8 strings
    text = text.decode('utf-8')
    # remove commas in numbers (else vectorizer will split on them)
    text = re.sub(r',([0-9])', '\\1', text)
    # sort out HMTL formatting of &
    text = re.sub(r'&amp', 'and', text)
    # strip urls
    return re.sub(r'http[s]{0,1}://[^\s]*', '', text)

df['text'] = df['text'].map(clean_tweet)

## Tokenizing

The field of natural language processing contains a lot of jargon from linguistics. We don't want to get too bogged down in defining lots of new terms, but the following two are helpful:

- Type: An element of the vocabulary. May be a word, may be an n-gram (ordered sequence of words)
- Token: An instance of a type in running text.

Any given language has a large enough vocabulary that trying to do data science on the set of all possible sentences is totally impractical. Instead it helps to break text up into smaller chunks, a process called tokenizing.

Exactly how we do this will depend on the problem, but some common ways include splitting on whitespace, or splitting on non-alphanumeric characters. In general, the method of tokenizing will be informed by the format of the text data being studied.

**Tokenizers are accessed in a slightly roundabout way in `sklearn` as below. Run this cell a few times to tokenize random tweets.**

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from random import randint

# tokenize a random tweet
i = randint(0, len(df) - 1)
tokenizer = CountVectorizer().build_tokenizer()
tokenizer(df['text'].iloc[i])

['For',
 'all',
 'that',
 'Beau',
 'Biden',
 'achieved',
 'in',
 'his',
 'life',
 'nothing',
 'claimed',
 'fuller',
 'focus',
 'of',
 'his',
 'love',
 'and',
 'devotion',
 'than',
 'his',
 'family',
 'President',
 'Obama']

## Vectorizing

Tokenizing breaks our raw text data down into more manageable chunks, but it's still not in a form that is particularly useful for training models. Let's look at a few common, simple ways of vectorizing text data. We will use `sklearn` which can efficiently vectorize text data and stores everything as `scipy` sparse arrays.

### Count Vectors

Perhaps the simplest way to vectorize is to simply create a vector of counts of the number of times any type appears in a given piece of text.

To get some intuition, let's try it on a small test corpus of 10 random tweets.

**Use `sample` on the series `df['text']` to get a random selection of 10 tweets**

In [6]:
import random
random.sample(range(len(df)),10)

[5768, 5213, 1650, 2375, 3265, 5001, 3116, 3258, 5173, 2622]

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

test_corpus = df['text'].iloc[random.sample(range(len(df)),10)]

test_corpus = test_corpus.reset_index(drop = True)

Given the sample, we can create count vectors using `CountVectorizer` from `sklearn`. We set `max_features=5` so as to work with a small vocabulary of only the most common terms.

See the next cell for usage of `CountVectorizer`.

In [15]:
# create a count vectorizer with our desired parameters
count_vectorizer = CountVectorizer(max_features=5)

# first 'fit' the vectorizer to the corpus
# this step automatically determines the vocabulary
count_vectorizer.fit(test_corpus)

# then 'transform' the corpus to count vectors (a matrix)
count_vectors = count_vectorizer.transform(test_corpus)

count_vectorizer.get_feature_names()

['for', 'rt', 'the', 'to', 'we']

The next cell will display each of your tweets and the corresponding counts.

In [19]:
features = count_vectorizer.get_feature_names()

# we use .toarray() to convert from sparse 
# array to dense numpy array
for i, row in enumerate(count_vectors.toarray()):
    print(test_corpus[i])
    print(pd.DataFrame({'Terms': features, 'Counts': row}).to_string(index=False))
    print("-" * 40)

Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign.
Counts Terms
     0   for
     0    rt
     0   the
     1    to
     1    we
----------------------------------------
Let's keep fighting for real, lasting change: 
Counts Terms
     0   for
     0    rt
     0   the
     0    to
     0    we
----------------------------------------
Great bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and… 
Counts Terms
     0   for
     0    rt
     3   the
     1    to
     0    we
----------------------------------------
Working in Bedminster, N.J., as long planned construction is being done at the White House. This is not a vacation - meetings and calls!
Counts Terms
     0   for
     1    rt
     2   the
     0    to
     1    we
----------------------------------------
More Anti-Catholic Emails From Team Clinton:  
Counts Terms
     1   for
     0    rt

### Term frequency vectors

Count vectors are very sensitive to document length. In our case we expect all tweets to be similar lengths, but in general we might be dealing with documents of varying lengths, so it makes sense to normalise the count vectors. This results in so-called frequency vectors.

** Using `TfidfVectorizer`, compute term frequency vectors for the test corpus and print them out as we did for the count vectors. Make sure you set `use_idf=False` when initialising your `TfidfVectorizer`. As before limit the vocabulary to 5 types.**

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(max_features=5, use_idf = False)
tf_corpus = tf_vectorizer.fit_transform(test_corpus)
tf_vectorizer.get_feature_names()

['and', 'in', 'is', 'the', 'to']

In [25]:
features = tf_vectorizer.get_feature_names()
for i, row in enumerate(tf_corpus.toarray()):
    print(test_corpus[i])
    print(pd.DataFrame({'Terms': features, 'Counts': row}).to_string(index=False))
    print("-" * 40)

Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign.
Counts Terms
0.000000   and
0.707107    in
0.000000    is
0.000000   the
0.707107    to
----------------------------------------
Let's keep fighting for real, lasting change: 
Counts Terms
   0.0   and
   0.0    in
   0.0    is
   0.0   the
   0.0    to
----------------------------------------
Great bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and… 
Counts Terms
0.707107   and
0.000000    in
0.000000    is
0.707107   the
0.000000    to
----------------------------------------
Working in Bedminster, N.J., as long planned construction is being done at the White House. This is not a vacation - meetings and calls!
Counts Terms
0.377964   and
0.377964    in
0.755929    is
0.377964   the
0.000000    to
----------------------------------------
More Anti-Catholic Emails From Team Clinton:  
Counts T

### tfidf vectors

tf-idf stands for 'term frequency - inverse document frequency'. Given our term frequencys, we re-weight by the inverse of the document frequency. Therefore a given term will have a larger value if it both appears many times in the document, but appears infrequently across the corpus. In this sense it automatically detects and upweights terms which are likely to be able to help us distinguish between documents.

**Compute tfidf vectors for your test_corpus. You can once again use `TfidfVectorizer`, but this time set `use_idf=True`.**

In [44]:
tfidf_vectorizer = TfidfVectorizer(max_features=5, use_idf=True)# your code here
tf_corpus = tf_vectorizer.fit_transform(test_corpus)
tf_vectorizer.get_feature_names()

['and', 'in', 'is', 'the', 'to']

In [45]:
# print out your vectors in a nice way
features = tf_vectorizer.get_feature_names()
for i, row in enumerate(tf_corpus.toarray()):
    print(test_corpus[i])
    print(pd.DataFrame({'Terms': features, 'Counts': row}).to_string(index=False))
    print("-" * 40)

Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign.
Counts Terms
0.000000   and
0.707107    in
0.000000    is
0.000000   the
0.707107    to
----------------------------------------
Let's keep fighting for real, lasting change: 
Counts Terms
   0.0   and
   0.0    in
   0.0    is
   0.0   the
   0.0    to
----------------------------------------
Great bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and… 
Counts Terms
0.707107   and
0.000000    in
0.000000    is
0.707107   the
0.000000    to
----------------------------------------
Working in Bedminster, N.J., as long planned construction is being done at the White House. This is not a vacation - meetings and calls!
Counts Terms
0.377964   and
0.377964    in
0.755929    is
0.377964   the
0.000000    to
----------------------------------------
More Anti-Catholic Emails From Team Clinton:  
Counts T

## N-grams

So far we have only considered individual words and their frequencies. We lose a lot of information doing so, because we discard word order and grammar etc.

A simple solution to this is to use n-grams, that is sequences of words of length n, when we tokenize.

In [47]:
# you can tokenize/vectorize with n-grams using the parameter
# ngram_range. It takes a tuple of ints that specify min and max
# n-gram lengths
ngram_vectorizer = CountVectorizer(max_features=5, ngram_range=(2,2))

ngram_vectorizer.fit_transform(test_corpus)
ngram_vectorizer.get_feature_names()

['actions will', 'at the', 'president obama', 'so powerfully', 'source of']

**Use `ngram_vectorizer` to compute bigram count vectors for your test corpus**

In [None]:
# your code here

In [None]:
# print your vectors in a nice way

The same priciples as before, namely vectorizing using term frequencies or term frequency-inverse document frequencies, apply here too.

A big advantage of tokenizing using n-grams is that models can learn some basic information about which words tend to appear together, and which words follow on from other sequences.

### Generative Model

We've written a tweet generator that uses n-grams and a simple Markov model to generate new tweets based on some training data. You can see that when using unigrams, it just returns a random collection of words that follow the same distribution as the observed data. However bigrams and trigrams already manage to capture a lot of information about how words are used together.

You can try the model using 10-grams or some large value of n too. But at that point there is not enough training data to make the Markov model particularly interesting. The model will just start to repeat actual tweets rather than generating new content. It will have massively overfit the data.

If you have time, feel free to dive into the source code to see how the generator works, but you are also more than welcome to just use it as a black box.

In [27]:
from asi_nlp.twitter import TweetGenerator

unigram_generator = TweetGenerator(1)
unigram_generator.train(df[df['label'] == 0]['text'])

bigram_generator = TweetGenerator(2)
bigram_generator.train(df[df['label'] == 0]['text'])

trigram_generator = TweetGenerator(3)
trigram_generator.train(df[df['label'] == 0]['text'])

Model trained!
Model trained!
Model trained!


In [28]:
unigram_generator.generate()

['covered their been news killer ms wikileaks time its with hillary korea in better together wrong of and voter enjoy']

In [48]:
bigram_generator.generate()

['rt if we are fake news to meet ag to nato commander just arrived in big failure countries charge them and staff thought her to vote maga gopchairwoman']

In [51]:
trigram_generator.generate()

['settled the trump dossier is complete and total fabrication utter nonsense very unfair']