# Introduction to Natural Language Processing
## 1. Data Representation
### ASI Data Science Fellowship XIII - 24th January 2019

In [None]:
# download the data used in this notebook
!wget "https://s3-eu-west-1.amazonaws.com/fellowship-teaching-materials/NLP/tweets.pkl"


In [None]:
import pandas

This notebook will take you through many of the concepts we have introduced in this session. We will use the same dataset for all examples, namely a collection of 6000 or so tweets from @realDonaldTrump and @BarackObama. 

Wherever possible we will use `sklearn`, Python's machine learning library that you are most likely already familiar with. For a few tasks we will turn to `nltk` (natural language toolkit) a Python library for Nautural Language Procession (NLP).

In [None]:
df = pandas.read_pickle('tweets.pkl')
df.head()

## Data Cleaning 

There are many things to consider when cleaning text data. Some problems are common to other data types, such as how to deal with missing values. Others are unique to text data, and include things like removing HTML tags or urls. We don't want to focus too much on data cleaning for the purposes of this course, we've done a little bit of cleaning below to give you a taste. Generally speaking regular expressions (available in Python in the `re` module) will get you pretty far. For specific tasks there are often existing libraries you can use. For example `feedparser` is good for getting data from an RSS feed, `beautifulsoup` is good for parsing HTML/XML.

In [None]:
import re

def clean_tweet(text):
    # encode tweets as utf-8 strings
    text = text.decode('utf-8')
    # remove commas in numbers (else vectorizer will split on them)
    text = re.sub(r',([0-9])', '\\1', text)
    # sort out HMTL formatting of &
    text = re.sub(r'&amp', 'and', text)
    # strip urls
    return re.sub(r'http[s]{0,1}://[^\s]*', '', text)

df['text'] = df['text'].map(clean_tweet)

## Tokenizing

The field of NLP contains a lot of jargon from linguistics. We don't want to get too bogged down in defining lots of new terms, but the following two are helpful:

- Type: An element of the vocabulary. May be a word, may be an n-gram (ordered sequence of words)
- Token: An instance of a type in running text.

Any given language has a large enough vocabulary that trying to do data science on the set of all possible sentences is totally impractical. Instead it helps to break text up into smaller chunks, a process called tokenizing.

Exactly how we do this will depend on the problem, but some common ways include splitting on whitespace, or splitting on non-alphanumeric characters. In general, the method of tokenizing will be informed by the format of the text data being studied.

**Tokenizers are accessed in a slightly roundabout way in `sklearn`, as below. Run this cell a few times to tokenize random tweets.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from random import randint

# tokenize a random tweet
i = randint(0, len(df) - 1)
tokenizer = CountVectorizer().build_tokenizer()
tokenizer(df['text'].iloc[i])

## Vectorizing

Tokenizing breaks our raw text data down into more manageable chunks, but it's still not in a form that is particularly useful for training models. Let's look at a few common, simple ways of vectorizing text data. We will use `sklearn` which can efficiently vectorize text data and stores everything as `scipy` sparse arrays.

### Count Vectors

Perhaps the simplest way to vectorize is to simply create a vector of counts of the number of times any type appears in a given piece of text.

To get some intuition, let's try it on a small test corpus of 10 random tweets.

**Use `sample` on the series `df['text']` to get a random selection of 10 tweets**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

test_corpus = # your code here

test_corpus

Given the sample, we can create count vectors using `CountVectorizer` from `sklearn`. We set `max_features=5` so as to work with a small vocabulary of only the most common terms.

See the next cell for usage of `CountVectorizer`.

In [None]:
# create a count vectorizer with our desired parameters
count_vectorizer = CountVectorizer(max_features=5)

# first 'fit' the vectorizer to the corpus
# this step automatically determines the vocabulary
count_vectorizer.fit(test_corpus)

# then 'transform' the corpus to count vectors (a matrix)
count_vectors = count_vectorizer.transform(test_corpus)

In [None]:
features = count_vectorizer.get_feature_names()
# we use .toarray() to convert from sparse 
# array to dense numpy array
for i, row in enumerate(count_vectors.toarray()):
    print(test_corpus.iloc[i])
    print(pandas.DataFrame({'Terms': features, 'Counts': row}).to_string(index=False))
    print("-" * 40)

### Term frequency vectors

Count vectors are very sensitive to document length. In our case we expect all tweets to be similar lengths, but in general we might be dealing with documents of varying lengths, so it makes sense to normalise the count vectors. This results in so-called frequency vectors.

**Using `TfidfVectorizer`, compute term frequency vectors for the test corpus and print them out as we did for the count vectors. Make sure you set `use_idf=False` when initialising your `TfidfVectorizer`. As before limit the vocabulary to 5 types.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = # your code here

In [None]:
features = tf_vectorizer.get_feature_names()
for i, row in enumerate(<your_vectors>):
    # print out your vectors in a nice way

### tfidf vectors

tf-idf stands for 'term frequency - inverse document frequencys. Given our term frequencys, we re-weight by the inverse of the document frequency. Therefore a given term will have a larger value if it both appears many times in the document, but appears infrequently across the corpus. In this sense it automatically detects and upweights terms which are likely to be able to help us distinguish between documents.

**Compute tfidf vectors for your test_corpus. You can once again use `TfidfVectorizer`, but this time set `use_idf=True`.**

In [None]:
tfidf_vectorizer = # your code here

In [None]:
# print out your vectors in a nice way

## N-grams

So far we have only considered individual words and their frequencies. We lose a lot of information doing so, because we discard word order and grammar etc.

A simple solution to this is to use n-grams, that is sequences of words of length n, when we tokenize.

In [None]:
# you can tokenize/vectorize with n-grams using the parameter
# ngram_range. It takes a tuple of ints that specify min and max
# n-gram lengths
ngram_vectorizer = CountVectorizer(max_features=5, ngram_range=(2, 2))

**Use `ngram_vectorizer` to compute bigram count vectors for your test corpus**

In [None]:
# your code here

In [None]:
# print your vectors in a nice way

The same principles as before, namely vectorizing using term frequencies or term frequency-inverse document frequencies, apply here too.

A big advantage of tokenizing using n-grams is that models can learn some basic information about which words tend to appear together, and which words follow on from other sequences.

### Generative Model

We've written a tweet generator that uses n-grams and a simple Markov model to generate new tweets based on some training data. You can see that when using unigrams, it just returns a random collection of words that follow the same distribution as the observed data. However bigrams and trigrams already manage to capture a lot of information about how words are used together.

You can try the model using 10-grams or some large value of n too. But at that point there is not enough training data to make the Markov model particularly interesting. The model will just start to repeat actual tweets rather than generating new content. It will have massively overfit the data.

If you have time, feel free to dive into the source code to see how the generator works, but you are also more than welcome to just use it as a black box.

In [None]:
from asi_nlp.twitter import TweetGenerator

unigram_generator = TweetGenerator(1)
unigram_generator.train(df[df['label'] == 0]['text'])

bigram_generator = TweetGenerator(2)
bigram_generator.train(df[df['label'] == 0]['text'])

trigram_generator = TweetGenerator(3)
trigram_generator.train(df[df['label'] == 0]['text'])

In [None]:
unigram_generator.generate()

In [None]:
bigram_generator.generate()

In [None]:
trigram_generator.generate()