# Week 5 - Natural Language Processing (NLP)

In this week's exercise, we will apply neural networks to a new type of data - text. You will also pick your own corpus of text to play with.  **But first, some definitions** (there's a lot of new jargon in 'NLP'):
* *Corpus*: The set of all text documents you want to work on
* *Document*: An individual unit of text in your corpus

Below are some **examples of corpora** to get you thinking:

* Corpus of 100,000 IMDB reviews, where each document is an individual review
* Corpus of 20 English novels, where each document is an individual novel
* Corpus of one Malay novel, where each document is a chapter

You can **browse for a corpus in the below links** (if you don't already have one in mind):

* https://github.com/niderhoff/nlp-datasets
* https://www.gutenberg.org/catalog/

If this is your first time working with text, it's probably easier to deal with a corpus of many short documents - for example the IMDB review dataset, which is linked below in Chapter 2.  Play around with several corpora over the course of week, and work with something that interests you. Remember text doesn't have to be English (try other languages), or even a natural language (try code or musical notation)!

**Key learning resources** for the week:
* https://web.stanford.edu/~jurafsky/slp3/ - legendary textbook introducing key theory and concepts of working with text, up to deep learning methods
* http://web.stanford.edu/class/cs224n/ - great course that introduces theory and concepts of text processing in the context of deep learning (can read class notes / assignments and skip videos if you are short on time) 
* https://course.fast.ai/index.html - fast.ai's introduction to deep learning (you'll have to pick out the bits about text and RNNs) is an efficient and effective way of tackling the topic
* https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python - very hands on datacamp course that will let you practice using existing tools for NLP tasks

Some **additional tools below** that can help in NLP (if you haven't found them already):
* scikit-learn has a handy set of features for NLP
* https://spacy.io/ - commercially oriented python package for NLP
* https://www.nltk.org/ - slightly more academic oriented python package for NLP 

## Imports

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.cm as cm
from scipy import stats
from collections import Counter, defaultdict
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models.tfidfmodel import TfidfModel
import itertools



In [2]:
# import tarfile
# from pathlib import Path

# data_folder = Path("data/raw/")
# file_to_open = data_folder / "aclImdb_v1.tar.gz"

# tar = tarfile.open(file_to_open, "r:gz")
# tar.extractall(path=data_folder)
# tar.close()

# Chapter 1: How do we turn text into data we can use?

### Convert your corpus into bags of words

We can't apply any of the techniques we have learned over the past few weeks directly on raw text.  Therefore, our first task is to convert our corpus into numbers.  The simplest way to do this is to use a **bag of words**. You can see some examples of this here: https://liferay.de.dariah.eu/tatom/index.html.

Once you understand the concept, convert your corpus and documents into bags of words below:

In [3]:
df_raw = pd.read_csv('data/raw/Twitter-sentiment-self-drive-DFE.csv', encoding='ISO-8859-1')
df_raw.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,sentiment,sentiment:confidence,our_id,sentiment_gold,sentiment_gold_reason,text
0,724227031,True,golden,236,,5,0.7579,10001,5\n4,Author is excited about the development of the...,Two places I'd invest all my money if I could:...
1,724227032,True,golden,231,,5,0.8775,10002,5\n4,Author is excited that driverless cars will be...,Awesome! Google driverless cars will help the ...
2,724227033,True,golden,233,,2,0.6805,10003,2\n1,The author is skeptical of the safety and reli...,If Google maps can't keep up with road constru...
3,724227034,True,golden,240,,2,0.882,10004,2\n1,The author is skeptical of the project's value.,Autonomous cars seem way overhyped given the t...
4,724227035,True,golden,240,,3,1.0,10005,3,Author is making an observation without expres...,Just saw Google self-driving car on I-34. It w...


In [4]:
df = df_raw[df_raw.sentiment != 'not_relevant']['text']
df.head()

0    Two places I'd invest all my money if I could:...
1    Awesome! Google driverless cars will help the ...
2    If Google maps can't keep up with road constru...
3    Autonomous cars seem way overhyped given the t...
4    Just saw Google self-driving car on I-34. It w...
Name: text, dtype: object

In [5]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Tokenize all the tweets in the dataframe
raw_corpus = [word_tokenize(tweet) for tweet in df]

corpus = []
corpus_bow = []

for tweet in raw_corpus:
    # lower case the tokens
    lower_tokens = [token.lower() for token in tweet]
    
    # Retain alphabetic words: alpha_only
    alpha_only = [t for t in lower_tokens if t.isalpha()]
    
    # Remove all stop words: no_stops
    no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

    # Lemmatize all tokens into a new list: lemmatized
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
    corpus.append(lemmatized)

    # Create the bag-of-words: bow
    bow = Counter(lemmatized)
    corpus_bow.append(bow)

### Show us your bags

Show and explain what one of your documents looks like as a bag of words below.  What are the advantages and disadvantages of encoding text as bags of words?

#### Single Document Topic Identification:

In [6]:
corpus_bow[0].most_common(5)

[('two', 1), ('place', 1), ('invest', 1), ('money', 1), ('could', 1)]

#### Corpus Topic Identification:

In [7]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(corpus)
gensim_corpus = [dictionary.doc2bow(document) for document in corpus]

In [8]:
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(gensim_corpus):
    total_word_count[word_id] += word_count

# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

car 6811
http 3696
google 3096
driverless 1868
driving 1709


### Tell us a story with your bags
Now that your text is in a more digestible format, you can apply previously learned techniques to better understand the corpus. **Create a brief story around your corpus, for example by using clustering techniques.** Some examples of what you can do below:
* Use *Hierarchical Clustering* to understand similarity of documents in your corpus. What distance measure works best? Are the results what you expect?
* Learn about *Latent Dirichlet Allocation* to extract topics from your corpora, and measure each document on how much of each topic it contains. How do you interpret these topics?

Some **potential inspiration** below (but please keep your own story simple!):
* https://liferay.de.dariah.eu/tatom/topic_model_mallet.html covers a few examples of text analysis
* http://fantheory.viacom.com/
* https://pudding.cool/2017/02/vocabulary/

Additional resources on LDA (if you are interested): 
* https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
* https://www.youtube.com/watch?v=DDq3OVp9dNA

In [9]:
NUM_TOPICS = 5
ldamodel = LdaModel(gensim_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.086*"http" + 0.082*"car" + 0.048*"google" + 0.014*"driverless"')
(1, '0.147*"driving" + 0.126*"self" + 0.087*"car" + 0.047*"google"')
(2, '0.100*"car" + 0.031*"google" + 0.018*"driverless" + 0.015*"http"')
(3, '0.091*"car" + 0.076*"http" + 0.047*"google" + 0.018*"driverless"')
(4, '0.086*"car" + 0.064*"http" + 0.032*"driverless" + 0.020*"google"')


In [10]:
print(ldamodel.get_document_topics(gensim_corpus[0]))

[(0, 0.025259985), (1, 0.025088727), (2, 0.7310085), (3, 0.19304325), (4, 0.025599515)]


### Normalize your bags
In the above exercise, you may find it important to normalize your data.  One useful method when dealing with text is *Term Frequency - Inverse Document Frequency (TF-IDF)*. You can see more detail on this here: http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/.

Once you understand the concept, **express your data as TF-IDF vectors (instead of simple bag-of-words counts), and see if it changes your above story**. 

In [11]:
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(gensim_corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[gensim_corpus[0]]

### Show us your bags (Version 2)

Show and explain what one of your documents looks like as a TF-IDF vector below.  How is this different from a simple bag-of-words?

In [12]:
# Print the first five weights
print(tfidf_weights[:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

[(0, 0.008460742956819567), (1, 0.25039186204171204), (2, 0.48792055660159156), (3, 0.4072926714561025), (4, 0.3995601932716216)]
invest 0.48792055660159156
printing 0.48792055660159156
money 0.4072926714561025
place 0.3995601932716216
two 0.36818979327784845


# Chapter 2: Simple Supervised Learning with Text
Now that you are comfortable with treating text as numbers, we can try out supervised learning.  We'll use a labelled dataset of IMDB reviews to classify each review as 'positive' or 'negative'.  You can **find the data below:**

http://ai.stanford.edu/~amaas/data/sentiment/

Load in and process the data, then train a supervised learning model.  **You should achieve val or test set accuracy of 85%**. Pretty good for a simple bag, no?

# Chapter 3: Playing with Recurrent Neural Networks (RNN)
So far, we've only treated text as a simple bag, with reasonable results.  We'll now shift to a more complex representation of language: recurrent neural networks.  To do so, we need to process text at the word or character level, and capture the sequence of a document. 

Our task here is to build an RNN that 'eats up' sequences of characters in order to predict the next character in a sequence, for every step in the sequence of a document. This is a common (and fun) task, with lots of examples available online. 

For this task, use existing RNN APIs (don't code everything from scratch) from Keras or PyTorch. 

**Read up on RNNs and this exercise** below:
* http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - start here!
* https://github.com/martin-gorner/tensorflow-rnn-shakespeare - video, slides and code going through an example with Shakespeare
* http://killianlevacher.github.io/blog/posts/post-2016-03-01/post.html - another nice example based on Trump tweets

### Prepare your data

Our first step is to prepare our text. **Process your corpora into a format that can be used by an RNN, and walkthough one sequence below**.

An **example way to shape your data** for this task is as follows (feel free to play around with different structures):

*In this example your corpora starts with the string 'the cat and I'*
* RNN input: divide your text into sequences of 10 characters e.g. 'the cat an'
* RNN output: the 1 character immediately following RNN input sequences e.g. 'd'. 
* Note: You may or may not want to divide your text into overlapping strings (e.g. RNN input contains 'the cat an', 'he cat and', 'e cat and ', ...) . How is the model different in each case?
* Note: Your 'vocabulary' or `vocab_size` here is the number of unique characters in your text (and therefore the number of classes you want to predict)

### Generate text

Once the model is trained, we can use it to generate completely new text in the style of your training data.  **Train a model using your original choice of corpus below, and generate some sample sentences.** Don't worry too much about your loss / accuracy during training, but instead check on the text your model is generating. Your generated text should be somewhat coherent, i.e. similar to your training text in structure, and not excessively mispelled.

An **example model architecture** is as follows (feel free to play around with different structures):
* Embedding (for each character in your vocab) of dimension 64
* Dropout of 20% for the embedding input to the RNN
* 2 LSTM layers, each of dimension 512 (play around with the number and dimension of hidden layers)
* Dropout of 50% for each LSTM layer
* Dense softmax layer of same dimension as your vocab size (e.g. if your vocab size is 100, this layer is the probabilty that your output is one of 100 possible characters)
    
**You should understand what each of the above elements are and how they work at a high level by the end of this week's exercise.**

### Generalizing the exercise
How do you think you can apply what you learned in the above exercise to other problems involving text? For example, how would you tackle the previous IMDB sentiment classification task using an RNN architecture? **Discuss below.**

(*Bonus*: create an RNN model for the IMDB classification task and discuss your results. How does the performance compare to your bag of words model?)

# Chapter 4: RNNs from scratch
Now that you understand how to use RNNs, it's time to build a basic one from scratch.  You won't understand how they work until you get stuck in the weeds! 

### Generate text (Version 2)
Your task is now to **build the forward pass of a simple RNN, without using any existing RNN APIs**. You can use PyTorch or Tensorflow (Keras is too high level for this exercise), both of which will automatically handle backpropagation for you.  If you use Tensorflow, please research and use Eager execution - it replaces Tensorflow's default graph / session framework, which is very difficult to learn and debug.

Similar to last week's exercise, create a class for your network (write forward and loss steps, allowing PyTorch or Tensorflow to handle backpropagation for you).  Consider appropriate sizes for your input, hidden and output layers - your __init__ method should take in the params `hidden_size`, `vocab_size`, and `embedding_size` (if you use embeddings). Using these variables, you should initialise three weight layers `input_layer`, `hidden_layer`, and `output_layer`.  In an RNN, you will also have to deal with another item - the `hidden_state`. (Note: your RNN structure may vary slightly from this depending on your learning materials, but the key part is always `hidden_state`)

You should **train your RNN on the same data and task as in Chapter 3.**

**How do the results of your basic RNN compare to your model in Chapter 3?**  What do you think explains the difference in performance? Discuss below.

Some relevant resources on LSTMs (and RNN theory) below if you are interested:
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://www.youtube.com/watch?v=93rzMHtYT_0&list=LLpNVCNE9cYqVrjb2O8bZUGg&index=2&t=0s
* https://www.youtube.com/watch?v=zQxm3Upr3_I
* http://harinisuresh.com/2016/10/09/lstms/

### Bonus Challenges (not required!):
1. Build the forward pass of an LSTM, without using any existing RNN APIs (as above, with PyTorch or Tensorflow)
1. Build a basic RNN or LSTM in Numpy - including forward pass as well as backpropogation