# X05: Introduction to Natural Language Processing

Natural Language Processing (NLP) is a way of approaching analysing text data. We won't be delving too much into the nuts and bolts of NLP since it's a massive area worthy of several hours/weeks/years of study depending upon your level of interest. This lesson is adapted from the excellent beginners tutorial available on <a href = "https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words">Kaggle</a> however there is a more in depth course available on <a href = "https://www.coursera.org/course/nlpintro">Coursera</a> should you wish to explore NLTK further.

## NLTK

We're going to be using a library called <a href = "www.nltk.org">NLTK</a> which stands for Natural Language ToolKit. This is included in the Anaconda installation so you don't need to install it via the pip.

However you will need to install some data as NLTK comes with a vast array of texts, trained models, grammar modules etc. To bring up the download window you can type:

    nltk.download()
    
into your Jupyter notebook start the NLTK downloader where you can select individual files to download.

For now we don't need to download anything since we'll be importing data to use. We'll start by importing some libraries as follows:

In [47]:
import nltk
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re

from nltk.corpus import stopwords # Import the stop word list

We're now going to import our data:

In [2]:
# Defining Variables

path = 'C:\\Users\\owner\\Google Drive\\DfT Data Science\\Projects\\P018 - NLP Training\\'

l_train = 'labeledTrainData.tsv'        # Labeled training set
u_train = 'unlabeledTrainData.tsv'      # Unlabelled training set
test = 'testData.tsv'                   # Test set
sample = 'sampleSubmission.csv'         # Sample data

In [3]:
# Importing the data

df_ltrain = pd.read_csv(path+l_train, header=0,delimiter="\t", quoting=3)  # Labelled Training data
df_utrain = pd.read_csv(path+u_train, header=0,delimiter="\t", quoting=3)  # Unlabelled Training data
df_test = pd.read_csv(path+test, header=0,delimiter="\t", quoting=3)       # Test data

Note that the 
    quoting = 3
option tells Python to ignore double quotes. This option comes in handy when parsing text data. 

Also note that the files are .tsv files. This stands for 'Tab Separated Values'. These files work in a similar manner to .csv files that most of you should be familiar with, however when working with text data, commas will appear quite often so we're using the tab as a separator instead.

## Part 1: Data Cleaning

Before we can start to process the data, we must first clean it up so it's in a state to be processed. This means:

* Removing the html markup
* Removing punctuation
* Making all text lower case
* Removing stop words

As we have several datasets of thousands of rows each, we'll take a single review and process this first before 'scaling up' our code to apply to the dataframes we've created:

In [4]:
review = df_ltrain.iloc[0]['review']
review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

We can firstly remove the html markup using Beautiful Soup:

In [66]:
review = BeautifulSoup(review,'lxml').get_text()
review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

Then we can remove the punctuation using a Regular Expression. Regular Expressions are a lot like Find and Replace in traditional programs, in that they can find strings, characters etc. and replace them with an alternative. We can use the re built in library in conjunction with the .sub() (substitute) method to do this as follows:

In [5]:
review = re.sub("[^a-zA-Z]",   # Find anything not in (aka ^) a-z or A-Z
                " ",           # Replace it with a blank
                review)        # The object to perform the action on
review

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay  br    br   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him  br    br   The actual feature film bit when it finally st

We can then replace upper case text with lower case text using the lower() method:

In [68]:
review = review.lower()
review

' with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    m

Next we need to remove Stopwords. These are words with little meaning such as 'a', 'I', 'the' etc. We can do this with NLTKs stopwords library and a list comprehension as follows:

In [26]:
stops = set(stopwords.words("english"))        # Creating a set as sets are more efficient than lists
review = review.split()                        # Splitting the text into individual words
review = [w for w in review if not w in stops] # Removing the stopwords via a list comprehension
review = " ".join(review)                      # Joining the word list back together with spaces in between the words
review = [review]                              # Turning the review into a list
review 

['With stuff going moment MJ started listening music watching odd documentary watched The Wiz watched Moonwalker Maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent Moonwalker part biography part feature film remember going see cinema originally released Some subtle messages MJ feeling towards press also obvious message drugs bad kay br br Visually impressive course Michael Jackson unless remotely like MJ anyway going hate find boring Some may call MJ egotist consenting making movie BUT MJ fans would say made fans true really nice br br The actual feature film bit finally starts minutes excluding Smooth Criminal sequence Joe Pesci convincing psychopathic powerful drug lord Why wants MJ dead bad beyond Because MJ overheard plans Nah Joe Pesci character ranted wanted people know supplying drugs etc dunno maybe hates MJ music br br Lots cool things like MJ turning car robot whole Speed Demon sequence Also director must patience saint cam

Our data is now ready to process! However we must apply this to all the data we have via a function as follows:

In [56]:
# Function to clean the data up

stops = set(stopwords.words("english"))     # Because sets are quicker than lists!

def cleaner(row):
    review = re.sub("[^a-zA-Z]",                         
                    " ",              
                    BeautifulSoup(row["review"],'lxml').get_text().lower() )
    review = review.split()
    review = [w for w in review if not w in stops]
    review = ( " ".join(review))   
    return review

df_ltrain["clean text"] = df_ltrain.apply(cleaner,axis=1)
df_utrain["clean text"] = df_utrain.apply(cleaner,axis=1)
df_test["clean text"]   = df_test.apply(cleaner,axis=1)

In [58]:
df_ltrain.head(5)

Unnamed: 0,id,sentiment,review,clean text
0,"""5814_8""",1,"""With all this stuff going down at the moment ...",stuff going moment mj started listening music ...
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...",classic war worlds timothy hines entertaining ...
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...",film starts manager nicholas bell giving welco...
3,"""3630_4""",0,"""It must be assumed that those who praised thi...",must assumed praised film greatest filmed oper...
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...",superbly trashy wondrously unpretentious explo...


## Part 2: Bag of Words

Now that we've cleaned our data up, we can create our bag of words

The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:

* Sentence 1: "The cat sat on the hat"

* Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary is as follows:

* { the, cat, sat, on, hat, dog, ate, and }

To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:

* { the, cat, sat, on, hat, dog, ate, and }
* Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}

## Making the bag of words

Making a Bag of Words in Python is accompished using the Scikit-learn library. In this example we're going to be limiting the number of words to 5000 both due to processing time and also the danger of overfitting.

The code for this is as follows:

In [84]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000)      # Creating the Bag of Words

ltrain = df_ltrain["clean text"].tolist()              # Converting the features column to a list
ltrain_features = vectorizer.fit_transform(ltrain)     # Fitting the training data to our model
ltrain_features = ltrain_features.toarray()            # Converting the test data to an array

## Decision Trees & Random Forests

Now we have our bag of words, we're going to do some supervised learning with a Random Forest, which is essentially a type of Decision Tree.

Tom is at Blockbuster deciding on a film to rent and has invited Emma along to help him decide which one to watch. Emma begins by asking Tom for a list of films he has previously enjoyed (a training dataset). 

Tom them picks up a film and Emma asks him a series of questions such as 'What genre of film is it?' , 'Which actors & actresses are in it' and 'Who directed it?' and cross references this with the list he has given her before giving him a yes or no reccomendation as to whether he will enjoy the film (classifying the test dataset).

In this example Emma is a <b>Decision Tree</b>.

However Emma is just one person and her questions may lean towards on aspects that actually Tom doesn't care about (overfitting).

So to get a better idea Tom invites Rupesh, Kathryn, Will and Howard to Blockbuster and has them repeat the process. Each gives their answer and Tom decides based upon the majority opinion (an ensemble classifier).

But what if Emma, Rupesh, Kathryn, Will and Howard have the exact same process for asking and answering questions and therefore give him similar answers? To get around this Tom decides to randomly select films to give to each of his friends to avoid getting the same list of answers.

There is also the danger of spurious connections. For example, two of the films on Tom's favoutites list are The Breakfast Club and Predator. He likes the Breakfast Club because it's coming-of-age comedy/drama starring several 'Brat Pack' members, but he likes Predator because it is a sci-fi film starring Arnold Schwarzenegger. However, one thing connecting these films is that they were both made in the 1980s. As such his friends may ask which year the film was made and conclude that Tom likes these films because they were made in the 1980s and reccomend he watch Footloose, something which he has no inclination to watch!

To fix this, Tom decides to make his friends randomly choose which questions to ask.

This introduces randomness to both the sampling process (giving each of his friends a random list of films) and the model (by making them ask a random question).

This process is known as a <b>Random Forest</b>.

### For example:
    
Tom randomly selects a list of films he likes and gives it to each of his friends:
    
Emma: I like Predator, The Breakfast Club, Shawshank Redemption, Wall-E, Blade Runner
Rupesh: I like Predator, Shawshank Redemption, Aliens, How to Train Your Dragon, The Dark Knight
Kathryn: I like The Breakfast Club, Terminator 2, Inception, Whiplash, Shawshank Redemption
Will: I like Mystery Men, Aliens, Wall-E, Shawshank Redemption, Fight Club
Howard: I like Predator, How to Train Your Dragon, Kindergarden Cop, Aliens, Die Hard
    
Tom then picks up Total Recall and gets his friends then ask him random questions from their list to determine whether he will like it:

Emma: ['What year was Predator made?', 'Who directed Predator', 'Is it a Rom-com?' ....], ['When did you first watch the Breakfast Club?','Is it a cartoon?', 'Did you watch it at the cinema?']... <br/>
Rupesh: ['Who is the lead actor in Predator?' , 'Is it an animated film?', 'Is it a Sci-fi film?' ...], ['Who wrote the Shawshank Redemption?', 'Is it based on a True Story?','Is it an ensemble cast?']...<br/>
Kathryn: ['When was the Breakfast club made?' , 'What was it's budget?', 'Was it considered a box office success?'...], ['Who Directed Mystery Men?', 'What was it's budget?', 'Who is the lead actor / actress?']... <br/>

... and so on.

Based upon Tom's list of films, and the list of questions his friends will each give him a yes / no answer as to whether he will like Total Recall.

## Information Gain & Entropy

In order to pick which feature to split on, we need a way of measuring how good the split is. This is where information gain and entropy come in.

We would like to choose questions that give a lot of information about the tree’s prediction. For example, if there is a single yes/no question that accurately predicts the outputs 99% of the time, then that question allows us to “gain” a lot of information about our data. In order to measure how much information we gain, we introduce entropy which is a measure of uncertainty associated with our data. We can intuitively think that if a data set had only one label then we have a low entropy. So we would like to split our data in a way that minimizes the entropy. The better the splits, the better our prediction will be.

## Random Forests in Scikit-Learn

Coding these concepts from scratch would be a laborious and time consuming job. Fortunately this is all built into Scikit-learn's RandomForestClassifier library and you can create a Random Forest model very simply:

In [61]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 100)              # Creating a blank Random Forest. n_estimators is the number of trees in the forest
forest = forest.fit(ltrain_features, df_ltrain["sentiment"])     # Training the model with our data

In [68]:
ltest = df_test["clean text"].tolist()                           # Converting the test data column to a list
ltest_features = vectorizer.fit_transform(ltest)                 # Fitting the test data to our model
ltest_features = ltest_features.toarray()                        # Converting the test data to an array

In [86]:
result = forest.predict(ltest_features)                          # Running our model

df_result = pd.DataFrame(data={"id":df_test["id"],
                            "review":df_test["review"],
                            "sentiment":result})                 # Transferring the result to a DataFrame

In [97]:
df_result

Unnamed: 0,id,review,sentiment
0,"""12311_10""","""Naturally in a film who's main themes are of ...",0
1,"""8348_2""","""This movie is a disaster within a disaster fi...",0
2,"""5828_4""","""All in all, this is a movie for kids. We saw ...",0
3,"""7186_2""","""Afraid of the Dark left me with the impressio...",1
4,"""12128_7""","""A very accurate depiction of small time mob l...",0
5,"""2913_8""","""...as valuable as King Tut's tomb! (OK, maybe...",0
6,"""4396_1""","""This has to be one of the biggest misfires ev...",1
7,"""395_2""","""This is one of those movies I watched, and wo...",1
8,"""10616_1""","""The worst movie i've seen in years (and i've ...",0
9,"""9074_9""","""Five medical students (Kevin Bacon, David Lab...",0


## Further Reading

<a href = "https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words">Kaggle Tutorial</a><br/>
<a href = "https://medium.com/@josemarcialportilla/enchanted-random-forest-b08d418cb411#.ikr779iae">Introduction to Random Forests</a><br/>
<a href = "www.nltk.org">NLTK</a><br/>
<a href = "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Scikit-Learn Random Forest Class</a><br/>