<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Introduction to Natural Language Processing

## Icebreaker

Today we're talking about text data. With your neighbour, discuss some use cases for analysing/making predictions using text data.

- what problems could you solve with text analysis algorithms?
- what are some text-based **targets** you could predict with machine learning?

# Learning objectives

- Definition of NLP
- Applications of NLP
- Techniques to implement NLP (Bag of Words, Preprocessing)
- How to exploit NLP in Machine Learning

# What is Natural Language Processing?

Natural Language Processing or NLP is a technique used in Computer Science to automate:

- understanding
- interpretation
- manipulation

of human language, being this written or spoken.

![](assets/images/a-panorama-of-natural-language-processing-6-638.jpg)

Source: https://www.slideshare.net/TedXiao/a-panorama-of-natural-language-processing

Using Linguistics we can transform the human language (written text, audio) into _digestible_ data for computer algorithms

![](assets/images/nlp_1.png)

The process works both ways:

- computer can receive human language as input
- computer can send human language as output

# Where do we use Natural Language Processing?

- **Chatbots:** Understand natural language from the user and return intelligent responses

- **Information retrieval:** Search!

- **Information extraction:** Structured information from unstructured documents, e.g. Google extracting events from emails

- **Machine translation**

- **Predictive text input**

- **Sentiment analysis**

- **Automatic summarisation:** Extractive or abstractive summarisation.

- **Speech recognition and generation:** Speech-to-text, text-to-speech

## Why is NLP hard?

### Ambiguity

- Hospitals Are Sued by 7 Foot Doctors

- Juvenile Court to Try Shooting Defendant

- Local High School Dropouts Cut in Half

### Non-standard English

- slang

- txt msg speak (LOL)

- newly coined words like "retweet"

### Idioms

- "throw in the towel"

### Tricky entity names

"Where is A Bug's Life playing?"

### Sarcasm

The Data Science/machine learning approach is **not** to model human language, but to **find patterns**

## Today: what category does a "document" belong to based on the words in it?

Category can be anything - it's just a classification problem!

For our example we will use text from Yelp reviews to predict the star rating given by the user

# Techniques to implement NLP

In [1]:
import pandas as pd

df = pd.read_csv("assets/data/yelp.csv.gz")
df.shape

(10000, 10)

In [2]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
reviews = df["text"].values
star_ratings = df["stars"].values
print(reviews[0])
print(star_ratings[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
5


### How do we turn this into a machine learning task?

Our target is to predict the number of stars, so it's a classification problem.

But what are our features?

## Bag of Words

Our first approach is simple. Each review (or generically "document") is characterised by the presence of the words in it

We use all our documents to construct a "vocabulary" and create binary features for each word

### Example

Let's use Bag of Words to describe the following three documents.

![](assets/images/bag_example_1.png)

We create a dictionary which contains ALL the words we find in these documents.

![](assets/images/bag_example_2.png)

We then count EVERY word in the dictionary for every document, so that each document can be described by a list of occurences (count vector).

![](assets/images/bag_example_3.png)

Other times, rather than knowing how many times a word occurs in a document, we may just need to know whether this word is present or not (Boolean vector).

![](assets/images/bag_example_4.png)

In other cases, the dictionary can be tailored to specific searches, such as:

- positive sentiment dictionary ("amazing", "brilliant", "fantastic", ....)
- negative sentiment dictionary ("horrible", "terrible", "disgusting", ...)

When a document contains lots of words from the positive sentiment dictionary and few words from the negative sentiment dictionary can be classified as `positive` and viceversa.

Useful link

https://machinelearningmastery.com/gentle-introduction-bag-words-model/

# Preprocessing

### 1. Lowercasing
### 2. Remove stopwords
### 3. Stemming/Lemmatisation

What do we need to do with documents like this?

In [4]:
reviews[0]

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

### 1. Lowercasing

We usually want "The" and "the" to be the same word!

In [5]:
words = reviews[0].split()
words[:10]

['My',
 'wife',
 'took',
 'me',
 'here',
 'on',
 'my',
 'birthday',
 'for',
 'breakfast']

In [6]:
lower_words = [w.lower() for w in words]
lower_words[0:10]

['my',
 'wife',
 'took',
 'me',
 'here',
 'on',
 'my',
 'birthday',
 'for',
 'breakfast']

Obviously you can also `uppercase` every word, the point of this step is to avoid words to be flagged as different only because their cases differ.

### 2. Remove stopwords

Stopwords are words which do not contain any useful information about the context of the document.

We'll be using NLTK ([http://nltk.org](http://nltk.org))

Useful link: http://xpo6.com/list-of-english-stop-words/

In [63]:
from nltk.corpus import stopwords as nltk_stopwords

stopwords = nltk_stopwords.words('english')
print(len(stopwords))
stopwords[:10]

179


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [61]:
print("number of words in the original document", len(set(lower_words)))
useful_words = [word for word in lower_words if word not in stopwords]
print("number of words in the original document, excluding stopwords", len(set(useful_words)))

number of words in the original document 106
number of words in the original document, excluding stopwords 76


### 3. Stemming

- We want words like "having" and "have" to be the same

Useful link: https://tartarus.org/martin/PorterStemmer/

In [9]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')
stemmer.stem("having")

'have'

In [10]:
stemmed_words = [stemmer.stem(word) for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])

['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sitting', 'outside']
['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sit', 'outsid']


### 3b. Lemmatisation

Like stemming, but knows more about language

In [11]:
from nltk import WordNetLemmatizer

lem = WordNetLemmatizer()
lemmatised_words = [lem.lemmatize(word, 'v') for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])
print(lemmatised_words[:10])

['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sitting', 'outside']
['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sit', 'outsid']
['wife', 'take', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'make', 'sit', 'outside']


Downside: you need to tell it what type of word it is (verb, noun etc.)

You can use a "part-of-speech tagging" approach. More info here: [https://www.nltk.org/book/ch05.html](https://www.nltk.org/book/ch05.html)

Useful link: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

# Machine Learning with Bag of Words

By using Bag of Words:

- we can describe a document in a computer-friendly way
- every document becomes a vector
- we can assign a `score` to the vector
- we can assign a `label` to the vector

Our model can therefore:

- predict the `score` of the document (`regression problem`)
- predict the `label` of the document (`classification problem`)

In [62]:
# apply stemming
df["text_stemmed"] = df["text"].apply(lambda x: " ".join([stemmer.stem(w) for w in x.split()]))
print(df["text"].values[0])
print(df["text_stemmed"].values[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
my wife took me here on my birthday for breakfast and it was excellent. the weather was perfect which made sit

Let's start with "1 star" vs. "5 star" reviews as a binary classification

In [50]:
from sklearn.model_selection import train_test_split

X = df.loc[df["stars"].isin([1, 5]), "text_stemmed"]
y = df.loc[df["stars"].isin([1, 5]), "stars"]

# stratify keeps the proportions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
y.value_counts()

5    3337
1     749
Name: stars, dtype: int64

Useful link on stratify

https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn/41074531

In [64]:
# in order to use LogisticRegression we must have numerical values as X_train
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)


ValueError: could not convert string to float: "yikes, read other review i realiz my bad experi wasn't unique. as a server i make a veri laid back customer. i like pretti much everyth i eat and don't requir a lot of attent from the waiter. la piccola cucina would benefit from just one extra person in the front of the house. our guy, though ador and friendly, was too busi to refil our drink and to rememb to bring our appet (though charg us for it). the ahi tuna (high recommend over the other fish option he said) was so overcook it was the color and consist of chicken. like other review mentioned, he was frantic and made that clear to everi customer. at one point i even saw him in the kitchen cook - they need anoth person! i left super stress out from the experience, which is very, veri unusu for me. you can either have bad servic or bad food, but not both."

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True,
                      stop_words='english',
                      lowercase=True # default
                     )

# starting from our 2860 documents we took for training set, we translate them into bag of words, 
# i.e. dictionaries of word count
X_train_text = vec.fit_transform(X_train)
X_test_text = vec.transform(X_test)

print(len(vec.vocabulary_))
# look at some random features
print(vec.get_feature_names()[1000:1010])

14379
['astound', 'astounded', 'astounding', 'astrological', 'astronom', 'asu', 'atari', 'ate', 'atf', 'athelet']


In [67]:
# in order to use LogisticRegression we must have numerical values as X_train
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_text, y_train)
print(lr.intercept_, lr.coef_)

[1.13192726] [[ 2.39703753e-01  4.95926015e-02 -1.36257012e-03 ...  2.78036103e-03
   8.09948163e-06  6.41188584e-05]]


In [74]:
from sklearn.metrics import accuracy_score

y_pred = lr.predict(X_test_text)

accuracy_score(y_test, y_pred)

0.9290375203915171

In [76]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, X_train_text, y_train, scoring="f1", cv=5)
print(scores, np.mean(scores))

[0.83       0.7486631  0.72527473 0.82978723 0.72527473] 0.7717999572392562


Useful link on Cross-Validation

https://machinelearningmastery.com/k-fold-cross-validation/

Useful link on `f1`

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Try some more things!

`min_df` is "minimum number of documents word has appeared in"

In [77]:
def try_new_vectoriser(vec, X, y):
    X_train_text = vec.fit_transform(X)
    print(len(vec.vocabulary_))
    lr = LogisticRegression()
    scores = cross_val_score(lr, X_train_text, y_train, scoring="f1", cv=5)
    print(scores, np.mean(scores))

try_new_vectoriser(CountVectorizer(binary=True,
                                   stop_words='english',
                                   min_df=2
                                  ),
                   X_train,
                   y_train)

7251
[0.84       0.75132275 0.72432432 0.83597884 0.72826087] 0.7759773562382257


Instead of binary let's use actual counts

In [78]:
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2
                                  ),
                   X_train,
                   y_train)

7251
[0.83673469 0.73096447 0.72527473 0.77419355 0.7311828 ] 0.7596700460486747


Limit to top 1000 most frequent words

In [79]:
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   max_features=1000
                                  ),
                   X_train,
                   y_train)

1000
[0.79802956 0.7106599  0.72727273 0.74074074 0.73958333] 0.7432572512948411


## Exercise

- Start with notebook 01 to install `nltk`!