# References

Kaggle tutorial [NLP with spaCy](https://www.kaggle.com/learn/natural-language-processing)           
DATACAMP :: Advanced_NLP_with_spaCy              
Free version of Datacamp course [spaCy](https://course.spacy.io/en/)                 

Tags: NLP, spaCy

In [28]:
# imports
import os
import random

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

import spacy
from spacy.matcher import PhraseMatcher
from spacy.util import minibatch

In [None]:
# getting help

# help(plt.hist)
# help(plt.legend)

In [None]:
# settings
import warnings
warnings.filterwarnings('ignore')
#print all rows of a df in ipython shell 
pd.set_option('display.max_rows', None)
#print all columns of a df in ipython shell 
pd.set_option('display.max_columns', None)

# Data preparation

In [36]:
path = os.path.abspath(os.getcwd())
datadir = 'data'
full_path = os.path.join(path, datadir)
spam_file = os.path.join(full_path, "datasets_483_982_spam.csv")
rawData = pd.read_csv(spam_file, encoding = "ISO-8859-1")
fullCorpus = rawData[['v1', 'v2']]
fullCorpus.head()
fullCorpus.columns = ['label', 'text'];
fullCorpus.head()
spam = fullCorpus


In [37]:
# data for case study #1

# big data file
yelp_file = os.path.join(full_path, "yelp_academic_dataset_review.csv.zip")
rawData = pd.read_csv(yelp_file)

reviews = rawData[:100]
'''
# Dataframes implement the Pandas API, use dask for large files
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
'''

In [39]:
#spam.head(10)
reviews.head()

Unnamed: 0,user_id,text,date,review_id,business_id,funny,cool,useful,stars
0,b'hG7b0MtEbXx5QzbzE6C_VA',b'Total bill for this horrible service? Over $...,b'2013-05-07 04:34:36',b'Q1sbwvVQXV2734tPgoKj4Q',b'ujmEBvifdJM6h6RLv4wQIg',1,0,6,1.0
1,b'yXQM5uF2jS6es16SJzNHfg',"b""I *adore* Travis at the Hard Rock's new Kell...",b'2017-01-14 21:30:33',b'GJXCdrto3ASJOqKeVWPi6Q',b'NZnhc2sEQy3RmzKTZnqtwQ',0,0,0,5.0
2,b'n6-Gk65cPZL6Uz8qRm3NYw',"b""I have to say that this office really has it...",b'2016-11-09 20:09:03',b'2TzJjDVDEuAW6MR5Vuc1ug',b'WTqjgwHlXbSFevF32_DJVw',0,0,3,5.0
3,b'dacAIZ6fTM6mqwW5uxkskg',"b""Went in for a lunch. Steak sandwich was deli...",b'2018-01-09 20:56:38',b'yi0R0Ugj_xUx_Nek0-_Qig',b'ikCg8xy5JIg_NGPx-MSIDA',0,0,0,5.0
4,b'ssoyf2_x0EQMed6fgHeMyQ',b'Today was my second out of three sessions I ...,b'2018-01-30 23:07:38',b'11a8sVPMUFtaC7_ABRkmtw',b'b1b1eb3uo-w561D0ZfCEiQ',0,0,7,1.0


# Intro to NLP

In [None]:
## NLP with spaCy

In [4]:
#!python -m spacy download en
#!python -m spacy download en_core_web_lg
#!python -m spacy download en_core_web_sm

In [2]:
# load english model 
nlp = spacy.load('en_core_web_sm')

# with the model create a doc to be processed
doc = nlp("Tea is healthy and calming, don't you think?")

## Tokenizing

This returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.

In [3]:
# tokens
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


## Text preprocessing

There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing." The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

With a spaCy token, token.lemma_ returns the lemma, while token.is_stop returns a boolean True if the token is a stopword (and False otherwise).

In [5]:
# print the lemma and a boolean if stopword 
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


## Pattern Matching

In [11]:
# pattern matching
# create a matcher object
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# create a list of terms to match in the text
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']

# matcher needs the patterns as document objects
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)

# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 

# create doc from text to search for
matches = matcher(text_doc)
print(matches)

match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])
match_id, start, end = matches[1]
print(nlp.vocab.strings[match_id], text_doc[start:end])
match_id, start, end = matches[2]
print(nlp.vocab.strings[match_id], text_doc[start:end])
match_id, start, end = matches[3]
print(nlp.vocab.strings[match_id], text_doc[start:end])

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]
TerminologyList iPhone 11
TerminologyList Galaxy Note
TerminologyList iPhone XS
TerminologyList Google Pixel


# Text Classification with SpaCy using bag of words

A common task in NLP is text classification. This is "classification" in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries.

In this tutorial, you'll learn text classification with spaCy. The classifier will detect spam messages, a common functionality in most email clients. Here is an overview of the data you'll use:


## Bag of Words

## Building a Bag of Words model

Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. spaCy handles the bag of words conversion and building a simple linear model for you with the TextCategorizer class.

The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens. When you create a spaCy model with nlp = spacy.load('en_core_web_sm'), there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model doc = nlp("Some text here"), the output of the pipes are attached to the tokens in the doc object. The lemmas for token.lemma_ come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

Since the classes are either ham or spam, we set "exclusive_classes" to True. We've also configured it with the bag of words ("bow") architecture. spaCy provides a convolutional neural network architecture as well, but it's more complex than you need for now.

Next we'll add the labels to the model. Here "ham" are for the real messages, "spam" are spam messages.


In [19]:
# create a TextCategorizer model
# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")


1

## Training a Text Categorizer Model

In [21]:
# create training data

train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]
train_data = list(zip(train_texts, train_labels))
train_data[:3]


[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

In [24]:
# training the model

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 1.3595883706348104}
{'textcat': 1.7019103596425111}
{'textcat': 1.8889567541276335}
{'textcat': 2.0086188285025415}
{'textcat': 2.0908757135393223}
{'textcat': 2.142813575270255}
{'textcat': 2.1792574747399316}
{'textcat': 2.205051286399283}
{'textcat': 2.224066504236819}
{'textcat': 2.238125779956701}


## Making Predictions

In [25]:
# making predictions

texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])


[[9.9995530e-01 4.4716071e-05]
 [1.3936103e-02 9.8606384e-01]]
['ham', 'spam']


# Text Classification with SpaCy using Word Embeddings

## Word Embeddings

## Word embeddings as features

![Image](img/WordEmbeddings.PNG)

In [29]:
# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [30]:
# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

vectors.shape

(12, 300)

In [31]:
# ham is the label for non-spam messages

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

(5572, 300)

## Classification Models

With the document vectors, you can train scikit-learn models, xgboost models, or any other standard approach to modeling. 

Here is an example using support vector machines (SVMs). Scikit-learn provides an SVM classifier LinearSVC. This works similar to other scikit-learn models.

In [32]:
# train, test split
X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

In [33]:
# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Accuracy: 97.312%


# Documents similarity 

Documents with similar content generally have similar vectors. So you can find similar documents by measuring the similarity between the vectors. A common metric for this is the cosine similarity which measures the angle between two vectors, a and b

cosθ=a⋅b/∥a∥∥b∥

This is the dot product of a and b, divided by the magnitudes of each vector. The cosine similarity can vary between -1 and 1, corresponding complete opposite to perfect similarity, respectively. To calculate it, you can use the metric from scikit-learn or write your own function.

In [34]:
# cosine similarity
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [35]:
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

0.7030031

# Case study #1 :: Sentiment analysis

Vectorizing Language

Embeddings are both conceptually clever and practically effective.

So let's try them for the sentiment analysis model you built for the restaurant. Then you can find the most similar review in the data set given some example text. It's a task where you can easily judge for yourself how well the embeddings work.

## Creating document vectors

In [41]:
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(100, 300)

## Training a Model on Document Vectors

Next you'll train a LinearSVC model using the document vectors. It runs pretty quick and works well in high dimensional settings like you have here.

After running the LinearSVC model, you might try experimenting with other types of models to see whether it improves your results.

In [44]:
reviews.head()

Unnamed: 0,user_id,text,date,review_id,business_id,funny,cool,useful,stars
0,b'hG7b0MtEbXx5QzbzE6C_VA',b'Total bill for this horrible service? Over $...,b'2013-05-07 04:34:36',b'Q1sbwvVQXV2734tPgoKj4Q',b'ujmEBvifdJM6h6RLv4wQIg',1,0,6,1.0
1,b'yXQM5uF2jS6es16SJzNHfg',"b""I *adore* Travis at the Hard Rock's new Kell...",b'2017-01-14 21:30:33',b'GJXCdrto3ASJOqKeVWPi6Q',b'NZnhc2sEQy3RmzKTZnqtwQ',0,0,0,5.0
2,b'n6-Gk65cPZL6Uz8qRm3NYw',"b""I have to say that this office really has it...",b'2016-11-09 20:09:03',b'2TzJjDVDEuAW6MR5Vuc1ug',b'WTqjgwHlXbSFevF32_DJVw',0,0,3,5.0
3,b'dacAIZ6fTM6mqwW5uxkskg',"b""Went in for a lunch. Steak sandwich was deli...",b'2018-01-09 20:56:38',b'yi0R0Ugj_xUx_Nek0-_Qig',b'ikCg8xy5JIg_NGPx-MSIDA',0,0,0,5.0
4,b'ssoyf2_x0EQMed6fgHeMyQ',b'Today was my second out of three sessions I ...,b'2018-01-30 23:07:38',b'11a8sVPMUFtaC7_ABRkmtw',b'b1b1eb3uo-w561D0ZfCEiQ',0,0,7,1.0


In [46]:
vectors

array([[-0.19047938,  0.1821789 , -0.02618041, ..., -0.04719066,
         0.01550798,  0.09590381],
       [-0.0256067 ,  0.1500729 , -0.11254921, ..., -0.04775102,
         0.03123375,  0.06923966],
       [-0.07813025,  0.1925469 , -0.14057271, ..., -0.04411409,
         0.06611802,  0.08484872],
       ...,
       [-0.03007777,  0.16273198, -0.08914592, ..., -0.15289608,
         0.1121713 ,  0.07087018],
       [-0.04080139,  0.22955525, -0.13488048, ..., -0.05866174,
         0.01276404,  0.0743405 ],
       [-0.02303501,  0.15512745, -0.08919447, ..., -0.04124865,
         0.03225725,  0.04699771]], dtype=float32)

In [43]:
# ERROR, sentiment(target) is missing
# train, test split
X_train, X_test, y_train, y_test = train_test_split(vectors, reviews.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)

# Fit the model
model.fit(X_train, y_train)

# Uncomment and run to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')


AttributeError: 'DataFrame' object has no attribute 'sentiment'