# Using spaCy  for Natural Language Processing

## Introduction

In the lecture, we discussed about natural language processing in data science and practiced on NLTK library in Python for handling text data.
In fact there are many other powerful libraries in python besides NLTK that we can use to process natural language.

spaCy an industrial-strength natural language processing library written in Python and Cython. It aims to achieve higher processing speed and better accuracy comparing to NLTK. It also provides some features not found in NLTK library such as word vectors.

In this tutorial, we will walk through some useful features in spaCy and use simple examples to show how to use those features. Then, we will use a larger data set - the tweet data to demonstrate two real-word usage of those features on natural language processing case.

## Installing spaCy

First of all, to install spacy we can use the following pip command in Anaconda.

pip install -U spacy

To enable the processer we should also download language model.

python -m spacy.en.download all

This is for the language model for English. spaCy also provides a German language model, which you may try out of your interest.

python -m spacy.de.download all

The downloads may take some time. After that, we can start playing with spaCy in Python.

In [1]:
import pandas as pd
import spacy
import numpy as np
from spacy.en import English
nlp = English()

Here, nlp is a Language object in spaCy, which includes parser and vectors built from a huge English text set. We can also disable the parser or assign custom parser during the initialization process. Here we will stick to the default one.

We can play with the same dataset we used in homework 3 – the tweet data.

In [2]:
tweets = pd.read_csv("tweets_train.csv", na_filter=False)

We get a list of Strings. Each String is the content of one document.

## spaCy Features

### Tokens

In spacy, we can generate a Doc object to represent the language model of a document by calling on Language. For example,

In [3]:
text = "Wake up Jack! Stop being a potato-coach.It is a good weather today. \
Get on your Nike shoes. Let’s go to New York and play baseball. 12345"
doc = nlp(text.decode("utf-8"))

By iterating through a doc object, we can see the tokens.

In [4]:
token_list = [token.string for token in doc]
print token_list

[u'Wake ', u'up ', u'Jack', u'! ', u'Stop ', u'being ', u'a ', u'potato', u'-', u'coach', u'.', u'It ', u'is ', u'a ', u'good ', u'weather ', u'today', u'. ', u'Get ', u'on ', u'your ', u'Nike ', u'shoes', u'. ', u'Let', u'\u2019s ', u'go ', u'to ', u'New ', u'York ', u'and ', u'play ', u'baseball', u'. ', u'12345']


We can see that the tokens are in their original format. But spaCy has integrated a large amount of features for token objects. For instance, we can see the lemma or lower case representation by getting lemma_ or lower_ features. We can even check if the token is url/num/email format, or check if the token is a stop word

In [5]:
print "lemma:", " ".join(token.lemma_ for token in doc)
print "lower:"," ".join(token.lower_ for token in doc)
print doc[-1].like_num
print doc[1].is_stop

lemma: wake up jack ! stop be a potato - coach . it be a good weather today . get on your nike shoe . let ’s go to new york and play baseball . 12345
lower: wake up jack ! stop being a potato - coach . it is a good weather today . get on your nike shoes . let ’s go to new york and play baseball . 12345
True
True


Observe that "lower" prints each word in its lower case and "lemma" actually stem the words and print "being" and "is" both as be. 

The last two lines is check whether "12345" is likely a number and "up" is a stop word. Obviously they are.

In [6]:
def process(doc):
    tokens = [(token.lemma_).lower() for token in doc if token.is_alpha and not token.is_stop]
    return tokens

token_list = process(doc)
print token_list
lem_doc = nlp((" ".join(token_list).decode("utf-8")))

[u'wake', u'jack', u'stop', u'potato', u'coach', u'good', u'weather', u'today', u'nike', u'shoe', u'let', u'new', u'york', u'play', u'baseball']


With these features, we can process the data as we have done in homework 3 to build the token lists.

Note that here we update doc as a doc with lemma representation.

spaCy allows us to parse the documents in a multi-thread manner, which makes our jobs of converts documents to tokens easier.

In [7]:
docs = nlp.pipe([tweet.decode('UTF-8') for tweet in tweets['text']], batch_size=50, n_threads=4)

We will play with the tweet data later on.

### Named Entity

Another feature of spaCy is that it provides named entiny recognition. The term Named Entity refers to a real-world object mentioned in the text. It can be persons, locations, orgnizations, etc. For instance, in the sentence above, "Jack" is a person, "Nike" is a company, and "New York" is a place. 

spaCy helps us identify the named entities and classify them different categories shown as follow.

![Named Entity Recognition](img/nea.png)

We can simply check the named entity type of a entity by checking entity.label feature using spaCy.

In [8]:
def extract_named_entities(doc, names=["PERSON", "NORP", "GPE", "ORG"]):
    name_entities = {}
    for name in names:
        name_entities[name] = []
    for entity in doc.ents:
        if entity.label_ in names:
            name_entities[entity.label_].append(entity.text)
    return name_entities

name_entities = extract_named_entities(doc)
print name_entities

{'PERSON': [u'Jack'], 'NORP': [], 'ORG': [u'Nike'], 'GPE': [u'New York']}


The previos method allows us to extract named entities of certain types. As we can see, spaCy sussessfully identify the named entities and their type as we expected.

Since a named entity can represent a real-world object, it can be used to analyze relationships between objects in real world, which we will see in the later part of the tutorial.

### Part of Speech

Part of speech (POS) tagging is another natural language analyse feature offered by spaCy. Part of speech tagging shows whether a word in a text is noun, verb, adjactive or other types.

Again, we can simply access the POS information of a token by accessing token.pos feature.

In [9]:
def tag_pos(doc):
    tagged_words = [(token.string, token.pos_) for token in doc]
    return tagged_words

tagged_words = tag_pos(lem_doc)
tagged_text = " ".join(word[0] + "/" + word[1] for word in tagged_words)
print tagged_text

wake /VERB jack /NOUN stop /NOUN potato /NOUN coach /NOUN good /ADJ weather /NOUN today /NOUN nike /ADJ shoe /NOUN let /VERB new /ADJ york /NOUN play /NOUN baseball/NOUN


The code above shows the part of speech tag of each word with a "/POS" after it.

Similar as before, we can define a method that extract words with certain types of POS tag.

In [10]:
def extract_pos(doc, tags=["NOUN", "VERB", "ADJ", "ADV"]):
    words_pos = {}
    for tag in tags:
        words_pos[tag] = []
    tagged_words = tag_pos(doc)
    for word in tagged_words:
        if word[1] in tags:
            words_pos[word[1]].append(word[0])
    return words_pos

words_pos = extract_pos(lem_doc)
print words_pos

{'ADV': [], 'ADJ': [u'good ', u'nike ', u'new '], 'VERB': [u'wake ', u'let '], 'NOUN': [u'jack ', u'stop ', u'potato ', u'coach ', u'weather ', u'today ', u'shoe ', u'york ', u'play ', u'baseball']}


Part of speech tagging is an important step for machine to understand natural language. In some cases, researchers may want to discard the adverbs usually don't contains a lot of information and use only nouns to represent the document. In other cases, researchers may want to analyze on adjectives to see the emotion of a writer. However, since a word in English may be used in varies part-of-speech cases, it is also a hard task that requires many future works.

We don't have the chance to use POS tagger more in this tutorial. But you can try it on your own if your are interested.

### Word Vector

One remarkable feature of spaCy that differentiate it from other natural language processing libraries is that it provides word vectors. A word vector is a "word embedding" representation in the form of numeric vector. It is usually used to analyze similarity between words. The most famous word embedding model is called word2vec. By default, spacy uses the word vectors produced by this model.

The similarity between two words is simply recognized as the cosine similarity of their vector.

In [11]:
def word_similarity(word_a, word_b):
    if not (word_a.has_vector and word_b.has_vector):
        return 0
    similarity = np.dot(word_a.vector, word_b.vector) / (np.linalg.norm(word_a.vector) * np.linalg.norm(word_b.vector))
    return similarity

word_a, word_b, word_c, word_d = nlp(u'apple pineapple computer microchip');
print "apple and pineapple:", word_similarity(word_a, word_b)
print "computer and microchip:", word_similarity(word_c, word_d)
print "apple and computer:", word_similarity(word_a, word_c)
print "pineapple and computer:", word_similarity(word_b, word_c)

apple and pineapple: 0.586574
computer and microchip: 0.647215
apple and computer: 0.406346
pineapple and computer: 0.294345


We can see that "apple" and "pineapple", "computer" and "microchip" have higher similarity comparing to "computer" with those two fruits. This make sense according to our knowledge. It is also interesting to see that apple actually have higher similarity with computer than other fruits. spaCy probably also takes Apple Computer into consideration when decides those similarities!

spaCy offers simpler way for that.

In [12]:
similarity = word_a.similarity(word_b)

Here word_a and word_b can be either tokens or spans.

We can use this feature to search for a document that related to a topic not only by an exact word match but some similar words. We will demonstrate it through the tweet data in the next section.

## Play with real cases

### Loading and Pre-processing

We can play with the same dataset we used in homework 3 – the tweet data.

In [13]:
def load_data(pd):
    docs = []
    for doc in nlp.pipe([tweet.decode('UTF-8') for tweet in pd], batch_size=50, n_threads=4):
        docs.append(doc)
    return docs
docs = load_data(tweets['text'])

In [14]:
print docs[1]

RT @DWStweets: The choice for 2016 is clear: We need another Democrat in the White House. #DemDebate #WeAreDemocrats http://t.co/0n5g0YN46f


The code above prepares a list of docs each represent a tweet. We can process those docs to generate lemmatized token lists.

In [15]:
token_lists = [process(doc) for doc in docs]
lem_lines = [u' '.join(tokens) for tokens in token_lists]
lem_docs = []
for line in lem_lines:
    if type(line) != 'unicode':
        try:
            lem_docs.append(nlp(line.decode("UTF-8")))
        except Exception as e:
            pass
    else:
        lem_docs.append(nlp(line))
print lem_docs[1]

rt choice clear need democrat white house demdebate wearedemocrats


Note that since token_lists is a list of lemmatized token list as we used in homework 3. But since here we want to fully demostrate features in spaCy, we turn the token list back to doc objects so that we can apply spaCy methods on them.

As we can see the doc are now lemmatized and with stop words and non-alphabetic words removed. (Here we discard some text that can not be decoded. They are mostly non-English words.)

### Relative Entities

If we want to see how tightly two entities related to each other, one easy way is to evaluate their co-occurrence in the corpus we have.
In this case, we will extract named entities of people and group from tweets and see how strongly they are related to each other.

First, we need a method that calculate the co-occurrence of different pairs of entities and the frequency of each entity.

Since a capital inital letter is often used as an important factor in named entity recognition, lemmatization will affect the result of recognition. Thus, we use original docs instead of lemmatized docs.

In [16]:
from collections import Counter
def get_co_occurrence(docs, names):
    freq = Counter()
    co_occurrence = {}
    for doc in docs:
        named_entities = extract_named_entities(doc, names)
        entity_set = set([entity for entities in named_entities.values() for entity in entities])
        freq += Counter(entity_set)
        for entity in entity_set:
            co = entity_set.difference([entity])
            if co_occurrence.has_key(entity):
                co_occurrence[entity] += Counter(co)
            else:
                co_occurrence[entity] = Counter(co)
    return freq, co_occurrence

f, co = get_co_occurrence(docs, ['PERSON', 'NORP'])
        

In [17]:
print co['Trump'].most_common(5)

[(u'American', 28), (u'Americans', 27), (u'Hillary', 27), (u'Muslims', 11), (u'Cruz', 9)]


The result of above code shows the words that show up with "Trump" the most often are "American", "Americans", "Hillary", "Muslims" and "Cruz". Obviously they are closely related to Trump. See that since we are using non-lemmatized words, "American" and "Americans" can not be recognized as same. This would be something to improve in the future work.

The above method analyzes the frequency of co-occurrence but it has a limit. The entities that show up most frequently tend to have more co-occurrence with all the other words, which does not neccessarily mean there are close relationships between them. Also, the more documents we have, the higher co-occurrence we may find.

Therefore, we want to take the individual frequencies of entities and corpus size into consideration and use a indicator called Pointwise Mutual Information (PMI).

$$PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)}$$

Where $p(x.y)$ denotes the rate that $x$ and $y$ co-occur among all the documents, and $p(x)$, $p(y)$ are individual occurrence rate of $x$ and $y$, respectively.

Using the formula, we can find the PMI value of entity pairs.

In [18]:
import math
def calculate_PMI(x, y, docs, names):
    freq, co_occurrence = get_co_occurrence(docs, names)
    if freq[x] == 0 or freq[y] == 0:
        return 0
    co = co_occurrence[x][y]
    f_x = freq[x]
    f_y = freq[y]
    pmi = math.log(co) + math.log(len(docs)) - math.log(f_x) - math.log(f_y)
    return pmi
pmi = calculate_PMI("Trump", "American", docs, ['PERSON', 'NORP'])
print pmi

0.815133888394


A big limitation of PMI is that it has infinite range. Therefore, we have no idea how big a PMI should be to show a strong relationship. There is another indicator called Phi-Square indicator that limits the range to [0, 1], which allows us to get more sense on the relationship when look at the number.

However, it's hard to evaluate which indicator performs best. Both of them have their advantage dealing with certain data. Everythin is corpus dependant. We don't have chance to show the comparison between Phi-Square and PMI here, but you can check more on https://en.wikipedia.org/wiki/Phi_coefficient

Such analysis is often regard as concept occurrence analysis.

### Search Tweet

As mentioned before, to illustrate whether a tweet is relevant to certain topic, one way is to check is there any relevant words or phrases in that tweet. We can utilize word vector similarity to perform the check.

In [19]:
def is_relevant(tweet, topic):
    for tweet_token in tweet:
        for topic_token in topic:
            if word_similarity(tweet_token, topic_token) > 0.5:
                return True
    return False


In [20]:
def find_relevant(docs,topic):
    relevant_doc = []
    for doc in docs:
        if is_relevant(doc, topic):
            relevant_doc.append(doc)
    return relevant_doc

money = nlp(u'money')
print find_relevant(lem_docs, money)[0]

clinton take campaign cash drug industry enemy demdebate


We can see that we identify the tweet related to "money" with no exact word "money" in the tweet. It contains the word "cash", which is highly similar to "money".

This can be recognized as an example of fuzzy search.

# Conclusion

spaCy is a strong industrial level natural language library that contains a well-trained model which can boost our analysis tasks. It provides features such as tokenizing, named entity recognition, part-of-speech tagging and word verctors that will be very helpful when dealing with natural languages.

In this tutorial, we walk through these basic functions in spaCy and use them in two real world cases that commonly seen in NLP problems - Concept Co-occurence and Fuzzy Search. Of course, those are just simplified examples. Real cases can be much more complicated, but you can regard it as a start and keep digging for more interesting problems.

Also, spaCy has more complicated functions such as self-defined models and taggers. If you are interested, there are much more to learn and practice on https://spacy.io/.