# Natural Language Processing

This notebook is intended as a compilation of different tests and proofs of concept for possible new projects in Natural Language Processing (NLP).

## Concepts

- **Part-of-Speech (PoS):** Give a tag to every word of a text (tokens of a doc)
- **Name Entity Recognition (NER):** Classifies every possible span (set of tokens) based on its meaning (e.g. a person, corporation, date, etc.)
- **Word Embeddings:** Process needed to convert tokens in mathematical representations to be fed to the chosen Deep Learning (DL) architecture. There're several types:
    - **One-Hot encoding**: Every word in a vocabulary is translated into a one-hot encoded vector. This is a sparse vector, of the size of the vocabulary, full of zeros, with a one in the position where this word is found in the vocabulary. 
This could be a possible way to encode vectors, but naïvely erases all kind of semantic information from the vocabulary.
    - **Word Embeddings**: Here, the words are encoded into a non-sparse vector, where its values configure a spatial relationship between words which define how close they are in a distance/meaning analogy.

In [42]:
import numpy as np
import spacy
from spacy import displacy
from spacy.pipeline import TextCategorizer
from spacy.pipeline import Pipe

from scipy import spatial

## SpaCy

Testing basic behaviour of the [SpaCy library](https://spacy.io/).
First of all, it's needed to load the pretrained models that're going to be used during the notebook.

In [5]:
nlp_en = spacy.load('en_vectors_web_lg') #spacy.load('en')
nlp_es = spacy.load('es_core_news_md') #spacy.load('es')

In [2]:
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
#nlp_en = spacy.load('en_core_web_sm')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

NameError: name 'nlp' is not defined

Here, an spanish pretrained model is loaded and a brief text is passed through it, resulting in the *doc* object. The entire useful information resulting from passing the text through the model is stored in *doc*. Several attributes can be taken from the object, such as the tokens detected (including punctuation marks), a coarse-grained part-of-speech and the syntactic dependency relation. 

In [5]:
text = (u"Con cien cañones por banda "
        u"viento en popa a toda vela "
        u"no corta el mar, sino vuela "
        u"un velero bergantín.")
doc = nlp_es(text)

print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)
    
#doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Con cien cañones por banda viento en popa a toda vela no corta el mar, sino vuela un velero bergantín.
Con ADP case
cien NUM nummod
cañones NOUN obl
por ADP case
banda NOUN nmod
viento NOUN amod
en ADP case
popa NOUN nmod
a ADP case
toda DET det
vela NOUN nmod
no ADV advmod
corta VERB amod
el DET det
mar NOUN nsubj
, PUNCT punct
sino CONJ cc
vuela VERB ROOT
un DET det
velero NOUN amod
bergantín ADJ obj
. PUNCT punct
Con/ADP__AdpType=Prep <--case-- cañones/NOUN__Gender=Masc|Number=Plur
cien/NUM__Number=Plur|NumType=Card <--nummod-- cañones/NOUN__Gender=Masc|Number=Plur
cañones/NOUN__Gender=Masc|Number=Plur <--obl-- vuela/VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
por/ADP__AdpType=Prep <--case-- banda/NOUN__Gender=Fem|Number=Sing
banda/NOUN__Gender=Fem|Number=Sing <--nmod-- cañones/NOUN__Gender=Masc|Number=Plur
viento/NOUN__Gender=Masc|Number=Sing <--amod-- cañones/NOUN__Gender=Masc|Number=Plur
en/ADP__AdpType=Prep <--case-- popa/NOUN__Gender=Fem|Number=Sing
popa/NOUN__G

It's possible to visualize some useful information in a graphical mode, like for example the relation between the tokens of the text corpus.

In [28]:
displacy.render(doc, style='dep', jupyter=True)

### Name Entity Recognition (NER)
The NER module gives a determined length array of tokens a syntactic entity. Not very rebust results in English, worse performance in Spanish.

In [24]:
ttt = nlp_en("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
aaa = nlp("He comprado 2 acciones a las 9 a.m. porque subieron un 30% en dos días según el NYT")
for ent in ttt.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


In [25]:
displacy.render(ttt, style='ent', jupyter=True)

### Chunking
The next cell deals with a nominal phrase (no verbs on it) and divides it in chunks

In [3]:
doc = nlp_en("Wall Street Journal just published an interesting piece on crypto currencies")
doc = nlp_es("El Wall Street Journal ha publicado un interesante articulo sobre criptodivisas")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

NameError: name 'np_left_deps' is not defined

In [14]:
texts = [u'One doc', u'...', u'Lots of docs']
textcat = TextCategorizer(nlp.vocab)
for doc in textcat.pipe(texts, batch_size=50):
    pass
scores = textcat.predict([doc1, doc2])


TypeError: 'bool' object is not callable

### Word Vectors
Feeding some text (as characters, words, sentences, or even entire documents) to a DL architecture needs several preprocessing steps, including word vectorization. Several approaches can be taken, sucha as:
* **One-Hot encoding**: Every word in a vocabulary is translated into a one-hot encoded vector. This is a sparse vector, of the size of the vocabulary, full of zeros, with a one in the position where this word is found in the vocabulary. 
This could be a possible way to encode vectors, but naïvely erases all kind of semantic information from the vocabulary.
* **Word Embeddings**: Here, the words are encoded into a non-sparse vector, where its values configure a spatial relationship between words which define how close they are in a distance/meaning analogy.

In [6]:
print("{0}, Shape: {1}".format(nlp_en.vocab['banana'].vector, 
                               nlp_en.vocab['banana'].vector.shape))

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

In [9]:
print("{0}, Shape: {1}".format(nlp_es.vocab['banana'].vector, 
                               nlp_es.vocab['banana'].vector.shape))

[ 0.643236 -0.330747  1.100676 -0.113157 -0.358365  1.452562 -0.207473
 -0.875244 -0.494573  0.210413  0.399065  0.245114 -0.032114 -0.635198
  1.629397 -0.218225  0.167917 -0.26751  -1.05198   0.516364  0.845993
 -1.215206  0.283176  0.597703 -0.782784 -0.588792 -1.486947 -0.405671
  0.763754  0.670085  0.293811 -0.051031 -0.59058  -0.337708  0.032161
  0.71774  -0.680784  0.065881 -1.197578  0.043932 -0.175394 -0.300493
  0.07077   0.05357   0.684    -0.592457 -0.092534 -1.039084 -0.957413
 -1.638385], Shape: (50,)


In [43]:
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
 
man = nlp_en.vocab['man'].vector
woman = nlp_en.vocab['woman'].vector
queen = nlp_en.vocab['queen'].vector
king = nlp_en.vocab['king'].vector
 
# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []
 
for word in nlp_en.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
 
    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))
 
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])



['Minkoff', '30.93', 'Blunt.', 'CAFFIAUX', 'Bamboo', 'DECEASED', 'gorey', 'Ginjo', 'Sibs', 'croup']


In [37]:
# Another version
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = u@v # np.dot(u, v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.linalg.norm(u) # np.sqrt(np.sum(np.power(u, 2)))
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.linalg.norm(v)
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot/(norm_u*norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

In [38]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

### Computing Similarity
The example above shows a way to use word embeddings in order to find similarities between words. let's use a spaCy built-in module for that.

In [12]:
banana = nlp_en.vocab['banana']
dog = nlp_en.vocab['dog']
fruit = nlp_en.vocab['fruit']
animal = nlp_en.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285

0.66185343 0.23552854
0.67148346 0.24272855


It's also possible to compare even entire docs.

In [13]:
target = nlp_en("Cats are beautiful animals.")
 
doc1 = nlp_en("Dogs are awesome.")
doc2 = nlp_en("Some gorgeous creatures are felines.")
doc3 = nlp_en("Dolphins are swimming mammals.")
doc4 = nlp_en("I like trains")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101
print(target.similarity(doc4))  # 0.7822956752876101
print(target.similarity(target))  # 0.7822956752876101

0.8901764174818698
0.911583014469743
0.7822954272178668
0.5720727309259485
1.0


In [14]:
target = nlp_es("Los gatos son unos animales preciosos.")
 
doc1 = nlp_es("Los perros son fantásticos.")
doc2 = nlp_es("Los felinos son criaturas magníficas.")
doc3 = nlp_es("Los delfines son mamíferos marinos.")
doc4 = nlp_es("Las próximas elecciones se plantean en un escenario de incertidumbre política")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101
print(target.similarity(doc4))  # 0.7822956752876101
print(target.similarity(target))  # 0.7822956752876101

0.976771864203624
0.9310405427465178
0.9344633965264562
0.8131771049915301
1.0


In [24]:
text = "Quiero ejecutar el job framework-online en Jenkins"
#text = "Jenkins is a scheduling tool"
doc = nlp_es(text)

for token in doc:
    print(token.text, token.pos_, token.dep_)

Quiero VERB ROOT
ejecutar VERB xcomp
el DET det
job NOUN obj
framework NOUN amod
- PUNCT punct
online NOUN appos
en ADP case
Jenkins PROPN nmod


In [32]:
spacy.explain("nmod")

'modifier of nominal'

In [36]:
displacy.render(doc, style='dep', jupyter=True)