# Natural Language Processing: Context-aware Tasks

Hi everyone! Today, we're continuing with NLP, specifically looking at context-aware problems, features that we can extract at a document-level, and common context-aware tasks such as rules-based sentiment analysis. 

We will be extensively using SpaCy's `en_core_web_lg model` for many of the context-aware tasks today. This model was pre-trained on several corpuses such as OntoNotes 5, GloVe Common Crawl, and others ([source](https://spacy.io/models/en])). It comes with pipeline components such as a tokenizer, a POS tagger, a dependency parse, a lemmatizer, and an named entity recognizer. It also comes with pre-trained document embeddings.

In [1]:
from IPython.display import Image
Image(filename='spacy_pipeline.png')

FileNotFoundError: [Errno 2] No such file or directory: 'spacy_pipeline.png'

By default, the SpaCy pipeline's preprocessing steps include tokenization, POS tagging, dependency parsing, and named entity recognition only. However, users have the option to use the other features such as lemmatization, embedding, and other self-defined steps. 

In [6]:
# Uncomment and run only if you do not have SpaCy and the en_core_web_lg model installed on your device yet
!pip install -U spacy
!python -m spacy download en_core_web_lg

You should consider upgrading via the '/Users/TL/.pyenv/versions/3.8.5/bin/python -m pip install --upgrade pip' command.[0m
Collecting en-core-web-lg==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.1.0/en_core_web_lg-3.1.0-py3-none-any.whl (777.1 MB)
[K     |████████████████████████████████| 777.1 MB 41 kB/s  eta 0:00:013     |█████▊                          | 138.7 MB 10.8 MB/s eta 0:01:00     |███████████████▊                | 381.8 MB 8.7 MB/s eta 0:00:46     |███████████████████████▎        | 564.0 MB 7.3 MB/s eta 0:00:30     |██████████████████████████▍     | 640.3 MB 14.7 MB/s eta 0:00:10


Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.1.0
You should consider upgrading via the '/Users/TL/.pyenv/versions/3.8.5/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [48]:
spacy.load('en_core_web_lg')

  and should_run_async(code)


<spacy.lang.en.English at 0x1ae7bf100>

### Parts-of-speech (POS) tagging and dependency parsing

In [3]:
import pandas as pd

In [7]:
# Import spacy and the en_core_web_lg model
import spacy
from spacy import displacy
try:
    nlp = spacy.load("en_core_web_lg")
except:
    print("Error loading 'en_core_web_lg' model.")

In [13]:
"""
Use the en_core_web_lg_model to pre-process our text
SpaCy's pre-trained model automatically determines various linguistic properties
such as POS tags and dependency trees
"""

doc = nlp("The Philippine flight was delayed due to trouble with the airplane.")

In [15]:
# Visualize every "token" in the document
tokens = pd.DataFrame(columns=
                      ['text', 'lemma', 'pos', 'tag',
                       'dependency', 'shape', 'is_alphabet',
                       'is_stopword', 'head_text', 'head_pos'])

for token in doc:
    data = [token.text, token.lemma_, token.pos_,
            token.tag_, token.dep_, token.shape_,
            token.is_alpha, token.is_stop,
            token.head.text, token.head.pos_]
    tokens.loc[len(tokens)] = data
tokens

Unnamed: 0,text,lemma,pos,tag,dependency,shape,is_alphabet,is_stopword,head_text,head_pos
0,The,the,DET,DT,det,Xxx,True,True,flight,NOUN
1,Philippine,philippine,ADJ,JJ,amod,Xxxxx,True,False,flight,NOUN
2,flight,flight,NOUN,NN,nsubjpass,xxxx,True,False,delayed,VERB
3,was,be,AUX,VBD,auxpass,xxx,True,True,delayed,VERB
4,delayed,delay,VERB,VBN,ROOT,xxxx,True,False,delayed,VERB
5,due,due,ADP,IN,prep,xxx,True,True,delayed,VERB
6,to,to,ADP,IN,pcomp,xx,True,True,due,ADP
7,trouble,trouble,NOUN,NN,pobj,xxxx,True,False,to,ADP
8,with,with,ADP,IN,prep,xxxx,True,True,trouble,NOUN
9,the,the,DET,DT,det,xxx,True,True,airplane,NOUN


In [16]:
displacy.render(doc, style='dep', jupyter=True)

### Named entity recognition

In [17]:
"""
Use the en_core_web_lg_model to pre-process our text
SpaCy's pre-trained model also determines any named entities from text. 
"""

doc = nlp("The Philippine flight was delayed yesterday.")

In [18]:
doc.ents

(Philippine, yesterday)

In [19]:
displacy.render(doc, style='ent', jupyter=True)

### Rules-based sentiment analysis

For sentiment analysis, we will be using the VADER model ([source](https://github.com/cjhutto/vaderSentiment)). VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

We will modify the SpaCy pipeline to add the additional step of using the VADER model to calculate the sentiment polarity of text. 

In [20]:
# # Uncomment and run only if you do not have VADER installed on your device yet
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 374 kB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
You should consider upgrading via the '/Users/TL/.pyenv/versions/3.8.5/bin/python -m pip install --upgrade pip' command.[0m


In [21]:
# Import VADER model
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Import Doc for extending the SpaCy pipeline
from spacy.tokens import Doc

In [22]:
# Define sentiment analysis extensions
"""
You can define the extension as a regular Python function:
def sentiment_analysis(doc):
    return sia.polarity_scores(doc.text)
"""
# Or you can create an anonymous lambda functions
sentiment_analysis = lambda doc: sia.polarity_scores(doc.text)

In [23]:
# Instantiate NLP pipeline and set extensions
nlp_with_sentiment = spacy.load("en_core_web_lg")
Doc.set_extension("sentiment", getter=sentiment_analysis, force=True)

In [24]:
doc = nlp_with_sentiment("I love this school, but it's tiring sometimes!")
doc._.sentiment # Acccess custom properties using the "._." operator

{'neg': 0.0, 'neu': 0.708, 'pos': 0.292, 'compound': 0.4389}

### Process multiple documents

SpaCy pipelines have an optimazation for processing multiple texts all at once using the `nlp.pipe()` method.

In [25]:
df = pd.DataFrame({
    "text": [
        "I love this school, but it's tiring sometimes!",
        "I'm looking forward to Christmas break.",
        "I'm sad that midterms are over.",
        "School is almost over.",
        "I'll be relaxing during the Christmas break."
    ]
})

In [27]:
df

Unnamed: 0,text
0,"I love this school, but it's tiring sometimes!"
1,I'm looking forward to Christmas break.
2,I'm sad that midterms are over.
3,School is almost over.
4,I'll be relaxing during the Christmas break.


In [28]:
# Use the nlp.pipe() method to process a collection of texts
docs = nlp_with_sentiment.pipe(df['text'])

In [29]:
df_modified = pd.DataFrame(columns=['text', 'document', 'com_sent', 'pos_sent', 'neg_sent', 'neu_sent'])
for doc in docs:
    df_modified = df_modified.append(
        {
            'text': doc.text,
            'document': doc,
            'com_sent': doc._.sentiment['compound'],
            'pos_sent': doc._.sentiment['pos'],
            'neg_sent': doc._.sentiment['neg'],
            'neu_sent': doc._.sentiment['neu']
        },
        ignore_index=True
    )
df_modified

Unnamed: 0,text,document,com_sent,pos_sent,neg_sent,neu_sent
0,"I love this school, but it's tiring sometimes!","(I, love, this, school, ,, but, it, 's, tiring...",0.4389,0.292,0.0,0.708
1,I'm looking forward to Christmas break.,"(I, 'm, looking, forward, to, Christmas, break...",0.0,0.0,0.0,1.0
2,I'm sad that midterms are over.,"(I, 'm, sad, that, midterms, are, over, .)",-0.4767,0.0,0.383,0.617
3,School is almost over.,"(School, is, almost, over, .)",0.0,0.0,0.0,1.0
4,I'll be relaxing during the Christmas break.,"(I, 'll, be, relaxing, during, the, Christmas,...",0.4939,0.348,0.0,0.652


### Document embeddings

SpaCy's `en_core_web_lg` model also comes with pre-trained document embeddings. As such, we can simply use them to immediately retrieve the embeddings of our text. 

In [30]:
doc

I'll be relaxing during the Christmas break.

In [31]:
doc.vector.shape # The embedding uses 300 feature columns only!

(300,)

In [34]:
doc.vector # This is the embedding vector

array([ 1.69008091e-01,  2.25695238e-01, -1.68768674e-01, -1.36101559e-01,
        9.52071026e-02, -1.27706647e-01,  6.65480122e-02, -7.65754357e-02,
        4.17082235e-02,  2.11479855e+00, -3.06387693e-01,  1.54878963e-02,
        1.47993699e-01, -7.26084504e-03,  9.56096649e-02, -1.20751448e-01,
       -9.59463567e-02,  1.24279106e+00, -1.21257037e-01,  3.46791223e-02,
       -2.60731522e-02,  1.58247799e-02, -9.91496742e-02, -1.62487905e-02,
       -1.03149183e-01,  9.22244191e-02, -1.16027325e-01, -1.52984649e-01,
       -5.60563465e-04, -3.21794413e-02, -1.47145286e-01, -2.52799004e-01,
       -1.57193437e-01,  1.14202991e-01,  1.21319994e-01, -8.64599831e-03,
        2.11264670e-01,  4.57501560e-02, -1.73189901e-02, -4.68007624e-02,
       -5.22216633e-02,  2.49998440e-04, -3.89611162e-02,  2.48558004e-04,
       -2.88100052e-03,  1.76195100e-01, -1.56749442e-01, -1.43664241e-01,
       -4.12045531e-02,  4.65430021e-02, -4.52112257e-02, -1.05004005e-01,
        1.81275219e-01, -

In the next following cells, we will attempt to embed a bunch of documents to retrieve their embedding vectors. 

In [35]:
df_embedded = df.copy()
df_embedded

Unnamed: 0,text
0,"I love this school, but it's tiring sometimes!"
1,I'm looking forward to Christmas break.
2,I'm sad that midterms are over.
3,School is almost over.
4,I'll be relaxing during the Christmas break.


In [36]:
# Embed a bunch of documents and append them to a dataframe
documents = []
vectors = []

for doc in nlp.pipe(df['text']):
    documents.append(doc)
    vectors.append(doc.vector)
    
df_embedded['document'] = pd.Series(documents, name='document')
df_embedded = df_embedded.join(pd.DataFrame(vectors))
df_embedded.head()

Unnamed: 0,text,document,0,1,2,3,4,5,6,7,...,290,291,292,293,294,295,296,297,298,299
0,"I love this school, but it's tiring sometimes!","(I, love, this, school, ,, but, it, 's, tiring...",-0.034511,0.327626,-0.127263,-0.147311,0.046244,0.082281,0.057372,-0.219449,...,-0.036997,-0.04901,-0.035962,-0.047019,0.168925,-0.035583,0.019455,-0.091751,0.053876,0.060225
1,I'm looking forward to Christmas break.,"(I, 'm, looking, forward, to, Christmas, break...",0.150931,0.201293,-0.31519,-0.080541,0.298213,-0.066926,0.099092,-0.045008,...,-0.041712,0.092986,-0.03935,-0.083476,0.057845,-0.062269,-8.7e-05,0.063932,-0.061941,0.207759
2,I'm sad that midterms are over.,"(I, 'm, sad, that, midterms, are, over, .)",-0.054281,0.193412,-0.161498,-0.119088,-0.029463,0.012523,0.049262,-0.030172,...,-0.069403,-0.009032,-0.02105,-0.12302,0.130481,0.057626,3.1e-05,-0.027997,-0.001042,-0.024441
3,School is almost over.,"(School, is, almost, over, .)",-0.072062,0.28576,-0.049852,-0.164506,0.22677,-0.099472,-0.122914,-0.032732,...,0.014001,-0.105848,0.118661,0.025903,-0.07734,0.092603,-0.024908,-0.121334,0.02202,-0.036161
4,I'll be relaxing during the Christmas break.,"(I, 'll, be, relaxing, during, the, Christmas,...",0.169008,0.225695,-0.168769,-0.136102,0.095207,-0.127707,0.066548,-0.076575,...,-0.004041,0.121575,-0.09746,-0.145725,0.123401,0.041988,0.058736,0.094489,-0.080959,0.135943


### Document similarity

Now that we have document embeddings, we can use techniques such as cosine similarity to determine similarity between texts. 

In [37]:
a = nlp("Hello there!")
b = nlp("Greetings to you!")
c = nlp("Wikipedia is not a dictionary, or a usage or jargon guide.")

In [38]:
a.similarity(b) # The two sentences look very similar!

0.8158018311870442

In [39]:
a.similarity(c) # Not so similar (almost 50-50 coin toss)

0.5575620500314596

### Topic modeling and thematic analysis

For topic modeling and thematic analysis, we will be using GenSim ([source](https://radimrehurek.com/gensim/)) and pyLDAvis ([source](https://github.com/bmabey/pyLDAvis)). We will use GenSim to implement the Latent-Dirichlet Allocation (LDA) model, and use pyLDAvis to visualize the results. 

WARNING: Your notebook might become buggy when using pyLDAvis. Once you are done looking at the visual, simply clear the output of the cell that uses the pyLDAvis chart. 

In [40]:
# Uncomment and run only if you do not have gensim and pyLDAvis on your device yet
!pip install gensim
!pip install pyLDAvis

You should consider upgrading via the '/Users/TL/.pyenv/versions/3.8.5/bin/python -m pip install --upgrade pip' command.[0m
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 811 kB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Collecting numpy>=1.20.0
  Downloading numpy-1.21.3-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB)
[K     |████████████████████████████████| 16.9 MB 1.6 MB/s eta 0:00:01
[?25hCollecting future
  Using cached future-0.18.2-py3-none-any.whl
Collecting numexpr
  Downloading numexpr-2.7.3-cp38-cp38-macosx_10_9_x86_64.whl (99 kB)
[K     |████████████████████████████████| 99 kB 2.0 MB/s eta 0:00:01
Building wheels for collected packages: pyLDAvis
 

In [44]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis, pyLDAvis.gensim_models

  and should_run_async(code)


In [45]:
# Create a corpus from the documents
texts = [[token for token in doc] for doc in df_modified['document']] # Get tokens from each document
dictionary = Dictionary(split_texts) # Create a GenSimdictionary
corpus = [dictionary.doc2bow(text) for text in split_texts] # Create a GenSim corpus
corpus

  and should_run_async(code)


NameError: name 'split_texts' is not defined

In [46]:
# Create an LDA model with 3 topis from the generated corpus
lda = LdaModel(
    corpus=corpus, 
    id2word=dictionary, 
    num_topics=3
)
lda.print_topics()

  and should_run_async(code)


NameError: name 'corpus' is not defined

In [47]:
# Visualize the topic models and figure out the themes
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary)
vis

# Right-click the cell then choose "Clear outputs" when you are done.

  and should_run_async(code)


NameError: name 'lda' is not defined