# Natural Language Processing: Context-aware Tasks

Hi everyone! Today, we're continuing with NLP, specifically looking at context-aware problems, features that we can extract at a document-level, and common context-aware tasks such as rules-based sentiment analysis. 

We will be extensively using SpaCy's `en_core_web_lg model` for many of the context-aware tasks today. This model was pre-trained on several corpuses such as OntoNotes 5, GloVe Common Crawl, and others ([source](https://spacy.io/models/en])). It comes with pipeline components such as a tokenizer, a POS tagger, a dependency parse, a lemmatizer, and an named entity recognizer. It also comes with pre-trained document embeddings.

In [None]:
from IPython.display import Image
Image(filename='spacy_pipeline.png')

By default, the SpaCy pipeline's preprocessing steps include tokenization, POS tagging, dependency parsing, and named entity recognition only. However, users have the option to use the other features such as lemmatization, embedding, and other self-defined steps. 

In [None]:
# Uncomment and run only if you do not have SpaCy and the en_core_web_lg model installed on your device yet
!pip install -U spacy
!python -m spacy download en_core_web_lg

In [None]:
spacy.load('en_core_web_lg')

### Parts-of-speech (POS) tagging and dependency parsing

In [None]:
import pandas as pd

In [None]:
# Import spacy and the en_core_web_lg model
import spacy
from spacy import displacy
try:
    nlp = spacy.load("en_core_web_lg")
except:
    print("Error loading 'en_core_web_lg' model.")

In [None]:
"""
Use the en_core_web_lg_model to pre-process our text
SpaCy's pre-trained model automatically determines various linguistic properties
such as POS tags and dependency trees
"""

doc = nlp("The Philippine flight was delayed due to trouble with the airplane.")

In [None]:
# Visualize every "token" in the document
tokens = pd.DataFrame(columns=
                      ['text', 'lemma', 'pos', 'tag',
                       'dependency', 'shape', 'is_alphabet',
                       'is_stopword', 'head_text', 'head_pos'])

for token in doc:
    data = [token.text, token.lemma_, token.pos_,
            token.tag_, token.dep_, token.shape_,
            token.is_alpha, token.is_stop,
            token.head.text, token.head.pos_]
    tokens.loc[len(tokens)] = data
tokens

In [None]:
displacy.render(doc, style='dep', jupyter=True)

### Named entity recognition

In [None]:
"""
Use the en_core_web_lg_model to pre-process our text
SpaCy's pre-trained model also determines any named entities from text. 
"""

doc = nlp("The Philippine flight was delayed yesterday.")

In [None]:
doc.ents

In [None]:
displacy.render(doc, style='ent', jupyter=True)

### Rules-based sentiment analysis

For sentiment analysis, we will be using the VADER model ([source](https://github.com/cjhutto/vaderSentiment)). VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

We will modify the SpaCy pipeline to add the additional step of using the VADER model to calculate the sentiment polarity of text. 

In [None]:
# # Uncomment and run only if you do not have VADER installed on your device yet
!pip install vaderSentiment

In [None]:
# Import VADER model
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Import Doc for extending the SpaCy pipeline
from spacy.tokens import Doc

In [None]:
# Define sentiment analysis extensions
"""
You can define the extension as a regular Python function:
def sentiment_analysis(doc):
    return sia.polarity_scores(doc.text)
"""
# Or you can create an anonymous lambda functions
sentiment_analysis = lambda doc: sia.polarity_scores(doc.text)

In [None]:
# Instantiate NLP pipeline and set extensions
nlp_with_sentiment = spacy.load("en_core_web_lg")
Doc.set_extension("sentiment", getter=sentiment_analysis, force=True)

In [None]:
doc = nlp_with_sentiment("I love this school, but it's tiring sometimes!")
doc._.sentiment # Acccess custom properties using the "._." operator

### Process multiple documents

SpaCy pipelines have an optimazation for processing multiple texts all at once using the `nlp.pipe()` method.

In [None]:
df = pd.DataFrame({
    "text": [
        "I love this school, but it's tiring sometimes!",
        "I'm looking forward to Christmas break.",
        "I'm sad that midterms are over.",
        "School is almost over.",
        "I'll be relaxing during the Christmas break."
    ]
})

In [None]:
df

In [None]:
# Use the nlp.pipe() method to process a collection of texts
docs = nlp_with_sentiment.pipe(df['text'])

In [None]:
df_modified = pd.DataFrame(columns=['text', 'document', 'com_sent', 'pos_sent', 'neg_sent', 'neu_sent'])
for doc in docs:
    df_modified = df_modified.append(
        {
            'text': doc.text,
            'document': doc,
            'com_sent': doc._.sentiment['compound'],
            'pos_sent': doc._.sentiment['pos'],
            'neg_sent': doc._.sentiment['neg'],
            'neu_sent': doc._.sentiment['neu']
        },
        ignore_index=True
    )
df_modified

### Document embeddings

SpaCy's `en_core_web_lg` model also comes with pre-trained document embeddings. As such, we can simply use them to immediately retrieve the embeddings of our text. 

In [None]:
doc

In [None]:
doc.vector.shape # The embedding uses 300 feature columns only!

In [None]:
doc.vector # This is the embedding vector

In the next following cells, we will attempt to embed a bunch of documents to retrieve their embedding vectors. 

In [None]:
df_embedded = df.copy()
df_embedded

In [None]:
# Embed a bunch of documents and append them to a dataframe
documents = []
vectors = []

for doc in nlp.pipe(df['text']):
    documents.append(doc)
    vectors.append(doc.vector)
    
df_embedded['document'] = pd.Series(documents, name='document')
df_embedded = df_embedded.join(pd.DataFrame(vectors))
df_embedded.head()

### Document similarity

Now that we have document embeddings, we can use techniques such as cosine similarity to determine similarity between texts. 

In [None]:
a = nlp("Hello there!")
b = nlp("Greetings to you!")
c = nlp("Wikipedia is not a dictionary, or a usage or jargon guide.")

In [None]:
a.similarity(b) # The two sentences look very similar!

In [None]:
a.similarity(c) # Not so similar (almost 50-50 coin toss)

### Topic modeling and thematic analysis

For topic modeling and thematic analysis, we will be using GenSim ([source](https://radimrehurek.com/gensim/)) and pyLDAvis ([source](https://github.com/bmabey/pyLDAvis)). We will use GenSim to implement the Latent-Dirichlet Allocation (LDA) model, and use pyLDAvis to visualize the results. 

WARNING: Your notebook might become buggy when using pyLDAvis. Once you are done looking at the visual, simply clear the output of the cell that uses the pyLDAvis chart. 

In [None]:
# Uncomment and run only if you do not have gensim and pyLDAvis on your device yet
!pip install gensim
!pip install pyLDAvis

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis, pyLDAvis.gensim_models

In [None]:
# Create a corpus from the documents
split_texts = [[token.text for token in doc] for doc in df_modified['document']] # Get tokens from each document
dictionary = Dictionary(split_texts) # Create a GenSimdictionary
corpus = [dictionary.doc2bow(text) for text in split_texts] # Create a GenSim corpus
corpus

In [None]:
# Create an LDA model with 3 topis from the generated corpus
lda = LdaModel(
    corpus=corpus, 
    id2word=dictionary, 
    num_topics=3
)
lda.print_topics()

In [None]:
# Visualize the topic models and figure out the themes
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary)
vis

# Right-click the cell then choose "Clear outputs" when you are done.