![](img/330-banner.png)

# Lecture 16: Introduction to natural language processing 

UBC 2022 Summer

Instructor: Mehrdad Oveisi

## Imports

In [1]:
import os
import re
import string
import sys
import time
from collections import Counter, defaultdict

import IPython
import nltk
import numpy as np
import numpy.random as npr
import pandas as pd
from IPython.display import HTML
from ipywidgets import interactive
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

<br><br>

## Learning objectives

- Broadly explain what is natural language processing (NLP). 
- Name some common NLP applications. 
- Explain the general idea of a vector space model.
- Explain the difference between different word representations: term-term co-occurrence matrix representation and Word2Vec representation.
- Describe the reasons and benefits of using pre-trained embeddings. 
- Load and use pre-trained word embeddings to find word similarities and analogies. 
- Demonstrate biases in embeddings and learn to watch out for such biases in pre-trained embeddings.
- Use word embeddings in text classification and document clustering using `spaCy`.
- Explain the general idea of topic modeling. 
- Describe the input and output of topic modeling. 
- Carry out basic text preprocessing using `spaCy`.   

<br><br>

## What is Natural Language Processing (NLP)?

- What should a search engine return when asked the following question? 

![](img/lexical_ambiguity.png)

<!-- <center> -->
<!-- <img src="img/lexical_ambiguity.png" width="1000" height="1000"> -->
<!-- </center> -->

### What is Natural Language Processing (NLP)?
#### How often do you search everyday? 

![](img/Google_search.png)

<!-- <center> -->
<!-- <img src="img/Google_search.png" width="900" height="900"> -->
<!-- </center> -->

### What is Natural Language Processing (NLP)?

![](img/WhatisNLP.png)

<!-- <center> -->
<!-- <img src="img/WhatisNLP.png" width="800" height="800"> -->
<!-- </center> -->    

### Everyday NLP applications

![](img/annotation-image.png)
<!-- <center> -->
<!-- <img src="img/annotation-image.png" height="1200" width="1200"> -->
<!-- </center> -->

### NLP in news 

Often you'll NLP in news. Some examples: 
- [How suicide prevention is getting a boost from artificial intelligence](https://abcnews.go.com/GMA/Wellness/suicide-prevention-boost-artificial-intelligence-exclusive/story?id=76541481)
- [Meet GPT-3. It Has Learned to Code (and Blog and Argue).](https://www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html)
- [How Do You Know a Human Wrote This?](https://www.nytimes.com/2020/07/29/opinion/gpt-3-ai-automation.html)
- ...

### Why is NLP hard?

- Language is complex and subtle. 
- Language is ambiguous at different levels. 
- Language understanding involves common-sense knowledge and real-world reasoning.
- All the problems related to representation and reasoning in artificial intelligence arise in this domain. 

### Example: Lexical ambiguity

<br><br>

![](img/lexical_ambiguity.png)

<!-- <img src="img/lexical_ambiguity.png" width="1000" height="1000"> -->
    

### Example: Referential ambiguity
<br><br>

<!-- <img src="img/referential_ambiguity.png" width="1000" height="1000"> -->

![](img/referential_ambiguity.png)
    

### [Ambiguous news headlines](http://www.fun-with-words.com/ambiguous_headlines.html)

<blockquote>
KICKING BABY CONSIDERED TO BE HEALTHY    
</blockquote> 

- **kicking** is used as an adjective or a verb?

<blockquote>
MILK DRINKERS ARE TURNING TO POWDER
</blockquote>

- **turning** means becoming or take up?

### Overall goal

- Give you a quick introduction to this important field in artificial intelligence which extensively uses machine learning.

![](img/NLP_in_industry.png)

<!-- <center> -->
<!-- <img src="img/NLP_in_industry.png" width="900" height="800"> -->
<!-- </center> -->

Today's plan

- Word embeddings
- Topic modeling 
- Basic text preprocessing 

## Word Embeddings 

- The idea is to represent word meaning so that similar words are close together. 

![](img/t-SNE_word_embeddings.png)

<!-- <center> -->
<!-- <img src="img/t-SNE_word_embeddings.png" width="900" height="900"> -->
<!-- </center> -->    

(Attribution: Jurafsky and Martin 3rd edition)

### Why do we care about word representation?

- So far we have been talking about sentence or document representation. 
- Now we are going one step back and talking about **word representation**. 
- Although word representation cannot be directly used in text classification tasks such as sentiment analysis using tradition ML models, it's good to know about **word embeddings** because they are so widely used. 
- They are quite useful in more advanced machine learning models such as recurrent neural networks. 

### Word meaning 

- A favourite topic of philosophers for centuries. 
- An example from legal domain: [Are hockey gloves gloves or "articles of plastics"?](https://www.scc-csc.ca/case-dossier/info/sum-som-eng.aspx?cas=36258)

<blockquote>
Canada (A.G.) v. Igloo Vikski Inc. was a tariff code case that made its way to the SCC (Supreme Court of Canada). The case disputed the definition of hockey gloves as either gloves or as "articles of plastics."
</blockquote>

![](img/hockey_gloves_case.png)

<!-- <center> -->
<!-- <img src="img/hockey_gloves_case.png" width="800" height="800"> -->
<!-- </center> -->

### Word meaning: ML and NLP view
- Modeling word meaning that allows us to 
    * draw useful inferences to solve meaning-related problems 
    * find relationship between words, 
        * E.g., which words are similar, which ones have positive or negative connotations
    

## Word representations

### How do we represent words?  

- Suppose you are building a question answering system and you are given the following question and three candidate answers. 
- What kind of relationship between words would we like our representation to capture in order to arrive at the correct answer?  
    
<blockquote>       
<p style="font-size:30px"><b>Question:</b> How <b>tall</b> is Machu Picchu?</p>
    <p style="font-size:30px"><b>Candidate 1:</b> Machu Picchu is 13.164 degrees south of the equator.</p>    
<p style="font-size:30px"><b>Candidate 2:</b> The official height of Machu Picchu is 2,430 m.</p>
<p style="font-size:30px"><b>Candidate 3:</b> Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.</p>    
</blockquote> 
    

### Need a representation that captures *relationships between words*.

- We will be looking at two such representations.  
    1. Sparse representation with **term-term co-occurrence matrix**
    2. Dense representation with **Word2Vec**
- Both are based on two ideas: **distributional hypothesis** and **vector space model**.

### Distributional hypothesis

<blockquote> 
    <p>You shall know a word by the company it keeps.</p>
    <footer>Firth, 1957</footer>
</blockquote>

<blockquote>
    <p>If A and B have almost identical environments we say that they are synonyms.</p>
    <footer>Harris, 1954</footer>
</blockquote>

Example: 

- Her **child** loves to play in the playground. 
- Her **kid** loves to play in the playground. 



### Vector space model

- Model the meaning of a word by placing it into a vector space.  
- A standard way to represent meaning in NLP
- The idea is to create **embeddings of words** so that distances among words in the vector space indicate the relationship between them. 

![](img/t-SNE_word_embeddings.png)

<!-- <center> -->
<!-- <img src="img/t-SNE_word_embeddings.png" width="900" height="900"> -->
<!-- </center> -->    

(Attribution: Jurafsky and Martin 3rd edition)

### Term-term co-occurrence matrix

- So far we have been talking about documents and we created document-term co-occurrence matrix (e.g., bag-of-words representation of text). 
- We can also do this with words. The idea is to go through a corpus of text, keeping a count of all of the words that appear in context of each word (within a window).

- An example: 

![](img/term-term_comat.png)

<!-- <center> -->
<!-- <img src="img/term-term_comat.png" width="600" height="600"> -->
<!-- </center> -->
(Credit: Jurafsky and Martin 3rd edition)


### Visualizing word vectors and similarity 

![](img/word_vectors_and_angles.png)

<!-- <center> -->
<!-- <img src="img/word_vectors_and_angles.png" width="800" height="800"> -->
<!-- </center> -->
    
(Credit: Jurafsky and Martin 3rd edition)

- The similarity is calculated using dot products between word vectors.
    - Example: $\vec{\text{digital}}.\vec{\text{information}} = 0 \times 1 + 1\times 6 = 6$
    - Higher the dot product more similar the words.

### Visualizing word vectors and similarity

![](img/word_vectors_and_angles.png)

<!-- <img src="img/word_vectors_and_angles.png" width="600" height="600"> -->

(Credit: Jurafsky and Martin 3rd edition)

- The similarity is calculated using dot products between word vectors.
    - Example: $\vec{\text{digital}}.\vec{\text{information}} = 0 \times 1 + 1\times 6 = 6$
    - Higher the dot product more similar the words.

- We can also calculate a normalized version of dot products. 
    $$similarity_{cosine}(w_1,w_2) = \frac{w_1.w_2}{\left\lVert w_1\right\rVert_2 \left\lVert w_2\right\rVert_2}$$


### Let's build term-term co-occurrence matrix for our text.

In [2]:
# You need to run the following line once
# nltk.download('stopwords')  # You need to run this at least once

In [3]:
sys.path.append("code/.")

from comat import CooccurrenceMatrix
from preprocessing import MyPreprocessor

corpus = [
    "How tall is Machu Picchu?",
    "Machu Picchu is 13.164 degrees south of the equator.",
    "The official height of Machu Picchu is 2,430 m.",
    "Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.",
    "It is 80 kilometres (50 miles) northwest of Cusco, on the crest of the mountain Machu Picchu, located about 2,430 metres (7,970 feet) above mean sea level, over 1,000 metres (3,300 ft) lower than Cusco, which has an elevation of 3,400 metres (11,200 ft).",
]
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
cm = CooccurrenceMatrix(pp_corpus)
vocab, comat = cm.fit_transform()
words = [
    key for key, value in sorted(vocab.items(), key=lambda item: (item[1], item[0]))
]
df = pd.DataFrame(comat.todense(), columns=words, index=words, dtype=np.int8)
df.head()

Unnamed: 0,tall,machu,picchu,13.164,degrees,south,equator,official,height,"2,430",...,mean,sea,level,"1,000","3,300",ft,lower,elevation,"3,400","11,200"
tall,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
machu,1,0,5,1,1,0,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
picchu,1,5,0,1,1,1,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
13.164,0,1,1,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
degrees,0,1,1,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
import warnings
from sklearn.metrics.pairwise import cosine_similarity


def similarity(word1, word2):
    """
    Returns similarity score between word1 and word2
    Arguments
    ---------
    word1 -- (str)
        The first word
    word2 -- (str)
        The second word

    Returns
    --------
    None. Prints the similarity score between word1 and word2.
    """
    vec1 = cm.get_word_vector(word1).todense().flatten()
    vec2 = cm.get_word_vector(word2).todense().flatten()
    v1 = np.squeeze(np.asarray(vec1))
    v2 = np.squeeze(np.asarray(vec2))
    warnings.simplefilter(action="ignore", category=FutureWarning)
    print(
        "The dot product between %s and %s is %0.2f and cosine similarity is %0.2f"
        % (word1, word2, np.dot(v1, v2), cosine_similarity(vec1, vec2))
    )
    warnings.simplefilter(action="default", category=FutureWarning)


similarity("tall", "height")
similarity("tall", "official")

### Not very reliable similarity scores because we used only 4 sentences.

The dot product between tall and height is 2.00 and cosine similarity is 0.71
The dot product between tall and official is 2.00 and cosine similarity is 0.82


- We are able to capture some similarities between words now. 
- That said similarities do not make much sense in the toy example above because we're using a tiny corpus. 
- To find meaningful patterns of similarities between words, we need a large corpus. 
- Let's try a bit larger corpus and check whether the similarities make sense. 

In [5]:
import wikipedia
from nltk.tokenize import sent_tokenize, word_tokenize

corpus = []

queries = ["Machu Picchu", "human stature", "mountain"]

for i in range(len(queries)):
    sents = sent_tokenize(wikipedia.page(queries[i]).content)
    corpus.extend(sents)
print("Number of sentences in the corpus: ", len(sents))

Number of sentences in the corpus:  129


In [6]:
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
cm = CooccurrenceMatrix(pp_corpus)
vocab, comat = cm.fit_transform()
words = [
    key for key, value in sorted(vocab.items(), key=lambda item: (item[1], item[0]))
]
df = pd.DataFrame(comat.todense(), columns=words, index=words, dtype=np.int8)
df.head()

Unnamed: 0,machu,picchu,15th-century,inca,citadel,located,eastern,cordillera,southern,peru,...,contender,"6,600",ranges,lists,accessible,encyclopædia,britannica,11th,quotations,wikiquote
machu,0,93,1,6,2,2,1,0,1,5,...,0,0,0,0,0,0,0,0,0,0
picchu,93,0,1,7,4,1,1,0,1,4,...,0,0,0,0,0,0,0,0,0,0
15th-century,1,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inca,6,7,1,0,1,1,1,0,0,2,...,0,0,0,0,0,0,0,0,0,0
citadel,2,4,1,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
similarity("tall", "height")
similarity("tall", "official")

The dot product between tall and height is 28.00 and cosine similarity is 0.10
The dot product between tall and official is 0.00 and cosine similarity is 0.00


<br><br>

### Sparse vs. dense word vectors

- Term-term co-occurrence matrices are long and sparse. 
    - length |V| is usually large (e.g., > 50,000) 
    - most elements are zero
- That is OK because there are efficient ways to deal with sparse matrices.


### Alternative 
- Learn short (~100 to 1000 dimensions) and dense vectors. 
- Short vectors may be easier to train with ML models (less weights to train).
- They may generalize better.
- In practice they work much better! 

### Word2Vec 

- A family of algorithms to create dense word embeddings

![](img/word2vec.png)

<!-- <img src="img/word2vec.png" width="1000" height="1000"> -->


### Success of Word2Vec

- Able to capture complex relationships between words.
- Example: What is the word that is similar to **WOMAN** in the same sense as **KING** is similar to **MAN**?
- Perform a simple algebraic operations with the vector representation of words.
    $\vec{X} = \vec{\text{KING}} − \vec{\text{MAN}} + \vec{\text{WOMAN}}$
- Search in the vector space for the word closest to $\vec{X}$ measured by cosine distance.

![](img/word_analogies1.png)

<!-- <center> -->
<!-- <img src="img/word_analogies1.png" width="500" height="500"> -->
<!-- </center> -->
    
(Credit: Mikolov et al. 2013)    


- We can create a dense representation with a library called `gensim`. 

`conda install -n cpsc330 -c anaconda gensim`

In [8]:
from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)

Let's look at the word vector of the word _cat_. 

In [9]:
model.wv["cat"]

array([-0.00713902,  0.00124103, -0.00717672, -0.00224462,  0.0037193 ,
        0.00583312,  0.00119818,  0.00210273, -0.00411039,  0.00722533,
       -0.00630704,  0.00464721, -0.00821997,  0.00203647, -0.00497705,
       -0.00424769, -0.00310899,  0.00565521,  0.0057984 , -0.00497465,
        0.00077333, -0.00849578,  0.00780981,  0.00925729, -0.00274233,
        0.00080022,  0.00074665,  0.00547788, -0.00860608,  0.00058445,
        0.00686942,  0.00223159,  0.00112468, -0.00932216,  0.00848237,
       -0.00626413, -0.00299237,  0.00349379, -0.00077263,  0.00141129,
        0.00178199, -0.0068289 , -0.00972481,  0.00904058,  0.00619805,
       -0.00691293,  0.00340348,  0.00020606,  0.00475374, -0.00711994,
        0.00402695,  0.00434743,  0.00995737, -0.00447374, -0.00138927,
       -0.00731732, -0.00969783, -0.00908026, -0.00102276, -0.00650329,
        0.00484973, -0.00616403,  0.00251919,  0.00073944, -0.00339216,
       -0.00097922,  0.00997912,  0.00914589, -0.00446183,  0.00

What's the most similar word to the word _cat_? 

In [10]:
model.wv.most_similar("cat")

[('dog', 0.17018885910511017),
 ('woof', 0.004503009375184774),
 ('say', -0.027750369161367416),
 ('meow', -0.04461709037423134)]

This is good. But if you want good and meaningful representations of words you need to train models on a large corpus such as the whole Wikipedia, which is computationally intensive. 

So instead of training our own models, we use the **pre-trained embeddings**. These are the word embeddings people have trained embeddings on huge corpora and made them available for us to use. 

Let's try out Google news pre-trained word vectors. 

In [11]:
# It'll take a while to run this when you try it out for the first time.
import gensim.downloader as api

google_news_vectors = api.load("word2vec-google-news-300")

In [12]:
print("Size of vocabulary: ", len(google_news_vectors))

Size of vocabulary:  3000000


- `google_news_vectors` above has 300 dimensional word vectors for 3,000,000 unique words from Google news. 
- What can we do with these word vectors?

### Finding similar words 

- Given word $w$, search in the vector space for the word closest to $w$ as measured by cosine distance. 

In [13]:
google_news_vectors.most_similar("UBC")

[('UVic', 0.7886475920677185),
 ('SFU', 0.7588528394699097),
 ('Simon_Fraser', 0.7356574535369873),
 ('UFV', 0.6880435943603516),
 ('VIU', 0.6778583526611328),
 ('Kwantlen', 0.6771429181098938),
 ('UBCO', 0.6734487414360046),
 ('UPEI', 0.6731126308441162),
 ('UBC_Okanagan', 0.6709133982658386),
 ('Lakehead_University', 0.6622507572174072)]

In [14]:
google_news_vectors.most_similar("information")

[('info', 0.7363681793212891),
 ('infomation', 0.680029571056366),
 ('infor_mation', 0.673384964466095),
 ('informaiton', 0.6639009118080139),
 ('informa_tion', 0.6601256728172302),
 ('informationon', 0.6339334845542908),
 ('informationabout', 0.6320980191230774),
 ('Information', 0.6186580657958984),
 ('informaion', 0.6093292236328125),
 ('details', 0.6063088774681091)]

If you want to extract all documents containing information or similar words, you could use this information.  

### Finding similarity scores between words

In [15]:
google_news_vectors.similarity("Canada", "hockey")

0.27610135

In [16]:
google_news_vectors.similarity("China", "hockey")

0.060384665

In [17]:
word_pairs = [
    ("height", "tall"),
    ("pineapple", "mango"),
    ("pineapple", "juice"),
    ("sun", "robot"),
    ("GPU", "lion"),
]
for pair in word_pairs:
    print(
        "The similarity between %s and %s is %0.3f"
        % (pair[0], pair[1], google_news_vectors.similarity(pair[0], pair[1]))
    )

The similarity between height and tall is 0.473
The similarity between pineapple and mango is 0.668
The similarity between pineapple and juice is 0.418
The similarity between sun and robot is 0.029
The similarity between GPU and lion is 0.002


In [18]:
def analogy(word1, word2, word3, model=google_news_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

In [19]:
analogy("man", "king", "woman")

man : king :: woman : ?


Unnamed: 0,Analogy word,Score
0,queen,0.711819
1,monarch,0.618967
2,princess,0.590243
3,crown_prince,0.549946
4,prince,0.537732
5,kings,0.523684
6,Queen_Consort,0.523595
7,queens,0.518113
8,sultan,0.509859
9,monarchy,0.508741


In [20]:
analogy("Montreal", "Canadiens", "Vancouver")

Montreal : Canadiens :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,Canucks,0.821327
1,Vancouver_Canucks,0.750401
2,Calgary_Flames,0.70547
3,Leafs,0.695783
4,Maple_Leafs,0.691617
5,Thrashers,0.687504
6,Avs,0.681716
7,Sabres,0.665307
8,Blackhawks,0.664625
9,Habs,0.661023


In [21]:
analogy("Toronto", "UofT", "Vancouver")

Toronto : UofT :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,SFU,0.579245
1,UVic,0.576921
2,UBC,0.571431
3,Simon_Fraser,0.543464
4,Langara_College,0.541347
5,UVIC,0.520495
6,Grant_MacEwan,0.517273
7,UFV,0.51415
8,Ubyssey,0.510421
9,Kwantlen,0.503807


### Examples of semantic and syntactic relationships

![](img/word_analogies2.png)

<!-- <img src="img/word_analogies2.png" width="800" height="800"> -->

(Credit: Mikolov 2013)

### Implicit biases and stereotypes in word embeddings

- Reflect gender stereotypes present in broader society.
- They may also amplify these stereotypes because of their widespread usage. 
- See the paper [Man is to Computer Programmer as Woman is to ...](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf).

In [22]:
analogy("man", "computer_programmer", "woman")

man : computer_programmer :: woman : ?


Unnamed: 0,Analogy word,Score
0,homemaker,0.562712
1,housewife,0.510505
2,graphic_designer,0.50518
3,schoolteacher,0.497949
4,businesswoman,0.493489
5,paralegal,0.492551
6,registered_nurse,0.490797
7,saleswoman,0.488163
8,electrical_engineer,0.479773
9,mechanical_engineer,0.47554


![](img/eva-srsly.png)

Most of the modern embeddings are de-biased. 

### Other pre-trained embeddings

A number of pre-trained word embeddings are available. The most popular ones are:  

- [Word2Vec](https://code.google.com/archive/p/word2vec/)
    * trained on several corpora using the word2vec algorithm 
- [wikipedia2vec](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/)
    * pretrained embeddings for 12 languages 
- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using [the fastText algorithm](http://aclweb.org/anthology/Q17-1010)
    * published by Facebook    

***Note*** 
These pre-trained word vectors are of big size (in several gigabytes).

Here is a list of all pre-trained embeddings available with `gensim`. 

In [23]:
import gensim.downloader

print(*gensim.downloader.info()["models"].keys(), sep=", ")

fasttext-wiki-news-subwords-300, conceptnet-numberbatch-17-06-300, word2vec-ruscorpora-300, word2vec-google-news-300, glove-wiki-gigaword-50, glove-wiki-gigaword-100, glove-wiki-gigaword-200, glove-wiki-gigaword-300, glove-twitter-25, glove-twitter-50, glove-twitter-100, glove-twitter-200, __testing_word2vec-matrix-synopsis


<br><br>

### Word vectors with spaCy

- spaCy also gives you access to word vectors with bigger models: `en_core_web_md` or `en_core_web_lr`
- `spaCy`'s pre-trained embeddings are trained on [OntoNotes corpus](https://catalog.ldc.upenn.edu/LDC2013T19).
- This corpus has a collection of different styles of texts such as telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs, religious texts. 
- Let's try it out. 

In [24]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("pineapple")
doc.vector

array([-6.3358e-01,  1.2266e-01,  4.7232e-01, -2.2974e-01, -2.6307e-01,
        5.6499e-01, -7.2338e-01,  1.6736e-01,  4.2030e-01,  9.3788e-01,
       -1.7248e-01,  2.9126e-01, -7.8257e-01, -1.2844e-01,  3.3886e-01,
       -3.3947e-01, -4.2727e-01,  1.4579e+00, -3.1335e-01, -5.2318e-05,
        6.1973e-01, -3.6482e-02, -1.0523e-01, -1.0039e-01,  2.6267e-01,
       -5.3926e-01,  1.8011e-01, -6.1530e-02,  1.9553e-01, -1.0654e+00,
        6.8364e-02,  1.0889e-02,  6.8862e-01, -3.0893e-02, -2.9747e-01,
        2.7474e-01,  3.5519e-01,  6.7799e-02, -4.5287e-01,  5.9265e-01,
       -1.7061e-01, -2.9119e-01, -1.5082e-01, -3.1661e-01, -3.8160e-01,
        2.8878e-01, -9.4547e-02,  1.9872e-01, -4.1434e-02, -4.5629e-03,
        2.0329e-01,  4.4544e-02, -1.0714e-01,  1.3336e-02,  7.6487e-01,
       -9.7920e-01,  4.1987e-01,  1.2322e-01,  2.2844e-01, -3.8980e-01,
        3.5201e-02,  5.8634e-03,  5.3757e-01, -2.7319e-01, -2.0973e-01,
       -5.6800e-01, -1.3261e-01, -4.0366e-02, -4.3473e-01,  5.93

<br><br>

### Representing documents using word embeddings

- Assuming that we have reasonable representations of words. 
- How do we represent meaning of **paragraphs** or **documents**?
- Two simple approaches
    - **Averaging** embeddings
    - **Concatenating** embeddings

### Averaging embeddings

<blockquote>
All empty promises
</blockquote>
    
$(embedding(all) + embedding(empty) + embedding(promise))/3$

### Average embeddings with spaCy

- We can do this conveniently with [spaCy](https://spacy.io/usage/linguistic-features#vectors-similarity). 
- We need `en_core_web_md` model to access word vectors. 
- You can download the model by going to command line and in your course `conda` environment and download `en_core_web_md` as follows.   

```
conda activate cpsc330
python -m spacy download en_core_web_md
```

We can access word vectors for individual words in `spaCy` as follows. 

In [25]:
nlp("empty").vector[0:10]

array([-0.76058 , -0.21996 ,  0.095465,  0.10395 , -0.52345 ,  0.060248,
        0.069298, -0.61524 , -0.37134 ,  1.5214  ], dtype=float32)

We can get **average embeddings** for a sentence or a document in `spaCy` as follows: 

In [26]:
s = "All empty promises"
doc = nlp(s)
avg_sent_emb = doc.vector
print(avg_sent_emb.shape)
print("Vector for: {}\n{}".format((s), (avg_sent_emb[0:10])))

(300,)
Vector for: All empty promises
[-0.6622033   0.06859333  0.18747167  0.04704666 -0.239323    0.043686
  0.13470568 -0.00746667 -0.03796467  1.9193001 ]


### Similarity between documents 

- We can also get similarity between documents as follows. 
- Note that this is **based on average embeddings of each sentence**. 

In [27]:
doc1 = nlp("Deep learning is very popular these days.")
doc2 = nlp("Machine learning is dominated by neural networks.")
doc3 = nlp("A home-made fresh bread with butter and cheese.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc2, "<->", doc3, doc2.similarity(doc3))

Deep learning is very popular these days. <-> Machine learning is dominated by neural networks. 0.7346255041188937
Machine learning is dominated by neural networks. <-> A home-made fresh bread with butter and cheese. 0.4759406337841433


- Do these scores make sense? 
- There are no common words, but we are still able to identify that doc1 and doc2 are more similar that doc2 and doc3.  
- You can use such average embedding representation in text classification tasks.

###  Airline sentiment analysis using average embedding representation 

- Let's try average embedding representation for airline sentiment analysis.
- We used this dataset last week so you should already have it in the data directory. If not you can download it [here](https://www.kaggle.com/jaskarancr/airline-sentiment-dataset). 

In [28]:
df = pd.read_csv("data/Airline-Sentiment-2-w-AA.csv", encoding="ISO-8859-1")

In [29]:
from sklearn.model_selection import cross_validate, train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
X_train, y_train = train_df["text"], train_df["airline_sentiment"]
X_test, y_test = test_df["text"], test_df["airline_sentiment"]

In [30]:
train_df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
5789,681455792,False,finalized,3,2/25/15 4:21,negative,1.0,Can't Tell,0.6667,Southwest,,mrssuperdimmock,,0,@SouthwestAir link doesn't work,,2/19/15 18:53,5.68604e+17,"Lake Arrowhead, CA",Pacific Time (US & Canada)
8918,681459957,False,finalized,3,2/25/15 9:45,neutral,1.0,,,Delta,,labeles,,0,@JetBlue okayyyy. But I had huge irons on way ...,,2/17/15 10:18,5.6775e+17,,
11688,681462990,False,finalized,3,2/25/15 9:53,negative,1.0,Customer Service Issue,0.6727,US Airways,,DropMeAnywhere,,0,@USAirways They're all reservations numbers an...,"[0.0, 0.0]",2/17/15 14:50,5.67819e+17,"Here, There and Everywhere",Arizona
413,681448905,False,finalized,3,2/25/15 10:10,neutral,1.0,,,Virgin America,,jsamaudio,,0,@VirginAmerica no A's channel this year?,,2/18/15 12:25,5.68144e+17,St. Francis (Calif.),Pacific Time (US & Canada)
4135,681454122,False,finalized,3,2/25/15 10:08,negative,1.0,Bad Flight,0.3544,United,,CajunSQL,,0,"@united missed it. Incoming on time, then Sat...",,2/17/15 14:20,5.67811e+17,"Baton Rouge, LA",


### Bag-of-words representation for sentiment analysis

In [31]:
pipe = make_pipeline(
    CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)
)
pipe.named_steps["countvectorizer"].fit(X_train)
X_train_transformed = pipe.named_steps["countvectorizer"].transform(X_train)
print("Data matrix shape:", X_train_transformed.shape)
pipe.fit(X_train, y_train);

Data matrix shape: (11712, 13064)


In [32]:
print("Train accuracy {:.2f}".format(pipe.score(X_train, y_train)))
print("Test accuracy {:.2f}".format(pipe.score(X_test, y_test)))

Train accuracy 0.94
Test accuracy 0.80


### Sentiment analysis with average embedding representation

- Let's see how can we get word vectors using `spaCy`. 
- Let's create average embedding representation for each example. 

In [33]:
X_train_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_train)])
X_test_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_test)])

We have **reduced dimensionality** from 13,064 to 300! 

In [34]:
X_train_embeddings.shape

(11712, 300)

In [35]:
X_train_embeddings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.561952,0.225163,-0.293684,-0.082498,-0.192374,0.050405,0.060202,-0.25421,0.089486,1.7739,...,-0.210224,-0.03394,-0.130713,-0.145631,0.138794,0.115038,-0.0813,-0.155296,0.053767,0.150895
1,-0.700668,0.124276,-0.259541,-0.260308,-0.129364,0.06821,0.066224,-0.307202,-0.047832,1.612876,...,-0.172155,0.069523,-0.077437,-0.027307,0.09739,0.174096,0.023029,0.083394,-0.03437,0.089504
2,-0.720738,0.284142,-0.285948,-0.150022,-0.13463,-0.035959,0.124829,-0.282646,0.04468,2.146762,...,-0.308978,0.067948,-0.141636,-0.08379,0.118324,0.212624,-0.079882,-0.146916,0.038091,0.046573
3,-0.602323,0.310292,-0.197665,-0.003409,-0.053958,0.033286,0.190612,-0.242265,0.060668,1.795106,...,-0.16588,0.106567,-0.159539,0.017612,0.111956,0.153079,0.005538,-0.133347,0.075873,-0.150727
4,-0.657547,0.22827,-0.184336,-0.16273,-0.076056,0.042302,0.01439,-0.260486,0.062895,1.678635,...,-0.202215,0.068234,0.006729,-0.011597,0.08786,0.185101,0.031942,-0.112022,0.029528,0.123609


### Sentiment classification using average embeddings 

- What are the train and test accuracies with average word embedding representation? 
- The accuracy is a bit better with **less overfitting**. 
- Note that we are using **transfer learning** here.
- The embeddings are trained on a completely different corpus. 

In [36]:
lgr = LogisticRegression(max_iter=1000)
lgr.fit(X_train_embeddings, y_train)
print("Train accuracy {:.2f}".format(lgr.score(X_train_embeddings, y_train)))
print("Test accuracy {:.2f}".format(lgr.score(X_test_embeddings, y_test)))

Train accuracy 0.74
Test accuracy 0.75


### Sentiment classification using advanced sentence representations 

- Since, representing documents is so essential for text classification tasks, there are more advanced methods for document representation.

In [37]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")

In [38]:
emb_sents = embedder.encode("all empty promises")
emb_sents.shape

(768,)

In [39]:
emb_train = embedder.encode(train_df["text"].tolist())
emb_train_df = pd.DataFrame(emb_train, index=train_df.index)
emb_train_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
5789,-0.120494,0.250262,-0.022795,-0.116368,0.078650,0.037357,-0.251341,0.321429,-0.143984,-0.123487,...,0.199151,-0.150143,0.167078,-0.407671,-0.066161,0.049514,0.019384,-0.357601,0.125996,0.381073
8918,-0.182954,0.118282,0.066341,-0.136099,0.094947,-0.121303,0.069233,-0.097500,0.025739,-0.367980,...,0.113612,0.114661,0.049926,0.256736,-0.118687,-0.190720,0.011986,-0.141883,-0.230142,0.024899
11688,-0.032988,0.630251,-0.079516,0.148981,0.194709,-0.226264,-0.043630,0.217398,-0.010716,0.069643,...,0.676791,0.244483,0.051042,0.064099,-0.146945,0.090878,-0.090059,0.077212,-0.209226,0.308773
413,-0.119258,0.172168,0.098698,0.319859,0.415475,0.248360,-0.025923,0.385350,0.066414,-0.334289,...,-0.128482,-0.232447,-0.077805,0.181328,0.123244,-0.143693,0.660456,-0.048714,0.204774,0.163496
4135,0.094240,0.360193,0.213747,0.363690,0.275521,0.134936,-0.276319,0.009336,-0.021523,-0.258992,...,0.474885,0.242125,0.294532,0.279013,0.037831,0.089761,-0.548748,-0.049258,0.154525,0.141268
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5218,-0.204409,-0.145290,-0.064201,0.213571,-0.140225,0.338555,-0.148578,0.224516,-0.042963,0.075930,...,-0.161949,0.040582,0.003971,-0.152549,-0.582908,-0.126527,0.060502,-0.111495,-0.097492,0.199322
12252,0.108408,0.438293,0.216812,-0.349289,0.422689,0.377760,0.045198,-0.034096,0.427570,-0.328272,...,0.257849,-0.032362,-0.275003,0.080452,-0.078975,-0.049972,-0.009762,-0.314754,-0.020773,0.268777
1346,0.068411,0.017591,0.236154,0.221446,-0.103567,0.055510,0.062909,0.067425,-0.003504,-0.157758,...,0.007711,0.323297,0.334638,0.367042,-0.068821,0.063667,-0.329990,0.232331,-0.184768,-0.000683
11646,-0.091488,-0.155709,0.032391,0.018313,0.524998,0.563933,-0.080984,0.097983,-0.535285,-0.377195,...,0.428014,-0.144573,0.045296,-0.107935,-0.135673,-0.290019,-0.137200,-0.503395,-0.042567,-0.282592


In [40]:
emb_test = embedder.encode(test_df["text"].tolist())
emb_test_df = pd.DataFrame(emb_test, index=test_df.index)
emb_test_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
1671,-0.002864,0.217326,0.124350,-0.082548,0.709688,-0.582442,0.257897,0.169356,0.248880,-0.266686,...,0.501767,0.095387,0.340172,0.087452,-0.368359,0.276195,0.238676,-0.219546,0.066603,0.256149
10951,-0.141048,0.137934,0.131319,0.194773,0.868205,0.078791,-0.131656,0.036243,-0.215749,-0.291946,...,-0.056256,-0.056040,0.147341,0.189665,-0.357366,0.061799,-0.161923,-0.278955,-0.173722,0.065323
5382,-0.252943,0.527507,-0.065609,0.013467,0.207989,0.003881,-0.066282,0.253166,0.021039,0.290956,...,0.180685,-0.042605,-0.173793,-0.079129,-0.169160,0.001317,-0.142593,-0.070816,-0.208826,0.400736
3954,0.054318,0.096738,0.113037,0.032039,0.493064,-0.641102,0.078760,0.402187,0.189743,-0.089538,...,0.123879,-0.285018,-0.297771,0.557171,0.076168,-0.029826,-0.076094,0.225454,0.002134,0.235429
11193,-0.065858,0.223270,0.507333,0.266193,0.104696,-0.219555,0.146247,0.315649,-0.126193,-0.435462,...,0.163994,0.207813,-0.001871,0.109391,-0.166779,-0.249200,-0.525419,-0.413065,0.119939,0.064297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5861,0.077512,0.322276,0.026697,-0.111392,0.174208,0.235202,0.053888,0.244941,0.181625,-0.226870,...,0.149843,0.311337,0.045975,-0.572319,-0.068257,0.217745,-0.056509,-0.355174,-0.028610,0.090676
3627,-0.173311,-0.023604,0.190388,-0.136543,-0.360269,-0.444687,0.056311,0.291941,-0.399719,-0.167930,...,0.042209,-0.161905,-0.040535,-0.050516,-0.252020,-0.133980,0.155001,-0.154482,-0.060201,-0.126555
12559,-0.124635,-0.101799,0.129061,0.636908,0.681090,0.399300,-0.078321,0.221823,-0.277218,-0.178589,...,0.022364,-0.109274,-0.073540,-0.153336,-0.123705,-0.238896,0.296447,-0.116798,0.115076,-0.345925
8123,0.063508,0.332506,0.119605,-0.001362,-0.161801,-0.082302,-0.025883,0.048027,0.126975,-0.159802,...,0.002221,-0.093885,0.430285,-0.088562,0.321488,0.447437,0.292395,-0.188566,-0.272767,0.126173


In [41]:
lgr = LogisticRegression(max_iter=1000)
lgr.fit(emb_train, y_train)
print("Train accuracy {:.2f}".format(lgr.score(emb_train, y_train)))
print("Test accuracy {:.2f}".format(lgr.score(emb_test, y_test)))

Train accuracy 0.87
Test accuracy 0.83


- Some improvement over bag of words and average embedding representations! 
- But much slower ...

## Break (5 min)

![](img/eva-coffee.png)

<br><br><br><br>

## Topic modeling 

### Topic modeling motivation

- Suppose you have a large collection of documents on a variety of topics. 

### Example: A corpus of news articles 

![](img/TM_NYT_articles.png)

<!-- <center> -->
<!-- <img src="img/TM_NYT_articles.png" height="2000" width="2000">  -->
<!-- </center> -->

### Example: A corpus of food magazines 

![](img/TM_food_magazines.png)

<!-- <center> -->
<!-- <img src="img/TM_food_magazines.png" height="2000" width="2000">  -->
<!-- </center> -->

### A corpus of scientific articles

![](img/TM_science_articles.png)

<!-- <center> -->
<!-- <img src="img/TM_science_articles.png" height="2000" width="2000">  -->
<!-- </center> -->

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling motivation

- Humans are pretty good at reading and understanding a document and answering questions such as 
    - What is it about?  
    - Which documents is it related to?     
- But for a large collection of documents it would take years to read all documents and organize and categorize them so that they are easy to search.
- You need an automated way
    - to get an idea of what's going on in the data or 
    - to pull documents related to a certain topic

### Topic modeling

- Topic modeling gives you an ability to summarize the major themes in a large collection of documents (corpus). 
    - Example: The major themes in a collection of news articles could be 
        - **politics**
        - **entertainment**
        - **sports**
        - **technology**
        - ...
- A common tool to solve such problems is unsupervised ML methods.
- Given the hyperparameter $K$, the idea of topic modeling is to describe the data using $K$ "topics"

### Topic modeling: Input and output

- Input
    - A large collection of documents
    - A value for the hyperparameter $K$ (e.g., $K = 3$)
- Output
    1. Topic-words association 
        - For each topic, what words describe that topic? 
    2. Document-topics association
        - For each document, what topics are expressed by the document? 
    

### Topic modeling: Example

- Topic-words association 
    - For each topic, what words describe that topic?  
    - A topic is a mixture of words. 

![](img/topic_modeling_word_topics.png)

<!-- <center> -->
<!-- <img src="img/topic_modeling_word_topics.png" height="1000" width="1000">  -->
<!-- </center> -->    

### Topic modeling: Example

- Document-topics association 
    - For each document, what topics are expressed by the document?
    - A document is a mixture of topics. 
    
![](img/topic_modeling_doc_topics.png)

<!-- <center> -->    
<!-- <img src="img/topic_modeling_doc_topics.png" height="800" width="800">  -->
<!-- </center> -->    

### Topic modeling: Input and output

- Input
    - A large collection of documents
    - A value for the hyperparameter $K$ (e.g., $K = 3$)
- Output
    - For each topic, what words describe that topic?  
    - For each document, what topics are expressed by the document?

![](img/topic_modeling_output.png)

<!-- <center> -->
<!-- <img src="img/topic_modeling_output.png" height="800" width="800">  -->
<!-- </center> -->    

### Topic modeling: Some applications

- Topic modeling is a great EDA tool to get a sense of what's going on in a large corpus. 
- Some examples
    - If you want to pull documents related to a particular lawsuit. 
    - You want to examine people's sentiment towards a particular candidate and/or political party and so you want to pull tweets or Facebook posts related to election.   

### Topic modeling toy example

In [42]:
toy_df = pd.read_csv("data/toy_lda_data.csv")
toy_df

Unnamed: 0,doc_id,text
0,1,famous fashion model
1,2,fashion model pattern
2,3,fashion model probabilistic topic model confer...
3,4,famous fashion model
4,5,fresh fashion model
5,6,famous fashion model
6,7,famous fashion model
7,8,famous fashion model
8,9,famous fashion model
9,10,creative fashion model


In [43]:
from gensim import corpora, matutils, models

corpus = [doc.split() for doc in toy_df["text"].tolist()]
# Create a vocabulary for the lda model
dictionary = corpora.Dictionary(corpus)
# Convert our corpus into document-term matrix for Lda
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]

In [44]:
from gensim.models import LdaModel

# Train an lda model
lda = models.LdaModel(
    corpus=doc_term_matrix,
    id2word=dictionary,
    num_topics=3,
    random_state=123,
    passes=10,
)

In [45]:
### Examine the topics in our LDA model
lda.print_topics(num_words=4)

[(0,
  '0.303*"model" + 0.296*"probabilistic" + 0.261*"topic" + 0.040*"pattern"'),
 (1, '0.245*"kiwi" + 0.219*"apple" + 0.218*"nutrition" + 0.140*"health"'),
 (2, '0.308*"fashion" + 0.307*"model" + 0.180*"famous" + 0.071*"conference"')]

In [46]:
### Examine the topic distribution for a document
print("Document: ", corpus[0])
df = pd.DataFrame(lda[doc_term_matrix[0]], columns=["topic id", "probability"])
df.sort_values("probability", ascending=False)

Document:  ['famous', 'fashion', 'model']


Unnamed: 0,topic id,probability
2,2,0.82876
0,0,0.087849
1,1,0.083391


You can also visualize the topics using `pyLDAvis`.

`conda install -n cpsc330 -c anaconda pyLDAvis`

In [47]:
# Visualize the topics
import pyLDAvis

pyLDAvis.enable_notebook()
import pyLDAvis.gensim_models as gensimvis

warnings.simplefilter(action="ignore", category=FutureWarning)

vis = gensimvis.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)

warnings.simplefilter(action="default", category=FutureWarning)

vis

  from imp import reload


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload



## Topic modeling pipeline 

- Preprocess your corpus. 
- Train LDA using `Gensim`.
- Interpret your topics.     

### Data

In [48]:
import wikipedia

queries = [
    "Artificial Intelligence",
    "unsupervised learning",
    "Supreme Court of Canada",
    "Peace, Order, and Good Government",
    "Canadian constitutional law",
    "ice hockey",
]
wiki_dict = {"wiki query": [], "text": []}
for i in range(len(queries)):
    wiki_dict["text"].append(wikipedia.page(queries[i]).content)
    wiki_dict["wiki query"].append(queries[i])

wiki_df = pd.DataFrame(wiki_dict)
wiki_df

Unnamed: 0,wiki query,text
0,Artificial Intelligence,Artificial intelligence (AI) is intelligence d...
1,unsupervised learning,Supervised learning (SL) is the machine learni...
2,Supreme Court of Canada,The Supreme Court of Canada (SCC; French: Cour...
3,"Peace, Order, and Good Government","In many Commonwealth jurisdictions, the phrase..."
4,Canadian constitutional law,Canadian constitutional law (French: droit con...
5,ice hockey,Ice hockey (or simply hockey) is a winter team...


### Preprocessing the corpus 

- **Preprocessing is crucial!**
- Tokenization, converting text to lower case
- Removing punctuation and stopwords
- Discarding words with length < threshold or word frequency < threshold        
- Possibly lemmatization: Consider the lemmas instead of inflected forms. 
- Depending upon your application, restrict to specific part of speech;
    * For example, only consider nouns, verbs, and adjectives
    
We'll use [`spaCy`](https://spacy.io/) for preprocessing. 

In [49]:
import spacy

warnings.simplefilter(action="ignore", category=DeprecationWarning)

nlp = spacy.load("en_core_web_md", disable=["parser", "ner"])

warnings.simplefilter(action="default", category=DeprecationWarning)


In [50]:
def preprocess(
    doc,
    min_token_len=2,
    irrelevant_pos=["ADV", "PRON", "CCONJ", "PUNCT", "PART", "DET", "ADP", "SPACE"],
):
    """
    Given text, min_token_len, and irrelevant_pos carry out preprocessing of the text
    and return a preprocessed string.

    Parameters
    -------------
    doc : (spaCy doc object)
        the spacy doc object of the text
    min_token_len : (int)
        min_token_length required
    irrelevant_pos : (list)
        a list of irrelevant pos tags

    Returns
    -------------
    (str) the preprocessed text
    """

    clean_text = []

    for token in doc:
        if (
            token.is_stop == False  # Check if it's not a stopword
            and len(token) > min_token_len  # Check if the word meets minimum threshold
            and token.pos_ not in irrelevant_pos
        ):  # Check if the POS is in the acceptable POS tags
            lemma = token.lemma_  # Take the lemma of the word
            clean_text.append(lemma.lower())
    return " ".join(clean_text)

In [51]:
wiki_df["text_pp"] = [preprocess(text) for text in nlp.pipe(wiki_df["text"])]

In [52]:
wiki_df

Unnamed: 0,wiki query,text,text_pp
0,Artificial Intelligence,Artificial intelligence (AI) is intelligence d...,artificial intelligence intelligence demonstra...
1,unsupervised learning,Supervised learning (SL) is the machine learni...,supervised learning machine learn task learn f...
2,Supreme Court of Canada,The Supreme Court of Canada (SCC; French: Cour...,supreme court canada scc french cour suprême c...
3,"Peace, Order, and Good Government","In many Commonwealth jurisdictions, the phrase...",commonwealth jurisdiction phrase peace order g...
4,Canadian constitutional law,Canadian constitutional law (French: droit con...,canadian constitutional law french droit const...
5,ice hockey,Ice hockey (or simply hockey) is a winter team...,ice hockey hockey winter team sport play ice s...


### Training LDA with [gensim](https://radimrehurek.com/gensim/models/ldamodel.html)

To train an LDA model with [gensim](https://radimrehurek.com/gensim/models/ldamodel.html), you need

- Document-term matrix 
- Dictionary (vocabulary)
- The number of topics ($K$): `num_topics`
- The number of passes: `passes`

### `Gensim`'s `doc2bow`
- Let's first create a dictionary using [`corpora.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html). 

In [53]:
corpus = [doc.split() for doc in wiki_df["text_pp"].tolist()]
dictionary = corpora.Dictionary(corpus)  # Create a vocabulary for the lda model
pd.DataFrame(
    dictionary.token2id.keys(), index=dictionary.token2id.values(), columns=["Word"]
)

Unnamed: 0,Word
0,0026
1,0030
2,0036
3,0040
4,007
...,...
4976,zhenskaya
4977,мячом
4978,русский
4979,хоккей


### `Gensim`'s `doc2bow`
- Now let's convert our corpus into document-term matrix for LDA using [`dictionary.doc2bow`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow).
- For each document, it stores the **frequency of each token** in the document in the format (**token_id**, **frequency**). 

In [54]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
doc_term_matrix[1][:20]

[(292, 4),
 (305, 3),
 (312, 1),
 (318, 3),
 (319, 1),
 (327, 1),
 (329, 3),
 (337, 1),
 (376, 52),
 (386, 1),
 (400, 4),
 (402, 1),
 (403, 1),
 (410, 1),
 (426, 2),
 (428, 4),
 (430, 3),
 (455, 1),
 (466, 1),
 (478, 2)]

Now we are ready to train an LDA model. 

In [55]:
from gensim.models import LdaModel

num_topics = 3

lda = models.LdaModel(
    corpus=doc_term_matrix,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
)

### Examine the topics and topic distribution for a document in our LDA model

In [56]:
lda.print_topics(num_words=4)  # Topics

[(0, '0.034*"hockey" + 0.021*"ice" + 0.017*"team" + 0.016*"player"'),
 (1, '0.027*"court" + 0.015*"law" + 0.011*"provincial" + 0.010*"canada"'),
 (2,
  '0.013*"intelligence" + 0.011*"artificial" + 0.008*"learning" + 0.007*"algorithm"')]

In [57]:
print("Document: ", wiki_df.iloc[0][0])
print("Topic assignment for document: ", lda[doc_term_matrix[0]])  # Topic distribution

Document:  Artificial Intelligence
Topic assignment for document:  [(2, 0.9999198)]


### Visualize topics

In [58]:
warnings.simplefilter(action="ignore", category=FutureWarning)

vis = gensimvis.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)

warnings.simplefilter(action="default", category=FutureWarning)

vis

### (Optional) Topic modeling with `sklearn`

- We are using `Gensim` LDA so that we'll be able to use `CoherenceModel` to evaluate topic model later. 
- Bit we can also train an LDA model with `sklearn`.

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(wiki_df["text_pp"])

In [60]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 3
lda = LatentDirichletAllocation(
    n_components=n_topics, learning_method="batch", max_iter=10, random_state=0
)
document_topics = lda.fit_transform(X)

In [61]:
print("lda.components_.shape: {}".format(lda.components_.shape))

lda.components_.shape: (3, 4957)


In [62]:
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
feature_names = np.array(vec.get_feature_names_out())

In [63]:
import mglearn

warnings.simplefilter(action="ignore", category=DeprecationWarning)

mglearn.tools.print_topics(
    topics=range(3),
    feature_names=feature_names,
    sorting=sorting,
    topics_per_chunk=5,
    n_words=10,
)

warnings.simplefilter(action="default", category=DeprecationWarning)

topic 0       topic 1       topic 2       
--------      --------      --------      
court         hockey        intelligence  
law           ice           artificial    
provincial    team          learning      
federal       player        algorithm     
government    league        original      
power         play          retrieve      
canada        game          machine       
justice       penalty       archive       
supreme       puck          human         
case          nhl           displaystyle  




You provided "cachedir='cache'", use "location='cache'" instead.
  memory = Memory(cachedir="cache")
You provided "cachedir='cache'", use "location='cache'" instead.
  memory = Memory(cachedir="cache")


<br><br><br><br>

## Basic text preprocessing

### Introduction 
- Why do we need preprocessing?
    - Text data is unstructured and messy. 
    - We need to "normalize" it before we do anything interesting with it. 
- Example:     
    - **Lemma**: Same stem, same part-of-speech, roughly the same meaning
        - Vancouver's &rarr; Vancouver
        - computers &rarr; computer 
        - rising &rarr; rise, rose, rises    

### Tokenization

- Sentence segmentation
    - Split text into sentences
- Word tokenization 
    - Split sentences into words

### Tokenization: sentence segmentation

<blockquote>
MDS is a Master's program at UBC in British Columbia. MDS teaching team is truly multicultural!! Dr. George did his Ph.D. in Scotland. Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. Dr. Gelbart did his PhD in the U.S.
</blockquote>

- How many sentences are there in this text? 

In [64]:
### Let's do sentence segmentation on "."
text = (
    "MDS is a Master's program at UBC in British Columbia. "
    "MDS teaching team is truly multicultural!! "
    "Dr. George did his Ph.D. in Scotland. "
    "Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
    "Dr. Gelbart did his PhD in the U.S."
)

print(text.split("."))

["MDS is a Master's program at UBC in British Columbia", ' MDS teaching team is truly multicultural!! Dr', ' George did his Ph', 'D', ' in Scotland', ' Dr', ' Timbers, Dr', ' Ostblom, Dr', ' Rodríguez-Arelis, and Dr', ' Kolhatkar did theirs in Canada', ' Dr', ' Gelbart did his PhD in the U', 'S', '']


### Sentence segmentation

- In English, period (.) is quite ambiguous. (In Chinese, it is unambiguous.)
    - Abbreviations like Dr., U.S., Inc.  
    - Numbers like 60.44%, 0.98
- ! and ? are relatively ambiguous.
- How about writing regular expressions? 
- A common way is using off-the-shelf models for sentence segmentation. 

In [65]:
### Let's try to do sentence segmentation using nltk
from nltk.tokenize import sent_tokenize

sent_tokenized = sent_tokenize(text)
print(sent_tokenized)

["MDS is a Master's program at UBC in British Columbia.", 'MDS teaching team is truly multicultural!!', 'Dr. George did his Ph.D. in Scotland.', 'Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada.', 'Dr. Gelbart did his PhD in the U.S.']


### Word tokenization

<blockquote>
MDS is a Master's program at UBC in British Columbia. 
</blockquote>

- How many words are there in this sentence?  
- Is whitespace a sufficient condition for a word boundary?

### Word tokenization 

<blockquote>
MDS is a Master's program at UBC in British Columbia. 
</blockquote>

- What's our definition of a word?
    - Should British Columbia be one word or two words? 
    - Should punctuation be considered a separate word?
    - What about the punctuations in `U.S.`?
    - What do we do with words like `Master's`?
- This process of identifying word boundaries is referred to as **tokenization**.
- You can use regex but better to do it with off-the-shelf ML models.  

In [66]:
### Let's do word segmentation on white spaces
print("Splitting on whitespace: ", [sent.split() for sent in sent_tokenized])

### Let's try to do word segmentation using nltk
from nltk.tokenize import word_tokenize

word_tokenized = [word_tokenize(sent) for sent in sent_tokenized]
# This is similar to the input format of word2vec algorithm
print("\n\n\nTokenized: ", word_tokenized)

Splitting on whitespace:  [['MDS', 'is', 'a', "Master's", 'program', 'at', 'UBC', 'in', 'British', 'Columbia.'], ['MDS', 'teaching', 'team', 'is', 'truly', 'multicultural!!'], ['Dr.', 'George', 'did', 'his', 'Ph.D.', 'in', 'Scotland.'], ['Dr.', 'Timbers,', 'Dr.', 'Ostblom,', 'Dr.', 'Rodríguez-Arelis,', 'and', 'Dr.', 'Kolhatkar', 'did', 'theirs', 'in', 'Canada.'], ['Dr.', 'Gelbart', 'did', 'his', 'PhD', 'in', 'the', 'U.S.']]



Tokenized:  [['MDS', 'is', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.'], ['MDS', 'teaching', 'team', 'is', 'truly', 'multicultural', '!', '!'], ['Dr.', 'George', 'did', 'his', 'Ph.D.', 'in', 'Scotland', '.'], ['Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez-Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'did', 'theirs', 'in', 'Canada', '.'], ['Dr.', 'Gelbart', 'did', 'his', 'PhD', 'in', 'the', 'U.S', '.']]


### Word segmentation 

For some languages you need much more sophisticated tokenizers. 
- For languages such as Chinese, there are no spaces between words.
    - [jieba](https://github.com/fxsjy/jieba) is a popular tokenizer for Chinese. 
- German doesn't separate compound words.
    * Example: _rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz_
    * (the law for the delegation of monitoring beef labeling)

### Types and tokens
- Usually in NLP, we talk about 
    - **Type** an element in the vocabulary
    - **Token** an instance of that type in running text 


### Exercise for you 

<blockquote>    
UBC is located in the beautiful province of British Columbia. It's very close 
to the U.S. border. You'll get to the USA border in about 45 mins by car.     
</blockquote>  

- Consider the example above. 
    - How many types? (task dependent)
    - How many tokens? 

### Other commonly used preprocessing steps

- Punctuation and stopword removal
- Stemming and lemmatization

### Punctuation and stopword removal

- The most frequently occurring words in English are not very useful in many NLP tasks.
    - Example: _the_ , _is_ , _a_ , and punctuation
- Probably not very informative in many tasks 

In [67]:
# Let's use `nltk.stopwords`.
# Add punctuations to the list.
stop_words = list(set(stopwords.words("english")))
import string

punctuation = string.punctuation
stop_words += list(punctuation)
# stop_words.extend(['``','`','br','"',"”", "''", "'s"])
print(stop_words)

["haven't", 'mustn', 'his', 'had', 'should', 'd', 'have', 'on', 'has', 'me', 'that', 'theirs', 'ourselves', "you'll", 'our', "that'll", 'ma', 'shouldn', 'didn', 'm', 'weren', "should've", 'shan', 'then', 'such', 'so', 'i', 'just', "you'd", 'been', 'does', 'because', 'if', 'those', 'were', 'him', 'y', 'into', 'than', "she's", 'its', 'them', 'through', 'can', 'hadn', "shouldn't", 'yourself', 'couldn', 'of', "don't", 'for', 'about', "hadn't", 'both', 'here', 'hers', 'in', 'above', 'only', 'some', 'any', 'now', 'after', "aren't", "mightn't", 'until', "it's", 've', 'hasn', 'ain', 'same', 'her', "wouldn't", 'up', 'isn', 'they', 'during', 'before', 'there', 'what', "you're", 'or', 'few', 'to', 'will', "won't", "weren't", 'do', 'down', 'am', 'off', 'once', 'your', 'himself', 'my', 'whom', "isn't", 'is', 'out', 'no', 'who', 'each', 'this', 't', 'from', 'again', 'wasn', 'herself', 'by', 'll', 'with', 'most', 'myself', 'be', 'at', 'further', 'mightn', 'o', 'it', 'between', 'themselves', 'how', 'i

In [68]:
### Get rid of stop words
preprocessed = []
for sent in word_tokenized:
    for token in sent:
        token = token.lower()
        if token not in stop_words:
            preprocessed.append(token)
print(preprocessed)

['mds', 'master', "'s", 'program', 'ubc', 'british', 'columbia', 'mds', 'teaching', 'team', 'truly', 'multicultural', 'dr.', 'george', 'ph.d.', 'scotland', 'dr.', 'timbers', 'dr.', 'ostblom', 'dr.', 'rodríguez-arelis', 'dr.', 'kolhatkar', 'canada', 'dr.', 'gelbart', 'phd', 'u.s']


### Lemmatization 

- For many NLP tasks (e.g., web search) we want to ignore morphological differences between words
    - Example: If your search term is "studying for ML quiz" you might want to include pages containing "tips to study for an ML quiz" or "here is how I studied for my ML quiz"
- Lemmatization converts inflected forms into the base form. 

In [69]:
import nltk

# nltk.download("wordnet")  # You need to run this at least once
# nltk.download('omw-1.4')  # You need to run this at least once

In [70]:
# nltk has a lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("Lemma of studying: ", lemmatizer.lemmatize("studying", "v"))
print("Lemma of studied: ", lemmatizer.lemmatize("studied", "v"))

Lemma of studying:  study
Lemma of studied:  study


### Stemming

- Has a similar purpose but it is a crude chopping of affixes 
    * _automates, automatic, automation_ all reduced to _automat_.
- Usually these reduced forms (stems) are not actual words themselves.  
- A popular stemming algorithm for English is PorterStemmer. 
- Beware that it can be aggressive sometimes.

In [71]:
from nltk.stem.porter import PorterStemmer

text = (
    "UBC is located in the beautiful province of British Columbia... "
    "It's very close to the U.S. border."
)
ps = PorterStemmer()
tokenized = word_tokenize(text)
stemmed = [ps.stem(token) for token in tokenized]
print("Before stemming: ", text)
print("\n\nAfter stemming: ", " ".join(stemmed))

Before stemming:  UBC is located in the beautiful province of British Columbia... It's very close to the U.S. border.


After stemming:  ubc is locat in the beauti provinc of british columbia ... it 's veri close to the u.s. border .


### Other tools for preprocessing 

- We used [Natural Language Processing Toolkit (nltk)](https://www.nltk.org/) above
- Many available tools    
- [spaCy](https://spacy.io/)

### [spaCy](https://spacy.io/)

- Industrial strength NLP library. 
- Lightweight, fast, and convenient to use. 
- spaCy does many things that we did above in one line of code! 
- Also has [multi-lingual](https://spacy.io/models/xx) support. 

In [72]:
import spacy


warnings.simplefilter(action="ignore", category=DeprecationWarning)

# Load the model
nlp = spacy.load("en_core_web_md")

warnings.simplefilter(action="default", category=DeprecationWarning)

text = (
    "MDS is a Master's program at UBC in British Columbia. "
    "MDS teaching team is truly multicultural!! "
    "Dr. George did his Ph.D. in Scotland. "
    "Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
    "Dr. Gelbart did his PhD in the U.S."
)

doc = nlp(text)

In [73]:
# Accessing tokens
tokens = [token for token in doc]
print("\nTokens: ", tokens)

# Accessing lemma
lemmas = [token.lemma_ for token in doc]
print("\nLemmas: ", lemmas)

# Accessing pos
pos = [token.pos_ for token in doc]
print("\nPOS: ", pos)


Tokens:  [MDS, is, a, Master, 's, program, at, UBC, in, British, Columbia, ., MDS, teaching, team, is, truly, multicultural, !, !, Dr., George, did, his, Ph.D., in, Scotland, ., Dr., Timbers, ,, Dr., Ostblom, ,, Dr., Rodríguez, -, Arelis, ,, and, Dr., Kolhatkar, did, theirs, in, Canada, ., Dr., Gelbart, did, his, PhD, in, the, U.S.]

Lemmas:  ['MDS', 'be', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.', 'MDS', 'teaching', 'team', 'be', 'truly', 'multicultural', '!', '!', 'Dr.', 'George', 'do', 'his', 'ph.d.', 'in', 'Scotland', '.', 'Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez', '-', 'Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'do', 'theirs', 'in', 'Canada', '.', 'Dr.', 'Gelbart', 'do', 'his', 'phd', 'in', 'the', 'U.S.']

POS:  ['PROPN', 'AUX', 'DET', 'PROPN', 'PART', 'NOUN', 'ADP', 'PROPN', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'NOUN', 'NOUN', 'AUX', 'ADV', 'ADJ', 'PUNCT', 'PUNCT', 'PROPN', 'PROPN', 'VERB', 'PRON', 'NOUN', 'ADP', 'P

### Other typical NLP tasks 
In order to understand text, we usually are interested in extracting information from text. Some common tasks in NLP pipeline are: 
- Part of speech tagging
    - Assigning part-of-speech tags to all words in a sentence.
- Named entity recognition
    - Labelling named “real-world” objects, like persons, companies or locations.    
- Coreference resolution
    - Deciding whether two strings (e.g., UBC vs University of British Columbia) refer to the same entity
- Dependency parsing
    - Representing grammatical structure of a sentence

### Extracting named-entities using spaCy

In [74]:
from spacy import displacy


warnings.simplefilter(action="ignore", category=DeprecationWarning)

doc = nlp(
    "University of British Columbia "
    "is located in the beautiful "
    "province of British Columbia."
)
displacy.render(doc, style="ent")

warnings.simplefilter(action="default", category=DeprecationWarning)

# Text and label of named entity span
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("GPE means: ", spacy.explain("GPE"))

Named entities:
 [('University of British Columbia', 'ORG'), ('British Columbia', 'GPE')]

ORG means:  Companies, agencies, institutions, etc.
GPE means:  Countries, cities, states


### Dependency parsing using spaCy

In [75]:
warnings.simplefilter(action="ignore", category=DeprecationWarning)

doc = nlp("I like cats")
displacy.render(doc, style="dep")

warnings.simplefilter(action="default", category=DeprecationWarning)

### Many other things possible

- A powerful tool 
- All my Capstone groups last year used this tool. 
- You can build your own rule-based searches. 
- You can also access word vectors using spaCy with bigger models. (Currently we are using `en_core_web_md` model.)

<br><br><br><br>

## Bug workaround!

In [1]:
# "Do you have pyLDAvis installed in your virtual 
#  environment? Apparently there is a strong evidence 
#  that it is an interaction with pyLDAvis."
# https://stackoverflow.com/a/69171847/14881731

from IPython.display import HTML
css_str = '<style> \
.jp-Button path { fill: #616161;} \
text.terms { fill: #616161;} \
.jp-icon-warn0 path {fill: var(--jp-warn-color0);} \
.bp3-button-text path { fill: var(--jp-inverse-layout-color3);} \
.jp-icon-brand0 path { fill: var(--jp-brand-color0);} \
text.terms { fill: #616161;} \
</style>'
display(HTML(css_str ))