<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 14 - Latent Variables and Natural Language Processing

---

## Guided practice and demos

In [1]:
# Imports
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

# Config
np.random.seed(1)

In [2]:
# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

In [6]:
# Import data
df = pd.read_csv('stumbleupon.tsv', sep='\t')
df['title'] = df.boilerplate.map(lambda x: json.loads(x).get('title', ''))
df['body'] = df.boilerplate.map(lambda x: json.loads(x).get('body', ''))

## Demo: "LDA in gensim"

Gensim is a library of language processing tools focused on latent variable models for text. It was originally developed by grad students dissatisfied with current implementations of latent models. Documentation and tutorials are available on the [package’s website](https://radimrehurek.com/gensim/index.html).


Let’s first translate a set of documents (articles) into a matrix representation with a row per document and a column per feature (word or n-gram).

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

body_text = df.body.dropna()
vectorizer = CountVectorizer(binary=False, # uses 1-gram by default
                             stop_words='english',
                             min_df=3) # word must appear in at least 3 docs
vectorizer.fit(body_text)
docs = vectorizer.transform(body_text)

In [10]:
len(vectorizer.get_feature_names())

27283

In [16]:
# Build a mapping of numerical ID to word
id2word = dict(enumerate(vectorizer.get_feature_names()))
print(id2word[5000], id2word[10000], id2word[20000])
id2word

cheers flagship recounting


{0: '00',
 1: '000',
 2: '000000',
 3: '001',
 4: '007',
 5: '00am',
 6: '00pm',
 7: '01',
 8: '01pm',
 9: '02',
 10: '0206790666',
 11: '025',
 12: '03',
 13: '04',
 14: '044',
 15: '05',
 16: '06',
 17: '0674921071',
 18: '07',
 19: '075',
 20: '0782835788',
 21: '08',
 22: '09',
 23: '0g',
 24: '0http',
 25: '0px',
 26: '0s',
 27: '0sodium',
 28: '10',
 29: '100',
 30: '1000',
 31: '100000000000000000',
 32: '1000000000000000000',
 33: '1000px',
 34: '1000s',
 35: '1001',
 36: '10013',
 37: '100g',
 38: '100k',
 39: '100m',
 40: '100ml',
 41: '100px',
 42: '100th',
 43: '101',
 44: '10184',
 45: '102',
 46: '1024',
 47: '103',
 48: '1034',
 49: '1036',
 50: '104',
 51: '105',
 52: '10522',
 53: '10529',
 54: '106',
 55: '107',
 56: '108',
 57: '1080p',
 58: '109',
 59: '1090',
 60: '10am',
 61: '10g',
 62: '10km',
 63: '10lbs',
 64: '10m',
 65: '10mm',
 66: '10oz',
 67: '10pm',
 68: '10px',
 69: '10th',
 70: '11',
 71: '110',
 72: '1100',
 73: '11000',
 74: '1107748553',
 75: '111',

- We want to learn which columns are correlated (i.e. likely to come from the same topic). This is the word distribution. 
- We can also determine what topics are in each document, the topic distribution.

In [17]:
from gensim.models.ldamodel import LdaModel
from gensim.matutils import Sparse2Corpus

# First we convert our word-matrix into gensim's format
corpus = Sparse2Corpus(docs, documents_columns=False)

# Then we fit an LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=15) # number of topics to uncover (arbitrary)
# minimum_probability = 0.01 by default which means the model will not assign a topic to a document unless
# the probability of the topic fitting that document is greater than 0.01

In [20]:
list(corpus) # shows list of words and frequencies for each document

# a list of documents, each element is a list of words and frequencies for that document

[[(1, 1),
  (9, 1),
  (13, 1),
  (28, 2),
  (70, 2),
  (169, 1),
  (364, 1),
  (415, 1),
  (429, 1),
  (497, 3),
  (865, 1),
  (1389, 1),
  (1412, 1),
  (1570, 1),
  (1701, 2),
  (1781, 1),
  (1886, 1),
  (2062, 1),
  (2203, 1),
  (2204, 1),
  (2234, 2),
  (2454, 1),
  (2460, 1),
  (2650, 1),
  (2672, 1),
  (2766, 1),
  (2779, 1),
  (2836, 2),
  (2879, 1),
  (2881, 1),
  (2950, 2),
  (3000, 3),
  (3002, 1),
  (3247, 2),
  (3281, 1),
  (3284, 1),
  (3285, 1),
  (3306, 1),
  (3310, 1),
  (3393, 1),
  (3515, 4),
  (3563, 1),
  (3893, 1),
  (3984, 1),
  (4003, 1),
  (4127, 1),
  (4222, 5),
  (4273, 1),
  (4355, 4),
  (4361, 2),
  (4381, 1),
  (4407, 2),
  (4504, 1),
  (4613, 2),
  (4664, 1),
  (4803, 3),
  (4808, 2),
  (4818, 1),
  (4900, 1),
  (4927, 2),
  (5294, 1),
  (5296, 2),
  (5304, 1),
  (5717, 1),
  (5784, 6),
  (5821, 1),
  (5860, 1),
  (5886, 5),
  (5887, 3),
  (5888, 1),
  (5951, 1),
  (6115, 1),
  (6120, 1),
  (6124, 2),
  (6174, 1),
  (6207, 1),
  (6226, 1),
  (6272, 1),
  (6

In this model, we need to explicitly specify the number of topics we want the model to uncover. This is a critical parameter, but there isn’t much guidance on how to choose it.  Try to use domain expertise where possible.


Now we need to assess the goodness of fit for our model. Like other unsupervised learning techniques, our validation techniques are mostly about interpretation.

#### Use the following questions to guide you:

- Did we learn reasonable topics?
- Do the words that make up a topic make sense?
- Is this topic helpful towards our goal?

#### We can evaluate fit by viewing the top words in each topic.

- Gensim has a `show_topics()` function for this.

In [21]:
for ti, topic in enumerate(lda_model.show_topics(num_topics=15, num_words=10)):
    print("Topic: {}".format(ti))
    print(topic)

Topic: 0
(0, '0.008*"image" + 0.007*"track" + 0.007*"small" + 0.006*"link" + 0.006*"buzz" + 0.006*"campaign" + 0.006*"like" + 0.006*"2011" + 0.006*"static" + 0.005*"images"')
Topic: 1
(1, '0.006*"news" + 0.006*"workout" + 0.005*"exercises" + 0.005*"muscle" + 0.004*"like" + 0.004*"said" + 0.004*"just" + 0.004*"leg" + 0.004*"body" + 0.004*"make"')
Topic: 2
(2, '0.019*"flashvars" + 0.016*"fashion" + 0.014*"com" + 0.009*"swimsuit" + 0.008*"si" + 0.008*"2011" + 0.008*"http" + 0.007*"images" + 0.006*"image" + 0.006*"chocolate"')
Topic: 3
(3, '0.007*"10" + 0.006*"said" + 0.006*"just" + 0.005*"2009" + 0.005*"like" + 0.005*"2010" + 0.004*"12" + 0.004*"11" + 0.004*"game" + 0.004*"pm"')
Topic: 4
(4, '0.006*"sleep" + 0.005*"make" + 0.004*"people" + 0.004*"like" + 0.004*"just" + 0.004*"brain" + 0.004*"time" + 0.004*"health" + 0.004*"good" + 0.004*"help"')
Topic: 5
(5, '0.029*"com" + 0.028*"online" + 0.026*"www" + 0.025*"http" + 0.022*"guide" + 0.020*"damascus" + 0.017*"cheesecake" + 0.006*"cancer" 

#### Let's now use our fitted model to predict topics for some new data

(examples taken from http://www.buzzfeed.com/babymantis/25-stupid-newspaper-headlines-1opu)

In [22]:
new_text = [
    "Japanese scientists grow frog eyes and ears",
    "Statistics show that teen pregnancy drops of significantly after age 25",
    "Bugs flying around with wings are flying bugs",
    "Federal agents raid gun shop, find weapons",
    "Marijuana issue sent to a joint committee"
]

# Transform the text into the bag-of-words (bow) space using our vectorizer
new_bow = vectorizer.transform(new_text)

# Transform into format expected by gensim
new_corpus = Sparse2Corpus(new_bow, documents_columns=False)

# Print out first entry + matching words
print(list(new_corpus)[0])
print([(id2word[id], count) for id, count in list(new_corpus)[0]])

[(8461, 1), (9477, 1), (10556, 1), (11436, 1), (13429, 1), (21460, 1)]
[('ears', 1), ('eyes', 1), ('frog', 1), ('grow', 1), ('japanese', 1), ('scientists', 1)]


#### Some functions within LDA Model 

In [31]:
id2word[11849]

'health'

In [35]:
#list(lda_model.get_document_topics(new_corpus))
print(lda_model.get_term_topics(11849,minimum_probability=0.002))
print(lda_model.get_topic_terms(topicid=10,topn=10))

[(4, 0.0037836226167982136), (5, 0.0029790995206494478), (6, 0.0038716929666972265), (7, 0.003927172971523331), (8, 0.0070169109287124363)]
[(4332, 0.016544101843291154), (8244, 0.01110171070853422), (22414, 0.0097198968607114643), (5146, 0.0083802910382261217), (0, 0.0068817088869400833), (24780, 0.0042015211432145216), (15006, 0.0039876160052542678), (26589, 0.003354022156726763), (4581, 0.0032713244170213276), (25555, 0.0031676769641280494)]


#### Transform into LDA space by applying fitted LDA model to the corpus

In [32]:
lda_vector = lda_model[new_corpus]

# lda_model is the model which we trained on data.body
# this is where the model created the topics 0-14

#### For each entry we can extract a tuple indicating how much it makes part of each topic

In [33]:
[list(lda_vec) for lda_vec in lda_vector]

[[(8, 0.63194204219504491), (10, 0.2442482111575916)],
 [(5, 0.33372078866691623), (8, 0.5579455950427773)],
 [(0, 0.011111132892560582),
  (1, 0.011111138772837799),
  (2, 0.011111128331178135),
  (3, 0.011111116912792408),
  (4, 0.011111116286353853),
  (5, 0.22354021344859279),
  (6, 0.011111132521579378),
  (7, 0.011111116851688446),
  (8, 0.01111111430839774),
  (9, 0.011111119668563121),
  (10, 0.011111128723051578),
  (11, 0.01111112146125022),
  (12, 0.63201516112286193),
  (13, 0.011111128401433405),
  (14, 0.011111130296858406)],
 [(12, 0.86666626887404608)],
 [(0, 0.011111122441748361),
  (1, 0.011111133371909112),
  (2, 0.011111202926901645),
  (3, 0.011111136362720117),
  (4, 0.011111133639174734),
  (5, 0.01111112802795663),
  (6, 0.011111125252458201),
  (7, 0.84444410487660893),
  (8, 0.011111136851258417),
  (9, 0.011111116053869526),
  (10, 0.011111129301093071),
  (11, 0.011111127452686139),
  (12, 0.011111137244433487),
  (13, 0.011111128631958579),
  (14, 0.0111111

#### Extract most prominent LDA topics for each entry

In [None]:
top_topics = [max(x, key=lambda item: item[1]) for x in list(lda_vector)]
top_topics

#### Print out text + topic

In [None]:
for i, topic_tuple in enumerate(top_topics):
    print(new_text[i])
    print("{0:.1f}% as topic #{1}:".format(100 * topic_tuple[1], topic_tuple[0]))
    print(lda_model.print_topic(topic_tuple[0],topn=10), "\n")

For more examples on using LDA with gensim, see: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

## Demo: Word2Vec in gensim

We will build a Word2Vec model using the text body of the articles available in the StumbleUpon dataset.

The Word2Vec class has many arguments:

- size represents how many concepts or topics we should use
- window represents how many words surrounding a sentence we should use as our original feature
- min_count is the number of times that context or word must appear
- workers is the number of CPU cores to use to speed up model training

In [36]:
from gensim.models import Word2Vec

# Setup the text body
text = df.body.dropna().map(lambda x: x.split())
model = Word2Vec(text,
                 size=100,      # how many concepts or topics should we use?
                 window=5,      # how many words surrounding a sentence we should use as our original feature?
                 min_count=5,   # number of times that context or word must appear
                 workers=4)     # number of CPU cores to use (can speed up model training)

#`sg` defines the training algorithm. By default (`sg=0`), CBOW is used.
#Otherwise (`sg=1`), skip-gram is employed.

The model has a `most_similar function` that helps finding the words most similar to the one you queried.
This will return words that are most often used in the same context.
It can easily identify words related to those from this dataset.

In [38]:
model.most_similar(positive=['cookie', 'brownie'])# most similar to this list of words

[('cupcake', 0.9304956197738647),
 ('cake', 0.8600763082504272),
 ('crust', 0.8569961190223694),
 ('tart', 0.8484545946121216),
 ('pie', 0.8456522822380066),
 ('mini', 0.839384913444519),
 ('cakes', 0.822641134262085),
 ('cheesecake', 0.8222636580467224),
 ('icing', 0.8154040575027466),
 ('filling', 0.8088855743408203)]

In [39]:
model.most_similar(negative=['cookie','brownie'])

[('chronic', 0.4840221405029297),
 ('abuse', 0.483193039894104),
 ('injuries', 0.47000136971473694),
 ('obesity', 0.468454509973526),
 ('lung', 0.4670487642288208),
 ('management', 0.4635861814022064),
 ('Dr', 0.46205583214759827),
 ('conditions.', 0.4572290778160095),
 ('demolitions', 0.45115551352500916),
 ('cognitive', 0.4498996138572693)]

#### Word vector maths: 

- "man - boy $\approx$ person"

In [40]:
model.most_similar(positive=['man'], negative=['boy'])

[('person', 0.4179037809371948),
 ('you', 0.349819153547287),
 ('those', 0.3148405849933624),
 ('anyone', 0.31106966733932495),
 ('someone', 0.30445587635040283),
 ('example', 0.30161377787590027),
 ('thing', 0.2990867495536804),
 ('fertilisation', 0.2867507338523865),
 ('people', 0.2861977815628052),
 ('difference', 0.28583407402038574)]

#### Read this as "man is to woman as boy is to...girl"
 
- "man + boy - woman = girl"

In [41]:
model.most_similar(positive=['man', 'boy'], negative=['woman'])

[('daughter', 0.8149204254150391),
 ('son', 0.7855384349822998),
 ('brother', 0.7824009656906128),
 ('wife', 0.7821298241615295),
 ('girl', 0.7799780964851379),
 ('father', 0.7783591747283936),
 ('girlfriend', 0.7603330612182617),
 ('dad', 0.7416545152664185),
 ('guy', 0.7385281324386597),
 ('husband', 0.733360767364502)]

#### "cheesecake + cake - frosting = pie"

In [42]:
model.most_similar(positive=['cheesecake', 'cake'], negative=['frosting'])

[('pie', 0.763269305229187),
 ('tart', 0.7427929639816284),
 ('brownie', 0.7358402013778687),
 ('cupcake', 0.7290023565292358),
 ('crust', 0.7189767360687256),
 ('cookie', 0.6843904256820679),
 ('dessert', 0.683058500289917),
 ('pizza', 0.6806069612503052),
 ('cakes', 0.6800334453582764),
 ('dish', 0.6611363887786865)]

#### data + science - statistics = ?

In [43]:
model.most_similar(positive=['data', 'science'], negative=['statistics'])

[('technology', 0.7104990482330322),
 ('research', 0.7101888656616211),
 ('industry', 0.6933605074882507),
 ('company', 0.6924563050270081),
 ('report', 0.6858687400817871),
 ('device', 0.6835160255432129),
 ('FDA', 0.6788550019264221),
 ('human', 0.6675242781639099),
 ('potential', 0.656730592250824),
 ('product', 0.6564807891845703)]

#### technology + entrepreneur - hipster = ?

In [44]:
model.most_similar(positive=['technology', 'entrepreneur'], negative=['hipster'])

[('concept', 0.7965396642684937),
 ('design', 0.7740277051925659),
 ('manufacturing', 0.7712497115135193),
 ('computing', 0.7689513564109802),
 ('electronics', 0.7687945365905762),
 ('cell', 0.7665839791297913),
 ('aircraft', 0.763490617275238),
 ('innovation', 0.7627125978469849),
 ('solar', 0.7612137794494629),
 ('engineering', 0.7589035630226135)]

#### Which one of these doesn't fit?

In [46]:
print(model.doesnt_match("breakfast cereal lunch dinner".split()))
print(model.doesnt_match("facebook twitter tumblr myspace".split()))

cereal
myspace


#### Similarity between two words

In [49]:
print (model.similarity('man', 'woman'))
print (model.similarity('man', 'monkey'))
print (model.similarity('apple', 'pear'))
print (model.similarity('man', 'apple'))

0.905201263333
0.206245709222
0.642689643942
-0.0620174904301


#### Inspect a single vector

In [50]:
model['man']

array([ 0.79671896,  0.28557587, -0.24596512, -2.13458872,  0.84468973,
        1.49832821, -0.63018841,  1.32799911,  0.17558731, -0.54315615,
        0.62714541,  0.84165853, -0.94405061, -0.66986221,  0.31884253,
       -0.66172236, -1.25170887,  0.6737963 ,  0.1414531 , -0.05315728,
       -0.10914821, -0.13506451,  1.80610347, -0.62940747, -0.03612972,
       -1.27625263, -1.04190004, -0.6224041 , -0.38445905,  0.65051568,
       -0.97188848, -0.74773932,  0.65067387,  0.65322876,  1.32147264,
        0.34293902, -0.66689903,  0.96012127,  1.0663625 ,  0.71801472,
       -0.46326935, -2.06709695, -1.10924947,  0.14489691, -0.01258467,
       -0.19534077, -0.37217668,  0.77195185, -1.43389034,  0.08513048,
        0.29235303,  0.2183858 , -0.01322468,  1.17714226,  1.54776442,
       -0.22917001,  0.12437997,  0.03246436, -0.88347846, -0.72030008,
        1.83631122, -0.62434083,  0.53218418,  0.43283603, -0.52869284,
       -0.11392604, -1.45422685,  0.53606284,  0.68476307,  0.12