In [144]:
import os.path
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt

import gensim 
from gensim.models import LdaModel
from gensim.models.wrappers import LdaMallet

import gensim.corpora as corpora
from gensim.corpora import Dictionary

from gensim import matutils, models

import pyLDAvis.gensim
import string
from multiprocessing import  Pool
import time
pd.set_option('display.max_colwidth', 100)
from pandarallel import pandarallel
%matplotlib inline

## Topic modeling with LDA <a name="lda"></a>

 We will explore the topics in `scikit-learn`'s [20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) using [`gensim`'s `ldamodel`](https://radimrehurek.com/gensim/models/ldamodel.html). Usually, topic modeling is used for discovering the abstract "topics" that occur in a collection of documents when you do not know the actual topics present in the documents. But since 20 newsgroups text dataset is labeled with categories (e.g., sports, hardware, religion), you will be able to cross-check the topics discovered by your model with the actual topics. 

Let's load the data and examine the first few rows. Note that we won't be violating the golden rule by looking at the training subset; later we will be using a separate test subset to evaluate the model. 

Load the train and test portion of the data and convert the train portion into a pandas DataFrame. Note that we are using train and test splits so that we can later examine how well the LDA model we learn is able to assign topics to unseen documents. 

In [2]:
### BEGIN STARTER CODE
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
### END STARTER CODE

In [189]:
### BEGIN STARTER CODE
data = {'text':[], 'target_name':[], 'target':[]}
data['text'] = newsgroups_train.data
data['target_name'] = [newsgroups_train.target_names[target] for target in newsgroups_train.target]
data['target'] = [target for target in newsgroups_train.target]
df = pd.DataFrame(data)
df.head()
### END STARTER CODE

Unnamed: 0,text,target_name,target
0,From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac...,rec.autos,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Fi...,comp.sys.mac.hardware,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdu...,comp.sys.mac.hardware,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris Computer Syste...,comp.graphics,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDowell)\nSubject: Re: Shuttle Launch Question\nOrgani...,sci.space,14


In [4]:
print(df['text'].iloc[4])

From: jcm@head-cfa.harvard.edu (Jonathan McDowell)
Subject: Re: Shuttle Launch Question
Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA
Distribution: sci
Lines: 23

From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):
>>In article <C5JLwx.4H9.1@cs.cmu.edu>, ETRAT@ttacs1.ttu.edu (Pack Rat) writes...
>>>errors. ...".  I am wondering what an "expected error" might
>>>be.  Sorry if this is a really dumb question, but
> 
> Parity errors in memory or previously known conditions that were waivered.
>    "Yes that is an error, but we already knew about it"
> I'd be curious as to what the real meaning of the quote is.
> 
> tom


My understanding is that the 'expected errors' are basically
that don't have the right values in yet because they aren't
set till after launch, and suchlike. Rather than fix the code
and possibly introduce new bugs, they just tell the crew

 - Jonathan





In [5]:
np.unique(df.target)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [6]:
df.shape

(11314, 3)

###  Preprocessing

We want our topic model to identify interesting and important patterns. For that we need to "normalize" our text. Preprocessing is a crucial step before you train an LDA model and it markedly affects the results.


In [7]:
### BEGIN STARTER CODE
import spacy
# Load English model for SpaCy
nlp = spacy.load("en_core_web_sm")

### END STARTER CODE

In [40]:
doc

From: jcm@head-cfa.harvard.edu (Jonathan McDowell)
Subject: Re: Shuttle Launch Question
Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA
Distribution: sci
Lines: 23

From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):
>>In article <C5JLwx.4H9.1@cs.cmu.edu>, ETRAT@ttacs1.ttu.edu (Pack Rat) writes...
>>>errors. ...".  I am wondering what an "expected error" might
>>>be.  Sorry if this is a really dumb question, but
> 
> Parity errors in memory or previously known conditions that were waivered.
>    "Yes that is an error, but we already knew about it"
> I'd be curious as to what the real meaning of the quote is.
> 
> tom


My understanding is that the 'expected errors' are basically
that don't have the right values in yet because they aren't
set till after launch, and suchlike. Rather than fix the code
and possibly introduce new bugs, they just tell the crew

 - Jonathan



In [82]:
doc = nlp(df['text'].iloc[4])
i = 57
print(doc[i])
doc[i].lemma_.lower()

C5JLwx.4H9.1@cs.cmu.edu


'c5jlwx.4h9.1@cs.cmu.edu'

In [190]:
### BEGIN STARTER CODE
def preprocess(text, 
               min_token_len = 2, 
               irrelevant_pos = ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']): 
    """
    Given text, min_token_len, and irrelevant_pos carry out preprocessing of the text 
    and return a preprocessed string. 
    
    Parameters
    -------------
    text : (str) 
        the text to be preprocessed
    min_token_len : (int) 
        min_token_length required
    irrelevant_pos : (list) 
        a list of irrelevant pos tags
    
    Returns
    -------------
    (str) the preprocessed text
    """
    #YOUR CODE HERE
    doc = nlp(text)
    pp_text = ''
    for token in doc:
        if token.like_email or len(token)<min_token_len or token.is_stop:
            continue
        if token.pos_ in irrelevant_pos:
            continue
        pp_text = pp_text + token.lemma_.lower() +" "
    return pp_text
### END STARTER CODE    
print(preprocess(df['text'].iloc[4]))



In [191]:
### YOUR ANSWER HERE
from pandarallel import pandarallel
pandarallel.initialize()
s = time.time()
df['pp_text'] = df.text.parallel_apply(preprocess)
t = time.time()-s
print("time taken is",t,'seconds')

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
time taken is 200.45548725128174 seconds


In [192]:
### YOUR ANSWER HERE
df['pp_text'].head()

0    thing subject car nntp post host rac3.wam.umd.edu organization university maryland college park ...
1    guy kuo subject si clock poll final summary final si clock report keywords si acceleration clock...
2    thomas willis subject pb question organization purdue university engineering computer network di...
3    jgreen@amber joe green subject weitek p9000 organization harris computer systems division line 1...
4    jonathan mcdowell subject shuttle launch question organization smithsonian astrophysical observa...
Name: pp_text, dtype: object

###  Build dictionary and document-term co-occurrence matrix

We need two things to build `gensim`'s `LdaModel`: a dictionary and a document-term co-occurrence matrix. 

In [219]:
### YOUR ANSWER HERE
corpus = [text.split() for text in df.pp_text.values]
dictionary = corpora.Dictionary(corpus)

In [220]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]

###  Build a topic model



In [208]:
lda = models.LdaModel(corpus=doc_term_matrix, 
                      id2word=dictionary, 
                      num_topics=7, 
                      passes=10)

In [210]:

lda.print_topics(num_words=20)

[(0,
  '0.014*"space" + 0.006*"nasa" + 0.005*"launch" + 0.004*"orbit" + 0.004*"moon" + 0.004*"food" + 0.004*"earth" + 0.004*"organization" + 0.003*"gordon" + 0.003*"msg" + 0.003*"science" + 0.003*"article" + 0.003*"subject" + 0.003*"satellite" + 0.003*"disease" + 0.003*"mission" + 0.003*"research" + 0.003*"banks" + 0.003*"center" + 0.003*"year"'),
 (1,
  '0.011*"file" + 0.008*"use" + 0.007*"window" + 0.007*"problem" + 0.007*"line" + 0.006*"subject" + 0.006*"card" + 0.006*"program" + 0.006*"drive" + 0.005*"organization" + 0.005*"windows" + 0.005*"g9v" + 0.005*"system" + 0.005*"work" + 0.005*"write" + 0.005*"disk" + 0.005*"need" + 0.004*"run" + 0.004*"driver" + 0.004*"set"'),
 (2,
  '0.014*"subject" + 0.014*"organization" + 0.009*"university" + 0.008*"line" + 0.007*"nntp" + 0.007*"host" + 0.007*"lines" + 0.007*"write" + 0.007*"posting" + 0.006*"article" + 0.005*"good" + 0.005*"game" + 0.005*"year" + 0.004*"team" + 0.004*"new" + 0.004*"like" + 0.004*"think" + 0.004*"know" + 0.003*"distrib

### Model description

* Total of 7 topics were chosen and for each topic 15 words were displayed.
* The result displayed words corresponding to particular topic and its associated probability.

###  Visualization and interpretation



In [198]:
### BEGIN STARTER CODE
topic_labels = {0:'Science and technology'}
### END STARTER CODE

In [211]:
### YOUR ANSWER HERE
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
vis


![](plots.png)

In [259]:
### YOUR ANSWER HERE
topic_labels = {
    0:'Space technology',
    1:'Computers',
    2:'Sports',
    3:'Religion',
    4:'Computer peripherals',
    5:'Science and Technology',
    6: 'Politics'
}
topic_labels

{0: 'Space technology',
 1: 'Computers',
 2: 'Sports',
 3: 'Religion',
 4: 'Computer peripherals',
 5: 'Science and Technology',
 6: 'Politics'}

###  Test on unseen documents 

In this particular data, we already know the "topics" (labels) for each article. You will examine to what extent the topics identified by the LDA model match with the actual labels of unseen documents. 

In [None]:
### BEGIN STARTER CODE
data = {'text':[], 'target':[]}
data['text'] = newsgroups_test.data
data['target_name'] = [newsgroups_test.target_names[target] for target in newsgroups_test.target]
data['target'] = [target for target in newsgroups_test.target]
test_df = pd.DataFrame(data)
sample_test_df = test_df.sample(100)
sample_test_df
### END STARTER CODE

pandarallel.initialize()
s = time.time()
sample_test_df['pp_text'] = sample_test_df.text.parallel_apply(preprocess)
t = time.time()-s
print("time taken is",t,'seconds')

In [221]:
corpus = [text.split() for text in sample_test_df.pp_text.values]
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]

In [236]:
p = lda[doc_term_matrix[1]]
p

[(1, 0.14103273),
 (3, 0.40715578),
 (4, 0.032082643),
 (5, 0.31440067),
 (6, 0.099440254)]

In [235]:
max(p,key=lambda item:item[1])

(3, 0.4071601)

In [260]:

def get_most_prob_topic(unseen_document, model = lda):
    """
    Given an unseen_document, and a trained LDA model, this function
    finds the most likely topic (topic with the highest probability) from the 
    topic distribution of the unseen document and returns the best topic with 
    its probability. . 
    
    Parameters
    ------------
    unseen_document : (str) 
        the document to be labeled with a topic
    model : (gensim ldamodel) 
        the trained LDA model
    
    Returns: 
    -------------
        (str) a string of the form 
        `most likely topic label:probability of that label` 
    
    Examples:
    ----------
    >> get_most_prob_topic("The research uses an HMM for discovering gene sequence.", 
                            model = lda)
    Science and Technology:0.435
    """   
    doc = preprocess(unseen_document)
    bow_vector = dictionary.doc2bow(doc.split())
    topics = model[bow_vector]
    max_topic = max(topics,key=lambda item:item[1])
    topic_prob = topic_labels[max_topic[0]]+":"+str(max_topic[1])
    return topic_prob
  

In [265]:
sample_test_df['predicted'] = sample_test_df.text.apply(get_most_prob_topic)

In [269]:
### YOUR ANSWER HERE
sample_test_df[['target_name','predicted']].head(20)

Unnamed: 0,target_name,predicted
3835,rec.autos,Science and Technology:0.6712201
1393,alt.atheism,Religion:0.40715802
2553,comp.graphics,Sports:0.40570045
4854,sci.crypt,Science and Technology:0.9308202
2202,talk.politics.misc,Politics:0.33784994
5997,sci.crypt,Science and Technology:0.7994448
158,rec.sport.hockey,Sports:0.978897
629,talk.religion.misc,Religion:0.49454913
1284,comp.sys.mac.hardware,Sports:0.87936836
7526,rec.autos,Sports:0.46092635




* The LDA assignment makes sense but it also gives many wrong assignments.
* This can be improved by improving the preprocessing of the text