<a href="https://colab.research.google.com/github/xerojester/Assignment-6/blob/main/Exercise_1_Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1 - Topic Modeling

In this notebook, we will apply our understanding of topic modeling techniques like LDA and NMF

__Fill in the sections marked with `<YOUR CODE HERE>`__

## Import Libraries

In [3]:
import nltk
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import gensim

import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF

In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [5]:
pd.set_option('display.max_colwidth', -1)

  """Entry point for launching an IPython kernel.


## Get Dataset

For this assignment, we will use the __20 Newsgroup__ dataset. This dataset contains ~11k news articles spread across 20 news categories. The ``sklearn`` library provides an easy to use interface to get this dataset

In [6]:
newsgroups_train = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [7]:
# view the news categories
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Pre-process Text

## Question 1: Complete Regex to remove emails (1 point)

In [8]:
# Convert to list
data = newsgroups_train.data

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove extra spaces \ new lines
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

print(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- ']


In [9]:
stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

## Question 2: Complete the `normalize_corpus` function (2 points)

__Note:__ Remove tokens with length 2 or more (as compared to 1 or more in Tutorial 1)

__Hint:__ The `normalize_corpus()` function in Tutorial 1 will come in handy here

In [10]:
def normalize_corpus(news_articles):
    norm_articles = []
    for article in tqdm(news_articles):
        article = article.lower()
        article_tokens = [token.strip() for token in wtk.tokenize(article)]
        article_tokens = [wnl.lemmatize(token) for token in article_tokens if not token.isnumeric()]
        article_tokens = [token for token in article_tokens if len(token) > 1]
        article_tokens = [token for token in article_tokens if token not in stop_words]
        article_tokens = list(filter(None, article_tokens))
        if article_tokens:
            norm_articles.append(article_tokens)
    return norm_articles

In [11]:
%%time

norm_data = normalize_corpus(data)
print(len(norm_data))

100%|██████████| 11314/11314 [00:20<00:00, 554.61it/s]

11314
CPU times: user 20.1 s, sys: 235 ms, total: 20.4 s
Wall time: 20.4 s





# Topic Modeling with LDA

## Feature Engineering: Bi-Grams

## Question 3: Fill up the necessary code snippets to create a Bi-gram Bag of Words Model (1 point)

#### Build the bi-gram phrase model

__Note:__ Use `min_count` and `threshold` parameters similar to the tutorial 

In [12]:
bigram = gensim.models.Phrases(norm_data, 
                               min_count=20, 
                               threshold=20, 
                               delimiter=b'_')
bigram_model = gensim.models.phrases.Phraser(bigram)

print(bigram_model[norm_data[0]][:50])

['wheres', 'thing', 'subject', 'car', 'nntp_posting', 'host', 'rac3', 'wam', 'umd_edu', 'organization_university', 'maryland_college', 'park', 'line', 'wa_wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'wa', 'door', 'sport', 'car', 'looked', 'late', '60', 'early', '70', 'wa', 'called', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'wa', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year']


In [13]:
norm_corpus_bigrams = [bigram_model[doc] for doc in norm_data]

#### Generate the dictionary

In [14]:
# Create a dictionary representation of the documents.
dictionary = gensim.corpora.Dictionary(norm_corpus_bigrams)
print('Sample word to number mappings:', list(dictionary.items())[:15])
print('Total Vocabulary Size:', len(dictionary))

Sample word to number mappings: [(0, '60'), (1, '70'), (2, 'addition'), (3, 'anyone'), (4, 'body'), (5, 'bricklin'), (6, 'brought'), (7, 'bumper'), (8, 'called'), (9, 'car'), (10, 'could'), (11, 'day'), (12, 'door'), (13, 'early'), (14, 'engine')]
Total Vocabulary Size: 94305


#### Remove unnecessary terms

__Note:__ Use `no_below` and `no_above` parameters similar to the tutorial 

In [15]:
# Filter out words that occur less than 20 documents, 
# or more than 60% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.6)
print('Total Vocabulary Size:', len(dictionary))

Total Vocabulary Size: 7989


#### Create the Bag of Words model

In [16]:
# Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams]

In [17]:
# view sample transformation
print(bow_corpus[1][:50])

[(10, 2), (31, 1), (42, 1), (50, 1), (51, 1), (52, 2), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 5), (62, 1), (63, 1), (64, 1), (65, 1), (66, 2), (67, 1), (68, 2), (69, 1), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 3), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 4), (95, 1), (96, 1)]


## Topic Modeling using LDA

### LDA using ``MALLET``
The MALLET framework is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET stands for __MA__chine __L__earning for __L__anguag __E__ __T__oolkit. It was developed by Andrew McCallum along with several people at the University of Massachusetts Amherst. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA. To use MALLET’s capabilities, we need to download the framework.

In [18]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip

--2021-02-08 00:54:31--  http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Resolving mallet.cs.umass.edu (mallet.cs.umass.edu)... 128.119.246.70
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16184794 (15M) [application/zip]
Saving to: ‘mallet-2.0.8.zip’


2021-02-08 00:54:33 (9.91 MB/s) - ‘mallet-2.0.8.zip’ saved [16184794/16184794]



In [19]:
!unzip -q mallet-2.0.8.zip

## Question 4: Build an LDA topic model with MALLET (1 point)

__Hint:__ Refer to the tutorial and use a similar configuration for the model settings (hyperparameters). __Also set the total topics to be 20__

In [20]:
%%time
TOTAL_TOPICS = 20
lda_model = gensim.models.LdaModel(corpus=bow_corpus, 
                                   id2word=dictionary, 
                                   chunksize=1740, 
                                   alpha='auto', 
                                   eta='auto', 
                                   random_state=42,
                                   iterations=500, 
                                   num_topics=TOTAL_TOPICS, 
                                   passes=20, 
                                   eval_every=None)

CPU times: user 3min 57s, sys: 2min 25s, total: 6min 22s
Wall time: 3min 38s


In [21]:
%%time

MALLET_PATH = 'mallet-2.0.8/bin/mallet'
lda_mallet = gensim.models.wrappers.LdaMallet(mallet_path=MALLET_PATH, 
                                              corpus=bow_corpus, 
                                              num_topics=TOTAL_TOPICS, 
                                              id2word=dictionary,
                                              iterations=500, 
                                              workers=4)

CPU times: user 5.39 s, sys: 61.5 ms, total: 5.45 s
Wall time: 1min 26s


__The model may take some time to run depending on your system config__

## Question 5: View Topics (1 point)

__Hint:__ The _View Topics_ section in Tutorial 1 might be useful here

In [22]:
topics = [[(term, round(wt, 3)) 
               for term, wt in lda_mallet.show_topic(n, topn=20)] 
                   for n in range(0, lda_mallet.num_topics)]
topics_df = pd.DataFrame([', '.join([term for term, wt in topic])  
                              for topic in topics],
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, lda_mallet.num_topics+1)]
                         )

topics_df

Unnamed: 0,Terms per Topic
Topic1,"god, christian, wa, jesus, people, bible, life, religion, church, belief, truth, word, faith, ha, atheist, man, law, love, christ, doe"
Topic2,"wa, armenian, people, jew, world, war, turkish, greek, jewish, turkey, government, arab, land, attack, nazi, country, muslim, killed, ha, german"
Topic3,"space, center, system, nasa, earth, year, technology, project, cost, launch, research, satellite, data, moon, mission, wa, science, orbit, rocket, design"
Topic4,"game, team, player, wa, year, win, play, hockey, season, fan, ha, division, good, baseball, goal, la, boston, run, hit, league"
Topic5,"ax, max, pl, sl, mr, tm, 1d9, wm, m3, bj, mq, ml, mn, ah, au, mi, tg, km, mw, st"
Topic6,"line, bike, mark, ca, back, dod, mike, road, john, time, flame, left, ride, day, dog, ed, blue, motorcycle, line_article, black"
Topic7,"drive, problem, card, system, driver, mac, scsi, window, bit, memory, computer, monitor, disk, pc, machine, apple, board, work, mode, video"
Topic8,"key, system, information, encryption, ha, chip, message, government, security, bit, de, computer, technology, data, access, privacy, phone, clipper, communication, internet"
Topic9,"science, ha, wa, problem, food, effect, study, theory, human, result, writes, system, disease, doctor, medical, health, patient, case, line_article, research"
Topic10,"car, sale, price, power, buy, good, engine, sell, light, water, model, ha, box, ground, wire, cost, circuit, speed, high, mile"


## Question 6: Evaluate Model Performance (1 point)

__Note:__ print the Cv and UMass coherence scores

In [23]:
cv_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus, 
                                                      texts=norm_corpus_bigrams,
                                                      dictionary=dictionary, 
                                                      coherence='c_v')

avg_coherence_cv = cv_coherence_model_lda.get_coherence()

In [24]:
umass_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus, 
                                                         texts=norm_corpus_bigrams,
                                                         dictionary=dictionary, 
                                                         coherence='u_mass')

avg_coherence_umass = umass_coherence_model_lda.get_coherence()

In [25]:
print('Avg. Coherence Score (Cv):', avg_coherence_cv)
print('Avg. Coherence Score (UMass):', avg_coherence_umass)

Avg. Coherence Score (Cv): 0.6003681369400063
Avg. Coherence Score (UMass): -2.471784412585273


## Inference on documents

Here we will try to take some documents and predict \ infer their topics using our trained LDA model. Do note you can use any new documents also in this scenario but you would need to transform them into relevant bag of words vectors before predictions

#### Create a sample dataset of 3 documents

In [26]:
sample_docs = [' '.join(doc) for doc in norm_data[5:8]]
sample_docs

['foxvog douglas subject rewording second amendment idea organization vtt line article tavares writes article foxvog douglas writes article tavares writes article john lawrence rutledge writes massive destructive power many modern weapon make cost accidental crimial usage weapon great weapon mass destruction need control government individual access would result needle death million make right people keep bear many modern weapon non existant thanks stating youre coming needle say disagree every count believe individual right weapon mass destruction find hard believe would support neighbor right keep nuclear weapon biological weapon nerve gas property cannot even agree keeping weapon mass destruction hand individual hope dont sign blank check course term must rigidly defined bill doug foxvog say weapon mass destruction mean cbw nuke sarah brady say weapon mass destruction mean street sweeper shotgun semi automatic sks rifle doubt us term using quote allegedly back john lawrence rutledge

#### Check their class labels

Since this is actually a labeled dataset we can see the actual class \ category labels of these news posts

In [27]:
print(np.array(newsgroups_train.target_names)[newsgroups_train.target[5:8]])

['talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware']


## Question 7: Pre-process documents (1 point)

__Note:__ You can refer to Tutorial 1 or even refer to the steps above (before building them model)

1. Tokenize the sample documents to get list of words per document (string splitting is useful here)

2. Get bigram phrases for each tokenized document using `bigram_model`

3. Use the `dictionary` built previously in the above section to get the BOW vectors using `gensim`

In [28]:
from nltk.tokenize import word_tokenize
# 1. Tokenize documents
tokenized_norm_docs = [word_tokenize(doc) for doc in sample_docs]

# 2. Bi-gram phrases for tokenized documents
bigram_data = [bigram_model[doc] for doc in tokenized_norm_docs]

# 3. BOW vectors for each document
bow_vectorized_features = [dictionary.doc2bow(text) for text in bigram_data]

## Question 8: Inference with trained topic model (1 point)

__Note:__ Use the trained `lda_mallet` model from above to predict and get the top (most dominant) topic per document. Remember to refer to the __Interpret Results__ section in Tutorial 1 if needed.

In [58]:
predicted_topics = [[(term, round(wt, 3)) 
               for term, wt in lda_mallet.show_topic(n, topn=20)] 
                   for n in range(0, 15)]
top_topics = [[(term, round(wt, 3)) 
               for term, wt in lda_mallet.show_topic(n, topn=2)] 
                   for n in range(0, 15)]
final_topics = [(topic+1, weight) for topic, weight in top_topics]

TypeError: ignored

In [None]:
print(final_topics)

NameError: ignored

In [None]:
[topics_df.loc['Topic'+str(topic_id)]['Terms per Topic'] 
    for topic_id, weight in final_topics]

# Topic Modeling using NMF

## Get list of documents

In [None]:
norm_docs = [' '.join(tokenized_doc) for tokenized_doc in norm_data]

## Question 9: Generate Bag of Words features (1 point)

__Note:__

1. Use `CountVectorizer` 
2. Set `min_df` as 20 and `max_df` as 0.6
3. Use both 1 and 2-grams

In [None]:
cv = CountVectorizer(min_df=20, max_df=0.6, ngram_range=(1,2),
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)
cv_features = cv.fit_transform(norm_docs)

cv_features.shape

(11314, 1164)

In [None]:
vocabulary = np.array(cv.get_feature_names())
print('Total Vocabulary Size:', len(vocabulary))

Total Vocabulary Size: 1164


## Question 10: Train NMF Topic Model (1 point)

__Note:__ You can use a similar config as Tutorial 2

In [None]:
%%time 

nmf_model = NMF(n_components=TOTAL_TOPICS, solver='cd', max_iter=500,
                random_state=42, alpha=.1, l1_ratio=.85)
document_topics = nmf_model.fit_transform(cv_features)

CPU times: user 23.9 s, sys: 19.1 s, total: 43 s
Wall time: 21.9 s


## Question 11: Display Topics and their Terms (2 points)

__Note:__ We have done a similar exercise in Tutorial 2

In [None]:
topic_terms = nmf_model.components_
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
topic_keyterms = vocabulary[topic_key_term_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame(topics,
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df

NameError: ignored