## Practical 1. Topic Modelling using LDA
### Strictly used for internal purpose in Singapore Polytechnic. Do not disclose!

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

In [3]:
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


### As you can see that there are some distinct themes in the news categories like sports, religion, science, technology, politics etc

In [4]:
# Lets look at some sample news
newsgroups_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

## Step 2: Data Preprocessing

In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [6]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Wilson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [8]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))

In [9]:
print(processed_docs[0])

['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']


## Step 3: Bag of words on the dataset

In [33]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [34]:

'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [35]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [36]:
'''
Preview BOW for our sample preprocessed document
'''
document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 18 ("rest") appears 1 time.
Word 166 ("clear") appears 1 time.
Word 336 ("refer") appears 1 time.
Word 350 ("true") appears 1 time.
Word 391 ("technolog") appears 1 time.
Word 437 ("christian") appears 1 time.
Word 453 ("exampl") appears 1 time.
Word 476 ("jew") appears 1 time.
Word 480 ("lead") appears 1 time.
Word 482 ("littl") appears 3 time.
Word 520 ("wors") appears 2 time.
Word 721 ("keith") appears 3 time.
Word 732 ("punish") appears 1 time.
Word 803 ("california") appears 1 time.
Word 859 ("institut") appears 1 time.
Word 917 ("similar") appears 1 time.
Word 990 ("allan") appears 1 time.
Word 991 ("anti") appears 1 time.
Word 992 ("arriv") appears 1 time.
Word 993 ("austria") appears 1 time.
Word 994 ("caltech") appears 2 time.
Word 995 ("distinguish") appears 1 time.
Word 996 ("german") appears 1 time.
Word 997 ("germani") appears 3 time.
Word 998 ("hitler") appears 1 time.
Word 999 ("livesey") appears 2 time.
Word 1000 ("motto") appears 2 time.
Word 1001 ("order") appear

## Step 4: Running LDA using Bag of Words

1. **num_topics** is the number of requested latent topics to be extracted from the training corpus.
2. **id2word** is a mapping from word ids (integers) to words (strings).
3. **passes** is the number of training passes through the corpus.
4. **workers** is the number of extra processes to use for parallelization

In [37]:
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [38]:
# For each topic, we will explore the words occuring in that topic and its relative weight
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.012*"armenian" + 0.010*"israel" + 0.008*"isra" + 0.006*"turkish" + 0.006*"kill" + 0.006*"jew" + 0.005*"arab" + 0.005*"govern" + 0.004*"live" + 0.004*"attack"


Topic: 1 
Words: 0.009*"govern" + 0.006*"presid" + 0.005*"american" + 0.005*"weapon" + 0.005*"clinton" + 0.004*"gun" + 0.004*"crime" + 0.004*"firearm" + 0.003*"public" + 0.003*"polic"


Topic: 2 
Words: 0.022*"window" + 0.012*"file" + 0.009*"program" + 0.007*"version" + 0.006*"graphic" + 0.006*"color" + 0.006*"imag" + 0.006*"display" + 0.006*"server" + 0.005*"applic"


Topic: 3 
Words: 0.005*"pitt" + 0.005*"food" + 0.004*"bank" + 0.004*"medic" + 0.004*"research" + 0.004*"scienc" + 0.004*"health" + 0.004*"studi" + 0.004*"program" + 0.004*"entri"


Topic: 4 
Words: 0.014*"game" + 0.011*"team" + 0.009*"play" + 0.007*"player" + 0.006*"bike" + 0.005*"hockey" + 0.005*"season" + 0.004*"score" + 0.004*"leagu" + 0.003*"john"


Topic: 5 
Words: 0.012*"drive" + 0.009*"chip" + 0.008*"card" + 0.008*"file" + 0.008*"encrypt"

Classification of the topics
Using the words in each topic and their corresponding weights, there are the categories we can probably infer:

0: politics
1: gun violence
2: graphic card
3: food and health
4: sports
5: encription
6: religion
7: space

## Step 5: Testing on unseen document

In [81]:
num = 600
unseen_document = newsgroups_test.data[num]
print(unseen_document)

From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539)
Subject: Re: Vandalizing the sky.
Organization: Texas Instruments Inc
Lines: 65

In <C63nA8.4C1@news.cso.uiuc.edu> gfk39017@uxa.cso.uiuc.edu (George F. Krumins) writes:

>I was suggesting that the minority of professional and amateur astronomers
>have the right to a dark, uncluttered night sky.

And from whence does this right stem, that it overrides the 'rights'
of the rest of us?

>Let me give you an example.  When you watch TV, they have commercials to pay
>for the programming.  You accept that as part of watching.  If you don't like
>it, you can turn it off.  If you want to view the night sky, and there is a
>floating billboard out there, you can't turn it off.  It's the same 
>reasoning that limits billboards in scenic areas.

And if you want to view that television station, you have to watch the
commercials.  You can't turn them off and still be viewing the
television station.  In other words, if you don't like what you see,

In [82]:
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

In [83]:
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.4081532657146454	 Topic: 0.013*"space" + 0.009*"nasa" + 0.006*"power" + 0.005*"wire" + 0.005*"orbit"
Score: 0.2321239560842514	 Topic: 0.012*"christian" + 0.008*"jesus" + 0.006*"exist" + 0.005*"moral" + 0.005*"bibl"
Score: 0.22693298757076263	 Topic: 0.009*"govern" + 0.006*"presid" + 0.005*"american" + 0.005*"weapon" + 0.005*"clinton"
Score: 0.10106444358825684	 Topic: 0.012*"armenian" + 0.010*"israel" + 0.008*"isra" + 0.006*"turkish" + 0.006*"kill"
Score: 0.02939053624868393	 Topic: 0.005*"pitt" + 0.005*"food" + 0.004*"bank" + 0.004*"medic" + 0.004*"research"


In [84]:
cat_num = newsgroups_test.target[num]
print(f'category number: {cat_num}')

category number: 14


In [85]:
print(f'category name: {newsgroups_test.target_names[cat_num]}')

category name: sci.space
