# Topic Modeling Using Latent Dirichlet Allocation
#### LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import gensim

## Step 1: Load the dataset
#### The dataset used is the 20newsgroup dataset that is available from sklearn. This dataset has news articles grouped into 20 news categories

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

#### Topics of different news groups

In [6]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
type(newsgroups_train.data)

list

In [8]:
len(newsgroups_train.data)

11314

In [9]:
data=newsgroups_train.data

In [10]:
data[2]

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdue University Engineering Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985.  sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected?  i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180?  i could probably swin

## Step 2: Data Preprocessing

### Perform the following steps:

#### Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
#### Words that have fewer than 3 characters are removed.
#### All stopwords are removed.
#### Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.
#### Words are stemmed - words are reduced to their root form.

In [11]:
lemma=WordNetLemmatizer()

In [12]:
new_data=[]
for a in data:
    a=a.lower()
    text=re.sub('[^a-z]',' ',a)
    text=text.split(' ')
    text=[lemma.lemmatize(word) for word in text if word not in stopwords.words('english') and len(word) > 3]
    text=' '.join(text)
    new_data.append(text)

In [13]:
print(len(new_data))

11314


In [14]:
new_data[2]

'twillis purdue thomas willis subject question organization purdue university engineering computer network distribution line well folk plus finally gave ghost weekend starting life back sooo market machine sooner intended looking picking powerbook maybe bunch question hopefully somebody answer anybody know dirt next round powerbook introduction expected heard supposed make appearence summer heard anymore since access macleak wondering anybody info anybody heard rumor price drop powerbook line like one went recently impression display could probably swing disk rather really feel much better display look great store really good could solicit opinion people worth taking disk size money active display realize real subjective question played around machine computer store breifly figured opinion somebody actually us machine daily might prove helpful well hellcat perform thanks bunch advance info could email post summary news reading time premium final around corner willis twillis purdue purd

## Step 3: Bag of words on the dataset

In [15]:
vector=CountVectorizer()

In [16]:
X = vector.fit_transform(new_data)

In [17]:
X.shape

(11314, 73210)

#### Convert sparse matrix to gensim corpus.

In [18]:
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

#### Mapping from word IDs to words

In [19]:
id_map={}
for k,v in vector.vocabulary_.items():
    id_map[v]=k

## Step 4: Running LDA using Bag of Words

In [20]:
lda_model = gensim.models.LdaModel(corpus,id2word=id_map,num_topics=8,passes=35)

In [29]:
topics_words = lda_model.print_topics(num_topics=8,num_words=10)

In [30]:
for no,top in topics_words:
    print('topic',no)
    print(top)
    print('\n')

topic 0
0.022*"drive" + 0.016*"card" + 0.011*"scsi" + 0.010*"disk" + 0.009*"driver" + 0.008*"problem" + 0.008*"system" + 0.007*"line" + 0.007*"subject" + 0.007*"organization"


topic 1
0.012*"chip" + 0.012*"encryption" + 0.011*"clipper" + 0.007*"government" + 0.007*"key" + 0.007*"security" + 0.006*"system" + 0.006*"public" + 0.005*"privacy" + 0.005*"escrow"


topic 2
0.021*"line" + 0.020*"subject" + 0.019*"organization" + 0.012*"posting" + 0.011*"nntp" + 0.011*"host" + 0.011*"article" + 0.011*"writes" + 0.010*"university" + 0.007*"would"


topic 3
0.011*"game" + 0.009*"team" + 0.008*"line" + 0.008*"year" + 0.008*"subject" + 0.007*"organization" + 0.007*"writes" + 0.006*"player" + 0.005*"article" + 0.005*"play"


topic 4
0.010*"people" + 0.009*"would" + 0.006*"think" + 0.006*"writes" + 0.006*"subject" + 0.005*"line" + 0.005*"organization" + 0.005*"know" + 0.005*"article" + 0.005*"christian"


topic 5
0.008*"people" + 0.006*"state" + 0.006*"government" + 0.005*"armenian" + 0.005*"would" 

### Classification of the topics
#### Using the words in each topic and their categories


* Hardware
* Encryption
* Sports
* Religion
* Politics
* Graphics
* Space


## Step 6: Testing model on unseen document

In [23]:
num=20
unseen_document = newsgroups_test.data[num]
print(unseen_document)

From: Matjaz.Gams@IJS.si
Subject: ``Call for Papers for INFORMATIKA'' Vol. 17 No. 2
Originator: abrodnik@watdragon.uwaterloo.ca
Organization: Jozef Stefan Institute, Ljubljana, Slovenia
Lines: 69

This is an invitation to send articles to the Informatica magazine.
The first fully international issue has been published and echoes 
are quite favourable. For any information, contact (matjaz.gams@ijs.si). 

Dear Colleague,                                        April 25, 1993

Number 1 of volume 17 of Informatica is now out of print and some of you 
will receive it in a week or so. As you will see, the journal is structured
in the following way: the editorial (first page); profiles (second page
-- biography of an editor, in this issue, Terry Winograd); the edited
part of papers (pp. 3-80); mission and research reports (A plan for
knowledge archives project in Japan and CSLI in Stanford, pp. 81-100);
and news and announcements (pp. 101-108). This structure is mentioned to
give you a suggest

In [24]:
print(newsgroups_test.target[num])

1


In [25]:
list(newsgroups_test.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [26]:
def clean_text(a):
    a = a.lower()
    text = re.sub('[^a-z]', ' ', a)
    text = text.split(' ')
    text = [lemma.lemmatize(word) for word in text if word not in stopwords.words('english') and len(word) > 3]
    text = ' '.join(text)
    return text

In [27]:
unseen_document=clean_text(unseen_document)
unseen_document

'matjaz gam subject call paper informatika originator abrodnik watdragon uwaterloo organization jozef stefan institute ljubljana slovenia line invitation send article informatica magazine first fully international issue published echo quite favourable information contact matjaz gam dear colleague april number volume informatica print receive week journal structured following editorial first page profile second page biography editor issue terry winograd edited part paper mission research report plan knowledge archive project japan csli stanford news announcement structure mentioned give suggestion could help make content journal significant diverse interesting bringing view discourse great emphasis given called editorial page page express opinion belief writing editor problem within scope computing informatics extending concerning discipline cybernetics advanced cognitive science mind informationally concerned neural science advanced technology photonics asked professor terry winograd w

In [28]:
y = vector.transform([unseen_document])
y_corpus = gensim.matutils.Sparse2Corpus(y, documents_columns=False)
prediction = lda_model.get_document_topics(y_corpus)
list(sorted(prediction))

[[(1, 0.021908386),
  (2, 0.019237904),
  (4, 0.18396628),
  (5, 0.2053172),
  (6, 0.3739754),
  (7, 0.19483472)]]

Modle Predicted 97% Graphics

#### The model correctly classifies the unseen document with 'x'% probability to the X category.

In [38]:
num=158
unseen_document = newsgroups_test.data[num]
print(unseen_document)

From: nhmas@gauss.med.harvard.edu (Mark Shneyder 432-4219)
Subject: Re: BUFFALO 4, Boston 0: Defense!!
Organization: HMS
Lines: 27
NNTP-Posting-Host: gauss.med.harvard.edu

In article <1993Apr21.142357.14164@alchemy.chem.utoronto.ca> golchowy@alchemy.chem.utoronto.ca (Gerald Olchowy) writes:
>Fuhr is proving the Fuhr-bashers wrong, but Boston is an awfully
>good team.
>

Yeh,but :

1.Biran Sutter's playoff record as the head coach in St.L wasn't very
impressive. His Blues teams were eliminated very early in the playoffs.
It doesn't look like this trend will change with the Bruins.

2. Bruins have never come back to win after falling behind 2-0 in their
entire 68-year history. It doesn't look like Buffalo will just lose
their next two games at the Aud with the way Fuhr has been standing on
his head.

Basically,the Bruins will be on the golf course by next weekend.
Also,it seems like the whole Boston area has gone baseball crazy after
an incredibly great start by the Red Sox(best record 

In [39]:
print(newsgroups_test.target[num])

10


In [40]:
unseen_document=clean_text(unseen_document)
unseen_document

'nhmas gauss harvard mark shneyder subject buffalo boston defense organization line nntp posting host gauss harvard article alchemy chem utoronto golchowy alchemy chem utoronto gerald olchowy writes fuhr proving fuhr bashers wrong boston awfully good team biran sutter playoff record head coach impressive blue team eliminated early playoff look like trend change bruin bruin never come back falling behind entire year history look like buffalo lose next game fuhr standing head basically bruin golf course next weekend also seems like whole boston area gone baseball crazy incredibly great start best record major league would caller bruin flagship station sportsradio talking yapping patriot upcoming draft celtic bruin somehow around really schocked bruin folding early mark'

In [41]:
y = vector.transform([unseen_document])
y_corpus = gensim.matutils.Sparse2Corpus(y, documents_columns=False)
prediction = lda_model.get_document_topics(y_corpus)
list(sorted(prediction))

[[(2, 0.07447449), (3, 0.902919), (5, 0.01685181)]]

Modle Predicted 90% Sports

## Step 7: Saving The model

In [31]:
lda_model.save('topic_finding')