# Topic Modelling Using Latent Dirichlet Allocation
#### LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 

In C:\Users\HP\Anaconda3 64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.unicode rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In C:\Users\HP\Anaconda3 64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The savefig.frameon rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In C:\Users\HP\Anaconda3 64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The pgf.debug rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In C:\Users\HP\Anaconda3 64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The verbose.level rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In C:\Users\HP\Anaconda3 64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The verbose.fileo rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.


In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import gensim

## Step 1: Load the dataset
#### The dataset used is the 20newsgroup dataset that is available from sklearn. This dataset has news articles grouped into 20 news categories

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

#### Topics of different news groups

In [6]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
type(newsgroups_train.data)

list

In [8]:
len(newsgroups_train.data)

11314

In [9]:
data=newsgroups_train.data

In [10]:
data[2]

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdue University Engineering Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985.  sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected?  i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180?  i could probably swin

## Step 2: Data Preprocessing

### Perform the following steps:

#### Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
#### Words that have fewer than 3 characters are removed.
#### All stopwords are removed.
#### Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.
#### Words are stemmed - words are reduced to their root form.

In [11]:
lemma=WordNetLemmatizer()

In [12]:
new_data=[]
for a in data:
    a=a.lower()
    text=re.sub('[^a-z]',' ',a)
    text=text.split(' ')
    text=[lemma.lemmatize(word) for word in text if word not in stopwords.words('english') and len(word) > 3]
    text=' '.join(text)
    new_data.append(text)

In [13]:
print(len(new_data))

11314


In [14]:
new_data[2]

'twillis purdue thomas willis subject question organization purdue university engineering computer network distribution line well folk plus finally gave ghost weekend starting life back sooo market machine sooner intended looking picking powerbook maybe bunch question hopefully somebody answer anybody know dirt next round powerbook introduction expected heard supposed make appearence summer heard anymore since access macleak wondering anybody info anybody heard rumor price drop powerbook line like one went recently impression display could probably swing disk rather really feel much better display look great store really good could solicit opinion people worth taking disk size money active display realize real subjective question played around machine computer store breifly figured opinion somebody actually us machine daily might prove helpful well hellcat perform thanks bunch advance info could email post summary news reading time premium final around corner willis twillis purdue purd

## Step 3: Bag of words on the dataset

In [15]:
vector=CountVectorizer()

In [16]:
X = vector.fit_transform(new_data)

In [17]:
X.shape

(11314, 73210)

#### Convert sparse matrix to gensim corpus.

In [18]:
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

#### Mapping from word IDs to words

In [19]:
id_map={}
for k,v in vector.vocabulary_.items():
    id_map[v]=k

## Step 4: Running LDA using Bag of Words

In [None]:
lda_model = gensim.models.LdaModel(corpus,id2word=id_map,num_topics=8,passes=35)

In [None]:
topics_words = lda_model.print_topics(num_topics=8,num_words=10)

In [None]:
for no,top in topics_words:
    print('topic',no)
    print(top)
    print('\n')

### Classification of the topics
#### Using the words in each topic and their categories


* Hardware
* Encryption
* Sports
* Religion
* Politics
* Graphics
* Space


## Step 6: Testing model on unseen document

In [None]:
num=20
unseen_document = newsgroups_test.data[num]
print(unseen_document)

In [None]:
print(newsgroups_test.target[num])

In [None]:
list(newsgroups_test.target_names)

In [None]:
def clean_text(a):
    a = a.lower()
    text = re.sub('[^a-z]', ' ', a)
    text = text.split(' ')
    text = [lemma.lemmatize(word) for word in text if word not in stopwords.words('english') and len(word) > 3]
    text = ' '.join(text)
    return text

In [None]:
unseen_document=clean_text(unseen_document)
unseen_document

In [None]:
y = vector.transform([unseen_document])
y_corpus = gensim.matutils.Sparse2Corpus(y, documents_columns=False)
prediction = lda_model.get_document_topics(y_corpus)
list(sorted(prediction))

Modle Predicted 97% Graphics

#### The model correctly classifies the unseen document with 'x'% probability to the X category.

In [None]:
num=158
unseen_document = newsgroups_test.data[num]
print(unseen_document)

In [None]:
print(newsgroups_test.target[num])

In [None]:
unseen_document=clean_text(unseen_document)
unseen_document

In [None]:
y = vector.transform([unseen_document])
y_corpus = gensim.matutils.Sparse2Corpus(y, documents_columns=False)
prediction = lda_model.get_document_topics(y_corpus)
list(sorted(prediction))

Modle Predicted 90% Sports

## Step 7: Saving The model

In [None]:
lda_model.save('topic_finding')