<a href="https://colab.research.google.com/github/sahilshah9111/Topic-Modelling-LDA/blob/main/Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load the data**

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
#!pip install -U gensim
import gensim
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
#!pip install pyLDAvis
import pyLDAvis.gensim_models

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.4 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.9 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none

  from collections import Iterable
  from collections import Mapping


In [None]:
#nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')

df = pd.DataFrame({'post': newsgroups_train['data'], 'target': newsgroups_train['target']})
df['target_names'] = df['target'].apply(lambda t: newsgroups_train['target_names'][t])
df.head()

Unnamed: 0,post,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


As a textual preprocessing step, we will first remove URLs, HTML tags, emails, and non-alpha characters. After that, we will lemmatize it and remove stopwords.

In [None]:
def remove_urls(text):
  url_pattern = re.compile(r'https?://\S+|www\.\S+')
  return url_pattern.sub(r'', text)
  
def remove_html(text):
  html_pattern = re.compile('')
  return html_pattern.sub(r'', text)
    
def remove_emails(text):
  email_pattern = re.compile(r'\S*@\S*\s?')
  return email_pattern.sub(r'', text)
    
def remove_new_line(text):
  return re.sub(r'\s+', ' ', text)
    
def remove_non_alpha(text):
  return re.sub("[^A-Za-z]+", ' ', str(text))

def preprocess_text(text):
  t = remove_urls(text)
  t = remove_html(t)
  t = remove_emails(t)
  t = remove_new_line(t)
  t = remove_non_alpha(t)
  return t
    
def lemmatize_words(text, lemmatizer):
  return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

def remove_stopwords(text, stopwords):
  return " ".join([word for word in str(text).split() if word not in stopwords])

df['post_preprocessed'] = df['post'].apply(preprocess_text).str.lower()

print('lemming...')
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
df['post_final'] = df['post_preprocessed'].apply(lambda post: lemmatize_words(post, lemmatizer))

print('remove stopwords...')

#nltk.download('stopwords')
swords = set(stopwords.words('english'))

df['post_final'] = df['post_preprocessed'].apply(lambda post: remove_stopwords(post, swords))
df.head()

lemming...


[nltk_data] Downloading package wordnet to /root/nltk_data...


remove stopwords...


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,post,target,target_names,post_preprocessed,post_final
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos,from where s my thing subject what car is this...,thing subject car nntp posting host rac wam um...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware,from guy kuo subject si clock poll final call ...,guy kuo subject si clock poll final call summa...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware,from thomas e willis subject pb questions orga...,thomas e willis subject pb questions organizat...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics,from joe green subject re weitek p organizatio...,joe green subject weitek p organization harris...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space,from jonathan mcdowell subject re shuttle laun...,jonathan mcdowell subject shuttle launch quest...


After preprocessing, we don’t need to explicitly create the document term matrix (DTM). Gensim package has an internal mechanism to create the DTM.

In [None]:
posts = [x.split(' ') for x in df['post_final']]
id2word = corpora.Dictionary(posts)
print(id2word)

Dictionary<77511 unique tokens: ['addition', 'anyone', 'body', 'bricklin', 'brought']...>


The next step is to convert the corpus (the list of documents) into a document-term Matrix using the dictionary that we had prepared above. (The vectorizer used here is the Bag of Words).

In [None]:
corpus_tf = [id2word.doc2bow(text) for text in posts]
print(corpus_tf[0])

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 5), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1)]


This implies that we have document wise index of the word and its frequency. The 0th word is repeated 1 time, 1st word 1 time and so on...

Next, we first create and apply LDA model. Then, I will check performance using Coherence score.

In [None]:
lda_model = models.ldamodel.LdaModel(corpus=corpus_tf,
                                           id2word=id2word,
                                           num_topics=6, 
                                           random_state=100,
                                           alpha=0.0001,
                                           eta = 0.1,
                                           iterations =10,
                                           per_word_topics=True)

print(lda_model.print_topics())



[(0, '0.015*"subject" + 0.015*"lines" + 0.013*"organization" + 0.007*"posting" + 0.007*"article" + 0.007*"nntp" + 0.007*"university" + 0.006*"host" + 0.006*"writes" + 0.004*"would"'), (1, '0.030*"x" + 0.007*"subject" + 0.006*"c" + 0.006*"organization" + 0.006*"lines" + 0.005*"use" + 0.005*"would" + 0.004*"one" + 0.004*"windows" + 0.003*"file"'), (2, '0.007*"one" + 0.007*"would" + 0.007*"writes" + 0.006*"subject" + 0.006*"lines" + 0.006*"article" + 0.005*"organization" + 0.005*"people" + 0.004*"think" + 0.004*"like"'), (3, '0.233*"ax" + 0.035*"g" + 0.030*"w" + 0.029*"r" + 0.027*"p" + 0.026*"q" + 0.023*"f" + 0.022*"u" + 0.022*"v" + 0.018*"c"'), (4, '0.007*"would" + 0.007*"people" + 0.006*"one" + 0.006*"god" + 0.004*"lines" + 0.004*"organization" + 0.004*"writes" + 0.004*"subject" + 0.004*"think" + 0.004*"article"'), (5, '0.007*"one" + 0.006*"subject" + 0.006*"organization" + 0.006*"lines" + 0.005*"writes" + 0.005*"would" + 0.005*"like" + 0.004*"use" + 0.004*"get" + 0.004*"article"')]


In [None]:
#Compute Coherance Score
coherence = CoherenceModel(model = lda_model, texts = posts, dictionary = id2word, coherence = 'u_mass')

print('\nCoherence Score: ', coherence.get_coherence())


Coherence Score:  -1.3257969134157024
