# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
! pip install pyLDAvis gensim spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting pyLDAvis
  Using cached pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
Collecting gensim
  Downloading gensim-4.3.1-cp310-cp310-win_amd64.whl (24.0 MB)
     --------------------------------------- 24.0/24.0 MB 11.5 MB/s eta 0:00:00
Collecting spacy
  Downloading spacy-3.5.1-cp310-cp310-win_amd64.whl (12.2 MB)
     --------------------------------------- 12.2/12.2 MB 15.9 MB/s eta 0:00:00
Collecting jinja2
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     -------------------------------------- 133.1/133.1 kB 8.2 MB/s eta 0:00:00
Collecting numexpr
  Downloading numexpr-2.8.4-cp310-cp310-win_amd64.whl (92 kB)
     ---------------------------------------- 92.7/92.7 kB ? eta 0:00:00
Collecting funcy
  Using cached funcy-1.18-py2.py3-none-any.whl (33 kB)
Collecting smart-open>=1.8.1
  Using cached smart_open-6.3.0-py3-none-any.whl (56 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0


[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import the libraries

In [2]:
import nltk
! nltk.download('stopwords')

'nltk.download' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
! pip install gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim


[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [4]:
! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

--2023-03-20 11:45:15--  https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23237087 (22M) [text/plain]
Saving to: 'newsgroups.json.5'

     0K .......... .......... .......... .......... ..........  0% 6.74M 3s
    50K .......... .......... .......... .......... ..........  0% 58.3M 2s
   100K .......... .......... .......... .......... ..........  0% 12.4M 2s
   150K .......... .......... .......... .......... ..........  0% 27.6M 2s
   200K .......... .......... .......... .......... ..........  1% 74.9M 1s
   250K .......... .......... .......... .......... ..........  1% 6.70M 2s
   300K .......... .......... .......... .......... ..........  1%  123M 1s
   350K .......... ......

  9150K .......... .......... .......... .......... .......... 40% 29.5M 1s
  9200K .......... .......... .......... .......... .......... 40% 21.5M 1s
  9250K .......... .......... .......... .......... .......... 40% 18.3M 1s
  9300K .......... .......... .......... .......... .......... 41%  143M 1s
  9350K .......... .......... .......... .......... .......... 41% 24.7M 1s
  9400K .......... .......... .......... .......... .......... 41% 32.7M 1s
  9450K .......... .......... .......... .......... .......... 41% 16.5M 1s
  9500K .......... .......... .......... .......... .......... 42%  120M 1s
  9550K .......... .......... .......... .......... .......... 42% 23.2M 1s
  9600K .......... .......... .......... .......... .......... 42% 21.8M 1s
  9650K .......... .......... .......... .......... .......... 42%  236M 1s
  9700K .......... .......... .......... .......... .......... 42% 42.1M 1s
  9750K .......... .......... .......... .......... .......... 43% 10.6M 1s
  9800K ....

### Load the dataset

In [5]:
import pandas as pd

df = pd.read_json('newsgroups.json')
print(df.head())

                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Preprocess the data

In [6]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [7]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chail\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chail\AppData\Roaming\nltk_data...


True

In [8]:
df = pd.read_json('newsgroups.json')

In [9]:
# Preprocess the text column
df['preprocessed_text'] = df['content'].apply(lambda x: x.lower())
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
stop_words = set(stopwords.words('english'))
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
lemmatizer = WordNetLemmatizer()
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

In [10]:
df.head()

Unnamed: 0,content,target,target_names,preprocessed_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos,lerxstwamumdedu wheres thing subject car nntpp...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware,guykuocarsonuwashingtonedu guy kuo subject si ...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware,twillisececnpurdueedu thomas e willis subject ...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics,jgreenamber joe green subject weitek p organiz...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space,jcmheadcfaharvardedu jonathan mcdowell subject...


### Email Removal

In [11]:
import re

In [12]:
# Defininig a regular expression for matching email addresses
email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# Removing email addresses from the preprocessed text
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: re.sub(email_regex, '', x))
df.head()

Unnamed: 0,content,target,target_names,preprocessed_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos,lerxstwamumdedu wheres thing subject car nntpp...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware,guykuocarsonuwashingtonedu guy kuo subject si ...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware,twillisececnpurdueedu thomas e willis subject ...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics,jgreenamber joe green subject weitek p organiz...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space,jcmheadcfaharvardedu jonathan mcdowell subject...


### Newline Removal

In [13]:
# Removing newlines from the preprocessed text
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: x.replace('\n', ' '))
df.head()

Unnamed: 0,content,target,target_names,preprocessed_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos,lerxstwamumdedu wheres thing subject car nntpp...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware,guykuocarsonuwashingtonedu guy kuo subject si ...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware,twillisececnpurdueedu thomas e willis subject ...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics,jgreenamber joe green subject weitek p organiz...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space,jcmheadcfaharvardedu jonathan mcdowell subject...


### Single Quotes Removal

In [14]:
# Removing single quotes from the preprocessed text
df['preprocessed_text'] = df['preprocessed_text'].apply(lambda x: x.replace("'", ""))
df.head()

Unnamed: 0,content,target,target_names,preprocessed_text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos,lerxstwamumdedu wheres thing subject car nntpp...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware,guykuocarsonuwashingtonedu guy kuo subject si ...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware,twillisececnpurdueedu thomas e willis subject ...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics,jgreenamber joe green subject weitek p organiz...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space,jcmheadcfaharvardedu jonathan mcdowell subject...


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [15]:
from gensim.utils import simple_preprocess

In [16]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(simple_preprocess(str(sentence), deacc=True)) #True means removes punctuation
#usage example
data = df['preprocessed_text'].values.tolist()
data_words = list(sent_to_words(data))
print(data_words[:5])



### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [17]:
from gensim.parsing.preprocessing import STOPWORDS

In [18]:
# Defining additional stop words
CUSTOM_STOPWORDS = STOPWORDS.union(set(['from', 'subject', 're', 'edu', 'use']))

def remove_stopwords(texts):
    return [[word for word in doc if word not in CUSTOM_STOPWORDS] for doc in texts]

# Usage example:
data_words_nostops = remove_stopwords(data_words)
print(data_words_nostops[:5])



#### remove_stopwords( )

In [19]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [20]:
def remove_stopwords(texts):

    return [[word for word in doc if word not in stop_words] for doc in texts if doc]

In [21]:
data_words_nostops = remove_stopwords(data_words)

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [22]:
from gensim.models.phrases import Phrases, Phraser

In [23]:
# Build the bigram models
bigram = Phrases(data_words_nostops, min_count=5, threshold=100)
bigram_mod = Phraser(bigram)

In [24]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

data_words_bigrams = make_bigrams(data_words_nostops)

In [25]:
print(data_words_nostops[:10])
print(data_words[:10])



#### make_bigrams( )

In [26]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [27]:
! pip install spacy
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 18.7 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [28]:
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [32]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [33]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [34]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntppostinghost', 'organization', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'email', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [35]:
id2word = corpora.Dictionary(data_lemmatized)

### Create Corpus

In [36]:
texts = data_lemmatized

### Filter low-frequency words

In [37]:
id2word.filter_extremes(no_below=5, no_above=0.5)

### Create Index 2 word dictionary

In [38]:
id2word = corpora.Dictionary(data_lemmatized)

In [39]:
print(id2word)

Dictionary<55919 unique tokens: ['addition', 'body', 'bring', 'call', 'car']...>


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [40]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

In [41]:
# Creating a dictionary
dictionary = Dictionary(data_lemmatized)

In [42]:
# Filtering out words that occur less than 10 documents
dictionary.filter_extremes(no_below=10, no_above=0.5)

In [43]:
# Creating a bag of words corpus
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

In [44]:
# Trainining the LDA model
lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=20,
                     chunksize=100,
                     alpha='auto',
                     passes=10,
                     random_state=42)

### Print the Keyword in the 10 topics

In [45]:
for idx, topic in lda_model.show_topics(formatted=False, num_topics=10):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))
    print('\n')

Topic: 3 
Words: ['kill', 'soldier', 'village', 'armenian', 'war', 'attack', 'land', 'murder', 'people', 'turkish']


Topic: 14 
Words: ['space', 'center', 'launch', 'mission', 'agency', 'research', 'development', 'orbit', 'satellite', 'moon']


Topic: 15 
Words: ['image', 'earth', 'picture', 'object', 'search', 'scan', 'requirement', 'planet', 'surface', 'role']


Topic: 13 
Words: ['file', 'program', 'entry', 'format', 'nhl', 'session', 'decent', 'noise', 'correctly', 'att']


Topic: 10 
Words: ['window', 'driver', 'internet', 'application', 'screen', 'display', 'printer', 'manager', 'library', 'version']


Topic: 12 
Words: ['number', 'email', 'include', 'send', 'information', 'book', 'post', 'source', 'list', 'address']


Topic: 8 
Words: ['system', 'use', 'problem', 'run', 'set', 'card', 'bit', 'machine', 'help', 'work']


Topic: 7 
Words: ['evidence', 'reason', 'believe', 'people', 'claim', 'state', 'fact', 'case', 'exist', 'mean']


Topic: 6 
Words: ['get', 'nntppostinghost', 'n

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [46]:
print('Perplexity:', lda_model.log_perplexity(corpus))

Perplexity: -11.923750230142153


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [47]:
from gensim.models import CoherenceModel

In [48]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4931678054187249


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [49]:
!pip install pyLDAvis

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [50]:
import pyLDAvis.gensim_models
import gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis

  default_term_info = default_term_info.sort_values(
