# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
#! pip install pyLDAvis gensim spacy

### Import the libraries

In [6]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim
import spacy
import pyLDAvis.gensim

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

'wegt' is not recognized as an internal or external command,
operable program or batch file.


### Load the dataset

In [7]:
# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"

# Load JSON data into a DataFrame
df = pd.read_json(url)

# Display the first few rows of the dataset to check if it loaded correctly
print(df.head())

                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


### Preprocess the data

### Email Removal

In [8]:
df['content'] = df['content'].str.replace(r'\S*@\S*\s?', '', regex=True)

### Newline Removal

In [9]:
df['content'] = df['content'].str.replace(r'\s+', ' ', regex=True)

### Single Quotes Removal

In [33]:
df['content'] = df['content'].str.replace(r"\'", '', regex=True)

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [40]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True, min_len=1))

In [78]:
data_words = list(sent_to_words(df['content']))

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [42]:
from gensim.parsing.preprocessing import STOPWORDS
additional_stopwords = set(['from', 'subject', 're', 'edu', 'use'])
stop_words = STOPWORDS.union(additional_stopwords)

#### remove_stopwords( )

In [43]:
def remove_stopwords(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

In [44]:
data_words_nostops = remove_stopwords(data_words)

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [47]:
from gensim.models import Phrases

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


  and should_run_async(code)


#### make_bigrams( )

In [48]:
def make_bigrams(texts):
    bigram = Phrases(texts, min_count=1, threshold=100)
    return [bigram[line] for line in texts]

In [50]:
data_words_bigrams = make_bigrams(data_words_nostops)

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [51]:
! python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.4/12.8 MB 8.7 MB/s eta 0:00:02
     -- ------------------------------------- 0.7/12.8 MB 9.3 MB/s eta 0:00:02
     --- ------------------------------------ 1.1/12.8 MB 10.4 MB/s eta 0:00:02
     ----- ---------------------------------- 1.7/12.8 MB 10.8 MB/s eta 0:00:02
     ------- -------------------------------- 2.3/12.8 MB 11.2 MB/s eta 0:00:01
     --------- ------------------------------ 3.0/12.8 MB 11.9 MB/s eta 0:00:01
     ----------- ---------------------------- 3.5/12.8 MB 11.9 MB/s eta 0:00:01
     ----------- ---------------------------- 3.8/12.8 MB


[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [54]:
nlp = spacy.load("en_core_web_sm")

#### lemmatizaton( )

In [55]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [56]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [57]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host_rac', 'wam_umd', 'organization', 'university', 'park', 'line', 'wondering_enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 's', 'early', 'door', 'small', 'addition', 'bumper_separate', 'rest', 'body', 'know', 'tellme_model', 'engine', 'spec', 'year', 'production', 'car', 'history', 'look', 'car', 'e', 'mail', 'thank', 'bring']]


### Create a Dictionary

In [58]:
# Create a Dictionary
dictionary = Dictionary(data_words_bigrams)

# Filter out tokens that appear in less than 10 documents (adjust the parameter as needed)
dictionary.filter_extremes(no_below=10)

# Create a corpus based on the dictionary
corpus = [dictionary.doc2bow(doc) for doc in data_words_bigrams]

# Print the dictionary
print(dictionary)

Dictionary<11123 unique tokens: ['addition', 'body', 'brought', 'car', 'day']...>


### Create Corpus

In [59]:
# Create a corpus based on the dictionary
corpus = [dictionary.doc2bow(doc) for doc in data_words_bigrams]

### Filter low-frequency words

In [60]:
# Filter out tokens that appear in less than `no_below` documents or more than `no_above` documents (adjust as needed)
dictionary.filter_extremes(no_below=10, no_above=0.5)

# Re-create the corpus based on the updated dictionary
corpus = [dictionary.doc2bow(doc) for doc in data_words_bigrams]

### Create Index 2 word dictionary

In [61]:
# Create index to word dictionary
index2word = {idx: word for idx, word in dictionary.items()}

In [62]:
word_for_index_0 = index2word[0]

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [69]:
from gensim.models import LdaModel

# Set the number of topics
num_topics = 20

# Build the LDA model
lda_model = LdaModel(
    corpus=corpus,  # The corpus in bag-of-words format
    id2word=dictionary,  # Mapping of word IDs to words
    num_topics=num_topics,  # Number of topics
    chunksize=100,  # Number of documents to be used in each training chunk
    alpha='auto',  # Hyperparameter affecting sparsity of topics
    passes=20  # Total number of training passes
)

### Print the Keyword in the 10 topics

In [70]:
# Print the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.072*"jews" + 0.052*"jewish" + 0.047*"united_states" + 0.045*"war" + 0.039*"pl" + 0.038*"crime" + 0.038*"allowed" + 0.034*"died" + 0.025*"step" + 0.024*"state"')
(1, '0.068*"year" + 0.040*"team" + 0.036*"game" + 0.024*"win" + 0.022*"points" + 0.021*"play" + 0.021*"games" + 0.020*"vs" + 0.017*"face" + 0.017*"lost"')
(2, '0.734*"ax" + 0.162*"q" + 0.034*"f" + 0.015*"v" + 0.008*"mu" + 0.005*"g" + 0.004*"bc" + 0.003*"p" + 0.003*"icon" + 0.002*"jpl"')
(3, '0.086*"talking" + 0.058*"lots" + 0.056*"washington" + 0.049*"die" + 0.049*"health" + 0.040*"ms" + 0.039*"treatment" + 0.036*"patients" + 0.036*"failed" + 0.032*"medical"')
(4, '0.157*"o" + 0.043*"bike" + 0.034*"dr" + 0.029*"picture" + 0.027*"block" + 0.025*"dod" + 0.024*"dc" + 0.023*"ps" + 0.020*"disease" + 0.019*"rm"')
(5, '0.304*"x" + 0.050*"c" + 0.031*"file" + 0.023*"code" + 0.018*"sun" + 0.017*"graphics" + 0.017*"version" + 0.016*"entry" + 0.015*"available" + 0.015*"thanks_advance"')
(6, '0.040*"drive" + 0.032*"space" + 0.025*"so

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [71]:
# Compute Model Perplexity
perplexity = lda_model.log_perplexity(corpus)
print(f"Model Perplexity: {perplexity}")

Model Perplexity: -12.696255525168604


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [72]:
from gensim.models import CoherenceModel

# Compute Topic Coherence
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Topic Coherence: {coherence_lda}")

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])
  m_lr_i = np.log(numerator / denominator)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


Topic Coherence: nan


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [73]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Prepare the visualization
pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)