# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
!pip install pyLDAvis gensim spacy
!pip install gensim==3.6.0



### Import the libraries

In [2]:
import pandas as pd
import numpy as np
import json
import re
import requests
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize



### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [11]:
def load_data(data_url):
    response = requests.get(data_url)
    if response.status_code == 200:
        data = json.loads(response.text)
        return data
    else:
        print("Failed to load data.")
        return None

data_url = 'https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json'
data = load_data(data_url)

In [12]:
df = pd.DataFrame(data)

### Preprocess the data

### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [13]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    text = re.sub(r'\S*@\S*\s?', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\W', ' ', str(text))
    text = re.sub(r'\d+', '', text)
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

df['Processed_data'] = df['content'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vidit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vidit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [16]:
dictionary = corpora.Dictionary(df['Processed_data'])
corpus = [dictionary.doc2bow(text) for text in df['Processed_data']]

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [17]:
num_topics = 10
lda_model = models.LdaModel(corpus=corpus,
                            id2word=dictionary,
                            num_topics=num_topics,
                            random_state=42,
                            update_every=1,
                            chunksize=100,
                            passes=5,
                            alpha='auto',
                            per_word_topics=True)

### Print the Keyword in the 10 topics

In [20]:
topics = lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=True)
for topic in topics:
    print(topic)

(0, '0.054*"r" + 0.042*"p" + 0.041*"w" + 0.041*"_" + 0.036*"g" + 0.034*"u" + 0.030*"q" + 0.022*"c" + 0.020*"e" + 0.019*"x"')
(1, '0.007*"government" + 0.006*"state" + 0.006*"people" + 0.005*"gun" + 0.004*"mr" + 0.004*"states" + 0.004*"us" + 0.004*"american" + 0.004*"national" + 0.004*"president"')
(2, '0.009*"writes" + 0.008*"article" + 0.007*"organization" + 0.007*"lines" + 0.007*"subject" + 0.006*"year" + 0.006*"like" + 0.005*"go" + 0.005*"good" + 0.005*"would"')
(3, '0.357*"ax" + 0.028*"_o" + 0.018*"gv" + 0.016*"ei" + 0.015*"di" + 0.010*"p" + 0.010*"el" + 0.009*"bf" + 0.008*"ql" + 0.007*"rlk"')
(4, '0.011*"use" + 0.009*"key" + 0.007*"system" + 0.006*"one" + 0.006*"may" + 0.006*"chip" + 0.006*"would" + 0.005*"public" + 0.005*"number" + 0.004*"used"')
(5, '0.012*"people" + 0.012*"god" + 0.010*"one" + 0.010*"would" + 0.007*"think" + 0.007*"writes" + 0.007*"evidence" + 0.006*"know" + 0.006*"believe" + 0.006*"say"')
(6, '0.011*"car" + 0.008*"drive" + 0.007*"scsi" + 0.006*"gm" + 0.006*"id

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [21]:
perplexity = lda_model.log_perplexity(corpus)
print(f"Model Perplexity: {perplexity}")

Model Perplexity: -8.805778755872058


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [22]:
coherence_model = models.CoherenceModel(model=lda_model, texts=df['Processed_data'], dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Topic Coherence: {coherence_score}")

Topic Coherence: 0.6175215557628652


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [23]:
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)
