# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
! pip install pyLDAvis gensim spacy



### Import the libraries

In [30]:
import gensim
import spacy
import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import Phrases, LdaModel
from nltk.corpus import stopwords


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [31]:
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"
data = pd.read_json(url)
data

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
...,...,...,...
11309,From: jim.zisfein@factory.com (Jim Zisfein) \n...,13,sci.med
11310,From: ebodin@pearl.tufts.edu\nSubject: Screen ...,4,comp.sys.mac.hardware
11311,From: westes@netcom.com (Will Estes)\nSubject:...,3,comp.sys.ibm.pc.hardware
11312,From: steve@hcrlgw (Steven Collins)\nSubject: ...,1,comp.graphics


### Load the dataset

In [32]:
dataset = data['content']

### Preprocess the data

### Email Removal

In [33]:
import re
dataset  = dataset.map(lambda x: re.sub(r'\S+@\S+', '', x))


### Newline Removal

In [34]:
dataset  = dataset.map(lambda x: x.replace('\n', ' '))

### Single Quotes Removal

In [35]:
dataset = dataset.map(lambda x: x.replace("'", ''))

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [36]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))


In [37]:
data_words = list(sent_to_words(dataset))

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [38]:
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])


#### remove_stopwords( )

In [84]:
from gensim.utils import simple_preprocess
def remove_stopwords(texts):
     return [[word for word in doc if word not in stop_words] for doc in texts]
    
data_words_nostops = remove_stopwords(data_words)

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [85]:
from gensim.models import Phrases
bigram = Phrases(data_words_nostops, min_count=10, threshold=100)

def make_bigrams(texts):
    return [bigram[doc] for doc in texts]

data_words_bigrams = make_bigrams(data_words_nostops)

In [86]:
data_words_bigrams

[['wheres',
  'thing',
  'car',
  'nntp_posting',
  'host',
  'rac',
  'wam_umd',
  'organization',
  'university',
  'maryland_college',
  'park',
  'lines',
  'wondering',
  'anyone',
  'could',
  'enlighten',
  'car',
  'saw',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'really',
  'small',
  'addition',
  'front',
  'bumper',
  'separate',
  'rest',
  'body',
  'know',
  'anyone',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'years',
  'production',
  'car',
  'made',
  'history',
  'whatever',
  'info',
  'funky',
  'looking',
  'car',
  'please',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy_kuo',
  'si',
  'clock',
  'poll',
  'final',
  'call',
  'summary',
  'final',
  'call',
  'si',
  'clock',
  'reports',
  'keywords',
  'si',
  'acceleration',
  'clock',
  'upgrade',
  'article',
  'shelley',
  'qvfo',
  'innc',
  'organization',
  'university',
  'washington

#### make_bigrams( )

In [87]:
def make_bigrams(texts):
    return None

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [80]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [88]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp

<spacy.lang.en.English at 0x2b7239fd0>

#### lemmatizaton( )

In [89]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [90]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [91]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'rac', 'wam_umd', 'organization', 'university', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [92]:
from gensim.corpora import Dictionary
dictionary = Dictionary(data_lemmatized)

### Create Corpus

In [93]:
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

### Filter low-frequency words

In [94]:
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

### Create Index 2 word dictionary

In [95]:
# Create Index to word dictionary
#dictionary = {idx: word for idx,word in id2word.items()}
temp = dictionary[0]
id2word = dictionary.id2token
id2word

{0: 'addition',
 1: 'body',
 2: 'bring',
 3: 'bumper',
 4: 'call',
 5: 'car',
 6: 'day',
 7: 'door',
 8: 'early',
 9: 'engine',
 10: 'enlighten',
 11: 'front',
 12: 'funky',
 13: 'history',
 14: 'host',
 15: 'info',
 16: 'know',
 17: 'late',
 18: 'look',
 19: 'mail',
 20: 'make',
 21: 'model',
 22: 'name',
 23: 'neighborhood',
 24: 'nntp_poste',
 25: 'park',
 26: 'production',
 27: 'rac',
 28: 'really',
 29: 'rest',
 30: 's',
 31: 'see',
 32: 'separate',
 33: 'small',
 34: 'spec',
 35: 'sport',
 36: 'thank',
 37: 'thing',
 38: 'university',
 39: 'wam_umd',
 40: 'wonder',
 41: 'year',
 42: 'acceleration',
 43: 'adapter',
 44: 'add',
 45: 'answer',
 46: 'article',
 47: 'attain',
 48: 'base',
 49: 'brave',
 50: 'brief',
 51: 'card',
 52: 'clock',
 53: 'cpu',
 54: 'detail',
 55: 'do',
 56: 'especially',
 57: 'experience',
 58: 'fair',
 59: 'final',
 60: 'floppy',
 61: 'floppy_disk',
 62: 'functionality',
 63: 'heat',
 64: 'hour',
 65: 'keyword',
 66: 'knowledge',
 67: 'message',
 68: 'netw

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [96]:
ldamodel = LdaModel(corpus, num_topics=15, id2word = id2word, passes=20,chunksize=100, alpha='auto')

### Print the Keyword in the 10 topics

In [97]:
print(ldamodel.top_topics(corpus,topn=10))

[([(0.025102759, 'get'), (0.022942627, 'know'), (0.021482272, 'think'), (0.020423062, 'make'), (0.020363376, 'go'), (0.017554216, 'say'), (0.016860114, 'good'), (0.016654259, 'time'), (0.016446045, 'see'), (0.016027348, 'well')], -0.9824462876082904), ([(0.02123026, 'believe'), (0.020043062, 'evidence'), (0.019800128, 'reason'), (0.014657215, 'people'), (0.014048684, 'claim'), (0.0129031, 'say'), (0.012409707, 'question'), (0.011887853, 'mean'), (0.0116438065, 'exist'), (0.011272035, 'sense')], -1.5181630767526673), ([(0.10508674, 'nntp_poste'), (0.09909073, 'host'), (0.06738176, 'article'), (0.049816053, 'reply'), (0.035393756, 'thank'), (0.03202165, 'university'), (0.030116007, 'post'), (0.024217857, 'mail'), (0.021635043, 'm'), (0.018764725, 'help')], -1.733909716973707), ([(0.04264433, 'program'), (0.04144653, 'file'), (0.034835476, 'information'), (0.025746338, 'available'), (0.025044078, 'include'), (0.021013953, 'source'), (0.020891607, 'image'), (0.019762851, 'message'), (0.018

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [98]:
# Model Perplexity
perplexity = ldamodel.log_perplexity(corpus)
print("Perplexity: ", perplexity)

Perplexity:  -10.995148262396489


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [99]:
from gensim.models.coherencemodel import CoherenceModel

coherence_model = CoherenceModel(model=ldamodel, coherence='c_v', texts=data_lemmatized, corpus=corpus, dictionary=dictionary)
coherence_score = coherence_model.get_coherence()
print("Coherence Score: ", coherence_score)


Coherence Score:  0.5289886195071403


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [100]:
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [101]:
pyLDAvis.gensim.prepare(ldamodel, corpus,dictionary)