# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
#! pip install pyLDAvis gensim spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.4 MB/s eta 0:00:01
Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting numexpr
  Downloading numexpr-2.8.7-cp39-cp39-macosx_10_9_x86_64.whl (102 kB)
[K     |████████████████████████████████| 102 kB 19.7 MB/s eta 0:00:01
Installing collected packages: numexpr, funcy, pyLDAvis
Successfully installed funcy-2.0 numexpr-2.8.7 pyLDAvis-3.4.1
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Import the libraries

In [60]:
import gensim
import spacy
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phrases, Phraser
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis



### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [15]:
import requests

url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json"
data = requests.get(url).json()

### Load the dataset

In [26]:
df = pd.DataFrame(data)
df

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
...,...,...,...
11309,From: jim.zisfein@factory.com (Jim Zisfein) \n...,13,sci.med
11310,From: ebodin@pearl.tufts.edu\nSubject: Screen ...,4,comp.sys.mac.hardware
11311,From: westes@netcom.com (Will Estes)\nSubject:...,3,comp.sys.ibm.pc.hardware
11312,From: steve@hcrlgw (Steven Collins)\nSubject: ...,1,comp.graphics


### Preprocess the data

### Email Removal

In [27]:
import re

In [28]:
df['content'] = df['content'].apply(lambda x: re.sub(r'\S*@\S*\s?', '', x))

### Newline Removal

In [30]:
df['content'] = df['content'].replace('\n', ' ', regex=True)

### Single Quotes Removal

In [32]:
df['content'] = df['content'].replace("'", "", regex=True)

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [36]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

In [35]:
data_words = list(sent_to_words(df['content']))

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [38]:
from gensim.parsing.preprocessing import STOPWORDS

my_stop_words = STOPWORDS.union(set(['from', 'subject', 're', 'edu', 'use']))

#### remove_stopwords( )

In [39]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in my_stop_words] for doc in texts]

In [40]:
data_words_nostops = remove_stopwords(data_words)

In [41]:
data_words_nostops

[['wheres',
  'thing',
  'car',
  'nntp',
  'posting',
  'host',
  'rac',
  'wam',
  'umd',
  'organization',
  'university',
  'maryland',
  'college',
  'park',
  'lines',
  'wondering',
  'enlighten',
  'car',
  'saw',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'small',
  'addition',
  'bumper',
  'separate',
  'rest',
  'body',
  'know',
  'tellme',
  'model',
  'engine',
  'specs',
  'years',
  'production',
  'car',
  'history',
  'info',
  'funky',
  'looking',
  'car',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy',
  'kuo',
  'si',
  'clock',
  'poll',
  'final',
  'summary',
  'final',
  'si',
  'clock',
  'reports',
  'keywords',
  'si',
  'acceleration',
  'clock',
  'upgrade',
  'article',
  'shelley',
  'qvfo',
  'innc',
  'organization',
  'university',
  'washington',
  'lines',
  'nntp',
  'posting',
  'host',
  'carson',
  'washington',
  'fair',
  'number',
  'brav

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [42]:
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)


#### make_bigrams( )

In [43]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [44]:
data_words_bigrams = make_bigrams(data_words_nostops)

In [45]:
data_words_bigrams

[['wheres',
  'thing',
  'car',
  'nntp_posting',
  'host',
  'rac_wam',
  'umd',
  'organization',
  'university',
  'maryland_college',
  'park',
  'lines',
  'wondering',
  'enlighten',
  'car',
  'saw',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'small',
  'addition',
  'bumper',
  'separate',
  'rest',
  'body',
  'know',
  'tellme',
  'model',
  'engine',
  'specs',
  'years',
  'production',
  'car',
  'history',
  'info',
  'funky',
  'looking',
  'car',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy_kuo',
  'si',
  'clock',
  'poll',
  'final',
  'summary',
  'final',
  'si',
  'clock',
  'reports',
  'keywords',
  'si',
  'acceleration',
  'clock',
  'upgrade',
  'article',
  'shelley',
  'qvfo',
  'innc',
  'organization',
  'university',
  'washington',
  'lines',
  'nntp_posting',
  'host',
  'carson_washington',
  'fair',
  'number',
  'brave',
  'souls',
  'upgraded',
 

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


  and should_run_async(code)


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [20]:
#! python -m spacy download en

  and should_run_async(code)


In [48]:
# nlp = spacy.load('en', disable=['parser', 'ner'])
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [49]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [50]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [51]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'rac_wam', 'university', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'small', 'addition', 'bumper', 'separate', 'rest', 'body', 'know', 'model', 'engine', 'spec', 'year', 'production', 'car', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [52]:
id2word = gensim.corpora.Dictionary(data_lemmatized)

### Create Corpus

In [53]:
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]


### Filter low-frequency words

In [54]:
id2word.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)


### Create Index 2 word dictionary

In [55]:
id2word = gensim.corpora.Dictionary(data_lemmatized)


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [56]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)


### Print the Keyword in the 10 topics

In [57]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))


Topic: 0 
Words: 0.107*"team" + 0.099*"game" + 0.085*"win" + 0.074*"play" + 0.043*"year" + 0.034*"server" + 0.028*"run" + 0.026*"goal" + 0.025*"score" + 0.021*"division"
Topic: 1 
Words: 0.063*"wing" + 0.062*"nhl" + 0.057*"controller" + 0.047*"external" + 0.039*"vote" + 0.033*"flight" + 0.032*"hook" + 0.032*"battery" + 0.024*"percent" + 0.023*"task"
Topic: 2 
Words: 0.124*"gun" + 0.054*"law" + 0.042*"crime" + 0.040*"weapon" + 0.039*"citizen" + 0.032*"firearm" + 0.025*"rate" + 0.023*"tax" + 0.022*"court" + 0.019*"carry"
Topic: 3 
Words: 0.051*"character" + 0.043*"monitor" + 0.041*"internal" + 0.030*"series" + 0.029*"cable" + 0.029*"hole" + 0.028*"normal" + 0.028*"generate" + 0.027*"past" + 0.024*"font"
Topic: 4 
Words: 0.051*"sale" + 0.042*"player" + 0.033*"cheap" + 0.033*"bike" + 0.027*"wire" + 0.027*"drug" + 0.022*"review" + 0.019*"ride" + 0.018*"ground" + 0.018*"material"
Topic: 5 
Words: 0.057*"evidence" + 0.037*"believe" + 0.032*"reason" + 0.029*"faith" + 0.024*"sense" + 0.022*"exi

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [58]:
print('Perplexity: ', lda_model.log_perplexity(corpus))

Perplexity:  -13.822208537623625


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [59]:
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)



Coherence Score:  0.5348419123535703


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [61]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word)
vis