`S.Y.Babu, Data Scientist`

# Latent Dirichlet Allocation / Analysis (LDA)   
#Supervised

  This is a probabilistic model used to find clusters assigments for documents.  
It uses two probability values to cluster documents: 
- **P(word | topic)**: the probability that a particular word is associated with a particular topic. This first set of probability is also considered as the **Word X Topic** matrix.  
- **P(topics | documents)**: the topics associated with documents. This second set of probability is considered as **Topics X Documents** matrix.   
These probability values are calculated for all words, topics and documents.    



#we will be using the dataset of the Australian Broadcasting Corporation, available on kaggle: 
#unzip the data 


In [1]:
#dataset in zip format
#import zipfile
#import os
# Path to the ZIP file
#zip_file_path = 'lda.zip'
# Folder to extract files to
#extracted_folder_path = 'lda_data'# Extract ZIP file
#with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    #zip_ref.extractall(extracted_folder_path)
# Extract ZIP file
#with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    #zip_ref.extractall(extracted_folder_path)

In [2]:
# List the extracted files
#extracted_files = os.listdir(extracted_folder_path)
#print(extracted_files)

In [3]:
#Identify the CSV file (assuming there is only one CSV file)
#csv_file = [file for file in extracted_files if file.endswith(".csv")]
#if csv_file:
    #csv_file_path = os.path.join(extracted_folder_path, csv_file[0])
    #print("CSV file found:", csv_file_path)
#else:
    #print("No CSV file found!")
    #exit()
#csv
#csv_file

In [4]:
# Load the dataset into a pandas DataFrame
#df = pd.read_csv(csv_file_path, delimiter=',', on_bad_lines='skip')     #delimiter check  [',' or ''|'']
#ParserError to fix this  Use `on_bad_lines` to skip problematic rows

In [5]:
import warnings
# To ignore all warnings, you can use the following:
warnings.filterwarnings('ignore')

## Import Useful Libraries 

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Load the Dataset

In [7]:
news_data = pd.read_csv("abcnews-date-text.csv")
news_data.shape

(1244184, 2)

In [8]:
news_data.sample(5)

Unnamed: 0,publish_date,headline_text
87233,20040427,man arrested over spanish bombings
24254,20030616,call for marine park plan to have more no fishing
262909,20061010,security council condemns north korea nuclear ...
704052,20120624,aussie kennedy wins in japan
1178282,20191006,paris attacker radical vision islam anti terro...


Our data has over a million of records, and there are two columns: 
- the date a particular headline have been published.  
- the actual headline.   
By looking at the first 5 rows, we can see that we don't have the topic of the headline text! So, we will use LDA to attempt to figure out clusters of the news.   
Over a **a million** of record, that is a lot of data. To do so, we will use only **20000** records to make the computation faster. You can increase the number of observation if you wish. 

## Preprocessing.    

In [9]:
NUM_SAMPLES = 20000 # The number of sample to use 
sample_df = news_data.sample(NUM_SAMPLES, replace=False).reset_index(drop=True)

In [10]:
sample_df.shape

(20000, 2)

In [11]:
sample_df.sample(5) # randomly show 5 rows

Unnamed: 0,publish_date,headline_text
18201,20060718,landing gear problems force f 111 to dump fuel
9710,20091203,turnbull stands by bishop
11140,20050505,union maintains attack on hospital management
4395,20181004,abc board appoints independent adviser for inv...
18906,20140623,nbnco has a strategy to keep phone and interne...


We are not interested in the **publish_data** column, since we will only be using **headline_text** data.    

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.     


Be defining the **CountVectorizer** object as below, we ignore:   
- all terms that occur over 95% times in our document corpus. We say in this case that the terms occuring more than this threshold are not significant, most of them are  `stopwords`.   

- all the terms that occur fewer than three times in the entire corpus.  

In [12]:
cv = CountVectorizer(max_df=0.95, min_df=3, stop_words="english")
dtm = cv.fit_transform(sample_df['headline_text'])

In [13]:
dtm

<20000x6362 sparse matrix of type '<class 'numpy.int64'>'
	with 90127 stored elements in Compressed Sparse Row format>

We can observe that our Document X Term Matrix (dtm) has:  
- 20000 documents, and.  
- 6425 distinct words   

We can also get all those words using the `get_feature_names()` function

In [14]:
#feature_names = cv.get_feature_names()
feature_names = cv.get_feature_names_out()
len(feature_names) # show the total number of distinct words
#In versions after 1.0, the method get_feature_names() was replaced with get_feature_names_out().

6362

Let's have a look at some of the features that have been extracted from the documents.  

In [15]:
feature_names[6420:]

array([], dtype=object)

## LDA     
From our DTM matrix, we can now build our LDA to extract topics from the underlined texts. The number of topic to be extracted is a hyperparameter, so we do not know it a a glance. In our case, we will be using 7 topics.   
LDA is an iterative algorithm, we will have 30 iterations in our case, but the default value is 10.  

In [16]:
# Set the number of topics
NB_TOPICS = 7 

# Creat the model
LDA_model = LatentDirichletAllocation(n_components = NB_TOPICS, max_iter = 30, random_state = 2021)

# Fit the model on the dtm
LDA_model.fit(dtm)

### Show Stored Words.   
Let's randomnly have a look at some words of that have been stored.  

In [17]:
len(feature_names)

6362

In [19]:
import random 
for index in range(15):
    random_word_ID = random.randint(0, 6506)
    print(cv.get_feature_names_out()[random_word_ID])

loophole
praises
robbed
owner
roger
bounces
whales
taxi
officers
riewoldt
half
suburbs
quits
25pc
stumps


### Top Words Per Topic

In [20]:
len(LDA_model.components_[0])

6362

In [21]:
# Pick a single topic 
a_topic = LDA_model.components_[0]

# Get the indices that would sort this array
a_topic.argsort()

array([1104, 1214, 1507, ..., 2505, 4953, 2508], dtype=int64)

In [22]:
# The word least representative of this topic
a_topic[597]

0.14303621407409484

In [23]:
# The word most representative of this topic
a_topic[3598]

0.14299144835659988

Let have a look at the top 10 words for the topic we previously took

In [24]:
top_10_words_indices = a_topic.argsort()[-10:]

for i in top_10_words_indices:
    print(cv.get_feature_names_out()[i])

act
indigenous
sydney
new
deal
minister
health
government
says
govt


This looks like Government Article. Let's have a look at all the 7 topics found. 

In [25]:
for i, topic in enumerate(LDA_model.components_):
    print("THE TOP {} WORDS FOR TOPIC #{}".format(10, i))
    print([cv.get_feature_names_out()[index] for index in topic.argsort()[-10:]])
    print("\n")

THE TOP 10 WORDS FOR TOPIC #0
['act', 'indigenous', 'sydney', 'new', 'deal', 'minister', 'health', 'government', 'says', 'govt']


THE TOP 10 WORDS FOR TOPIC #1
['abuse', 'child', 'man', 'cup', 'accused', 'death', 'world', 'murder', 'interview', 'court']


THE TOP 10 WORDS FOR TOPIC #2
['change', 'fears', 'set', 'industry', 'workers', 'china', 'market', 'australian', 'home', 'new']


THE TOP 10 WORDS FOR TOPIC #3
['rail', 'coronavirus', 'cuts', 'covid', 'says', 'ban', 'wa', 'public', 'group', 'calls']


THE TOP 10 WORDS FOR TOPIC #4
['open', 'takes', 'council', 'says', 'urges', 'final', 'missing', 'water', 'win', 'plan']


THE TOP 10 WORDS FOR TOPIC #5
['country', 'west', 'nsw', 'qld', 'budget', 'gold', 'day', 'north', 'coast', 'south']


THE TOP 10 WORDS FOR TOPIC #6
['road', 'abc', 'dies', 'charged', 'woman', 'killed', 'car', 'crash', 'man', 'police']




### Attach Discovered Topic Labels to Original News

In [26]:
# Link documents to topics
final_topics = LDA_model.transform(dtm)

# Show the shape of the object 
print(final_topics.shape)

(20000, 7)


In [27]:
final_topics

array([[0.02041921, 0.02044082, 0.02040819, ..., 0.02046598, 0.87734868,
        0.0204834 ],
       [0.87736266, 0.02045173, 0.02042787, ..., 0.02042951, 0.02045849,
        0.02042967],
       [0.71415387, 0.04761907, 0.04763602, ..., 0.04771204, 0.04761906,
        0.04761906],
       ...,
       [0.20528415, 0.02380954, 0.0238289 , ..., 0.02384414, 0.02381496,
        0.0238271 ],
       [0.03571435, 0.03615018, 0.03571435, ..., 0.03575049, 0.78494718,
        0.03600911],
       [0.01593638, 0.23815635, 0.01588777, ..., 0.13800094, 0.16887341,
        0.13421798]])

**final_topics** contains, for each of our 20.000 documents, the probability score of how likely a document belongs to each of the 7 topics.  This is a Document X Topics matrix. 
For example, below is the probability values for the fourth document.

In [28]:
final_topics[4]

array([0.42854914, 0.02857144, 0.0286978 , 0.02857144, 0.02857144,
       0.02882197, 0.42821677])

In [29]:
final_topics[4].argmax()

0

This value (4) means that our LDA model thinks that the first document belongs to the 4th topic.

### Combination with the original data     
Let's create a new column called **Topic N°** that will correspond to the topic value to which each document belongs to.

In [30]:
sample_df["Topic N°"] = final_topics.argmax(axis=1)

In [31]:
sample_df.head()

Unnamed: 0,publish_date,headline_text,Topic N°
0,20101231,surf rescues prompt renewed warnings to swimmers,5
1,20150904,nsw government unveils olympic stadium plans,0
2,20120210,hewson and carr analyse politics,0
3,20031014,astle blow for kiwis ahead of one day series,4
4,20140921,deadly crash into uni undiscovered for hours,0


According to our LDA model:   
- the first document belongs to 4th topic.  
- the second document belongs to 4th topic. 
- the third document belongs to 6th topic.  
etc.   

In [32]:
!pip install pyLDAvis



In [33]:
#!pip install pyLDAvis==3.3.1
#to help visualize and interpret the results of topic modeling. 
#It's commonly used with topic models like Latent Dirichlet Allocation (LDA) to visualize topics,


In [35]:
#import pyLDAvis.sklearn
#versions of pyLDAvis (starting from version 4.0.0) have removed the pyLDAvis.sklearn module
import pyLDAvis
#import pyLDAvis.display

## Some Visualization       
We will be using the `pyldavis` module to visualize the topics associated to our documents.   

In [36]:
pyLDAvis.enable_notebook() # To enable the visualization on the notebook

In [37]:
# Prepare pyLDAvis inputs
panel = pyLDAvis.prepare(
    topic_term_dists=LDA_model.components_,          # Topic-term distributions
    doc_topic_dists=LDA_model.transform(dtm),        # Document-topic distributions
    doc_lengths=dtm.sum(axis=1).A1,                 # Document lengths
    vocab=cv.get_feature_names_out(),               # Vocabulary terms
    term_frequency=dtm.sum(axis=0).A1,              # Term frequencies
    mds='tsne'                                      # Use t-SNE for dimensionality reduction (optional)
)

In [38]:
#panel = pyLDAvis.sklearn.prepare(LDA_model, dtm, cv, mds='tsne') # Create the panel for the visualization for old version
panel

### Some Comments On The Graphic     

- By selecting a particular term on the right, we can see which topic(s) it belongs.    
- Vice-versa, by choosing a topic on the left, we can see all the terms, from most to least relevant term.  

In [None]:
--END--