Topic Modeling is a general method that can be useful for newspaper text analysis. This method involves using machine learning algorithms to identify the underlying topics or themes present in a collection of newspaper articles. By clustering similar articles together based on their topic, researchers can gain a better understanding of the issues, debates, and trends that are shaping the news coverage.

Non-negative Matrix Factorization (NMF): NMF is a matrix factorization technique that decomposes a matrix of word frequencies into two matrices representing a set of topics and the corresponding weights of each topic for each document.

In this code, we first load the text data from the folder containing the .txt files and then convert the text data into a document-term matrix using the TF-IDF vectorizer from Scikit-learn. We then create an NMF model with 10 topics and fit it to the document-term matrix.

We print the top 10 words for each topic and assign a topic to each document based on the highest probability topic assignment. Finally, we print the top 5 documents for each topic to get an idea of the types of documents that fall under each topic. Note that in this case, we are printing the filenames of the documents instead of their content since we are working with a folder of .txt files.

In [None]:
!pip install sklearn --no-index #only needed on SHARCNET

In [1]:
import os
import glob
import heapq
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import chardet
import re

In [41]:
# from watermark import watermark

# %load_ext watermark

# %watermark --iversions

In [15]:
# Set the path to the root folder containing all subfolders
# root_path = 'C:\\textmining\\echo'
root_path ='challenge' + os.sep + 'echo'
doc_type = 'volumes'
min_word_len = 4 #try to avoid mr, sr, the, etc.
news_range = '[1920 to 1930]'

In [26]:
# Create a list to hold the text data
data = []
# And something for files
files = []

In [27]:
folder_list = sorted(os.listdir(root_path))
folder_paths = [os.path.join(root_path,i) for i in folder_list]

In [28]:
ranges = re.findall(r'\d+', news_range) #pick out numbers
range0 = int(ranges[0])
range1 = int(ranges[1])
re_lns = re.compile("\r\n?|\n") #remove line returns
re_alpha = re.compile('[^a-zA-Z\s]') #remove non-alpha
prob_words = ['mr','mrs','sald','th', 'amherstburg'] #non-topic words and common OCR errors

docpath = os.sep + doc_type

# Traverse the directory tree and read all text files
for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
        folder = int(dirpath.split(os.sep)[2])
        #use len(data) == 0 to debug:
        if filename.endswith('.txt') and doc_type in dirpath and range0 <= folder <= range1:
            file_path = os.path.join(dirpath, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read()
                text = re.sub(re_lns, ' ', text)
                text = re.sub(re_alpha, '', text)
                text_words = text.split()
                topic_words = [word for word in text_words if word.lower() not in prob_words and len(word) >= min_word_len]
                data.append(' '.join(topic_words))
                files.append(file_path)
print("quantity", len(data))

quantity 571


Before converting the text data to a document-term matrix, we try removing stop words and stemming/lemmatizing the text before passing it to the TfidfVectorizer to reduce the number of terms and improve the quality of the features.

we have set max_features=1000, which means that the vectorizer will only consider the top 1000 most frequent terms in the corpus. You can experiment with different values of max_features to find the best setting for your dataset and analysis.

In [33]:
# Remove stop words and convert text data to a document-term matrix
tfidf = TfidfVectorizer(stop_words='english', max_features=10000, min_df=2)
dtm = tfidf.fit_transform(data)

In [34]:
# Instantiate an NMF model with k topics
k = 5
nmf_model = NMF(n_components=k, init='nndsvda', solver= 'mu', max_iter= 10000)  #nndsvd or nndsvda or nndsvdar

In [35]:
# Fit the model to the document-term matrix
nmf_model.fit(dtm)

NMF(init='nndsvda', max_iter=10000, n_components=5, solver='mu')

In [39]:
# Print top 10 words for each topic
for i, topic in enumerate(nmf_model.components_):
    print(f'Top 10 words for topic {i+1}:')
    print([tfidf.get_feature_names_out()[index] for index in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic 1:
['home', 'sunday', 'held', 'county', 'school', 'miss', 'phone', 'essex', 'years', 'church']


Top 10 words for topic 2:
['bank', 'farm', 'sullivan', 'street', 'house', 'acres', 'sale', 'phone', 'good', 'apply']


Top 10 words for topic 3:
['motion', 'mayor', 'windsor', 'sandwich', 'essex', 'road', 'town', 'reeve', 'county', 'council']


Top 10 words for topic 4:
['evening', 'teams', 'club', 'pins', 'league', 'score', 'games', 'high', 'game', 'team']


Top 10 words for topic 5:
['elected', 'years', 'councillors', 'january', 'deputy', 'year', 'acclamation', 'reeve', 'december', 'christmas']




In [40]:
# Get topic assignments for each document
topic_results = nmf_model.transform(dtm)

In [41]:
# Print the top 5 documents for each topic
for i in range(k):
    print(f'Top 5 documents for topic {i+1}:')
    #file_names = [os.path.basename(files[j]) for j in range(len(files)) if topic_results[j, i] == max(topic_results[:, i])]
    file_names = [os.path.basename(files[j]) for j in range(len(files)) if topic_results[j, i] in heapq.nlargest(k,topic_results[:,i])]
    print(file_names)


Top 5 documents for topic 1:
['1925-06-19.txt', '1925-09-25.txt', '1926-02-12.txt', '1926-09-24.txt', '1927-07-08.txt']
Top 5 documents for topic 2:
['1920-04-23.txt', '1920-08-13.txt', '1920-08-20.txt', '1920-08-27.txt', '1922-04-28.txt']
Top 5 documents for topic 3:
['1928-07-27.txt', '1929-09-13.txt', '1930-05-23.txt', '1930-05-30.txt', '1930-06-27.txt']
Top 5 documents for topic 4:
['1928-03-02.txt', '1928-03-16.txt', '1928-04-20.txt', '1928-11-09.txt', '1928-11-16.txt']
Top 5 documents for topic 5:
['1924-12-26.txt', '1925-12-25.txt', '1927-12-23.txt', '1927-12-30.txt', '1928-12-21.txt']


In [None]:
# pip install watermark


# from watermark import watermark

# %watermark --iversions