In this code, we first load the text data from the folder containing the .txt files and then convert the text data into a document-term matrix using the TF-IDF vectorizer from Scikit-learn. We then create an NMF model with 10 topics and fit it to the document-term matrix.

We print the top 10 words for each topic and assign a topic to each document based on the highest probability topic assignment. Finally, we print the top 5 documents for each topic to get an idea of the types of documents that fall under each topic. Note that in this case, we are printing the filenames of the documents instead of their content since we are working with a folder of .txt files.

In [38]:
import os
import glob
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import chardet
import re

In [43]:
from watermark import watermark

In [40]:
%load_ext watermark

In [41]:
%watermark --iversions

pandas : 1.4.4
re     : 2.2.1
chardet: 4.0.0



In [16]:
# Set the path to the root folder containing all subfolders
root_path = 'C:\\textmining\\echo'

In [17]:
# Create a list to hold the text data
data = []

In [18]:
# Traverse the directory tree and read all text files
for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
        if filename.endswith('.txt'):
            file_path = os.path.join(dirpath, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read()
                data.append(text)

In [19]:
file_path

'C:\\textmining\\echo\\1876_01\\1876-01-28-0004.txt'

In [20]:
# import os
# import re

# # Define the root directory containing the text files
# root_dir = 'D:/textmining/echo'

# # Define regular expression to match non-alphanumeric characters
# regex = re.compile('[^a-zA-Z0-9\s]')

# # Initialize an empty set to store unique special characters found
# unique_chars = set()

# # Loop over all files in the root directory and its subdirectories
# for subdir, dirs, files in os.walk(root_dir):
#     for file in files:
#         # Check if the file is a text file
#         if file.endswith('.txt'):
#             # Read in the text file
#             file_path = os.path.join(subdir, file)
#             with open(file_path, 'r', encoding='utf-8') as f:
#                 text = f.read()

#             # Find all matches of the regular expression in the text
#             matches = re.findall(regex, text)

#             # Add any unique matches found to the set of unique special characters
#             for match in matches:
#                 unique_chars.add(match)

# # Print the set of unique special characters found
# if len(unique_chars) > 0:
#     print('Unique special characters found:', unique_chars)
# else:
# # Define regular expression to match non-alphanumeric characters
# regex = re.compile('[^a-zA-Z0-9\s]')#     print('No special characters found')

In [21]:
# # Define regular expression to match non-alphanumeric characters
regex = re.compile('[^a-zA-Z0-9\s]')

# Loop over all files in the root directory and its subdirectories
for subdir, dirs, files in os.walk(root_path):
    for file in files:
        # Check if the file is a text file
        if file.endswith('.txt'):
            # Read in the text file
            file_path = os.path.join(subdir, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read()

            # Replace all matches of the regular expression in the text with an empty string
            text = re.sub(regex, '', text)

            # Write the modified text back to the file
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(text)


Before converting the text data to a document-term matrix, we try removing stop words and stemming/lemmatizing the text before passing it to the TfidfVectorizer to reduce the number of terms and improve the quality of the features.

we have set max_features=1000, which means that the vectorizer will only consider the top 1000 most frequent terms in the corpus. You can experiment with different values of max_features to find the best setting for your dataset and analysis.

In [22]:
# Remove stop words and convert text data to a document-term matrix
tfidf = TfidfVectorizer(stop_words='english', max_features=100000, min_df=2)
dtm = tfidf.fit_transform(data)

In [23]:
# Instantiate an NMF model with k topics
k = 5
nmf_model = NMF(n_components=k, init='nndsvd', solver= 'mu', max_iter= 10000)

In [24]:
# Fit the model to the document-term matrix
nmf_model.fit(dtm)



NMF(init='nndsvd', max_iter=10000, n_components=5, solver='mu')

In [25]:
# Print top 10 words for each topic
for i, topic in enumerate(nmf_model.components_):
    print(f'Top 10 words for topic {i+1}:')
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic 1:
['good', 'nnd', 'house', 'amherstburg', 'town', 'windsor', 'ho', 'mr', 'bo', 'tho']


Top 10 words for topic 2:




['books', 'boots', 'old', 'cash', 'best', 'tweeds', 'new', 'hand', 'stock', 'goods']


Top 10 words for topic 3:
['moved', 'ont', 'motion', 'election', 'council', 'house', 'money', 'wigle', 'amherstburg', 'mr']


Top 10 words for topic 4:
['meeting', 'bank', 'amherstburg', 'malden', 'soap', 'esq', 'evening', '00', 'town', 'mr']


Top 10 words for topic 5:
['amherstburg', 'cured', 'cases', 'fever', 'remedy', 'hand', 'tho', 'cure', 'pain', 'charm']




In [26]:
# Get topic assignments for each document
topic_results = nmf_model.transform(dtm)

In [27]:
# Print the top 5 documents for each topic
for i in range(k):
    print(f'Top 5 documents for topic {i+1}:')
    file_names = [os.path.basename(files[j]) for j in range(len(files)) if topic_results[j, i] == max(topic_results[:, i])]
    print(file_names[:5])
    print('\n')


Top 5 documents for topic 1:
[]


Top 5 documents for topic 2:
[]


Top 5 documents for topic 3:
[]


Top 5 documents for topic 4:
[]


Top 5 documents for topic 5:
[]




In [34]:
pip install watermark


Collecting watermark
  Downloading watermark-2.3.1-py2.py3-none-any.whl (7.2 kB)
Installing collected packages: watermark
Successfully installed watermark-2.3.1
Note: you may need to restart the kernel to use updated packages.


In [35]:
from watermark import watermark

In [42]:
%watermark --iversions

pandas : 1.4.4
re     : 2.2.1
chardet: 4.0.0

