# Do Some Other Stuff

This notebook is based on the [DARIAH-DE tutorial on topic modeling with MALLET](https://de.dariah.eu/tatom/topic_model_mallet.html). Full explanations of the code it the underlying theory are available there. Note that this code was intended for use with small groups of novels, rather than large numbers of small documents. It may not run efficiently on our data, and the graphs may not be completely readable out of the box. **Consider this notebook experimental.**

You should run this code after importing your data and running `2_clean_data.ipynb` on it. If you have a folder containing a prepared set of text files, you can also set that as your source directory in the configuration below.

This notebook allows you to run MALLET and create your topic model independently of `3_make_topic_model.ipynb`. You do not need to run that notebook. However, if you have already created your model in that notebook, you can skip the "Import Data to MALLET" and "Create Topic Model" cells. The rest of the notebook should work.

By default, this notebook runs its topic models with `--random-seed 1`. If you run the same model in `3_make_topic_model.ipynb` with this setting, you should get the same results.

Some of the cells further down in the notebook can be take further configuration. Be sure to read any instructions above each cell.

## Configuration

In [None]:
# Project Name
project_name = 'alternative_test'

# Location of Source Files
source_data = 'caches/text_files_clean' # Default
# source_data = 'scrubbed' # Custom source

## Imports

In [None]:
import os
import itertools
import operator
import numpy as np
import pandas as pd

## SETTINGS

In [None]:
## Project directory
project_dir = %pwd
print(project_dir)

## Import project settings
from settings import *

## Make the model directory
!mkdir -p {model_dir}

## Set various variables
CORPUS_PATH = os.path.join(project_dir, source_data)
filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

## Import Data to MALLET

In [None]:
## Build the mallet import command string
mallet_import_args = '--input ' + CORPUS_PATH + ' ' \
  + '--output ' + project_dir + '/' + model_dir + '/' + model_file + ' ' \
  + '--keep-sequence ' \
  + '--remove-stopwords ' \
  + '--extra-stopwords ' + project_dir + '/' + stopwords_dir + '/' + stopwords_file + ' '
mallet_import_command = 'mallet import-dir ' + mallet_import_args
print(mallet_import_command+'\n')

## Run mallet; capture and display output
mout = !mallet import-dir {mallet_import_args}
print('\n'.join(mout)+'\n')

print(os.listdir(project_dir + '/' + model_dir))

print('\n-----\nModel import done.')

## Set some helper variables

In [None]:
## Set some helper variables
_model_path = os.path.join(project_dir, model_dir)
_model_file = os.path.join(_model_path, model_file)
_model_state = os.path.join(_model_path, model_state)
_output_doc_topics = os.path.join(_model_path, model_composition)
_output_topic_keys = os.path.join(_model_path, model_keys)
_word_topic_counts = os.path.join(_model_path, model_counts)

## Create Topic Model

In [None]:
## Build the mallet training command string
mallet_train_args = '--input ' + _model_file + ' ' \
  + '--random-seed 1 ' \
  + '--num-topics ' + model_num_topics + ' ' \
  + '--optimize-interval 10 ' \
  + '--output-state ' + model_state + ' ' \
  + '--output-topic-keys ' + _output_topic_keys + ' ' \
  + '--output-doc-topics ' + _output_doc_topics + ' ' \
  + '--word-topic-counts-file ' + _word_topic_counts
mallet_train_command = 'mallet train-topics ' + mallet_train_args
print(mallet_train_command+'\n')

## Run mallet
!mallet train-topics {mallet_train_args}
print(os.listdir(project_dir + '/' + model_dir))

print('\n-----\nModel training done.')

## Create a Document-Term Matrix

In [None]:
## Create a DTM

# Helper function
def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue) 

doctopic_triples = []

mallet_docnames = []

with open(_output_doc_topics) as f:
    f.readline()  # read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t')
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

# Sort the triples
# Triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
# Sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

# Sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

# Collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

doctopic = np.zeros((num_docs, num_topics))

for i, (doc_name, triples) in enumerate(itertools.groupby(doctopic_triples, key=operator.itemgetter(0))):
    doctopic[i, :] = np.array([share for _, _, share in triples])
    
print('Document-Term Matrix created.')

## Get Article Names

You have two options here. You can try to generate article names from the data file names or create a list of names yourself. The latter takes more work, but the former may create unwieldy or unhelpful article names.

In [None]:
# Configure your own article names -- uncomment the next line to use this
# article_names = ['article1', 'article2', 'article3']

# Generate article names automatically -- comment out these lines to disable them
article_names = []
for fn in filenames:
     basename = os.path.basename(fn)
     name, ext = os.path.splitext(basename)
     name = name.rstrip('.json')
     article_names.append(name)
    
print('Article Names:')
print(article_names)

## Get Document-Topic Groups

In [None]:
# Get doc-topic groups
 
# Turn article_names into an array so we can use NumPy functions
article_names = np.asarray(article_names)

doctopic_orig = doctopic.copy()

# Group the doc-topic groups
num_groups = len(set(article_names))

doctopic_grouped = np.zeros((num_groups, num_topics))

for i, name in enumerate(sorted(set(article_names))):
     doctopic_grouped[i, :] = np.mean(doctopic[article_names == name, :], axis=0)
 
doctopic = doctopic_grouped
print('Doc-Topic groups array was generated.')

## Prepare the Model for Inspection

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

filenames = [os.path.join(CORPUS_PATH, fn) for fn in sorted(os.listdir(CORPUS_PATH))]

vectorizer = CountVectorizer(input='filename')

dtm = vectorizer.fit_transform(filenames)  # a sparse matrix

dtm.shape

dtm.data.nbytes  # number of bytes dtm takes up

dtm.toarray().data.nbytes  # number of bytes dtm as array takes up

doctopic_orig.shape

doctopic_orig.data.nbytes  # number of bytes document-topic shares take up

print('Document-Term Matrix prepared.')

## Identify Significant Topics in Each Document

Make sure that you configure the number of topics you wish to display for each topic in the `TOP_N_TOPICS` variable below.

In [None]:
# Identify significant topics in each document

# Number of topics to display
TOP_N_TOPICS = 5

articles = sorted(set(article_names))

print("Top topics in...\n")

for i in range(len(doctopic)):
     top_topics = np.argsort(doctopic[i,:])[::-1][0:TOP_N_TOPICS]
     top_topics_str = ' '.join(str(t) for t in top_topics)
     print("{}: {}".format(articles[i], top_topics_str))

## Get Topic-Word Distribution

Configure the `N_WORDS_DISPLAY` variable below to change the number of terms displayed for each topic. If you wish to display the top words in a single topic, you can create a new cell and use `print(topic_words[1])` to print the words in topic 1. 

In [None]:
# Get topic-word distribution

N_WORDS_DISPLAY = 10

with open(_output_topic_keys) as input:
     topic_keys_lines = input.readlines()

topic_words = []

for line in topic_keys_lines:
     _, _, words = line.split('\t')  # tab-separated
     words = words.rstrip().split(' ')  # remove the trailing '\n'
     topic_words.append(words)

for t in range(len(topic_words)):
     print("Topic {}: {}".format(t, ' '.join(topic_words[t][:N_WORDS_DISPLAY])))

## Find Distinctive Topics

This tool allows you to determine which topics are most distinctive in a comparison of two documents. Make sure to configure `doc1` and `doc2` below.

In [None]:
# Configure the names of the two documents to be prepared (from the list of article_names)
doc1          = "guardian_hum_0_4"
doc2          = "guardian_hum_118_3"
SHOW_N_TOPICS = 10

# Find distinctive topics
doc1_indices, doc2_indices = [], []

for index, fn in enumerate(sorted(set(article_names))):
     if doc1 in fn:
         doc1_indices.append(index)
     elif doc2 in fn:
         doc2_indices.append(index)

doc1_avg = np.mean(doctopic[doc1_indices, :], axis=0)

doc2_avg = np.mean(doctopic[doc2_indices, :], axis=0)

keyness = np.abs(doc1_avg - doc2_avg)

ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences

# Show distinctive topics:
print('Distinctive Topics (ranked from most to least):')
print(ranking[:SHOW_N_TOPICS])


## Show Topic Shares in Layered Bar Chart

By default, the image is 8" x 8". To adjust the size of the graph, modify the line `fig=plt.figure(figsize=(8, 8), dpi= 80)` in the cell below.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(8, 8), dpi= 80)

# Layered bar chart
N, K = doctopic.shape  # N documents, K topics

ind = np.arange(N)  # the x-axis locations for the articles

width = 0.5  # the width of the bars

plots = []

height_cumulative = np.zeros(N)

for k in range(K):
     color = plt.cm.coolwarm(k/K, 1)
     if k == 0:
         p = plt.bar(ind, doctopic[:, k], width, color=color)
     else:
         p = plt.bar(ind, doctopic[:, k], width, bottom=height_cumulative, color=color)
     height_cumulative += doctopic[:, k]
     plots.append(p)

plt.ylim((0, 1))  # proportions sum to 1, so the height of the stacked bars is 1

plt.ylabel('Topics')

plt.title('Topics in articles')

plt.xticks(ind+width/2, article_names)
plt.xticks(rotation=90)

plt.yticks(np.arange(0, 1, 10))

topic_labels = ['Topic {}'.format(k) for k in range(K)]

# see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend for details
# on making a legend in matplotlib
# plt.legend([p[0] for p in plots], topic_labels)

plt.show()

## Show Topic Shares in Heatmap

By default, the image is 15" x 5". To adjust the size of the graph, modify the line fig=plt.figure(figsize=(15, 5), dpi= 80) in the cell below.

In [None]:
# Heatmap
fig=plt.figure(figsize=(15, 5), dpi= 80)

# Ref: http://nbviewer.ipython.org/5427209
# Ref: http://code.activestate.com/recipes/578175-hierarchical-clustering-heatmap-python/
plt.pcolor(doctopic, norm=None, cmap='Blues')

# put the major ticks at the middle of each cell
# the trailing semicolon ';' suppresses output
plt.yticks(np.arange(doctopic.shape[0])+0.5, article_names);

plt.xticks(np.arange(doctopic.shape[1])+0.5, topic_labels);

# flip the y-axis so the texts are in the order we anticipate
plt.gca().invert_yaxis()

# rotate the ticks on the x-axis
plt.xticks(rotation=90)

# add a legend
plt.colorbar(cmap='Blues')

plt.tight_layout()  # fixes margins

plt.show()

## Show Topic-Word Associations

Configure the number of topics and the number of top words to display below.

In [None]:
# Topic-Word Associations

num_topics = 50
num_top_words = 10

# Ignore this redundant code
# with open(_output_topic_keys) as input:
#      topic_keys_lines = input.readlines()

# topic_words = []

# for line in topic_keys_lines:
#      _, _, words = line.split('\t')  # tab-separated
#      words = words.rstrip().split(' ')  # remove the trailing '\n'
#      topic_words.append(words)

# for t in range(len(topic_words)):
#      print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))


mallet_vocab = []

word_topic_counts = []

with open(_word_topic_counts) as f:
     for line in f:
         _, word, *topic_count_pairs = line.rstrip().split(' ')
         topic_count_pairs = [pair.split(':') for pair in topic_count_pairs]
         mallet_vocab.append(word)
         counts = np.zeros(num_topics)
         for topic, count in topic_count_pairs:
             counts[int(topic)] = int(count)
         word_topic_counts.append(counts) 

word_topic = np.array(word_topic_counts)

word_topic.shape
    
# np.sum(word_topic, axis=0) sums across rows, so it yields totals of words assigned to topics
word_topic = word_topic / np.sum(word_topic, axis=0)

mallet_vocab = np.array(mallet_vocab)  # convert vocab from a list to an array so we can use NumPy operations on it

for t in range(num_topics):
     top_words_idx = np.argsort(word_topic[:,t])[::-1]  # descending order
     top_words_idx = top_words_idx[:num_top_words]
     top_words = mallet_vocab[top_words_idx]
     top_words_shares = word_topic[top_words_idx, t]
     print("Topic #{}:".format(t))
     for word, share in zip(top_words, top_words_shares):
         print("{} : {}".format(np.round(share, 3), word))

# Word share visualisation -- experimental; probably won't work

# num_top_words = 10

# fontsize_base = 70 / np.max(word_topic) # font size for word with largest share in corpus

# for t in range(num_topics):
#      plt.subplot(1, num_topics, t + 1)  # plot numbering starts with 1
#      plt.ylim(0, num_top_words + 0.5)  # stretch the y-axis to accommodate the words
#      plt.xticks([])  # remove x-axis markings ('ticks')
#      plt.yticks([]) # remove y-axis markings ('ticks')
#      plt.title('Topic #{}'.format(t))
#      top_words_idx = np.argsort(word_topic[:,t])[::-1]  # descending order
#      top_words_idx = top_words_idx[:num_top_words]
#      top_words = mallet_vocab[top_words_idx]
#      top_words_shares = word_topic[top_words_idx, t]
#      for i, (word, share) in enumerate(zip(top_words, top_words_shares)):
#          plt.text(0.3, num_top_words-i-0.5, word, fontsize=fontsize_base*share)

# plt.tight_layout()
    
# # Number of word types associated with each topic
# np.sum(word_topic > 0, axis=0)