# Using MALLET

DS 5001 Text as Data

## Getting MALLET on Your System

[Download MALLET](https://github.com/mimno/Mallet/zipball/master) here. This will download a zip archive file. 

For other options, check out the [home page](https://mimno.github.io/Mallet/).

Once downloaded, you will find an executable in the `/bin` directory of the folder created when unzipping the archive file.

You can just now run MALLET in a variety of ways:

- You can just run the `mallet` binary file by its full path.\
  `/home/rca2t/Documents/MSDS/DS5001/mimno-Mallet-5fbf800/bin/mallet`
- You create a symbolic link of the binary to somewhere convenient on your system and call the link.\
  e.g. `ln -s mimno-Mallet-5fbf800/bin/mallet mallet`
- You can run `ant` and build and install it on your system.\
  See Appendix below information on how compile <a href="https://mimno.github.io/Mallet/topics">MALLET</a> if you are so inclined.

For Mallet to run properly:

- Have Java JDK and Ant installed.
  - `module load java/8`
  - `module load ant`
- Put JAVA_HOME and ANT_HOME in environment.
  - `export JAVA_HOME = "<location_of_java_root_dir>"`
  - `export ANT_HOME = "<location_of_ant_binary>"` 
- Download and unarchive MALLET in suitable directory.
- Put MALLET_HOME in environment.
  - `export MALLET_HOME = "<location_of_mallet_roor_dir>"` 
- Run `ant` in MALLET directory.

## Set Up

### Configure

In [62]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

In [63]:
data_prefix = 'novels'

In [64]:
mallet_data = f"{output_dir}/mallet-demo"
!mkdir {mallet_data}

mkdir: cannot create directory ‘/sfs/gpfs/tardis/home/rca2t/Documents/MSDS/DS5001/DS5001-2025-01-R/output/mallet-demo’: File exists


In [65]:
OHCO = ['book_id', 'chap_id', 'para_num', 'sent_num', 'token_num']

In [66]:
max_words = 10000

# For MALLET
num_topics = 20
num_iters = 1000
show_interval = 100

### Import

In [67]:
import pandas as pd
# import numpy as np # Not needed today

## Prepare

### Import a CORPUS

In [68]:
LIB = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [69]:
CORPUS = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [70]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy


### Create DOC

In [71]:
def gather_docs(CORPUS, ohco_level, term_col='term_str'):
    OHCO = CORPUS.index.names
    CORPUS[term_col] = CORPUS[term_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[term_col].apply(lambda x:' '.join(x)).to_frame('doc_str')
    return DOC

In [72]:
DOC = gather_docs(CORPUS, 2)

In [73]:
DOC['n_tokens'] = DOC.doc_str.apply(lambda x: len(x.split()))

In [74]:
DOC

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str,n_tokens
book_id,chap_id,Unnamed: 2_level_1,Unnamed: 3_level_1
adventures,1,a scandal in bohemia i to sherlock holmes she ...,8608
adventures,2,the red headed league i had called upon my fri...,9197
adventures,3,a case of identity my dear fellow said sherloc...,7020
adventures,4,the boscombe valley mystery we were seated at ...,9685
adventures,5,the five orange pips when i glance over my not...,7365
...,...,...,...
udolpho,54,vi unnatural deeds do breed unnatural troubles...,5605
udolpho,55,vii but in these cases we still have judgment ...,4164
udolpho,56,viii then fresh tears stood on her cheek as do...,2522
udolpho,57,ix now my task is smoothly done i can fly or i...,977


### Dump corpus to CSV file

In [75]:
mallet_corpus = DOC.join(LIB)[['doc_str','author_id']]
mallet_corpus.columns = 'doc_content doc_label'.split()
mallet_corpus[['doc_label','doc_content']].to_csv(f'{mallet_data}/novels-corpus.csv', index=False)

In [82]:
mallet_corpus

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_content,doc_label
book_id,chap_id,Unnamed: 2_level_1,Unnamed: 3_level_1
adventures,1,a scandal in bohemia i to sherlock holmes she ...,doyle
adventures,2,the red headed league i had called upon my fri...,doyle
adventures,3,a case of identity my dear fellow said sherloc...,doyle
adventures,4,the boscombe valley mystery we were seated at ...,doyle
adventures,5,the five orange pips when i glance over my not...,doyle
...,...,...,...
udolpho,54,vi unnatural deeds do breed unnatural troubles...,radcliffe
udolpho,55,vii but in these cases we still have judgment ...,radcliffe
udolpho,56,viii then fresh tears stood on her cheek as do...,radcliffe
udolpho,57,ix now my task is smoothly done i can fly or i...,radcliffe


In [84]:
# !more {mallet_data}/novels-corpus.csv

## MALLET Time

### Show MALLET options

In [77]:
# mallet_home = "/home/rca2t/Documents/MSDS/DS5001/mimno-Mallet-5fbf800/bin"
mallet = config['DEFAULT']['mallet_binary']
mallet

'/home/rca2t/opt/mallet/bin/mallet'

In [78]:
# ! {mallet_home}/mallet 

### Import corpus

This converts the `.csv` file we just created into a `.mallet` file. 

This file has a format that MALLET can more effectively process.

Here we pass all of our arguments as command line arguments. 

Note there is an option to put this arguments into file and just pass the file as an argument.

In [79]:
!{mallet} import-file \
--input {mallet_data}/novels-corpus.csv \
--output {mallet_data}/novels-corpus.mallet \
--keep-sequence TRUE

### Train topics

Now we generate our model with a bunch of paramaters.

- `num-topics`: Number of topics to generate.
- `num-iterations`: The number of iterations to run.
- `output-doc-topics`: The file to dump our doc-topic data (THETA)
- `output-topic-keys`: The file to dump our topic keys.
- `word-topic-counts-file`: The file to dump our word-topic counts (raw PHI) 
- `topic-word-weights-file`: The file to dump our topic-word weights (PHI) 
- `xml-topic-report`: The XML file to dump a report.
- `xml-topic-phrase-report`: The XML file to dump the phrases associated with topics.
- `show-topics-interval`: The iteration interval to show results during model fitting.
- `use-symmetric-alpha`: Whether or not to use the same alpha value for each topic.
- `optimize-interval`: The optimization interval.
- `diagnostics-file`: The XML file to dump some diagnostic information.

In [80]:
!{mallet} train-topics \
--input {mallet_data}/novels-corpus.mallet \
--num-topics {num_topics} \
--num-iterations {num_iters} \
--output-doc-topics {mallet_data}/novels-doc-topics.txt \
--output-topic-keys {mallet_data}/novels-topic-keys.txt \
--word-topic-counts-file {mallet_data}/novels-word-topic-counts-file.txt \
--topic-word-weights-file {mallet_data}/novels-topic-word-weights-file.txt \
--xml-topic-report {mallet_data}/novels-topic-report.xml \
--xml-topic-phrase-report {mallet_data}/novels-topic-phrase-report.xml \
--show-topics-interval {show_interval} \
--use-symmetric-alpha false  \
--optimize-interval 100 \
--diagnostics-file {mallet_data}/novels-diagnostics.xml

Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask
Data loaded.
max tokens: 15717
total tokens: 1164070
<10> LL/token: -9.09352
<20> LL/token: -8.60192
<30> LL/token: -8.42725
<40> LL/token: -8.32408
<50> LL/token: -8.24761
<60> LL/token: -8.18959
<70> LL/token: -8.14445
<80> LL/token: -8.11114
<90> LL/token: -8.08777

0	0.25	the and that for with sergeant miss franklin out when this had rachel which have diamond betteredge what was him 
1	0.25	the was and but been were had not would two who one there room very this that about from small 
2	0.25	the and was were they their which for had could more not with all them only that from even first 
3	0.25	the and his was had there with him down they that said out but for all were could upon sir 
4	0.25	her she had and was for herself could with not all more would when might from moment but left very 
5	0.25	the and that had from was its into among these with one over when air all this mind dark mountains 
6	0.25	the not manfred said thou t

# Inspect Results

In [81]:
!ls -l {mallet_data}

total 30121
-rw------- 1 rca2t users  7960141 Mar 20 14:24 novels-corpus.csv
-rw------- 1 rca2t users  4998651 Mar 20 14:24 novels-corpus.mallet
-rw------- 1 rca2t users    92188 Mar 20 14:26 novels-diagnostics.xml
-rw------- 1 rca2t users   138623 Mar 20 14:26 novels-doc-topics.txt
-rw------- 1 rca2t users     2368 Mar 20 14:26 novels-topic-keys.txt
-rw------- 1 rca2t users    58616 Mar 20 14:26 novels-topic-phrase-report.xml
-rw------- 1 rca2t users    19999 Mar 20 14:26 novels-topic-report.xml
-rw------- 1 rca2t users 16896437 Mar 20 14:26 novels-topic-word-weights-file.txt
-rw------- 1 rca2t users   640529 Mar 20 14:26 novels-word-topic-counts-file.txt


## Appendix

### Mallet

Website: https://mimno.github.io/Mallet/

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

### Installation

To build a Mallet 2.0 development release, you must have the Apache ant build tool installed. From the command prompt, first change to the mallet directory, and then type
`ant`

If `ant` finishes with `"BUILD SUCCESSFUL"`, Mallet is now ready to use.

If you would like to deploy Mallet as part of a larger application, it is helpful to create a single ".jar" file that contains all of the compiled code. Once you have compiled the individual Mallet class files, use the command:
`ant jar`

This process will create a file "mallet.jar" in the "dist" directory within Mallet.

### Usage

Once you have installed Mallet you can use it using the following command:

```bash
bin/mallet [command] --option value --option value ...
```
Type `bin/mallet` to get a list of commands, and use the option `--help` with any command to get a description of valid options.

For details about the commands please visit the API documentation and website at: https://mimno.github.io/Mallet/


### List of Algorithms

* Topic Modelling
  * LDA
  * Parallel LDA
  * DMR LDA
  * Hierarchical LDA
  * Labeled LDA
  * Polylingual Topic Model
  * Hierarchical Pachinko Allocation Model (PAM)
  * Weighted Topic Model
  * LDA with integrated phrase discovery
  * Word Embeddings (word2vec) using skip-gram with negative sampling
* Classification
  * AdaBoost
  * Bagging
  * Winnow
  * C45 Decision Tree
  * Ensemble Trainer
  * Maximum Entropy Classifier (Multinomial Logistic Regression)
  * Naive Bayes
  * Rank Maximum Entropy Classifier
  * Posterior Regularization Auxiliary Model
* Clustering
  * Greedy Agglomerative
  * Hill Climbing
  * K-Means
  * K-Best
* Sequence Prediction Models
  * Conditional Random Fields
  * Maximum Entropy Markov Models
  * Hidden Markov Models
  * Semi-Supervised Sequence Prediction Models
* Linear Regression