## Analyzing the Unabomber Manifesto
### CSC630: Rudd Fawcett


Below, please find specifications about the development of my text project. The website [is live here.](http://csc630.rudd.io/projects/unabomber-manifesto/).

Looking at the production of a single text-based visualization, this project focused on the mapping of the fifty most frequent words in the manifesto in a strip based fashion. The goals of this project were two fold:

1. To develop a beautiful experience and website that accurately and interactively visualized the distribution of text.
2. To develop a firm grasp and understanding of basic text analysis through the nltk Python library.

Having just finished The Discovery Channel's documentary on the Unabomber, I was looking forward to starting this project.

Overall, this project would not have been possible without the raw text of the manifesto as originally published in *The Washington Post*, and that was [distributed online](http://www.washingtonpost.com/wp-srv/national/longterm/unabomber/manifesto.text.htm). The manifesto, which includes over thirty-five thousand words, is broken up in sections and then paragraphs, in the style of a mid-1970s doctoral thesis paper.

The visualization for this project is rather unconvential, and frankly something that I haven't seen anywhere else (so I may have invited something new?!). It maps the book horizontally, broken up and marked by sections and chapters, and then plots the fifty most frequent words across this horizontal and vertical plane. I would encourage you to check it out through the link above for a better idea of what I'm talking about.

### Preparing Data for Use

Preparing data for this project proved to be a huge pain and a waste of a bunch of time. Originally, I analyzed the text and broke it up by section > paragraph > sentences. This proved to be interesting as looking at pieces of the whole, but when putting the parts together, none of it lined up (text indicies for example). Therefore I opted out of that type of analysis and just ended up taking the data set head on.

While analyzing the data, I had three different scripts: two that generated the indicies of paragraphs and sections based on regular expressions, and then another which tabulated the most frequent words throughout the manifesto. Using the three output files from these scripts, you can start to stitch together various layers.

### Generating Section Indicies

In [None]:
import string

import numpy as np
import pandas as pd
import nltk
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = WhitespaceTokenizer()
punctuation = string.punctuation + '“”’.,'
text = open('manifesto.txt').read()

ch_tokenizer = RegexpTokenizer('^(?:[A-Z-‘’]+ ?)+?(?=\n)')
sections_by_title_indices = list(zip(ch_tokenizer.tokenize(text), ch_tokenizer.span_tokenize(text)))
sections = []

for i, section in enumerate(sections_by_title_indices):
    c = [section[0], [section[1][1], 0]]
    try:
        c[1][1] = sections_by_title_indices[i+1][1][0]
    except:
        pass
    sections.append(c)

section_indices = []
i = 0
for section in sections:
    # Skip the chapter title and final notes section.
    if 'INDUSTRIAL SOCIETY AND ITS FUTURE' in section[0] or 'NOTES' in section[0]:
        continue
    else:
        entry = {
            'section': section[0],
            'index_left': section[1][0],
            'index_right': section[1][1]
        }
        section_indices.append(entry)
        i = 1

df = pd.DataFrame(section_indices)
df.to_csv('section_list.csv', index=True)

The code above simply maps all of the headings and then uses them to figure out where each section begins and ends.

### Generating Paragraph Indicies

In [None]:
import string

import numpy as np
import pandas as pd

import nltk
from nltk import RegexpTokenizer

punctuation = string.punctuation + '“”’.,'
text = open('manifesto.txt').read()

pp_tokenizer = RegexpTokenizer('^(?:[0-9]+ ?)+?\.\s')
paragraphs_by_title_indices = list(zip(pp_tokenizer.tokenize(text), pp_tokenizer.span_tokenize(text)))
paragraphs = []

for i, paragraph in enumerate(paragraphs_by_title_indices):
    p = [paragraph[0], [paragraph[1][1], 0]]
    try:
        p[1][1] = paragraphs_by_title_indices[i+1][1][0]
    except:
        pass
    paragraphs.append(p)

paragraph_indices = []

# Skip Notes at end of doc.
break_next = False
i = 0
for paragraph in paragraphs:
    if break_next:
        break

    # we want to skip the ones after 232 because they appear in the "final notes" section which I cut.
    if '232. ' in paragraph[0]:
        break_next = True

    entry = {
        'paragraph': paragraph[0].replace('. ', ''),
        'index_left': paragraph[1][0],
        'index_right': paragraph[1][1]
    }
    paragraph_indices.append(entry)
    i = 1

print(paragraph_indices)

df = pd.DataFrame(paragraph_indices)
df.to_csv('paragraph_list.csv', index=False)

Like the reference in the writeup on section generation, the code above simply maps the indicies of every numbered paragraph.

## Generating Most Frequent Word Indicies

In [None]:
import json
import string
import csv

import numpy as np
import pandas as pd
from collections import Counter

import nltk
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist

punctuation = string.punctuation + '“”’.,'
text = open('manifesto.txt').read()

most_common_words = {}
for row in csv.DictReader(open('most_common_words.csv')):
    most_common_words[row['word']] = {
        'rank': row['rank'],
        'count': row['count']
    }

word_indices = []

# https://stackoverflow.com/a/14307628/6669540
def all_occurences(file, str):
    initial = 0
    while True:
        initial = file.find(str, initial)
        if initial == -1: return
        yield initial
        initial += len(str)

for common in most_common_words:
    indicies = all_occurences(text, common)
    for index in indicies:
        entry = {
            'word': common,
            'index': index,
            'rank': most_common_words[common]['rank']
        }

        word_indices.append(entry)

print(word_indices)

df = pd.DataFrame(word_indices)
df.to_csv('word_list.csv', index=False)

## Final Thoughts
This proved to be a great project for me to work on. It pushed me both in creative thinking abilities (in trying to come up with a visualization) but also in my text-anlysis abilities, which admittedly are still rather limited. Overall, I am glad that I had the opportunity to work on such a project. Make sure to check it out here: http://csc630.rudd.io/projects/unabomber-manifesto/.

## Citations and Attributions

- "The Unabomber Trial: The Manifesto." *The Washington Post* http://www.washingtonpost.com/wp-srv/national/longterm/unabomber/manifesto.text.htm
