# Text Processing and LDA

## Change Log
### v.1
- initial build

### v.2 
- only use `isalpha()` words in dictionary/model
- remove mispelled words by building difference of sets against `nltk.corpus.words.words()`

### v.3
- lots more notes added
- more work on investigating spell correction routines 

### v.4 
- re-ordered spellling routines
- spell correcting is now online and part of the pipeline using `enchant` / `pyenchant`
- I am making the assumption this code is pipelining a single school at a time
- removing stopwords now after spelling correction

### v.5
- added variables for constants `MIN_WORD_COUNT`
- now using `itertools.chain.from_iterable()` to flatten list of lists
- now using a `collections.defaultdict()` (hash) to do word counts 
- word counts (due to the above two optimizations) are now near instantaneous
- using `file_root` variable to propogate changing filenames throughout notebook
- replaced nested comprehensions with `itertools`
- added full parameters to LDA model
- add logging so we can see any warnings from gensim
- added random seed for debugging (not needed unless you want reproducable result)
- added `dictionary.filter_extremes()` (still need to find best settings)

### v.6
- removed all LSI / LSA now just doing LDA
- tuned LDA parameters for `passes=10` and `update_every=0` (batch mode) by default
- added standard `matplotlib` setup from our student notebooks
- added `remove_border()` from our student notebooks
- added histograms for word frequency counts
- added models for LDA from tradional bow as well as tf-idf
- added comments at the top of most cells to give some indication of what runtime takes
- added descriptive statistics on the dictionary

### v.7
- now using mysql database routines
- now properly closing filehandle on file routines
- create `read_review_file` function
- create `clean_review_file` function
- removed "memory friendly" routines which work with persistant disk

### v.8
- increased stopwords
- add `WordNet` Lemmatization
- removing all words `len(word) < 3` in `clean_reviews`
- added bigram support (`CELL 110` controls whether unigrams or bigrams are used)
- `CELL 130` and `CELL 230` will not work properly with bigrams in use so leave those commented out if using bigrams
- `CELL 200` will not produce any output if using bigrams
- labeled cells

### v.9
- `CELL 70` grab school names
- `CELL 80` now pass a stop list to `clean_reviews`
- `CELL 110` now always computing bigrams
- added some additional stopwords to `custom_stopset`
- `CELL 170` stitches unigram and bigram texts together for processing
- `CELL 175` we chose which text to process: unigrams, bigrams or a unigram+bigram text
- `CELL 210` now calling `corpus corpus_bow`
- `CELL 240` now caling `tfidf corpus_tfidf`
- added `CELL 265` choose a `corpus`
- removed `CELL 280` (code factored into `CELL 270`)
- I upgraded my own workstation to premium (free) versions of anaconda/accelerate, timings now reflect that
- only now showing times for cells that take a while
- moved `summary_statistics` into its own `CELL 25`
- moved getting `clean reviews` from `CELL 90` to new `CELL 94`
- moved descriptive statistics from `CELL 90` to new `CELL 96`
- moved saving `corpus` from `CELL 210` to `CELL 212`
- Added nicer headings 
- added information about installing distributed computing components
- LDA now running in distributed mode

### v.10
- added `CELL 1500` HDP model for experimentation
- added `MAX_NUM_REVIEWS` so you can reduce amount of reviews when testing
- added `feature` dict to control what features are performed on text
- added tokenize feature and step in `clean_reviews`, before this was being done when removing stopwords
- re-worked entire `clean_reviews` function around `feature` dict
- incorporated `feature` dict to control which cells are enabled
- added pos tagging for later analysis
- updated `CELL 1000` for exploratory analysis

### v.11
- added `CELL 95` to look at the text
- corrected error in `CELL 60` where not intializing spell corrector unless `analyze_spell_correct` was set
- now using `nltk.tokenize` for tokenization of sentences and words

### v.12
- added `CELL 8`5 to instantiate parallel processing
- added `parallel` to `feature` dict
- added decoration to `CELL 80` for parallel processing
- modified `CELL 80` so it used `data_to_clean` and `stopset_to_use` as a name space hack
- modified `CELL 90, 94` to use `data_to_clean` and `stopset_to_use`
- modified `CELL 90` to use parallel processing for `school_names` if set
- added `CELL 91` to merge parallel data back into single result for `school_names`
- added parallel routines to `CELL 94`
- `CELL 96` now renumbered to `CELL 98`
- `CELL 95` now renumbered to `CELL 97`
- added `CELL 96` to merge data returned from parallel workers for `texts`, `texts_uncorrected`, etc
- move `bigram_stoplist` creation from `CELL 90` to `CELL 93`
- `stopset_bigrams` is now just a set of bigrams of school names, nothing else
- `CELL 97` now using `features` dict to control output
- added `tfidf` to `features` dict to control corpus creation and dictionary transform
- added cell magic `%%time` to `CELLS 90, 94, 96, 270`
- added "way", "thing", and "lot" to `stopset_unigram`
- factored code into `get_review_data()` for `CELL 97`
- added `CELL 95` text diagnostics
- tracked down bug in `CELL 80` `clean_reviews`, now imputing "the" instead of skipping `None` reviews in order to preserve indices across texts
- `CELL 80` moved lowercase function after spell correct, this will allow pos tagging to work more effectively
- `CELL 80`, factored out removal of smallwords (`len < 3`) into its own `remove_smallwords` step which can be set in `feature` dict
- features in `features` dict are now listed in the order of processing
- starts Notes section to notebook
- added second pass of stopword removal after lemmatization in `CELL 80`
- HyperThreading tested, tested with 8 virtual cores

### v.13
- added feature `only_tagged,` when set, only parts of speech in tag_set are used
- added `tag_set` which contains part of speech tags we will use when feature `only_tagged` is set
- removed `only_nn` as depricated by `only_tagged`
- modified `CELL 90, 94` to distribute `tag_set` to remote workers
- modified `CELL 80` to now use `tag_set`
- added "everybody", "everyone", "great", "excellent" and "part" to stoplist
- server setup on Amazon EC2, see Notes section for more info, entire notebook up and running fine
- added notes about installing Anaconda's premium distro as a recommendation for faster BLAS libraries
- `CELL 80` added conversion of POS tags from treebank/penn style to morphy style for passing into Lemmatizer (not being used yet)
- added info about starting the Pyro nameserver to Installation
- `clean_reviews()` is now renamed to `clean_data()` which is more reflective of what the function is doing

### v.14
- moved spellcheck routine to be above POS tagger
- `CELL 80` map penn pos tags to morphy pos tags when pos_tag enabled for lemmatization
- re-write `CELL 80` `clean_data` to be pos_tag aware on all routines, allowing lemmatization to use pos tags
- added to `custom_stoplist` 'hill','valley','den','alto','crest','wood','land'
- `CELL 1000` wrote `get_topic_data()`
- added `perplexity_search` feature parameter
- added perplexity grid search to `CELL 268`
- added "love", "amazing", and "okay" to `custom_stoplist`
- `CELL 97` is now `CELL 98`, `CELL 98` is now `CELL 99`
- new `CELL 97` created to save variables `reviews, school_names, texts, texts_uncorrected, texts_unlemmatized, texts_pos, lemma_dict`
- added `preprocess_save` to `feature` dict.  Saves pre-processed data in `CELL 97`
- added `preprocess_load` to `feature` dict.  Loads pre-processed data in `CELL 70` instead of raw db/file data
- skipping `CELLS 75, 80, 85, 90, 91, 94, 95 and 96` if `preprocess_load` set
- `CELL 269` added to review results of perplexity search
- added `preprocess_file` variable to use as filename for loading and saving preprocessed data

### v.15
- add import re to global imports and `CELL 85`
- stripping escape characters in `CELL 70` using a `re` compile pattern if `remove_html` is set
- added updated `mysql_ops` class in `CELL 30`
- modified `CELL 70` to use updated `mysql_ops` syntax
- identified bug in `gensim`, filed on [github](https://github.com/piskvorky/gensim/issues/144#issuecomment-29562894), proposed a solution, fixed


### v.16
- updated formatting and flow to `CELL 268`
- removed unigram and bigram frequency count historgrams from `CELL 150, 151`
- merged `CELL 160` into ` CELL 150 and `CELL 161` into `CELL 151`
- multiple additional comments added to cells
- `CELL 1500` (HDP model) removed
- `CELL 140` removed, as its functionality is already handled by `dictionary.filter_extremes()` in `CELL 180`
- Now saving and loading `nces_code` and `universal_id` as `reviews_indexes` in `CELL 70` and `CELL 97`
- Added routine to extract beta, gamma and log probabilities and save them to a file in `CELL 95`

### v.17
- code from `CELL 270` moved to new `CELL 292` to build `corpus_model`
- modified `CELL 270` to store `num_topics` as a global variable
- modified `CELL 290` to save model using `file_root`
- added `CELL 295` to save `corpus_model`
- added `CELL 310` to store `gamma` and `beta` using `file_root`
- `CELL 97` now stores files individually using `file_root` as a base
- modify `CELL 80` to work with situation where `pos_tag` is set but `only_tagged` is not set
- added `dict_corpus_save` feature to control saving of dictionary and corpus in `CELL 182` and `CELL 212`
- added `dict_corpus_load` feature to control loading of dictionary and corpus in `CELL 205`
- added `model_save` feature to control saving of model in `CELL 290`
- added `model_load` feature to control loading of model in `CELL 289`
- added global variable `num_topics` in `CELL 0` which is used for creation of filenames to be loaded/saved and LDA model
- added feature `beta_gamma_save` to save `beta` and `gamma` of model in `CELL 310`
- added feature `model_topics_save` to save `model_topics` in `CELL 302`
- added feature `corpus_model_save` to save `corpus_model` in `CELL 296`
- added feature `model_topics_load` to load `model_topics` in `CELL 300`
- added `CELL 300` to view model_topics
- added feature `corpus_model_load` to load `corpus_model` in `CELL 293`
- moved code from `CELL 292` to new `CELL 295` for viewing `corpus_model`
- added feature `texts_final_load` to load `texts_final` in `CELL 160`
- added feature `texts_final_save` to save `texts_final` in `CELL 178`

### v.18
- updated `CELL 70` to show filenames and sizes as they are loaded
- updated `CELL 97` to show filenames and sizes as they are loaded

### v.19
- `CELL 40` which dealt with reading from a file has been removed
- No longer using `preprocess_file` to save all texts to, instead saving each one separately
- Removed `CELL 120` and `CELL 130` which dealt with analyzing our spell correction
- Factored `CELL 150` and `CELL 151` into a function `frequent_tokens` in `CELL 150`
- `CELL 151` now displays frequency counts of unigrams, `CELL 152` displays frequency counts of bigrams
- `feature preprocess_load` and `texts_final_load` are mutually exclusive otherwise you may get errors
- added `CELL 1100` to look at how to start getting data ready to be handed off to the database
- numerous logic corrections in how feature's are handled and what CELLs should run
- added `id` to `reviews_indexes` in `CELL 70`

### v.20
- added `postdate` to `reviews_indexes`
- `CELL 1500` through `CELL 1580` added to output Word Frequency Counts on various slices and cuts of the data 
- `CELL 150` now using NLTK's FreqDist() function

### v.21
- `CELL 10` and `CELL 20` removed
- copious amounts of annotation added for final submission


### TODO/CONCERNS
- factor out pickle read/write functions
- `CELL 70` convert db functions to variables
- Many more notes to write
- Is the stopset ideal?


# Installation
It is recommended you install the premium version of Anaconda, Accelerate and IOPro from [Continuum](https://store.continuum.io/cshop/anaconda/).  

You will see that there is an "All Products are free for Academic Use" link in the upper right, where you can obtain
free licensing.  This will give you, among many other benefits, faster BLAS libraries.  You can verify your BLAS libraries here:

    import numpy as np
    np.__config__.show()

You should install `gensim`

    pip install gensim
You should install `nltk`
    
    pip install nltk
You will need to install specific modules from `nltk`.  They are `stopwords, punkt, wordnet, maxent_treebank_pos_tagger`. Fire up a `python` interpretor and run

    import nltk
    nltk.download()
You should install `pymysql`

    pip install pymysql

If you get any errors about `error: invalid command 'egg_info'` you just need to install `setuptools`, as our original 
anaconda distribution in class used a package called `distribute` but parts of that have been factored out to `setuptools`

    pip install --upgrade setuptools

You should install `Enchant` for spell check capability, http://www.abisource.com/projects/enchant/ 
I used [homebrew](http://brew.sh) to install it on OSX, `brew install enchant`
You should then install `pyenchant`
    pip install pyenchant

####iPython Parallel Processing
iPythons parallel processing is handled through an architecture called [ipcluster](http://ipython.org/ipython-doc/dev/parallel/parallel_process.html).  We use this in our notebook to parallelize the text pre-processing pipeline.

first start an ipython `ipcluster` on your machine.  For example, for 4 cores:

    ipcluster start -n 4 

- most chips, such as Intel, support HyperThreading, so you can usually run say 8 virtual cores on a 4 physical core chip
- make sure `parallel` feature is set in `feature` dict
- make sure the decoration in `CELL 80` is uncommented `@dv.remote(block=True)`
 
you could also start the `ipcluster` right from within the iPython notebook using the "Cluster" tab
 
#### Gensim Distributed Computing for LDA
[Gensim](http://radimrehurek.com/gensim/) leverages a distributed RMI type architecture called [Pyro4](http://pythonhosted.org/Pyro4/)
To distribute the workload to other cores or computers on the same broadcast domain
install `gensim[distributed]`
    
    pip install gensim[distributed]  (or "pip install --update gensim[distributed]" if you already installed previously)
this installs Pyro4 as well.  Pyro4 is the distributed framework that `gensim` utilizes.

set environment variables (these must be set in the session you use to start the notebook from)
    
    export PYRO_SERIALIZERS_ACCEPTED=pickle
    export PYRO_SERIALIZER=pickle

start the `Pyro4` nameserver

    python -m Pyro4.naming -n 0.0.0.0 &
start workers (ex: on a four core system to start four workers)
    
    python -m gensim.models.lda_worker &
    python -m gensim.models.lda_worker &
    python -m gensim.models.lda_worker &
    python -m gensim.models.lda_worker &
start dispatcher (just runs on one node, can also be a worker node)

    python -m gensim.models.lda_dispatcher &
make sure in the model, you have `distributed=True` option set
that's it, you should see info about the distributed computing in the logs as you run the model

# Configuration

#### CELL 0
`CELL 0` contains a number of options which control the operation of the notebook.

The notebook can be navigated in an ad-hoc way, skipping cells you do not wish to use.  To make various routine tasks easier, as well as provide control over things like the text pipeline, a dictionary called `feature` is used.  This is self-documented in `CELL 0`.  An important thing to note is that many of the features could be enabled simultaneously but may not make sense or may give undesirable output.  There is little to no dependency checking. 

##### Used by the Database
`MAX_NUM_REVIEWS` can be set.  This will control how many reviews are ingested if using the database.  If set to `0` then all reviews returned by the function being used will be included.

`state` can be set, and if retrieving from the database it will retrieve that states data.  If this is set to `all` or `ALL` it will retrieve data for all states.  


##### Used when loading or saving to files
`data_dir` can be set as the location to use when loading/saving texts, corpus, dictionaries, models, etc.

`file_root` can be set, this controls the naming of all the many files that are loaded and saved.  You need to be careful what this is set to if any of the "save" bits are set in the `feature` dict, as you could overwrite previously saved data if not careful.

##### Used by the text processing pipeline
`tag_set` can be set to control which parts of speech are included when `only_tagged` is used in conjunction with feature `pos_tag`.  The various parts of speech that can be set here are explained at [Penn Treebank Tags](http://bulba.sdsu.edu/jeanette/thesis/PennTags.html).

`custom_stopset` is a list of words which is removed from consideration in any model processing.  This is unioned with the common set of english stopwords found in NLTK's `nltk.corpus.stopwords.words('english')` to produce `stopset_unigram`.  This is the stoplist we use to remove words for any consideration in unigram processing.

`stopset_unigram` is a unioned set of the above mentioned `custom_stopset` and NTLK's `nltk.corpus.stopwords.words('english')`.

##### Used by LDA
`num_topics` sets the number of topics to be used when buiding the LDA model.

#### CELL 175

In `CELL 175` you choose which text will go on to be considered in LDA.  Normally this is `text_combined` which is a combination of `texts` and `texts_bigrams` produced in `CELL 170`

#### CELL 180

In `CELL 180` you choose what tokens make it into the dictionary. By default we have this set to drop any tokens that appear in less than 5 documents.  We also drop any token that appears in more than 50% of all documents.  We also keep only the top 100000 most frequent tokens.  These can be set as desired.

#### CELL 240

`CELL 240` will transform the corpus via `tfidf` if set in `feature` dict.  LDA requires that actual values be used in the bag of words.  Converting to real numbers, as with `tfidf` may be beneficial, but it is not a valid transormation for traditional LDA as published.  So this feature has been left in place since earlier experiments in this project began, but all of our work for the most part has been exclusively using Bag of Words, without any transformation, including TF-IDF.

#### CELL 268

`CELL 268` can be used to do a grid search over any variable you wish.  The current stable branch of `gensim` however has a bug, which has been fixed in the [pyro_threads](https://github.com/piskvorky/gensim/tree/pyro_threads) branch.  So it is recommended that you install this branch if you wish to use this feature.  Otherwise, the Pyro4 underlying framework will run out of threads.  If not doing a grid search then the current release version of gensim[distributed] will work fine.



# Notes
`CELL` numbers are used at the top of each cell just as a label so they can be tracked in changes as they are happening and discussed.

#### Texts
There are currently the following key texts created in the pipeline depending on what options are enabled.  These texts are all for diagnostic purposes, the only one actually being used by LDA is the final output `texts`.  The other texts are being created because of their place in the pipeline, allow us to see how certain transformations are performing, such as analyzing spell correction.  The order of creation is below:

1. `reviews_indexes`
2. `reviews`
3. `school_names`
4. `texts_uncorrected` 
5. `texts_pos`
6. `texts_unlemmatized`
7. `lemma_dict`
8. `texts`

A summary of each text:

- `reviews_indexes` - this is a copy of the original gsid, nces_code, universal_id and postdate in a dataframe
- `reviews` - this is the data as ingested from the file or the database
- `school_names` - this is the school names pulled from the db AFTER they have been through pre-processing.  Since most words in school names are not in our tag set (NN or NNS for example), and many words in school names are stopped, this list is quite sparse.  It is used only for the  creation of a bigram_stopset, so that any bigrams created involving school names are not used.
- `texts_uncorrected` - this is the text before spell correction
- `texts_pos` - this is the text after it has been tagged with parts of speech
- `texts_unlemmatized` - this is the text before lemmatization
- `lemma_dict` - This is a dictionary of a set of words.  It is used just as an analysis to see what words were converted to what lemmas.  
- `texts` - this is the final text which is used further on in the notebook and is the comprises the bulk of what is inputed into LDA

There are various filters and transformations being done on the text.  Here is a workflow of where the transforms and filters are done, in relation to where the texts are being created.  All texts are created in `CELL 80 clean_data()` function.

#### General Data Cleaning Pipeline:

    reviews and school_names created in CELL 70
        clean_data() called in CELL 80
            impute "the" if review == None
            remove HTML encodings and escape characters
            tokenize documents into sentences
            tokenize sentences into words                      * texts_uncorrected created
            spell correct words
            POS tag words                                      * texts_pos created
            remove punctuation
            remove small words (len < 3)
            remove non-alpha words
            lowercase all words
            remove stopwords                                   * texts_unlemmatized created
            lemmatize all words
            lemma_dict created
            remove small words (len < 3) (again)
            remove stopwords (again)                           * texts created
            

A server has been configured on [Amazon EC2](http://aws.amazon.com/ec2/instance-types/) for our final analysis.

This server is a Compute Optimized `c3.8xlarge` instance.  It has the following specifications:
CPU: 64-bit vCPU: 32 eCPU: 108 Memory: 60 GB  Processor: Intel Xeon E5-2680

# Text Processing Pipeline
## Text Processing
Text processing was done using Pythons’s [Natural Language Toolkit (NLTK)](http://nltk.org) and [iPython clustering](http://ipython.org/ipython-doc/dev/parallel/).  [GreatSchools](http://www.greatschools.org) reviews were ingested from the database, scattered to multiple ipython worker nodes, processed and finally merged back together into one text.

Text processing followed the following workflow:

![Alt text](files/img_bf-notebook-0.png)

The premium version of [Continuum’s Anaconda](https://store.continuum.io/cshop/anaconda/), with [Accelerate](https://store.continuum.io/cshop/accelerate/), [IOPro](https://store.continuum.io/cshop/iopro/) and [MKL Optimizations](https://store.continuum.io/cshop/mkl-optimizations/) was used.  This distribution provided faster [BLAS](http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) libraries and greatly optimized NumPy operations.

The data from GreatSchools was in a MySQL database and that was the method used for ingesting the data into the text pipeline.

## Sharding the reviews data

iPython’s parallel clustering capabilities were leveraged.  This allows each core to represent a worker node which can work on part of the overall task.  With a 32 core system you can have 32 worker nodes, theoretically allowing you to process data in 1/32 the time of using a single core.  iPython has a scatter function which will take a dataset and shard it to all the workers available.  This was used to take the complete set of GreatSchools data and shard it across all workers.  In our production system we used an [Amazon c3.8xlarge](http://aws.amazon.com/ec2/instance-types/#instance-details), which is a 32 core system with 60GB of memory.  We supplemented the physical memory with 20GB of “virtual memory” using a swapfile, thus giving each core approximately 28,000 reviews to work on with 2.5GB of memory each.

![Alt text](files/img_bf-notebook-2.png)

## Order of text processing pipeline

During the course of working on the project, the order of operations being performed in the text pipeline changed.  There are many ways in which the operations can be combined, and most orderings involve a trade-off between performance and extracting the cleanest text.  For example, it is important that punctuation remain in place for part of speech (pos) tagging to perform optimally.  It is also important for words to be spelled correctly for a pos tagger to work correctly.  Similarly, it is difficult for a pos tagger to identify all Proper Nouns if you lowercase all text beforehand.  The final order of operations was decided on as being the most optimal for providing meaningful and clean nouns, which would be used as inputs to a topic model built using LDA.  

![Alt text](files/img_bf-notebook-3.png)

The text contains various [escape character](http://en.wikipedia.org/wiki/Escape_character) sequences and HTML encodings that needed to be removed, as not all the functions being used in text processing could handle them.  Basic string `replace()` methods and `re sub()` methods were used to handle this.

The text was next tokenized using the [NLTK sentence and word tokenizers](http://nltk.org/api/nltk.tokenize.html).

After tokenization, the text was spell corrected using the [Enchant spell correction library](http://www.abisource.com/projects/enchant/) from [Abiword](http://www.abisource.com) and its python helper library [pyenchant](http://pyenchant.readthedocs.org/en/latest/).  There were numerous spelling errors.  This would result in fragmentation of the analysis (word counts, parts of speech tagging, topic creation, etc.), and so it was decided to spell correct.  This was one of the most expensive operations in the text processing,.  Each word was compared to an English dictionary, and if no match was found, the most likely candidate was selected as long as it was within an edit distance of 2.

Part of speech tagging was performed using [NLTK’s Treebank maximum entropy classifier](http://nltk.org/api/nltk.classify.html) based pos tagger.  This is a slow but accurate pos tagger that was originally trained on the [Treebank](http://en.wikipedia.org/wiki/Treebank) corpus.  Ideally, we would have used a classifier that was trained on GreatSchools reviews or reviews data in general, but we felt the accuracy was good enough for our work.  We were looking at using only nouns and plural nouns in our topic model, and the default classification for almost all pos taggers including this one, is to use classify as a noun.  Most topic models are built using nouns as they are most representative of concrete topics.  Although adjectives, verbs and other parts of speech can add interesting context, they generally overpower the underlying nouns and can inject sentiment.  We were not using the words to classify sentiment, so only nouns were of interest.  Only nouns were further passed on in the pipeline.

Once pos tagging was performed we then removed punctuation and words that were less than 3 characters.  These words would add no value to our topic model, and removing them greatly reduces the amount of data we are working with.

All words were then lowercased using the normal python string `lower` method.

Next, stop words were removed.  The stop words started with a list of common English stop words from the [NLTK stopword corpus](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-module.html).  Additional words were manually added to create a `custom_stopset`, which we used to stop against all unigrams.

Next words were lemmatized using [NLTK’s WordNet Lemmatizer](http://nltk.org/_modules/nltk/stem/wordnet.html).  This was to reduce the dimension of data even further by aggregating words that either are the same root or have the same meaning.  The lemmatizer uses the pos tags to understand what words might be synonyms.  The lemmatizer would take a word such as *teacher* and *teachers* and make sure both were just *teacher*.  This kept us from fragmenting topics.  This operation was expensive but we felt it improved the quality of data considerably.  Other techniques were looked at such as stemming, but lemmatization appeared to be superior for our purpose, as we wanted to make sure the words that were used were actual words humans would understand, and stemmers can aggressively mangle words. 

## Example of a review moving through the pipeline

`State California
Data from reviews for review number 4
GSID: 214510 NCES Code: 060177000041    Universal ID: 600001`

#### Original Review Text

`We just went to Open House at Alameda High, and found the teachers very dedicated.  They know their students and were ready to give individual feedback to parents.  There is a range of academic levels offered including some very challenging AP classes. Some classes are rather large, but the EXP and AP classes may be smaller. There is a range of students, from those who are not interested in studying, to those who are highly motivated and hard working, with GPA's over 4.0. There is a lot of student involvement in extracurricular activities, such as a poetry anthology they were selling to raise money for the English Dept., a book sale to benefit the media center, a blood drive sponsored by one of the clubs.  They have after school sports, an active music dept (the jazz band just performed at Yoshi's), extraodinary drama dept. that puts on several plays each year.`

#### Tokenized Text with encodings removed

`['We', 'just', 'went', 'to', 'Open', 'House', 'at', 'Alameda', 'High', ',', 'and', 'found', 'the', 'teachers', 'very', 'dedicated', '.', 'They', 'know', 'their', 'students', 'and', 'were', 'ready', 'to', 'give', 'individual', 'feedback', 'to', 'parents', '.', 'There', 'is', 'a', 'range', 'of', 'academic', 'levels', 'offered', 'including', 'some', 'very', 'challenging', 'AP', 'classes', '.', 'Some', 'classes', 'are', 'rather', 'large', ',', 'but', 'the', 'EXP', 'and', 'AP', 'classes', 'may', 'be', 'smaller', '.', 'There', 'is', 'a', 'range', 'of', 'students', ',', 'from', 'those', 'who', 'are', 'not', 'interested', 'in', 'studying', ',', 'to', 'those', 'who', 'are', 'highly', 'motivated', 'and', 'hard', 'working', ',', 'with', 'GPA\\', "'s", 'over', '4.0', '.', 'There', 'is', 'a', 'lot', 'of', 'student', 'involvement', 'in', 'extracurricular', 'activities', ',', 'such', 'as', 'a', 'poetry', 'anthology', 'they', 'were', 'selling', 'to', 'raise', 'money', 'for', 'the', 'English', 'Dept.', ',', 'a', 'book', 'sale', 'to', 'benefit', 'the', 'media', 'center', ',', 'a', 'blood', 'drive', 'sponsored', 'by', 'one', 'of', 'the', 'clubs', '.', 'They', 'have', 'after', 'school', 'sports', ',', 'an', 'active', 'music', 'dept', '(', 'the', 'jazz', 'band', 'just', 'performed', 'at', 'Yoshi\\', "'s", ')', ',', 'extraodinary', 'drama', 'dept', '.', 'that', 'puts', 'on', 'several', 'plays', 'each', 'year', '.']`

#### Text after spell correction and part of speech tagging

`[('We', 'PRP'), ('just', 'RB'), ('went', 'VBD'), ('to', 'TO'), ('Open', 'NNP'), ('House', 'NNP'), ('at', 'IN'), ('Alarmed', 'NNP'), ('High', 'NNP'), (',', ','), ('and', 'CC'), ('found', 'VBN'), ('the', 'DT'), ('teachers', 'NNS'), ('very', 'RB'), ('dedicated', 'VBN'), ('.', '.'), ('They', 'PRP'), ('know', 'VBP'), ('their', 'PRP$'), ('students', 'NNS'), ('and', 'CC'), ('were', 'VBD'), ('ready', 'RB'), ('to', 'TO'), ('give', 'VB'), ('individual', 'JJ'), ('feedback', 'NN'), ('to', 'TO'), ('parents', 'NNS'), ('.', '.'), ('There', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('range', 'NN'), ('of', 'IN'), ('academic', 'JJ'), ('levels', 'NNS'), ('offered', 'VBD'), ('including', 'VBG'), ('some', 'DT'), ('very', 'RB'), ('challenging', 'VBG'), ('AP', 'NNP'), ('classes', 'NNS'), ('.', '.'), ('Some', 'DT'), ('classes', 'NNS'), ('are', 'VBP'), ('rather', 'RB'), ('large', 'JJ'), (',', ','), ('but', 'CC'), ('the', 'DT'), ('EXP', 'NNP'), ('and', 'CC'), ('AP', 'NNP'), ('classes', 'NNS'), ('may', 'MD'), ('be', 'VB'), ('smaller', 'JJR'), ('.', '.'), ('There', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('range', 'NN'), ('of', 'IN'), ('students', 'NNS'), (',', ','), ('from', 'IN'), ('those', 'DT'), ('who', 'WP'), ('are', 'VBP'), ('not', 'RB'), ('interested', 'JJ'), ('in', 'IN'), ('studying', 'NN'), (',', ','), ('to', 'TO'), ('those', 'DT'), ('who', 'WP'), ('are', 'VBP'), ('highly', 'RB'), ('motivated', 'VBN'), ('and', 'CC'), ('hard', 'RB'), ('working', 'VBG'), (',', ','), ('with', 'IN'), ('GPA', 'NNP'), ('S', 'NNP'), ('over', 'IN'), ('4.0', 'CD'), ('.', '.'), ('There', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('lot', 'NN'), ('of', 'IN'), ('student', 'NN'), ('involvement', 'NN'), ('in', 'IN'), ('extracurricular', 'JJ'), ('activities', 'NNS'), (',', ','), ('such', 'JJ'), ('as', 'IN'), ('a', 'DT'), ('poetry', 'NN'), ('anthology', 'NN'), ('they', 'PRP'), ('were', 'VBD'), ('selling', 'VBG'), ('to', 'TO'), ('raise', 'VB'), ('money', 'NN'), ('for', 'IN'), ('the', 'DT'), ('English', 'NNP'), ('Dept', 'NNP'), (',', ','), ('a', 'DT'), ('book', 'NN'), ('sale', 'NN'), ('to', 'TO'), ('benefit', 'VB'), ('the', 'DT'), ('media', 'NNS'), ('center', 'NN'), (',', ','), ('a', 'DT'), ('blood', 'NN'), ('drive', 'NN'), ('sponsored', 'VBD'), ('by', 'IN'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('clubs', 'NNS'), ('.', '.'), ('They', 'PRP'), ('have', 'VBP'), ('after', 'IN'), ('school', 'NN'), ('sports', 'NNS'), (',', ','), ('an', 'DT'), ('active', 'JJ'), ('music', 'NN'), ('dept', 'NN'), ('(', ':'), ('the', 'DT'), ('jazz', 'NN'), ('band', 'NN'), ('just', 'RB'), ('performed', 'VBN'), ('at', 'IN'), ('Yoshi\\', 'NNP'), ('S', 'NNP'), (')', 'NNP'), (',', ','), ('extraordinary', 'JJ'), ('drama', 'NN'), ('dept', 'NN'), ('.', '.'), ('that', 'IN'), ('puts', 'NNS'), ('on', 'IN'), ('several', 'JJ'), ('plays', 'NNS'), ('each', 'DT'), ('year', 'NN'), ('.', '.')]`

#### Text after lowercase, removal of stopwords, removal of small words, removal of punctuation

`['feedback', 'range', 'levels', 'classes', 'classes', 'classes', 'range', 'studying', 'involvement', 'activities', 'poetry', 'anthology', 'money', 'book', 'sale', 'media', 'center', 'blood', 'drive', 'clubs', 'sports', 'music', 'dept', 'jazz', 'band', 'drama', 'dept', 'puts', 'plays']`

#### Text after lemmatization

`['feedback', 'range', 'level', 'class', 'class', 'class', 'range', 'studying', 'involvement', 'activity', 'poetry', 'anthology', 'money', 'book', 'sale', 'medium', 'center', 'blood', 'drive', 'club', 'sport', 'music', 'dept', 'jazz', 'band', 'drama', 'dept', 'put', 'play']`

The text you see above only contains nouns (pos types NN and NNS).  This is what was the final output for this particular review, which we would use for later processing (imputation of bigrams, and then fed into our LDA model).

## Merging the reviews data

After the data was independently processed by each worker, it was then merged back together into a single text.  This text had the same number of documents as the original text.  The difference was that the text was processed, and basically reduced down to nouns.


# Code

In [None]:
# CELL 0

%matplotlib inline

import string
import nltk
import itertools
from gensim import corpora, models, similarities
from collections import defaultdict
import sys
import logging
import random
import operator
import pymysql
import csv
import time
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
# below for saving/loading files
import pickle
# below for search/replace of escape characters
import re
# below two imports needed for spell correction
import enchant
from nltk.metrics import edit_distance
# below for bigrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk import bigrams
# below for part of speech tagging
# uses 'taggers/maxent_treebank_pos_tagger/english.pickle'
from nltk.corpus import wordnet
# below for tokenizing
from nltk.tokenize import word_tokenize, sent_tokenize
# below for lemmatizer
from nltk.stem import WordNetLemmatizer

# set seed (only for debugging, no reason to do this unless you want repeatable result)
# One area I did find it useful, was in experimenting with grid searches
# np.random.seed(42)
# random.seed(42)

# setup logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# features, listed in the order of processing
feature = {

           # Text Pre-Processing
           # The features below concern themselves with the text pre-processing pipeline.  For example the only time we use ipythons 
           # ipcluster is during text pre-processing.  Be very careful if you have any "save" bits, set.  You will want to make sure 
           # that your file_root (defined below the feature dict) is set to a correct value, so that you do not overwrite data you
           # do not wish to.
           
           # Note: preprocess_load and texts_final_load are mutually exclusive.  It would not make sense to set them both, there
           # are other combinations that don't make sense that could lead to errors as well.
           
           'parallel'              : 1, # enables ipython parallelization using multiple cores
           'preprocess_save'       : 0, # write preprocessed data to file for later import CELL 97
           'preprocess_load'       : 0, # load previously saved preprocessed data from file instead of db (CELL 70)
           'remove_html'           : 1, # remove some html encodings, not needed if using data in database
           'tokenize'              : 1, # tokenize using NLTK's sentence and word tokenizers
           'spell_correct'         : 1, # spell correct
           'pos_tag'               : 1, # part of speech tag
           'only_tagged'           : 1, # only use parts of speech types listed in tag_set
           'remove_punctuation'    : 1, # remove punctuation
           'remove_smallwords'     : 1, # removes words with len(word) < 3
           'alpha_only'            : 0, # use alphabetic words only (probably not necessary)
           'analyze_spell_correct' : 1, # save text before spell correction for later analysis in CELL 98, 120, 130
           'lowercase'             : 1, # lowercase text
           'remove_stopwords'      : 1, # remove stopwords 
           'lemmatize'             : 1, # lemmatize text
           'analyze_lemmatize'     : 1, # save text before lemmatization for later analysis in CELL 98, 120 
           'lemma_dict'            : 1, # produce a dict of lemmas with words mapped to them in CELL 100 (time consuming)
           'texts_final_load'      : 0, # load the texts_final
           'texts_final_save'      : 0, # save the texts_final
           
           # Dictionary / Corpus
           # The features below concern themselves with the creation of a dictionary and corpus.
           
           'dict_corpus_load'      : 0, # load the dictionary and corpus in CELL 205
           'dict_corpus_save'      : 0, # save the dictionary created in CELL 182 and the corpus created in CELL 212
           'tfidf'                 : 0, # TF-IDF transform corpus, otherwise leave as bag of words
           
           # LDA
           # The features below concer themselves with the actual creation of the LDA model.  This is where the Pyro4
           # framework would be used if available.
           
           'perplexity_search'     : 0, # enables perplexity grid search in CELL 268 rather than regular model in 270
           'model_load'            : 0, # load the model in CELL 289
           'model_save'            : 0, # save the model in CELL 290
           'corpus_model_load'     : 0, # load the corpus model in CELL 293
           'corpus_model_save'     : 0, # save the corpus model in CELL 295
           'beta_gamma_save'       : 0, # save the LDA models beta and gamma in CELL 310
           'model_topics_load'     : 0, # load model_topics in CELL 301
           'model_topics_save'     : 0  # save model_topics in CELL 302
           }

# Maximum number of reviews to use if loading from the database, set to 0 to use all review data requested
# If you are just trying out this notebook, I recommend setting this to a low number like 500, 1000, 3000, etc.
# As if the query set in CELL 70 is "all reviews" or something like that, it could take many hours to complete
# text processing and model building depending on the platform it is running on.
MAX_NUM_REVIEWS = 500

# State to get data for, should be set to a valid 2 letter state abbreviation.  If set to "ALL" it will pull data for all states
# This applies for when loading data from the database
state = "CA"

# directory to look for data files
data_dir = "data"

# file root name (do not add file extension).  This is used to build almost every filename used in the notebook, whether you are
# loading or saving files.  Its important to set it correctly if you have any of the "save" bits set in the feature matrix, 
# otherwise you could overwrite previously saved data.
file_root = "schools"

# which parts of speech tags to use when only_tagged feature enabled
# for explaination of tags http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
tag_set = ['NN','NNS']

# our stopsets
custom_stopset = ['school','teachers','students','children','parents','kids','child','parent','son','sons','student','schools',
                  'daughter','daughters',"I'm","i'm","I've","i've",'teacher','principal','principle','mr','Mr','ms','Mrs',
                  'dint','kindergarten','would','went','get','one','ones','st','make','year','years','aw','grandson','ma',
                  "La's",'la','way','thing','lot','everybody','everyone','great','excellent','part','hill','valley','den',
                  'alto','crest','wood','land','love','amazing','okay']

stopset_unigram = set(nltk.corpus.stopwords.words('english'))
stopset_unigram = stopset_unigram.union(set(custom_stopset))
stopset_bigrams = []

# number of topics to use in LDA model.  This is set hear as opposed to just in CELL 270 where the model runs, because it is used
# in the creation of many filenames which can be saved/loaded (set in feature dict) that are specific to the number of 
# topics of the model.
num_topics = 75

In [None]:
# CELL 25
# summary statistics
def summary_stats(texts):
    """
    Produces Summary Statistics on the texts provided
    
    Parameters
    ----------
    texts : A tokenized set of documents (list of lists)
    """
    token_count = defaultdict(int)
    for word in itertools.chain.from_iterable(texts):
        token_count[word] += 1
    print "%d words appear in the text just 1 time" % sum([1 for word in token_count if token_count[word] == 1])
    print "%d words appear in the text just 2 times" % sum([1 for word in token_count if token_count[word] == 2])
    n, min_max, mean, var, skew, kurt = sp.stats.describe(token_count.values())
    print("Number of unique words: {0:d}".format(n))
    print("Minimum freq: {0:8d} Maximum freq: {1:8d}".format(min_max[0], min_max[1]))
    print("Mean: {0:8.2f}".format(mean))
    print("Std. deviation : {0:8.6f}".format(sp.std(token_count.values())))
    print("Variance: {0:8.2f}".format(var))
    print("Skew : {0:8.2f}".format(skew))
    print("Kurtosis: {0:8.2f}".format(kurt))

In [None]:
# CELL 30
# MySQL functions

"""
class mysql_ops

functions:

    mysql_connect - connects to our MySQL db
    mysql_disconnect - disconnects
    get_mysql_data - generic retrieval function that invokes a stored procedure returning a denormalized df combining school and review data
    
    retrieval methods (commented inline)
    
"""
class mysql_ops(object):
    
    # returns a connection to our MySQL db
    @classmethod
    def mysql_connect(self):
        
        cnx = pymysql.connect(host='cs109instance.ccikshmkulj7.us-east-1.rds.amazonaws.com', 
                               port=3306, 
                               user='michael', 
                               passwd='michael', 
                               db='cs109gs')
        this_cursor = cnx.cursor(pymysql.cursors.DictCursor)
        return cnx, this_cursor
    
    # close cursor and MySQL connection
    @classmethod
    def mysql_disconnect(self, cnx, cursor):
        cursor.close()
        cnx.close() # close db connection
        
    # get data from mysql db
    def get_mysql_data(self, sql, data_id1=None, data_id2=None):
        
        cnx, cursor = self.mysql_connect()
        
        if data_id2:
            cursor.callproc(sql, (data_id1, data_id2))
        elif data_id1:
            cursor.callproc(sql, (data_id1,))
        else:
            cursor.callproc(sql)
        
        if (cursor.rowcount > 0):
            df = pd.DataFrame(cursor.fetchall())
            self.mysql_disconnect(cnx, cursor)# close db connection
            return df
        else:
            self.mysql_disconnect(cnx, cursor)# close db connection
            print "no rows returned"
        
        return None
    
    # return all reviews for one school gsid (denormalized, i.e. with the associated school data repeated per review)
    def get_reviews_gsid(self, gsid):
        sql = 'school_reviews_by_gsid'
        return self.get_mysql_data(sql, gsid)
     
    # return ethnic composition data for one school gsid 
    def get_race_gsid(self, gsid):
        sql = 'school_race_by_gsid'
        return self.get_mysql_data(sql, gsid)
    
    # return GS census data for one school gsid 
    def get_census_gsid(self, gsid):
        sql = 'school_census_by_gsid'
        return self.get_mysql_data(sql, gsid)
    
    # return GS test results data for one school gsid 
    def get_results_gsid(self, gsid):
        sql = 'school_results_by_gsid'
        return self.get_mysql_data(sql, gsid)
 
    # return all reviews for one school ncesid (denormalized)
    def get_reviews_ncesid(self, ncesid):
        sql = 'school_reviews_by_ncesid'
        return self.get_mysql_data(sql, ncesid)
    
    # return all reviews for one school ncesid 
    def get_reviews_districtncesid(self, districtncesid):
        sql = 'school_reviews_by_districtncesid'
        return self.get_mysql_data(sql, districtncesid)
    
    # return all reviews for one state 
    def get_reviews_state(self, state):
        sql = 'school_reviews_by_state'
        return self.get_mysql_data(sql, state)
    
     # return all census data for one state 
    def get_census_state(self, state):
        sql = 'school_census_by_state'
        return self.get_mysql_data(sql, state)
    
     # return all test results for one state 
    def get_results_state(self, state):
        sql = 'school_results_by_state'
        return self.get_mysql_data(sql, state)
    
     # return all test results for one state and year
    def get_results_state_year(self, state, year):
        sql = 'school_results_by_state_year'
        return self.get_mysql_data(sql, state, year)
    
    # return all reviews for one county
    def get_reviews_county(self, state, county):
        sql = 'school_reviews_by_county'
        return self.get_mysql_data(sql, state, county)
    
    # return all reviews containing text string
    def get_reviews_string(self, text_string):
        sql = 'school_reviews_by_string'
        return self.get_mysql_data(sql, text_string)
    
    # note the psql syntax: returns all schools / reviews in db 
    # CAREFUL WITH THIS ONE:  IT WILL TAKE A LONG TIME AND RETURN EVERYTHING!
    def get_all_reviews(self):
        sql = 'school_reviews' 
        return self.get_mysql_data(sql)
        #df = psql.frame_query(sql, cn)
        #cn = cnx.mysql_disconnect()
        #return df
    
    #return all cities and towns for a given state
    def get_cities_towns_by_state(self, state):
        sql = 'get_cities_towns_by_state' 
        return self.get_mysql_data(sql, state)
    

In [None]:
# CELL 50
# Spell Checker Class

# SpellingReplacer class taken from Python Text Processing with NLTK 2.0 Cookbook
# It will return the word if its in the dictionary, otherwise return the closest word it can match if <= self.max_dist
# Otherwise, it will return the word
class SpellingReplacer(object):
    
    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = 2
    def replace(self, word):
        if self.spell_dict.check(word):
            return word
        suggestions = self.spell_dict.suggest(word)
        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

In [None]:
# CELL 60
# We instantiate a SpellingReplacer, and do a quick test
if(feature['spell_correct']):
    replacer = SpellingReplacer()
    replacer.replace('cookbok')

In [None]:
# CELL 70
# if preprocess_load is set get reviews as well as other data from pre-saved files
# otherwise get reviews data either by db

if(feature['preprocess_load']):
    filename = data_dir + '/' + file_root + '-reviews_indexes.pickle'
    with open(filename) as f:
        reviews_indexes = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(reviews_indexes))

    filename = data_dir + '/' + file_root + '-reviews.pickle'
    with open(filename) as f:
        reviews = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(reviews))

    filename = data_dir + '/' + file_root + '-school_names.pickle'
    with open(filename) as f:
        school_names = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(school_names))
    
    filename = data_dir + '/' + file_root + '-texts.pickle'
    with open(filename) as f:
        texts = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(texts))

    filename = data_dir + '/' + file_root + '-texts_uncorrected.pickle'
    with open(filename) as f:
        texts_uncorrected = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(texts_uncorrected))

    filename = data_dir + '/' + file_root + '-texts_unlemmatized.pickle'
    with open(filename) as f:
        texts_unlemmatized = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(texts_unlemmatized))

    filename = data_dir + '/' + file_root + '-texts_pos.pickle'
    with open(filename) as f:
        texts_pos = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(texts_pos))

    filename = data_dir + '/' + file_root + '-lemma_dict.pickle'
    with open(filename) as f:
        lemma_dict = pickle.load(f)
    print "Loaded file %s of size %d" % (filename,len(lemma_dict))

if(not (feature['preprocess_load'] or feature['texts_final_load'])):
    # instance mysql_ops
    cnx = mysql_ops()
    
    # Here is where we set what query we wish to use.  It can be one that involves a constrain using MAX_NUM_REVIEWS or not.
    # You can add whatever query you like, a full list are available in CELL 30.  Below just shows "get_all_reviews()" and
    # "get_all_reviews_state" as those would likely be the most commonly used with this notebook.
    if(MAX_NUM_REVIEWS == 0):
        if((state == "ALL") or (state == "all")):
            reviews = cnx.get_all_reviews()[['id','nces_code','universal_id','postdate','name','reviews']]
        else:
            reviews = cnx.get_reviews_state(state)[['id','nces_code','universal_id','postdate','name','reviews']]
    else:
        if((state == "ALL") or (state == "all")):
            reviews = cnx.get_all_reviews()[['id','nces_code','universal_id','postdate','name','reviews']][:MAX_NUM_REVIEWS]
        else:
            reviews = cnx.get_reviews_state(state)[['id','nces_code','universal_id','postdate','name','reviews']][:MAX_NUM_REVIEWS]
            
    # We save off the indexes of each review so we can correlate everything later if needed
    reviews_indexes = reviews[['id','nces_code','universal_id','postdate']]
    
    # This is a list of school names, which we immediately reduce to a set().  Its only being used to create bigrams from
    # which will them be added to a special stopset for bigrams.  Here they are unprocessed by after CELL 90 they will be 
    # very different looking, and very sparse, as most will have been removed during pre-processing (most parts of school names
    # are either stopped in our pre-processing or are not parts of speech that we decide to keep, nouns for example)
    school_names = list(set(reviews['name']))
    
    # This is the core of what we are processing and building our model on.  The unprocessed data from the database.
    reviews = reviews['reviews']
    print "Ingested %d documents from database" % len(reviews)

In [None]:
# CELL 75
# setup ipython for parallel processing
if(feature['parallel'] and not (feature['preprocess_load'] or feature['texts_final_load'])):
    from IPython.parallel import Client

    # first start an ipython cluster on your machine.  For example, for 4 cores:
    #    ipcluster start -n 4 

    # Setup client instance
    rc = Client()
    print "Discovered %d cores" % len(rc)
    rc.ids

    # we will use a DirectView object for direct execution across all cores
    # all your cores are belong to us
    dv = rc[:]

    # We will block on all executions
    dv.block=True

    dv.scatter('partition_ids', range(len(rc)))
    %px print(partition_ids)
    %px partition_id = partition_ids[0]
    %px print(partition_id)

In [None]:
# CELL 80
# remove punctuation, remove stopwords, remove words < 3, remove non alpha words, spell correct, lemmatize
# you need to comment out the below decorator if you don't want to run parallel
if(not (feature['preprocess_load'] or feature['texts_final_load'])):

    @dv.remote(block=True)
    def clean_data():
    
        replacer = SpellingReplacer()

        texts = []
        texts_uncorrected = []
        texts_unlemmatized = []
        texts_pos = []
        lemma_dict = defaultdict(set)
        counter = 0
    
        # we need to strip escape characters so we compile a pattern
        hexchars = re.compile('\\\\x(\w{2})')
        
        # Instantiate a WordNet Lemmatizer
        wnl = WordNetLemmatizer()
        
        # convert penn_tag to morphy_tag used by WordNet Lemmatizer
        # http://stackoverflow.com/questions/5364493/lemmatizing-pos-tagged-words-with-nltk
        morphy_tag = {'NN':wordnet.NOUN,'JJ':wordnet.ADJ,'VB':wordnet.VERB,'RB':wordnet.ADV}
        
        print "Starting to process %d documents" % len(data_to_clean)
    
        for review_text in data_to_clean:
            
            # print result to stdout, note that when using parallel stdout is not sent to the Client
            if(not feature['parallel']):
                counter += 1
                if ((counter % 1000)==0):
                    print "%d documents processed" % counter
                sys.stdout.flush()
                          
            # If its empty, impute "the" which is ignored anyways.  This needs to be done vs. skipping in order
            # to keep our indexes 1:1 across all texts for analysis
            if review_text == None:
                review_text = "the"
        
            # remove html encodings
            if(feature['remove_html']):
                review_text = review_text.replace('&amp;','').replace('&lt;','').replace('&gt;','').replace('&quot;','').replace('&#039;','').replace('&#034;','')
                review_text = re.sub(hexchars, '', review_text.encode("string-escape"))
          
            # tokenize, you don't want to turn this off
            if(feature['tokenize']):
                review_text = [word for sent in sent_tokenize(review_text) for word in word_tokenize(sent)]
            else:
                review_text = [word for word in review_text.split()]
            
            # Below line is just to store uncorrected text so we can check spelling routines later
            if(feature['analyze_spell_correct'] and feature['spell_correct']):
                texts_uncorrected.append(review_text)
            
            # Spell correction using the Enchant library
            # The corrected words that are returned may be capitalized as part of the correction, or even have
            # things like apostrophes added.  We will just leave these in place for now as this does not effect
            # the data quality. What is important is at this point the data has been de-duplicated, corrected and
            # is consistant.  So instead of "california", "California", "Califonria", it may all be just "California"
            # Note that spell correction DOES remove punctuation!
            if(feature['spell_correct']):
                review_text =  [word for words in review_text for word in replacer.replace(words).split()]
                   
            # get parts of speech
            if(feature['pos_tag']):
                review_text_pos = [word for word in nltk.pos_tag(review_text)]
                texts_pos.append(review_text_pos)
            
                # Remove anything that is not in our tag_set
                if(feature['only_tagged']):
                    # review_text = [word[0] for word in review_text_pos if word[1] in tag_set]
                    review_text = [word for word in review_text_pos if word[1] in tag_set]
                else:
                    review_text = [word for word in review_text_pos]
    
            # remove punctuation
            if(feature['remove_punctuation']):
                if(feature['pos_tag']):
                    review_text = [(word[0].translate(string.maketrans("",""), string.punctuation),word[1]) for word in review_text]
                else:
                    review_text = [word.translate(string.maketrans("",""), string.punctuation) for word in review_text]
                        
            # remove smallwords
            if(feature['remove_smallwords']):
                if(feature['pos_tag']):
                    review_text = [(word[0],word[1]) for word in review_text if not len(word[0]) < 3]
                else:
                    review_text = [word for word in review_text if not len(word) < 3]
                
            # Only consider alpha words (probably uncessary)
            if(feature['alpha_only']):
                if(feature['pos_tag']):
                    review_text = [(word[0],word[1]) for word in review_text if word[0].isalpha()]
                else:
                    review_text = [word for word in review_text if word.isalpha()]
            
            # lowercase all text
            if(feature['lowercase']):
                if(feature['pos_tag']):
                    review_text = [(word[0].lower(),word[1]) for word in review_text]
                else:
                    review_text = [word.lower() for word in review_text]
            
            # remove stopwords
            if(feature['remove_stopwords']):
                if(feature['pos_tag']):
                    review_text = [word for word in review_text if not word[0] in stopset_to_use]
                else:
                    review_text = [word for word in review_text if not word in stopset_to_use]
            
            # Below line is just to store unlemmatized text so we can check statistics later
            if((feature['analyze_lemmatize'] and feature['lemmatize']) or (feature['analyze_spell_correct'] and feature['spell_correct'])):
                if(feature['pos_tag']):
                    texts_unlemmatized.append([word[0] for word in review_text])
                else:
                    texts_unlemmatized.append(review_text)
                    
            # Lemmatize using the Wordnet Lemmatizer
            # Use nltk's lemmatizer to create word stems
            if(feature['lemmatize'] and not feature['lemma_dict']):
                if(feature['pos_tag']):
                    # convert penn_tag to morphy_tag used by WordNet Lemmatizer
                    review_text = [wnl.lemmatize(word[0],morphy_tag[word[1][:2]]) for word in review_text]
                else:
                    review_text = [wnl.lemmatize(word) for word in review_text]
                
            # creating a dictionary of lemmas can take a while, you may wish to first try on a small set by setting MAX_NUM_REVIEWS
            # in CELL 0
            if(feature['lemmatize'] and feature['lemma_dict']):
                lemma_text = []
                for word in review_text:
                    if(feature['pos_tag']):
                        orig_word = word[0]
                        lemma = wnl.lemmatize(word[0],morphy_tag[word[1][:2]])
                    else:
                        orig_word = word
                        lemma = wnl.lemmatize(word)
                        
                    # if word doesn't match its lemma then add this word to our lemma_dict
                    if lemma != orig_word:
                        lemma_dict[lemma].add(orig_word)
                        
                    lemma_text.append(lemma)
                
                review_text = lemma_text
    
            # remove stopwords and smallwords a second time, to make sure none were introduced from lemmatization 
            if(feature['lemmatize']):
                if(feature['remove_smallwords']):
                    review_text = [word for word in review_text if not len(word) < 3]
    
                if(feature['remove_stopwords']):
                    review_text = [word for word in review_text if not word in stopset_to_use]
    
                        
            # if pos_tag is turned on but lemma is not, then our text still has tags and we need to remove them
            if(feature['pos_tag'] and not feature['lemmatize']):
                review_text = [word[0] for word in review_text]
            
            # add to our list
            texts.append(review_text)
        return texts, texts_uncorrected, texts_unlemmatized, texts_pos, lemma_dict

In [None]:
# CELL 85
# setup clean_data for parallel processing
if(feature['parallel'] and not (feature['preprocess_load'] or feature['texts_final_load'])):

    # We sychronize our imports to our remote workers, as they have seperate environments than ours
    # these are just the imports that are needed for clean_data()
    with dv.sync_imports():
        import nltk
        import sys
        import string
        import enchant
        import re
        from collections import defaultdict
        from nltk.stem import WordNetLemmatizer
        from nltk.tokenize import word_tokenize, sent_tokenize
        from nltk.metrics import edit_distance
        from nltk.corpus import wordnet

In [None]:
%%time
# CELL 90

# %%time reports Wall time: 4.19 s for my Quad Core 2.66Ghz mac pro in parallel mode
# %%time reports Wall time: 1min 2s for Amazon instance using ALL reviews data ~900k

# get school names and create bigrams from them, then add those bigrams to our bigram stopset

# We clean our school_names the exact same way we clean other data, this must be done for them be used
# properly as stop words

# You will NOT see stdout from the workers if using parallel processing

# The commands below make use of the global name space as opposed to passing variables to the function. The nature of ipythons
# paralell architecture is that it basically relies on global variables in its own environment.  We rename them here to 
# keep things organized
if(not (feature['preprocess_load'] or feature['texts_final_load'])):
    
    data_to_clean = school_names
    stopset_to_use = stopset_unigram
    
    # copy the data_to_clean and stopset_to_use into name space of remote workers
    if(feature['parallel']):
        
        # We use ipythons parallel processing "scatter" to shard "reviews" accross our cores
        dv.scatter('data_to_clean',data_to_clean)
        
        # We copy the objects feature, stopset, tag_set as well as the Class SpellingReplacer to our remote workers
        dv['feature'] = feature
        dv['stopset_to_use'] = stopset_to_use
        dv['tag_set'] = tag_set
        dv['SpellingReplacer'] = SpellingReplacer
        
        # Show size of entire data being sent for processing
        print "Total data size: %d" % len(data_to_clean)
        
        # Show the shard size of each remote worker
        print "Sharded data size for each worker"
        %px print len(data_to_clean)
    
        # clean the school_names
        result = clean_data()
    else:
        # We are not running parallel and we only care about the first text returned from clean_data()
        school_names = clean_data()[0]

In [None]:
# CELL 91
# Merge Parallel Data back from remote workers back into one set for school_names
if(feature['parallel'] and not (feature['preprocess_load'] or feature['texts_final_load'])):
    school_names = []

    for worker in result:
        school_names.extend(worker[0])

    print "Total data size after merge: %d" % len(school_names)

## Richening the text with bigrams

We created bigrams from the above text in `CELL 110` and included those as well.  This brought about interesting bigrams that would frequently appear, such as *test scores*,  *front office*, *music program*, *field trip*, etc.  Once again, these were just built off of nouns.  The bigrams were added back to our processed document so that we ended up with each review being a list of unigrams and bigrams of that review.  We created a stopset here in `CELL 93` using the school names, so that these could be stopped from any bigrams that were created.  This is because the school names themselves were very frequent in bigrams and we were not concerned with the names of the school in doing topic analysis.


In [None]:
# CELL 93
# Create stopset_bigrams from bigrams created using school names for creation of a bigram stopset

# We create bigrams from the school_names, union with the original stopset to create our final bigram stopset

# We only stop bigrams made from school names.  Not unigrams.  This is because there are too many words potentially used in school 
# names which we would not want to stop.  For example the word "Art", or "Science" or any number of other words.  When discussing 
# a school in a review however, reviewers repeatedly mention school names, for example if the school name were "Glenn Brook Middle
# School" they may constantly mention "Glenn Brook".......this gets rid of it.  The reality is, when working with a tag_set of just
# nounds (NN/NNS), most names of schools are removed from the text anyways, as they are typically Proper Nouns (NNP) or adjectives
# and never survive further along to give us any trouble.  However having this bigram creation and stoplist is very useful if you 
# are NOT limiting to just nouns, so it give us great flexibility.
if(not feature['texts_final_load']):
    school_names_bigram = [[" ".join(bigram) for bigram in bigrams(text)] for text in school_names]
    school_names_bigram = list(itertools.chain.from_iterable(school_names_bigram))
    stopset_bigrams = set(school_names_bigram)

In [None]:
%%time
# CELL 94 
# collect optional uncorrected, unlemmatized, pos tagged and lemma dict texts for later analysis

# time 1m 56s for 3000 reviews on my 2.66Ghz quad core xeon mac pro
# time 25m for 37349 reviews on my 2.66Ghz quad core xeon mac pro
# time 18m 18s for 37349 reviews on my 2.66Ghz quad core xeon mac pro with 8 virtual cores HyperThreading
# time 31m 20s for 166610 CA reviews on Amazon EC2 16 hyperthreaded server
# time 2h 15m for 890000 reviews of all schools on Amazon EC2 32 hyperthreaded server 

# get our cleaned reviews
# We can do this using parallel processing or not
# You will NOT see stdout from the workers if using parallel processing

# The commands below make use of the global name space in each worker as opposed to passing variables to the function. The nature of
# ipythons paralell architecture is that it basically utilizes separate python processes, each with their own global namespace 
if(not (feature['preprocess_load'] or feature['texts_final_load'])):
    
    data_to_clean = reviews
    stopset_to_use = stopset_unigram
    
    # copy the data_to_clean and stopset_to_use into name space of remote workers
    if(feature['parallel']):
        
        # We use ipythons parallel processing "scatter" to shard "reviews" accross our cores
        dv.scatter('data_to_clean',data_to_clean)
        
        # We copy the objects feature and stopset, as well as the Class SpellingReplacer to our remote workers
        dv['feature'] = feature
        dv['stopset_to_use'] = stopset_to_use
        dv['tag_set'] = tag_set
        dv['SpellingReplacer'] = SpellingReplacer
        
        # Show size of entire data being sent for processing
        print "Total data size: %d" % len(data_to_clean)
        
        # Show the shard size of each remote worker
        print "Sharded data size for each worker"
        %px print len(data_to_clean)
    
        # clean the reviews
        result = clean_data()
    else:
        # run non-parallel
        texts,texts_uncorrected,texts_unlemmatized,texts_pos,lemma_dict = clean_data()

In [None]:
# CELL 95
# Diagnostics to to make sure all of our data is here.  There should be no drift in the lengths of texts.

if(feature['parallel'] and not (feature['preprocess_load'] or feature['texts_final_load'])):

    texts_size = 0
    texts_uncorrected_size = 0
    texts_unlemmatized_size = 0
    texts_pos_size = 0
    lemma_dict_size = 0

    # iterate through each worker and get lengths of the texts
    for worker in result:
        texts_size += len(worker[0])
        texts_uncorrected_size += len(worker[1])
        texts_unlemmatized_size += len(worker[2])
        texts_pos_size += len(worker[3])
        lemma_dict_size += len(worker[4])
        
if(not feature['parallel'] or feature['preprocess_load']):
    texts_size = len(texts)
    texts_uncorrected_size = len(texts_uncorrected)
    texts_unlemmatized_size = len(texts_unlemmatized)
    texts_pos_size = len(texts_pos)
    lemma_dict_size = len(lemma_dict)

if(not feature['texts_final_load']):
    print "original reviews size %d" % len(reviews)
    print "combined size of remote workers texts_pos %d" % texts_pos_size
    print "combined size of remote workers texts_uncorrected %d" % texts_uncorrected_size
    print "combined size of remote workers texts_unlemmatized %d" % texts_unlemmatized_size
    print "combined size of remote workers texts %d" % texts_size
    # sum of len of seperate dictionaries is not necessarily equal to their original/merged size
    print "combined size of remote workers lemma_dict %d" % lemma_dict_size

In [None]:
%%time
# CELL 96
# Merge Parallel Data back from remote workers back into one set
if(feature['parallel'] and not (feature['preprocess_load'] or feature['texts_final_load'])):
    texts = []
    texts_uncorrected = []
    texts_unlemmatized = []
    texts_pos = []
    lemma_dict = defaultdict(set)

    # iterate through each worker and merge the texts back together
    for worker in result:
        texts += worker[0]
        texts_uncorrected += worker[1]
        texts_unlemmatized += worker[2]
        texts_pos += worker[3]
        for k in worker[4]:
            lemma_dict[k] = lemma_dict[k].union(worker[4][k])
    print "Total data size after merge: %d" % len(texts)

In [None]:
# CELL 97
# save data 
# write reviews, school_names, texts, texts_uncorrected, texts_unlemmatized, texts_pos, lemma_dict
# out to a file using pickle.  This is so we don't have to re-run pre-processing later if we want to work on the same data
if(feature['preprocess_save']):
    filename = data_dir + '/' + file_root + '-reviews_indexes.pickle'
    print "Writing file %s of size %d" % (filename,len(reviews_indexes))
    with open(filename, 'w') as f:
        pickle.dump(reviews_indexes, f)

    filename = data_dir + '/' + file_root + '-reviews.pickle'
    print "Writing file %s of size %d" % (filename,len(reviews))
    with open(filename, 'w') as f:
        pickle.dump(reviews, f)

    filename = data_dir + '/' + file_root + '-school_names.pickle'
    print "Writing file %s of size %d" % (filename,len(school_names))
    with open(filename, 'w') as f:
        pickle.dump(school_names, f)
        
    filename = data_dir + '/' + file_root + '-texts.pickle'
    print "Writing file %s of size %d" % (filename,len(texts))
    with open(filename, 'w') as f:
        pickle.dump(texts, f)
        
    filename = data_dir + '/' + file_root + '-texts_uncorrected.pickle'
    print "Writing file %s of size %d" % (filename,len(texts_uncorrected))
    with open(filename, 'w') as f:
        pickle.dump(texts_uncorrected, f)
        
    filename = data_dir + '/' + file_root + '-texts_unlemmatized.pickle'
    print "Writing file %s of size %d" % (filename,len(texts_unlemmatized))
    with open(filename, 'w') as f:
        pickle.dump(texts_unlemmatized, f)
        
    filename = data_dir + '/' + file_root + '-texts_pos.pickle'
    print "Writing file %s of size %d" % (filename,len(texts_pos))
    with open(filename, 'w') as f:
        pickle.dump(texts_pos, f)
        
    filename = data_dir + '/' + file_root + '-lemma_dict.pickle'
    print "Writing file %s of size %d" % (filename,len(lemma_dict))
    with open(filename, 'w') as f:
        pickle.dump(lemma_dict, f)

In [None]:
# CELL 98
# Look at the text or analyze the text processing pipeline for a single review

def get_review_data(review_num):
    print "\nData from reviews for review number %d" % review_num
    print "GSID: %s NCES Code: %s    Universal ID: %s" % (reviews_indexes['id'][review_num],reviews_indexes['nces_code'][review_num],reviews_indexes['universal_id'][review_num])
    print "\n"
    print reviews[review_num]
    
    if(feature['analyze_spell_correct'] and feature['spell_correct']):
        print "\nData from texts_uncorrected for review number %d" % review_num
        print texts_uncorrected[review_num]

    if(feature['pos_tag']):
        print "\nData from texts_pos for review number %d" % review_num
        print texts_pos[review_num]

    if((feature['analyze_lemmatize'] and feature['lemmatize']) or (feature['analyze_spell_correct'] and feature['spell_correct'])):
        print "\nData from texts_unlemmatized for review number %d" % review_num
        print texts_unlemmatized[review_num]

    print "\nData from texts for review number %d" % review_num
    print texts[review_num]
  
if(not feature['texts_final_load']):
    print "Length of ingested reviews: %d" % len(reviews)
    print "Length of texts_pos: %d" % len(texts_pos)
    print "Length of texts_uncorrected: %d" % len(texts_uncorrected)
    print "Length of texts_lemmatized: %d" % len(texts_unlemmatized)
    print "Length of texts: %d" % len(texts)
    get_review_data(4)

In [None]:
# CELL 99
# view descriptive statistics on our texts
# Very interesting information here, over half the text is words that appear just 1 or 2 times!
if(not feature['texts_final_load']):
    if((feature['analyze_lemmatize'] and feature['lemmatize']) or (feature['analyze_spell_correct'] and feature['spell_correct'])):
        print "Descriptive statistics for texts_uncorrected"
        summary_stats(texts_uncorrected)
    if(feature['analyze_spell_correct'] and feature['spell_correct']):
        print "\nDescriptive statistics for texts_unlemmatized (after spell correction, removal of POS not in our tag_set, removal of stopwords, etc. but before lemmatization)"
        summary_stats(texts_unlemmatized)
    print "\nDescriptive statistics for texts (lemmatized and done with pre-processing)"
    summary_stats(texts)

In [None]:
# CELL 100
# This cell just prints some intresting information about the lemmatization

if(not feature['texts_final_load']):
    if(feature['lemmatize'] and feature['lemma_dict']):
        lemma_list = []
        lemma_total = 0
        for k,v in lemma_dict.iteritems():
            lemma_list.append((k,list(v)))
            lemma_total += len(v)+1
        print "there were %d words that were lemmatized down to a total of %d lemmas" % (lemma_total,len(lemma_dict))
        lemma_list = sorted(lemma_list, key = lambda x: len(x[1]), reverse=True)
        for lemma in lemma_list:
            print lemma

In [None]:
# CELL 110
# Create texts_bigrams

# Our final text is actually a creation of the orignal unigrams we decided to keep and their bigrams.
if(not feature['texts_final_load']):
    texts_bigrams = [[" ".join(bigram) for bigram in bigrams(text)] for text in texts]
    texts_bigrams = [[bigram for bigram in text if bigram not in stopset_bigrams] for text in texts_bigrams]
    # texts_bigrams
    
# Look at top 10 Bigrams (from our unigram text)
# bcf = BigramCollocationFinder.from_words(itertools.chain.from_iterable(texts))
# bcf.nbest(BigramAssocMeasures.likelihood_ratio, 100)

In [None]:
# CELL 150
# examining the most frequent words, to see at what point trash is introduced

def frequent_tokens(texts,topn):
    """
    Prints frequency counts of tokens
    
    Parameters
    ----------
    texts : tokenized, list of lists
        the text to analyze
    topn : int
        print top n frequency counts
    """
    fdist = nltk.FreqDist(list(itertools.chain.from_iterable(texts)))
    for word,count in fdist.items()[:topn]:
        print count, word

In [None]:
# CELL 151
# topn most frequent unigrams
if(not feature['texts_final_load']):
    frequent_tokens(texts,50)

In [None]:
# CELL 152
# topn most frequent bigrams
if(not feature['texts_final_load']):
    frequent_tokens(texts_bigrams, 50)

In [None]:
# CELL 160
# Load texts_final
if(feature['texts_final_load']):
    with open(data_dir + '/' + file_root + '-texts_final' + '.pickle') as f:
        texts_final = pickle.load(f)

In [None]:
# CELL 170
# Stitch unigrams and bigrams together in a new text texts_combined

if(not feature['texts_final_load']):
    texts_combined = []
    for index,text in enumerate(texts):
        texts_combined.append(text+texts_bigrams[index])
    
    print len(texts_combined)

In [None]:
# CELL 175
# Choose which text we will be processing with our model:
#    unigrams = texts
#    bigrams  = texts_bigrams
#    unigrams + bigrams = texts_combined

if(not feature['texts_final_load']):
    # texts_final = texts
    # texts_final = texts_bigrams
    texts_final = texts_combined

In [None]:
# CELL 178
# Save texts_final

if(feature['texts_final_save']):
    with open(data_dir + '/' + file_root + '-texts_final' + '.pickle', 'w') as f:
        pickle.dump(texts_final, f)

## Consideration of various models

Various models were considered and experimented with using a variety of tools.  In python [LSI](http://en.wikipedia.org/wiki/Latent_semantic_indexing), [LDA](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and [HDP](http://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process) were all experimented with.  Ultimately we chose to work with the [gensim implementation of LDA](http://radimrehurek.com/gensim/models/ldamodel.html).  There were a number of reasons gensim was chosen:

- Gensim is a mature project with over 5 years of being vetted
- Gensim is specifically evolved around the creation of topic models
- Gensim runs on python, a language that could deal with large amounts of data
- Gensim supports a distributed architecture, allowing us to parallelize its operations 

## LDA Workflow

LDA operates over a bag of words (bow).  The bow is created from first creating a dictionary of our final processed text, and then passing the text to the dictionary to create a bow.  LDA they takes the dictionary and bow and builds the model.  The end result is that topics are assigned to each document.

![Alt text](files/img_bf_classification.png)

## Dictionary and Corpus Creation

Our dictionary was created from the output of our text processing.  After the dictionary was created, we removed infrequent and frequent words by using the `dictionary.filter_extremes()` method.  We removed all words that did not appear in at least 5 documents, removed all words that appeared in more than 60% of the documents.  There are various ways in which you can present a corpus.  LDA relies on a bow representation.  We experimented with using a TD-IDF transform, but doing so is not consistant with the model of LDA.

## Bug discovery and fix in Gensim

Gensim is a project that has existed for over 5 years.  The code is very solid and trusted.  In running our grid searches however, we discovered a bug that was previously unknown.  The bug was not only identified as a bug, but the true root cause was found, and solution recommended to the author, which was implemented in its [pyro threads](https://github.com/piskvorky/gensim/tree/pyro_threads) branch.  The bug was related to how gensim was allocating threads.  It was not releasing the threads when it was finished with the model.  Therefore, creation of new models, for example during an iteration process of grid searching, would exhaust threads.  The design pattern deployed by gensim was very quick and dirty, it did not stick to the traditional producer/consumer pattern that is common to RMI models.  It had a context in which it was continously blocking until new data would arrive.  This was modified so that it would wait for a signal that their was new data, sleeping and regularly checking it, which allowed for it to receive further instructions as needed.


In [None]:
# CELL 180
# create a dictionary of words and filter it
# essentially makeing a N-D vector representation

if(not feature['dict_corpus_load']):

    # assigns integer id's to each word along with word counts and other statistics
    dictionary = corpora.Dictionary(texts_final)

    # filter dictionary, still need to find optimal settings default is no filtering
    # Currently trying this at no_below=5, no_above=0.6, keep_n=None
    # filters
    # 1. less than no_below documents (absolute number) or
    # 2. more than no_above documents (fraction of total corpus size, not absolute number).
    # after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).
    # After the pruning, shrink resulting gaps in word ids. (this means word ids may change after gap shrinking)
    dictionary.filter_extremes(no_below=5, no_above=0.6, keep_n=None)
    # print dictionary

In [None]:
# CELL 182
# save dictionary to disk for later use

if(feature['dict_corpus_save']):
    dictionary.save(data_dir + '/' + file_root + '.dict') 

In [None]:
# CELL 190
# look at mappings between id's and words

# print dictionary.token2id

In [None]:
# CELL 200
# test our dictionary

# we take the string to lower() and split on whitespace.  If this example had more complex things like punctuation, we would
# need to handle it like I did earlier with the full text reviews......this is just a simple example.
# the words "has" and "and" are ignored because these are stopwords, and were filtered and never added to the dictionary 
# in the first place.  The word "sauropod" does not appear in the dictionary because it was not in any review we ingested.
# id mappings below may be different from actual at this time
# 51 is "school", 57 is "students", 61 is "teachers"
# The ,1 after each id is the word count, which in this case is 2 for students and 1 for others 
# Note: now that I do dictionary filtering, the above words are filtered out, so the example is not the same as
# what you will see below
# new_doc = "military organizes disgrace and sauropods"
# new_vec = dictionary.doc2bow(new_doc.lower().split())
# print new_vec 

In [None]:
# CELL 205
# load a previous saved dictionary and corpus

if(feature['dict_corpus_load']):

    dictionary = corpora.Dictionary.load(data_dir + '/' + file_root + '.dict')

    # do not re-filter if the dictionary you are loading was already previously filtered before it was saved 
    # dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
    # print dictionary

    corpus_bow = corpora.MmCorpus(data_dir + '/' + file_root + '.mm')
    # print corpus_bow

In [None]:
# CELL 210
# our text converted to a bag of words based on our dictionary
if(not feature['dict_corpus_load']):
    corpus_bow = [dictionary.doc2bow(text) for text in texts_final]

In [None]:
# CELL 212
# save corpus to disk for later use
if(feature['dict_corpus_save']):
    corpora.MmCorpus.serialize(data_dir + '/' + file_root + '.mm', corpus_bow) 
    # print corpus_bow

In [None]:
# CELL 230
# How many times does a particular word appear in the corpus?

# Look up the id of the token 
# id = dictionary.token2id['military']

# Now get the count for that id
# total_sum = sum(dict(doc).get(123, 0) for doc in corpus_bow)
# print total_sum

In [None]:
# CELL 240

# create a TF-IDF model of our corpus
# this model can now convert data that was represented as "bag of words" to the new TF-IDF representation
# From the documentation "Expects a bag-of-words (integer values) training corpus during initialization. During transformation, 
# it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training 
# corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the 
# number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length."
# Note, I did not normalize here but I could.
if(feature['tfidf']):
    model_tfidf = models.TfidfModel(corpus_bow)

In [None]:
# CELL 250
# view a bow in our tfidf corpus
# for example I take review 1
# doc_bow = [(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 4), 
#           (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 2), 
#           (25, 1), (26, 1), (27, 2), (28, 2), (29, 1), (30, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), 
#           (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), 
#           (49, 1), (50, 1), (51, 3), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 3), (58, 1), (59, 2), (60, 1), 
#           (61, 3), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 2)]
# print model_tfidf[doc_bow]

In [None]:
# CELL 260
# transform the entire corpus to TF-IDF
if(feature['tfidf']):
    corpus_tfidf = model_tfidf[corpus_bow]

# This can take a while to print (15m or so)
# for doc in corpus_tfidf:
#    print doc

In [None]:
# CELL 265
# choose a corpus
if(feature['tfidf']):
    corpus = corpus_tfidf
else:
    corpus = corpus_bow

In [None]:
# CELL 268
# perplexity grid search
# we will take our documents, divide them into 80% train and 20% test.  We will measure the test data against the model built with 
# train  and calculate a perplexity measurement.  The goal will be to find the parameters which minimize perplexity.
# Intution gathered from posts on the gensim mailing list including https://groups.google.com/forum/#!topic/gensim/tsGNoDkMY7U

if(feature['perplexity_search']):
    
    grid = defaultdict(list)

    # Choose a parameter you are wanting to search, for example num_topics or alpha / eta, make sure you substitute "parameter_value"
    # into the model below instead of a static value.
    #
    # num topics
    # parameter_list=[10, 25, 50, 75, 100]
    
    # alpha / eta
    parameter_list=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 1.5]
    
    # shuffle corpus
    cp = list(corpus)
    random.shuffle(cp)

    # split into 80% training and 20% test sets
    p = int(len(cp) * .8)
    cp_train = cp[0:p]
    cp_test = cp[p:]

    # for num_topics_value in num_topics_list:
    for parameter_value in parameter_list:

        # print "starting pass for num_topic = %d" % num_topics_value
        print "starting pass for parameter_value = %.3f" % parameter_value
        start_time = time.time()

        # run model
        model = models.ldamodel.LdaModel(corpus=cp_train, id2word=dictionary, num_topics=40, chunksize=2000, 
                                        passes=50, update_every=0, alpha=parameter_value, eta=parameter_value, decay=0.5,
                                        distributed=True)
    
        # show elapsed time for model
        elapsed = time.time() - start_time
        print "Elapsed time: %s" % elapsed
    
        perplex = model.bound(cp_test)
        print "Perplexity: %s" % perplex
        grid[parameter_value].append(perplex)
    
        per_word_perplex = np.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))
        print "Per-word Perplexity: %s" % per_word_perplex
        grid[parameter_value].append(per_word_perplex)

In [None]:
# CELL 269
# View the results of our grid search
if(feature['perplexity_search']):
    for parameters in grid:
        print parameters

## LDA model building

The LDA model was built using 75 topics. A grid search was ran across multiple topic numbers and an ideal value for number of topics was chosen which had a low perplexity.  The team members further validated the number of topics by looking at the results of various models built from various numbers of topics and choosing the one that looked best.  Grid searches were also conducted for hyper parameters `alpha` and `eta`, once again measuring perplexity.  It was found that the default parameters were reasonable.  Default `alpha` and `eta` are `1/num_topics`.

## Gensim’s distributed architecture

Gensim uses the [Pyro4](http://pythonhosted.org/Pyro4/) framework to allow for distributed parallel processing of LDA jobs.  The corpus is divided into chunks, and each worker is assigned a chunk.   Since we were working with 32 cores, we used 32 workers over which the entire corpus was divided.  Gensim handled all of the sharding and merging of the data


In [None]:
%%time
# CELL 270
# Choose a model

# took 24m 57s to do LDA num_topics=40, chunksize=500, passes=100, update_every=0, distributed=True 
# on my 4 core 2.66Ghz mac pro, on 37349 reviews using 8 virtual cores (8 workers)

# If you don't have gensim setup for distributed set distributed=False
# update_every is set to 0 as we are running in "batch" mode vs. online
# chunksize is ideally = num documents / num workers     (when running in batch mode)
# alpha and eta defaul to 1/num_topics if not set (None)
if(not feature['model_load']):
    model = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=num_topics, chunksize=100, 
                                  passes=1, update_every=0, alpha=None, eta=None, decay=0.5,
                                  distributed=True)

In [None]:
# CELL 289
# Load model

if(feature['model_load']):
    model = models.ldamodel.LdaModel.load(data_dir + '/' + file_root + '-model-t' + str(num_topics) + '.lda')

In [None]:
# CELL 290
# Save model

if(feature['model_save']):
    model.save(data_dir + '/' + file_root + '-model-t' + str(num_topics) + '.lda') 

In [None]:
# CELL 292
# Build corpus model

# create a double wrapper over the original corpus: bow->fold-in-lda, bow->tfidf->fold-in-lda, etc.
# this is so you can view how documents are classified by the model
if(not feature['corpus_model_load']):
    corpus_model = model[corpus]

In [None]:
# CELL 293
# Load corpus model

if(feature['corpus_model_load']):
    with open(data_dir + '/' + file_root + '-corpus-model-t' + str(num_topics) + '.pickle') as f:
        corpus_model = pickle.load(f)

In [None]:
# CELL 295
# View corpus model

# you can leave this commented unless you wish to view the contents of corpus_model in its entirity
# I don't recommend it, because the transforms are done on the fly and it takes about 15min
# Here you can see each review, and how it relates to each of the n topics.  Each list is a review (document) and each item 
# in the list is a tuple of (topic number,score)
# the actual transformations, for example bow->lda are actually executed here, on the fly
# for doc in corpus_model: 
#    print doc

In [None]:
# CELL 296
# Save corpus model

if(feature['corpus_model_save']):
    with open(data_dir + '/' + file_root + '-corpus-model-t' + str(num_topics) + '.pickle', 'w') as f:
        pickle.dump(corpus_model, f)

In [None]:
# CELL 300
# load topics
if(feature['model_topics_load']):
    with open(data_dir + '/' + file_root + '-topics-t' + str(num_topics) + '.pickle') as f:
        model_topics = pickle.load(f)
    for num,topic in enumerate(model_topics[0]):
        print num,topic

In [None]:
# CELL 301
# print topics
model_topics = model.print_topics(num_topics)
model_topics

In [None]:
# CELL 302
# save topics
if(feature['model_topics_save']):
    with open(data_dir + '/' + file_root + '-topics-t' + str(num_topics) + '.pickle', 'w') as f:
        pickle.dump(model_topics, f)

In [None]:
# CELL 310
# These functions copied from https://groups.google.com/forum/#!searchin/gensim/visualize/gensim/SxFKsSsBTRs/cN6p3XaH4rUJ
# They extract the gamma, beta and log probabilities 

def get_gamma(lda, corpus):
    """
    Return gamma from a gensim LdaModel instance.
    
    Parameters
    ----------
    lda : LdaModel
        A fitted model.
    corpus : gensim Corpus
        An iterable Bag-of-Words Corpus used to fit the LDA model

    Returns
    -------
    gamma : ndarray
        An ndarray that contains gamma.
    """
    # lda.VAR_MAXITER = 'est' 
    chunksize = lda.chunksize
    chunker = itertools.groupby(enumerate(corpus),
               key=lambda (docno, doc): docno/chunksize)
    all_gamma = []
    for chunk_no, (key, group) in enumerate(chunker):
        chunk = np.asarray([np.asarray(doc) for _, doc in group])
        (gamma, sstats) = lda.inference(chunk)
        all_gamma.append(gamma)
    return np.vstack(all_gamma)

def blei_gamma(fname, gamma):
    """
    Writes the gamma file in Blei's format.

    Parameters
    ----------
    fname : str
        Path to file to save to
    gamma : ndarray
        Numpy array returned from get_gamma
    """
    np.savetxt(fname, gamma, fmt='%5.10f')

def blei_beta(fname, lda):
    """
    Write log probabilities to a space-delimited file as lda-c final.beta

    Parameters
    ----------
    fname : str
        Filename
    lda : LdaModel
        LdaModel instance
    """
    expElogbeta = np.log(lda.expElogbeta)
    np.savetxt(fname, expElogbeta, fmt='%5.10f')
    
if(feature['beta_gamma_save']):
    blei_beta(data_dir + '/' + file_root + '-t' + str(num_topics) + '.beta', model)
    gamma = get_gamma(model, corpus)
    blei_gamma(data_dir + '/' + file_root + '-t' + str(num_topics) + '.gamma', gamma)

In [None]:
# CELL 1000
# Exploratory things you can try

# set a review number to examine
review_num=0

def get_topic_data(review_num):
    """
    Produces information about topics associated to a particular review
    
    Parameters
    ----------
    review_num : int
        a valid review number 
    """
    
    # Look at the topics associated to this review in the corpus_model
    print "\nTopics for review number %d" % review_num
    topics = model[corpus[review_num]]
    print topics
    print "\nTopic details for topics in review number %d" % review_num
    topic_info = [model.print_topic(topic[0],topn=10) for topic in topics]
    for topic in topic_info:
        print topic

# Look at this review in the corpus (bag of words, tf-idf, etc)
# print corpus[review_num]

# get data about the document as it moved through the pipeline
if(not feature['texts_final_load']):
    get_review_data(review_num)
get_topic_data(review_num)

## Generating topic information for each review to be written back to the database 

The final result we needed was a list of all topics and their percentage associated with each review.  We built a CSV file of this information for easy import back into the database.  The file contained the schools GSID, NCES Code, Universal ID and then was followed by a list of n fields, one field per topic number.  So when building a model with 75 topics, as our final model was, we had 75 topic value fields.  These would correspond with a percentage the review was represented by that topic.  A review could have multiple topics it corresponded to.  Any topic correlation that was less than 1% was just kept at 0.  Later these values would all be normalized before being stored in the database.


In [None]:
%%time
# CELL 1100 
# Create CSV file to place topic numbers into the database for each review

with open(data_dir + '/' + file_root + '-review_topics-t' + str(num_topics) + '.csv', 'wb') as csvfile:
    topicwriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    for row in xrange(len(reviews)):
        # we store gsid, nces_code and universal_id in topic_array_1 which is str (some nces_codes are alphanum)
        topic_array_1 = np.empty(3, dtype="S10")
        # we store each topic percentage in topic_array_2 which is floats
        topic_array_2 = np.zeros(num_topics, float)
        
        if(row % 1000 == 0):
            print "Processing row %d" % row
        gsid, nces_code, universal_id, postdate = reviews_indexes.irow(row)
        # some gsid's are None because the school had no reviews
        if(gsid is None):
            gsid = 999999999
        topic_array_1[0], topic_array_1[1], topic_array_1[2] = gsid, nces_code, universal_id

        # we get the topics from the model
        topics = model[corpus[row]]
    
        # we update the topic_array_2 with the percentage values for each topic
        for topic in topics:
            topic_num,topic_score = topic
            topic_array_2[topic_num] = topic_score
         
        # we combine both arrays into a single writerow
        topicwriter.writerow(list(topic_array_1) + list(topic_array_2))

## Misc Cells for Word Frequency Counts
The below cells are just used in a very ad-hoc way to produce word counts of different slices and cuts of the data, and export them into CSV format.  We use a similar design paradigm as our original work earlier in the notebook, but its different enough that we could not re-use it as is.  If we had more time we could factor together much of the code/functionality below into some more generalized functions.  The below cells are not even controlled feature the feature dict, they are just ad-hoc cells to be ran manually.

In [None]:
# CELL 1500 
# Word Frequency Counts
# remove encodings, tokenize, remove punctuation, remove stopwords, remove words < 3, remove non alpha words

# This assumes paralell architecture has been setup in CELLS 75 and 85 and those CELLS have been ran, otherwise this will fail.
# The below function freq_clean() is not designed to be ran serial.

@dv.remote(block=True)
def freq_clean():
    """
    tokenizes and cleans the text that exists in data_to_clean

    Returns
    ----------
    reviews : list of lists
        a tokenized text
    """

    texts = []

    counter = 0

    # we need to strip escape characters so we compile a pattern
    hexchars = re.compile('\\\\x(\w{2})')
        
    print "Starting to process %d documents" % len(data_to_clean)

    for review_text in data_to_clean:
        
        # print result to stdout, note that when using parallel stdout is not sent to the Client
        # counter += 1
        # if ((counter % 1000)==0):
        #    print "%d documents processed" % counter
        # sys.stdout.flush()
                          
        # remove html encodings
        review_text = review_text.replace('&amp;','').replace('&lt;','').replace('&gt;','').replace('&quot;','').replace('&#039;','').replace('&#034;','')
        review_text = re.sub(hexchars, '', review_text.encode("string-escape"))
      
        # tokenize, you don't want to turn this off
        review_text = [word for sent in sent_tokenize(review_text) for word in word_tokenize(sent)]

        # remove punctuation
        review_text = [word.translate(string.maketrans("",""), string.punctuation) for word in review_text]
                    
        # remove smallwords
        review_text = [word for word in review_text if not len(word) < 3]
        
        # lowercase all text
        review_text = [word.lower() for word in review_text]
        
        # remove stopwords
        review_text = [word for word in review_text if not word in stopset_to_use]
        
        texts.append(review_text)
    return texts

In [None]:
# CELL 1510
# Merge Freq Count data back to one variable

def freq_merge(result):
    """
    Merges parallelized data back into a single structure
    
    Parameters
    ----------
    result : array
        An array containing the values returned from each thread/worker
    
    Returns
    ----------
    texts : list of lists
        a tokenized text
    """
    # Merge Parallel Data back from remote workers back into one set
    texts = []
    
    # iterate through each worker and merge the texts back together
    for worker in result:
        texts += worker
    print "Total after merge: %d" % len(texts)
    return texts

In [None]:
# CELL 1520
# get reviews from database

# get reviews for a single state
def get_state(state):
    """
    Gets all reviews from the database for a single state
    
    Parameters
    ----------
    state : string
        abbreviation of state
    
    Returns
    ----------
    reviews : list of lists
        a tokenized text
    """
    cnx = mysql_ops()
    reviews = cnx.get_reviews_state(state)[['reviews']]
    print "Ingested %d documents from database for state %s" % (len(reviews),state)
    return reviews

# get all reviews
def get_all():
    """
    Gets all reviews from the database
    
    Returns
    ----------
    reviews : list of lists
        a tokenized text
    """
    cnx = mysql_ops()
    reviews = cnx.get_all_reviews()
    print "Ingested %d documents from database for all states" % len(reviews)
    return reviews

In [None]:
# CELL 1530
# write freq counts to a file

def freq_write(texts,state):
    """
    Write word frequency count to file
    
    Parameters
    ----------
    texts : list of lists
        a tokenized text
    state : string
        name of state or other slice name
    """
    f = open(data_dir + '/' + state + '-freq.txt','w')
    print "Writing file for state %s" % state
    fdist = nltk.FreqDist(list(itertools.chain.from_iterable(texts)))
    for word,count in fdist.items()[:25]:
        f.write("%s,%s\n" % (word,count))
    f.close()

In [None]:
# CELL 1540
# get all reviews

reviews_all = get_all()

In [None]:
# CELL 1550
# slice and cut the reviews in various ways we wish to do our counts

# stars_0 = reviews_all[reviews_all['stars'] == 99]
# stars_1 = reviews_all[reviews_all['stars'] == 1]
# stars_2 = reviews_all[reviews_all['stars'] == 2]
# stars_3 = reviews_all[reviews_all['stars'] == 3]
# stars_4 = reviews_all[reviews_all['stars'] == 4]
# stars_5 = reviews_all[reviews_all['stars'] == 5]
# reviewer_parent = reviews_all[reviews_all['reviewer'] == 'parent']
# reviewer_student = reviews_all[reviews_all['reviewer'] == 'student']
# reviewer_other = reviews_all[reviews_all['reviewer'] == 'other']
# reviewer_teacher = reviews_all[reviews_all['reviewer'] == 'teacher']
# reviewer_former_student = reviews_all[reviews_all['reviewer'] == 'former student']
# reviewer_empty = reviews_all[reviews_all['reviewer'] == '']
# reviewer_staff = reviews_all[reviews_all['reviewer'] == 'staff']
# reviewer_administrator = reviews_all[reviews_all['reviewer'] == 'administrator']
# reviewer_principal = reviews_all[reviews_all['reviewer'] == 'principal']
# type_public = reviews_all[reviews_all['type'] == 'public']
# type_private = reviews_all[reviews_all['type'] == 'private']
# type_charter = reviews_all[reviews_all['type'] == 'charter']


In [None]:
# CELL 1560
# examine lengths

# print len(stars_0)
# print len(stars_1)
# print len(stars_2)
# print len(stars_3)
# print len(stars_4)
# print len(stars_5)
# print len(reviewer_parent)
# print len(reviewer_student)
# print len(reviewer_other)
# print len(reviewer_teacher)
# print len(reviewer_former_student)
# print len(reviewer_empty)
# print len(reviewer_staff)
# print len(reviewer_administrator)
# print len(reviewer_principal)
# print len(type_public)
# print len(type_private)
# print len(type_charter)

In [None]:
# CELL 1570
# examine the value counts of any type we wish just to make sure we sliced all the values

print reviews_all['type'].value_counts()

In [None]:
%%time 
# CELL 1580
# main pipeline to process freq counts

states = ['AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL',
          'GA','HI','IA','ID','IL','IN','KS','KY','LA','MA',
          'MD','ME','MI','MN','MO','MS','MT','NC','ND','NE',
          'NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR',
          'RI','SC','SD','TN','TX','UT','VA','VT','WA','WI',
          'WV','WY']

# For processing states you can use it as is.  For processing other types of slices, you would comment out the
# for loop and unindent its code, and then change the reviews= line and state= to match the slice your working with.
# It's very crude as this was done toward the end, but it worked nicely.
for state in states:
    # state = 'type_charter'
    reviews = get_state(state)
    
    print "Processing state %s" % state

    reviews = type_charter['reviews']
    
    data_to_clean = reviews
    stopset_to_use = set(nltk.corpus.stopwords.words('english'))
    
    # copy the data_to_clean and stopset_to_use into name space of remote workers 
    # We use ipythons parallel processing "scatter" to shard "reviews" accross our cores
    dv.scatter('data_to_clean',data_to_clean)
    
    # We copy the objects feature and stopset, as well as the Class SpellingReplacer to our remote workers
    dv['stopset_to_use'] = stopset_to_use
    
    # Show size of entire data being sent for processing
    print "Total data size: %d" % len(data_to_clean)
    
    # Show the shard size of each remote worker
    print "Sharded data size for each worker"
    %px print len(data_to_clean)
    
    # clean the reviews
    result = freq_clean()
    
    # merge results back
    texts = freq_merge(result)
    
    # write file 
    freq_write(texts,state)

## Model selection

We selected the model based on out-of-sample perplexity.  We used code from the <a href="http://nlp.stanford.edu/downloads/tmt/tmt-0.4/">Stanford topic modeling toolkit</a> to calculate out of sample perplexity.  Samples of 50,000 and 80,000 reviews were used, and run iteration of 500 and 1,500 were tested for differences.

Surprisingly, the number of reviews did not matter very much.  The perplexity is similar between 50,000 and 80,000 reviews.  The elbow appears around 50-80.  The number of iterations also did not seem to change after a large number of topics.

<img src="files/img_bf-notebook-perplex.png">

## Topic naming

To name the topics, we reviewed the top 10 most frequent words associated with each topic, as well as the top 10 most highly scored review associated with that topic, i.e., the reviews most likely to fall into that topic.  Using blind review by two team members, then verification by all team members, we classified the 75 topics into 30 8 categories and 30 subcategories.  A minimum of 13 hours was spent completing this task to ensure that the names most accurately captured the topic estimated by LDA.

Future work in this area could include:

- Use <a href="https://www.mturk.com/mturk/">Amazon Mechanical Turk</a> to verify naming

- Use tree-related text analysis algorithms