# Visualize Term Frequency Distributions

This notebook demonstrates how to visualize term frequency distributions using Rosette API via the `/morphology/lemmas` endpoint.  You can check out the code on [GitHub](https://github.com/zyocum/compare-vocabulary/blob/master/README.md).

## Setup

The first section imports the [Rosette API Python binding module](https://github.com/rosette-api/python).  We also import some helper methods from `visualize.py` and `compare_vocabulary.py`.  These modules can be used via their commandline drivers as well if preferred (run `./visualize.py -h` and `./compare_vocabulary.py -h` for usage instructions).  Finally we also import some helper methods for rendering inline HTML within the notebook.

In [2]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=DeprecationWarning)
    warnings.simplefilter("ignore", category=ImportWarning)

    import os

    from visualize import visualize, color_key
    from compare_vocabulary import fdist
    from rosette.api import API
    from IPython.display import display, HTML

## Instantiating a Rosette API Instance

The next step is to initialize Rosette API so that we can make API calls.  For this we need a Rosette API key.  If you already have a key or you want to sign up for a key, head over to [https://developer.rosette.com](https://developer.rosette.com).  After instantiating a `rosette.api.API` instance we also set the `output` URL parameter to `rosette` because we want to get detailed morphology analyses from Rosette's Annotated Data Model (ADM) in order to access the part-of-speech annotations.

In [3]:
api = API(
    user_key=(
        os.environ.get('ROSETTE_USER_KEY') or # load key from environment variable if possible
        getpass(prompt='Enter your Rosette API key: ') # fall back to prompting user for key
    ),
    service_url='https://api.rosette.com/rest/v1/'
)
api.setUrlParameter('output', 'rosette')

## Decide which Part-of-Speech (POS) Tags to Include

The next step is to determine which part-of-speech tags we are interested in comparing.  The less interesting tags have been commented out below, but you can experiment with different tags based on your interests.

In [4]:
POS_TAGS = {
    'ADJ',   #Adjective
    #'ADP',   #Adposition
    'ADV',   #Adverb
    #'AUX',   #Auxiliary
    #'CONJ',  #Coordinating
    #'DET',   #Determiner
    'INTJ',  #Interjection
    'NOUN',  #Noun
    'NUM',   #Numeral
    #'PART',  #Particle
    #'PRON',  #Pronoun
    'PROPN', #Proper
    #'PUNCT', #Punctuation
    #'SCONJ', #Subordinating
    'SYM',   #Symbol
    'VERB',  #Verb
    'X',     #Other
}

## Load Corpora from `data` Directories

The following block identifies the directories `data/{carroll,frost,poe,shakespeare,whitman,yeats}` as corpora to analyze.  These directories comprise small collections of poems by famous poets.  You can add your own corpora to analyze simply by adding directories of plain-text `.txt` files to the `data` directory and replacing the directory names below.  We also pick a value for `n` here which determines the cut-off point to limit the frequency distributions to the top-n most frequent terms in the corpus.  If you want more results you can increase `n` and if you want to simply analyze the entire vocabulary you can set `n = None`.

In [5]:
corpora = 'carroll', 'frost', 'poe', 'shakespeare', 'whitman', 'yeats'
n = 100 # visualize frequencies for top n most frequent terms

## Display Color Key

To help interpret the color-coded part-of-speech tags for each term, a color key is rendered below.

In [6]:
display(HTML(color_key()))

Tag,Name,Color
ADJ,Adjective,seagreen
ADP,Adposition,brown
ADV,Adverb,limegreen
AUX,Auxiliary verb,blue
CONJ,Coordinating conjunction,orangered
DET,Determiner,silver
INTJ,Interjection,mocha
NOUN,Noun,orange
NUM,Numeral,skyblue
PART,Particle,magenta


## Visualize the Frequency Distributions

The following code loops over each directory and computes a frequency distribution from the terms that occur in the corpus.  Each term is then rendered with a color corresponding to its part-of-speech and its size is relative to its frequency.  You can hover your mouse over individual terms to see the numerical frequencies.

In [7]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore", ResourceWarning)

    for corpus in corpora:
        display(HTML(f'<h1>{corpus}</h1>'))
        fd = fdist(f'data/{corpus}', api, n)
        display(HTML(visualize(fd, pos_tags=POS_TAGS)))