# Collocation Metrics for Bi- and Trigrams

Collocation is another way of discussing co-occurrence; in natural language processing, the term "collocation" usually refers to phrases of two or more tokens that commonly occur together in a given context. You can use this notebook to understand how common certain bi- and trigrams are in your project. Generally speaking, the more tokens you have in your project, and the larger your project data is, the more meaningful these metrics will be. 

For a brief introduction to the concept of collocation in natural language processing and to some of the metrics used in this notebook, see <a href="https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a" target="_blank">Collocations</a>. More in-depth explanations can be found in a <a href="https://nlp.stanford.edu/fsnlp/promo/colloc.pdf" target="_blank">NLP textbook chapter on collocations</a> and in Gerlof Bouma's <a href="https://www.semanticscholar.org/paper/Normalized-%28pointwise%29-mutual-information-in-Bouma/15218d9c029cbb903ae7c729b2c644c24994c201?p2df" target="_blank">Normalized (Pointwise) Mutual Information in Collocation Extraction</a>.

This notebook allows you to calculate five different collocation metrics:  1) Likelihood ratio; 2) Mutual information (MI) scores; 2) Pointwise mutual information (PMI) scores; 4) Student's t-test; and 5) Chi-squared test.

<strong>Important:</strong> Collocation metrics are only useful when you can tokenize on bi- and trigrams. Therefore, this notebook assumes your documents include full-text data, and that this data is stored as a string in the `content` field of each document (see the **Settings** cell).

### Technical Note

This notebook uses the NLTK package to build a custom tokenizer to tokenize project uni-, bi-, and trigrams. This tokenizer differs from the one used in the WE1S preprocessing pipeline. See the module's <a href="README.md" target="_blank">README.md</a> file for more information.

### INFO

__author__    = 'Lindsay Thomas'  
__copyright__ = 'copyright 2020, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.0'  
__email__     = 'lindsaythomas@miami.edu'

## Settings

In [None]:
# Python imports
import os
import csv
from pathlib import Path
from IPython.display import display, HTML

# Import scripts
%run scripts/count_tokens.py

# Define paths
current_dir     = %pwd
current_pathobj = Path(current_dir)
project_dir     = str(current_pathobj.parent.parent)
project_name    = os.path.basename(project_dir)
current_reldir  = current_dir.split("/write/")[1]
data_dir        = project_dir + '/project_data'
json_dir        = project_dir + '/project_data/json'
content_field   = 'content'
stopword_file   = '/home/jovyan/write/pub/templates/project_template/modules/topic_modeling/scripts/we1s_standard_stoplist.txt'

display(HTML('<p style="color: green;"><strong>Setup complete.</strong></p>'))

## 1. Configure Code

You must run all of the cells in this "Configure Code" section, even if you do not change the values.

### Set Tokenization Length
Configure the `set_length` variable below according to the length of ngram you are analyzing. Since collocations always involve 2 or more words, this section of the notebook only works with bigrams and trigrams. The default is bigrams; to count trigrams, comment out the bigram line, and uncomment the trigram line. 

**Note:** Because this code does not strip hyphens, hyphenated words like "first-generation" are considered unigrams.

In [None]:
# Choose to analyze bigrams, or trigrams
set_length = 'bigram'
# set_length = 'trigram'


if set_length not in ['bigram', 'trigram']:
    display(HTML("<p style=\"color: red;\">The <code>set_length</code> variable must be <code>'bigram'</code> or <code>'trigram'</code>.</p>"))
else:
    msg = 'You have set the <code>set_length</code> variable to <code>' + set_length + '</code>.'
    display(HTML('<p style="color: green;">' + msg + '</p>'))

### Configure Punctuation Setting

This cell strips common punctuation from project documents. It will **NOT** strip hyphens, single or double, in order to account for hyphenated words and phrases such as "first-generation". Because this punctuation list is bespoke and not standardized (standardized options strip hyphens), some punctuation marks or other non-Unicode characters may make it through. You do not need to change anything about the below cell (unless you are interested in the frequency of punctuation marks, or @ signs, etc.), but you do need to run it. If you do not want to remove punctuation from your documents, you should set the `punctuations` variable to an empty string by uncommenting the line that says `punctuations = ''` in the cell below.

In [None]:
# Define punctuation to strip
punctuations = "_______________________\'m\'n\'ve!()[]{};:\'\"\,<>./?@#$%^&*_~''``''"

# To strip no punctuation, uncomment the line below
# punctuations = ''

if punctuations == '':
    display(HTML('<p style="color: red;">You have elected not to strip any punctuation.</p>'))
else:
    msg = 'You have set the <code>punctuations</code> variable to <code>' + punctuations + '</code>.'
    display(HTML('<p style="color: green;">' + msg + '</p>'))   

### Configure Stop Word Setting

The default setting is to delete stop words from your data using the WE1S standard stoplist. You can view this list in you project's `modules/topic_modeling/scripts` folder. You can edit this file for your project or create a custom stoplist. If you use a custom list, make sure that it is a plain text file with one word per line. Upload the file to your project and configure the `stopword_file` variable in the **Settings** cell to indicate the path to your custom stop word file.

If your data has already had the stop words you want removed or if you do not want to remove stop words, change the value of `set_stopwords` to `False`.

It is generally recommended to delete stop words from a document before obtaining bi- and/or trigram frequencies. This will result in "inexact" bi- and trigrams, however, as any stop words will be deleted *before* tokenization into bi- or trigrams. If you are interested in specific bi- or trigrams that contain stop words, such as "first" in "first generation" (without a hyphen), you may want to create a custom stop word list.

In [None]:
# Delete stopwords from content fields before obtaining word frequencies.
# If set to True, stop words will be deleted. If set to false, stop words will not be deleted.
set_stopwords = True

if set_stopwords == True:
    display(HTML('<p style="color: green;">You have elected to strip stopwords.</p>'))
else:
    display(HTML('<p style="color: red;">You have elected not to strip stopwords.</p>'))

## 2. Calculate Token Frequencies

The cell below obtains the frequency values you need to calculate all collocation metrics below. To run any cells in **Section 3** of this notebook, you must run this cell.

In [None]:
# Obtain tf-idf scores
all_finders_freq, all_finders_list, freq, bad_jsons = frequency_dir(json_dir, content_field, set_stopwords, 
                                                                    punctuations, set_length, stopword_file)

if len(bad_jsons) > 0:
        msg = 'Token frequency calculations complete. Warning! ' + str(len(bad_jsons)) + ' documents failed to load and will not be included in the calculation. '
        msg += 'If this number is large, this may significantly affect your results.'
        display(HTML('<p style="color: red;">' + msg + '</p>'))

if all_finders_list != []:
    msg = 'Token frequency calculations complete. Calculate collocation metrics in the next section.'
    display(HTML('<p style="color:green;">' + msg + '</p>'))
if all_finders_list == []:
    display(HTML('<p style="color: red;">No results found.</p>'))

## 3. Calculate Collocation Metrics
This section of the notebook allows you to calculate five different collocation metrics: 1) Likelihood ratio; 2) Mutual information (MI) scores; 2) Pointwise mutual information (PMI) scores; 4) Student's t-test; and 5) Chi-squared test. All cells in this section of the notebook rely on the calculations you performed in section 2. You must run the previous cell before you can run any cells below.

For more information about each of these collocation metrics, see this module's <a href="README.md" target="_blank">README.md</a> file.

### Select Collocation Metric
Select the collocation metric you would like to calculate in the cell below. You may select `'likelihood'`, `'mi'`, `'pmi'`, `'t-test'`, `'chi-square'`. If you select `pmi` or `mi`, you should select a value for `freq_filter` below.

In [None]:
# Set to 'likelihood', 'mi', 'pmi', 't-test', or 'chi-square'
metric = ''

if metric not in ['likelihood', 'mi', 'pmi', 't-test', 'chi-square']:
    display(HTML('<p style="color:red;">The <code>metric</code> variable must be set to <code>likelihood</code>, <code>mi</code>, <code>pmi</code>, <code>t-test</code>, <code>chi-square</code>.</p>'))
elif metric in ['mi', 'pmi']:
    display(HTML('<p style="color:green;">The <code>metric</code> variable has been set to <code>' + metric + '</code>.</p>'))
    display(HTML('<p style="color:red;">You should set a <code>freq_filter</code> value in the next cell.</p>'))
else:
    display(HTML('<p style="color:green;">The <code>metric</code> variable has been set to <code>' + metric + '</code>. You do not need to run the next cell.</p>'))
    freq_filter=None

### Set Frequency Filter (MI and PMI Metrics Only)

MI and PMI scores are sensitive to unique words, which can make results less meaningful because often unique words will occur much less frequently throughout a corpus. To account for this, you can set a frequency filter so that you only measure MI or PMI scores for bi- or trigrams that occur a certain number of times. 

The `freq_filter` variable is set to `None` by default below. If you would like to apply a frequency filter, please provide a value for `freq_filter`, such as `freq_filter=5`. This value determines the frequency cutoff. 

If you are NOT calculating MI or PMI scores, you do not need to run the cell below. 

In [None]:
# Set frequency cutoff 
freq_filter = None
display(HTML('<p style="color:green;">You have set the <code>freq_filter</code> variable to <code>' + str(freq_filter) + '</code>.</p>'))

## Perform Calculations

In [None]:
# Perform Calculations
try:
    ResultsTable, all_scores = collocation_metric(set_length, all_finders_list, metric, freq_filter=freq_filter)
    display(HTML('<p style="color: green;">Calculations complete. View results in the below cells.</p>'))
except NameError:
    display(HTML('<p style="color:red;">You have not provided values for all required variables. Check sections 1-2 of this notebook.</p>'))

### View Dataframe of Scores

The below cell uses a <a href="https://github.com/quantopian/qgrid" target="_blank">QGrid</a> widget to display results in a dataframe, sorted from from highest to lowest. Click a column label to sort by that column. Click it again to reverse sort. Click the filter icon to the right of the column label to apply filters (for instance, reducing the table to only documents from specific sources). You can re-order the columns by dragging the column label.

In [None]:
# Display dataframe
qgrid_widget = qgrid.show_grid(ResultsTable, grid_options=grid_options, show_toolbar=False)

qgrid_widget

### Save Dataframe to CSV

The cell below will save the version of the dataframe you see displayed in the cell above. To save the full version of the dataframe (disregarding any filtering, etc you have done in the qgrid dataframe), skip the next cell, uncomment the code in the cell below it, and run that cell. 

Either cell will create a csv file in this module directory called whatever you value you assign to the `csv_file` variable.

In [None]:
# Configuree csv file name
csv_file = ''

# Save version of dataframe you see above to csv
if csv != '':
    changed_df = qgrid_widget.get_changed_df()
    changed_df.to_csv(csv_file, index_label = 'Index')
    display(HTML('<p style="color:green;">Csv file called <code>' + csv_file + '</code> created.'))
elif csv == '':
    display(HTML('<p style="color:red;">You have not provided a filename for the csv file.</p>'))

In [None]:
## onfigure csv file name
# csv_file = ''

## save the above dataframe to a csv file
# if csv != '':
#     LikelihoodTable.to_csv(csv_file, index_label = 'Index')
#     display(HTML('<p style="color:green;">Csv file called <code>' + csv_file + '</code> created.'))
# elif csv == '':
#     display(HTML('<p style="color:red;">You have not provided a filename for the csv file.</p>'))

### View Scores for a Specific Token and Save to CSV

You can check to see what other tokens are highly associated with your chosen token across your project, according to your selected metric. Enter only a single word below; it does not work if you enter a bigram or a trigram. Enter that word below following the format `token = 'example'`.

In [None]:
# Configure token
token = ''

if token == '' or token == None:
    display(HTML('<p style="color:red;">You have not selected a token.</p>'))
else:
    check = token.split(' ')
    if len(check) > 1:
        display(HTML('<p style="color:red;">Your <code>token</code> can only be a unigram</code>.</p>'))
    else:
        display(HTML('<p style="color:green;">You have set the <code>token</code> variable to <code>' + token + '</code>.</p>'))

You may also choose to save this information to a csv file by changing the value of the `save_csv` variable to `True`.
This will create a csv file in this module's directory called  whatever value you assign to the `csv_file` variable. If you do not wish to save a csv file set the value of the `csv_file` variable to `None`.

In [None]:
# Select True or False
save_csv = False
# Give the csv_file a name or select None
csv_file = None

if save_csv == False and csv_file == None:
    display(HTML('<p style="color:green;">You have elected not to save a csv file.</p>'))
elif save_csv == True and csv_file != None:
    display(HTML('<p style="color:green;">You have elected to save a csv file and have set <code>csv_file</code> to <code>' + str(csv_file) + '</code>.</p>'))
elif save_csv == False and csv_file != None:
    display(HTML('<p style="color:red;">You have given the csv file a name but set <code>save_csv</code> to <code>' + str(save_csv) + '</code></p>'))
elif save_csv == True and csv_file == None:
     display(HTML('<p style="color:red;">You have set <code>save_csv</code> to <code>' + str(save_csv) + '</code> but not provided a value for <code>csv_file</code>.</p>'))     

Run the cell below to see other token or tokens (depending on if you have calculated bi- or trigram frequencies) your provided token occurs with throughout your project, and the scores for each grouping. If you have not elected to save results to csv they will print to cell output.

In [None]:
# Get token_scores
token_scores = order_collocation_scores(all_scores, token, save_csv, csv_file)

if save_csv == True:
    display(HTML('<p style="color:green;">CSV file of results called <code>' + csv_file + '</code> created.'))
elif token == '':
    display(HTML('<p style="color:red;">You have not selected a token.</p>'))
else:
    print(token_scores)