# Identify the misuse of acronyms

When creating large documents with multiple authors and lots of jargon, acronyms are sometimes misused in a few ways:
1. Used but not defined. (Ex. This report will be submitted to the EPA.)
2. Used before being defined. (Ex. This report will be submitted to the EPA. The EPA (Environmental Protection Agency) will review the report.)
3. Defined multiple times. (Ex. This report will be submitted to the EPA (Environmental Protection Agency). The EPA (Environmental Protection Agency) will review the report.)

This notebook identifies instances of acronym misuse in a word document. Word's "find" function can then be used to assist in locating and correcting the instances of misuse.

In [1]:
from docx import Document
import pandas as pd
import numpy as np

In [2]:
document = Document('Example Document.docx')
text = ''
for paragraph in document.paragraphs:
    text += paragraph.text
# Create a list of the individual words after removing punctuation.
text_words = text.translate({ord(punc): None for punc in '.?!,:;"\''})
text_words = text_words.split()

### Get a list suspected acronyms from the document

In [3]:
# Get a list of words that are inclosed by parenthesis.
words_parentheses_with_paren = [word for word in text_words if 
                                                    word.startswith('(') and
                                                    word.endswith(')') and 
                                                    len(word) > 3]
words_parentheses = [word.translate({ord(punc): None for punc in '()'}) for word in words_parentheses_with_paren]
# Get a list of words that are all caps.
text_words_cleaned = [word.translate({ord(punc): None for punc in '()'}) for word in text_words]
words_all_caps = [word for word in text_words_cleaned if word.isupper()]
# Get a list of suspected acronyms by combining the parentheses and all caps words
suspected_acronyms = words_parentheses + words_all_caps
suspected_acronyms = list(set(suspected_acronyms))
# Remove some items that are not acronyms
acronyms_to_be_removed = []
for acronym in suspected_acronyms:
    # No letters.
    if not any(c.isalpha() for c in acronym):
        acronyms_to_be_removed.append(acronym)
    # Only one character.
    if len(acronym) <= 1:
        acronyms_to_be_removed.append(acronym)
    # Sample/Area of Concern/Monitoring Well (Items specific to the environmental field that often include a hyphen).
    if '-' in acronym:
        acronyms_to_be_removed.append(acronym)
    # TODO: Section header or all caps for another reason? TODO
acronyms = list(set(suspected_acronyms) - set(acronyms_to_be_removed))

### Create a dataframe of summary statistics

In [4]:
acronym_stats = dict.fromkeys(acronyms)
for key in acronym_stats:
    acronym_stats[key] = {'defined': [], 'used': []}
# Parenthesised.
for key in acronym_stats:
    loc = 0
    for word in text_words:
        parenthesised_key = '(' + key + ')'
        if word == parenthesised_key:
            acronym_stats[key]['defined'].append(loc)
        loc += 1    
# Abbreviated.
for key in acronym_stats:
    loc = 0
    for word in text_words:
        if word == key:
            acronym_stats[key]['used'].append(loc)
        loc += 1
acronym_df = pd.DataFrame(acronym_stats).T
acronym_df.index.name = 'acronym'

### List the acronyms that were misused

In [5]:
defined_multiple_times, never_defined, used_before_defined = [], [], []
for index, row in acronym_df.iterrows():
    if len(row['defined']) == 0:
        never_defined.append(index)
    elif min(row['used'], default=1000000) < min(row['defined'], default=0):
        used_before_defined.append(index)
    elif len(row['defined']) > 1:
        defined_multiple_times.append(index)
print('Acronyms used before being defined: ' + '%s' % ', '.join(map(str, used_before_defined)))
print('\n')
print('Acronyms used but never defined: ' + '%s' % ', '.join(map(str, never_defined)))
print('\n')
print('Acronyms defined more than once: ' + '%s' % ', '.join(map(str, defined_multiple_times)))

Acronyms used before being defined: NAPL, RCRA


Acronyms used but never defined: 25V, 2H+, AOC, EPA, GQS, II, IIA, ISCO, IT, ITRC, LSRP, NX, PA, SCC, SCC/SRS, SESC, SI/RI, SRS, VOC


Acronyms defined more than once: UST
