The purpose of this code is to demonstrate a technique that could prove useful for searching for meaning of an acronym. 

Imagine then that you have a sequence of words from the sentence: "The spokesperson for the Department of State said that peace is good." From this you can see that the words "Department", "of", "State" would be the components of an acronym DOS. So how might we find an automated way to capture this?

A first step would be to think of this as some type finite state type of problem. You know that you can only accept a sequence of words as a possible spelling out of the acronym if it has words that begin with the same letters as contained in the acronym.

In [25]:
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()



acronym = "DOS"
acronym_letters = list(acronym)
print(acronym_letters) # notice here that we took a single word and broke it into a list of characters

# now let's get a sentence with an acronym to find
input = "The Department of State is located in Foggy Bottom."

tokens = treebank_tokenizer.tokenize(input)
print(tokens)

# now we need to come up with some mechanism to scan through the list of tokens to start 
# seeing if we can find a sequence of words that match our acronym

acronym_letters_index = 0 # we set an index of the acronym letters to 0 meaning we are 
# about to start looking for the sequence of words

acronym_found = False
acronym_token_range = None

for index in list(range(0, len(tokens))):
    # what we now have is a for loop that will iterate through the indices of the tokens in the sentence
    token = tokens[index] # this gets the particular token that we are looking at
    if acronym_letters_index < len(acronym_letters) and token[0].lower() == acronym_letters[acronym_letters_index].lower():
        print("A match!")
        acronym_letters_index += 1 # we now are looking for a word with the next letter
        
        acronym_start_index = index # an index of where we are saying the acronym starts
        acronym_end_index = -1 # here we set the end index to -1 to indicate that we've not yet found an acronym
        
        # once we've found that there is a possible match we can start looking to the right of the current word to 
        # see if the following words match the pattern of the words
        for acronym_finder_index in list(range(index + 1, len(tokens))): 
            # notice we start this for loop at one word 
            # to the right of where we currently are
            next_token = tokens[acronym_finder_index]
            if acronym_letters_index < len(acronym_letters) and next_token[0].lower() == acronym_letters[acronym_letters_index].lower() :
                print("next letter found")
                if (acronym_letters_index + 1 < len(acronym_letters)):
                    # this block of code means we have captured part of the acronym and are moving on to the next letter
                    acronym_letters_index += 1
                else:
                    print("success!")
                    # if we are here then we have captured the acronym
                    acronym_token_range = [acronym_start_index, acronym_finder_index]
                    acronym_found = True
                    acronym_start_index = 0
                    break

if acronym_found:
    print("Acronym Found")
    print(tokens[acronym_token_range[0]:acronym_token_range[1]+1])
    print(" ".join(tokens[acronym_token_range[0]:acronym_token_range[1]+1]))
            
            


['D', 'O', 'S']
['The', 'Department', 'of', 'State', 'is', 'located', 'in', 'Foggy', 'Bottom', '.']
A match!
next letter found
next letter found
success!
A match!
Acronym Found
['Department', 'of', 'State']
Department of State


So this implementation looks like it might work, but what if we have a similar acronym in the sentence and what if we put the real acronym at the end?

In [26]:
# now let's get a sentence with an acronym to find
input = "The Department of Defense is located in Arlington but the Department of State is located in Foggy Bottom."

tokens = treebank_tokenizer.tokenize(input)
print(tokens)

# now we need to come up with some mechanism to scan through the list of tokens to start 
# seeing if we can find a sequence of words that match our acronym

acronym_letters_index = 0 # we set an index of the acronym letters to 0 meaning we are 
# about to start looking for the sequence of words

acronym_found = False
acronym_token_range = None

for index in list(range(0, len(tokens))):
    # what we now have is a for loop that will iterate through the indices of the tokens in the sentence
    token = tokens[index] # this gets the particular token that we are looking at
    if acronym_letters_index < len(acronym_letters) and token[0].lower() == acronym_letters[acronym_letters_index].lower():
        print("A match!")
        acronym_letters_index += 1 # we now are looking for a word with the next letter
        
        acronym_start_index = index # an index of where we are saying the acronym starts
        acronym_end_index = -1 # here we set the end index to -1 to indicate that we've not yet found an acronym
        
        # once we've found that there is a possible match we can start looking to the right of the current word to 
        # see if the following words match the pattern of the words
        for acronym_finder_index in list(range(index + 1, len(tokens))): 
            # notice we start this for loop at one word 
            # to the right of where we currently are
            next_token = tokens[acronym_finder_index]
            if acronym_letters_index < len(acronym_letters) and next_token[0].lower() == acronym_letters[acronym_letters_index].lower():
                print("next letter found")
                if (acronym_letters_index + 1 < len(acronym_letters)):
                    # this block of code means we have captured part of the acronym and are moving on to the next letter
                    acronym_letters_index += 1
                else:
                    print("success!")
                    # if we are here then we have captured the acronym
                    acronym_token_range = [acronym_start_index, acronym_finder_index]
                    acronym_found = True
                    acronym_start_index = 0
                    break
            

if acronym_found:
    print("Acronym Found")
    print(tokens[acronym_token_range[0]:acronym_token_range[1]+1])
    print(" ".join(tokens[acronym_token_range[0]:acronym_token_range[1]+1]))
            

['The', 'Department', 'of', 'Defense', 'is', 'located', 'in', 'Arlington', 'but', 'the', 'Department', 'of', 'State', 'is', 'located', 'in', 'Foggy', 'Bottom', '.']
A match!
next letter found
next letter found
success!
A match!
Acronym Found
['Department', 'of', 'Defense', 'is', 'located', 'in', 'Arlington', 'but', 'the', 'Department', 'of', 'State']
Department of Defense is located in Arlington but the Department of State


What we find here is that the acronym finder keeps running to try to find the full acronym and we get this wrong one.

What we need to do is reset the acronym_letter_index back to 0, which means that we are not on the right path to find the acronym and must restart.

In [28]:
# now let's get a sentence with an acronym to find
input = "The Department of Defense is located in Arlington but the Department of State is located in Foggy Bottom."

tokens = treebank_tokenizer.tokenize(input)
print(tokens)

# now we need to come up with some mechanism to scan through the list of tokens to start 
# seeing if we can find a sequence of words that match our acronym

acronym_letters_index = 0 # we set an index of the acronym letters to 0 meaning we are 
# about to start looking for the sequence of words

acronym_found = False
acronym_token_range = None

for index in list(range(0, len(tokens))):
    # what we now have is a for loop that will iterate through the indices of the tokens in the sentence
    token = tokens[index] # this gets the particular token that we are looking at
    if acronym_letters_index < len(acronym_letters) and token[0].lower() == acronym_letters[acronym_letters_index].lower():
        print("A match!")
        acronym_letters_index += 1 # we now are looking for a word with the next letter
        
        acronym_start_index = index # an index of where we are saying the acronym starts
        acronym_end_index = -1 # here we set the end index to -1 to indicate that we've not yet found an acronym
        
        # once we've found that there is a possible match we can start looking to the right of the current word to 
        # see if the following words match the pattern of the words
        for acronym_finder_index in list(range(index + 1, len(tokens))): 
            # notice we start this for loop at one word 
            # to the right of where we currently are
            next_token = tokens[acronym_finder_index]
            if acronym_letters_index < len(acronym_letters) and next_token[0].lower() == acronym_letters[acronym_letters_index].lower():
                print("next letter found")
                if (acronym_letters_index + 1 < len(acronym_letters)):
                    # this block of code means we have captured part of the acronym and are moving on to the next letter
                    acronym_letters_index += 1
                else:
                    print("success!")
                    # if we are here then we have captured the acronym
                    acronym_token_range = [acronym_start_index, acronym_finder_index]
                    acronym_found = True
                    acronym_start_index = 0
                    break
            else:
                acronym_letters_index = 0
                break
            

if acronym_found:
    print("Acronym Found")
    print(tokens[acronym_token_range[0]:acronym_token_range[1]+1])
    print(" ".join(tokens[acronym_token_range[0]:acronym_token_range[1]+1]))
            

['The', 'Department', 'of', 'Defense', 'is', 'located', 'in', 'Arlington', 'but', 'the', 'Department', 'of', 'State', 'is', 'located', 'in', 'Foggy', 'Bottom', '.']
A match!
next letter found
A match!
A match!
next letter found
next letter found
success!
A match!
Acronym Found
['Department', 'of', 'State']
Department of State


Notice, this implementation seems to work, but improvements are still possible. 

There are additional problems with this implementation that you might wish to fix. These include being able to identify the acronym for the FBI in real text (that is dealing with the "of" word).