Here, the aim is to write a function which prints the number of times each word appears in a file, sorted in alphabetical order.

Reading each line in a file as a string is straightforward and we don't need a UDF for that. Let's begin by writing a UDF that takes in a string and updates the dictionary data structure that holds the count of each word that has occured.

In [31]:
def update_wordcount_dict(line_str, wordcount_dict):
    line_words_list = (line_str.lower()).split() # convert string to lower case and split on whitespace
    
    for line_word in line_words_list:
        if line_word in wordcount_dict:
            curr_word_count = wordcount_dict[line_word]
            wordcount_dict[line_word] = curr_word_count + 1
        else:
            wordcount_dict[line_word] = 1

The above function uses pass by reference for the dictionary variable. We need to test if this works

In [32]:
wordcount_dict = {} 
update_wordcount_dict("We are not what we should be", wordcount_dict)
print('After the first line:')
print(wordcount_dict)
update_wordcount_dict("We are not what we should be", wordcount_dict)
print('After the second line:')
print(wordcount_dict)
update_wordcount_dict("We are not what we need to be", wordcount_dict)
print('After the third line:')
print(wordcount_dict)
del wordcount_dict

After the first line:
{'we': 2, 'are': 1, 'not': 1, 'what': 1, 'should': 1, 'be': 1}
After the second line:
{'we': 4, 'are': 2, 'not': 2, 'what': 2, 'should': 2, 'be': 2}
After the third line:
{'we': 6, 'are': 3, 'not': 3, 'what': 3, 'should': 2, 'be': 3, 'need': 1, 'to': 1}


So it does appear that a dictionary variable can be passed by reference and can be modified in an UDF. 

Next, we need to write the print_words method which takes a file as the input, separates it into lines, and then calls update_wordcount_dict for each line.

In [45]:
def print_words(filename):
    wordcount_dict = {}
    
    f = open(filename,"rU")
    
    for file_line_str in f:
        update_wordcount_dict(file_line_str, wordcount_dict)
        
    wordcount_dict_sorted = sorted(wordcount_dict)
    
    #we need to print the key value pairs. 
    for word_key in wordcount_dict_sorted:
        print ("%s    : %d" % (word_key, wordcount_dict[word_key]))
    

Now, we need to write the main method. 

In [47]:
def main():
    file_name_str = "small.txt"
    print(file_name_str)
    print_words(file_name_str)
    
if __name__ == "__main__":
    main()

small.txt
--    : 1
are    : 3
at    : 1
be    : 3
but    : 1
coach    : 1
football    : 1
least    : 1
need    : 1
not    : 3
should    : 1
to    : 2
used    : 1
we    : 6
what    : 3


