# Word Length in James Joyce's Ulysses

As the title suggests, my objective is to figure out how many words of each length there are in the entirety of Ulysses.

In [10]:
# read ulysses text file and create list that delimits by whitespace (space, tab, newline)
with open('ulysses.txt', 'r', encoding='utf8') as f:
    data = f.read()
    words = data.split()

# use generated list to populate dictionary, where key is word length, value is # of occurrences
length_data = {}
for word in words:
    if len(word) in length_data:
        length_data[len(word)] += 1
    else:
        length_data[len(word)] = 1
    if (len(word) == 105): longest_word = word
# print results in sorted order
for key in sorted(length_data):
    print("{}: {}".format(key, length_data[key]))
sum = 0
total_words = 0
for k in length_data:
    sum += k * length_data[k]
    total_words += length_data[k]
    
print("\nTotal number of letters in Ulysses is {}.".format(sum))
print("Total number of words in Ulysses is {}.".format(total_words))
print("Average length of words in Ulyssses is {:.1f}.".format(sum/total_words))
print("The longest word was {} at 105 letters long.".format(longest_word))

f.close()

1: 9375
2: 40719
3: 55526
4: 44338
5: 33911
6: 25941
7: 21178
8: 14151
9: 9415
10: 6035
11: 3505
12: 1908
13: 1018
14: 478
15: 207
16: 90
17: 65
18: 22
19: 17
20: 9
21: 9
22: 5
23: 2
24: 4
25: 1
26: 2
27: 2
28: 3
29: 2
30: 2
31: 1
34: 2
36: 1
37: 1
39: 1
53: 1
91: 1
105: 1

Total number of letters in Ulysses is 1255643.
Total number of words in Ulysses is 267949.
Average length of words in Ulyssses is 4.7.
The longest word was Nationalgymnasiummuseumsanatoriumandsuspensoriumsordinaryprivatdocentgeneralhistoryspecialprofessordoctor at 105 letters long.


## Steps of Analysis

The steps required to generate my results were fairly simple, and are largely explained by the commenting within the code cell. Nonetheless, here they are:

* Read text file into a very long string.
* Use the string to produce a list delimited by whitespaces.
* Iterate over the list to populate a dictionary, where keys represent the length of each word, and values represent their number of occurences.
* Iterate over the dictionary to produce values like total number of letters, word count, and average word length.

## Conclusions

Based on the information from the generated dictionary, we can derive some interesting information. First of all, based on a cursory Wikipedia search, the average word count of a modern novel comes to about 40,000 words. Ulysses, unsurprisingly, absolutely crushes this value at a colossal 267,949 word count.

According to a quick Google search, the average length of an English word is 5.1 letters. Interestingly, the average word length of Ulysses falls short of this at 4.7 letters.

Despite this, though, there are certainly some interesting anomolies in terms of word length. For instance, the longest word in the novel stands at a whopping 105 letters long. When I first saw this, I thought it must be a mistake - maybe there was a URL somewhere that was read as a single word. However, after printing the word out and searching for it in the text file, it turns out that it was the last title in a list of people with progressively absurd titles. The second longest word, at 91 letters long, was the product of similar absurdity.

Though it doesn't fit within the scale of the assignment, it would have been interesting to try determining why it is that Ulysses has an average word length less than that of the English language in general. Given its reputation as a highly esoteric text, this particular result was very surprising to me.