## Text Statistics using Python and Regular Expressions

### Introduction

The goal of this project is to show relevant statistics of a text file similar to the Word Count box from Microsoft Word.

![](figures/TextStatisticsFromMSWord_Small.png)

Notepad++ also provides the total number of characters and lines for each file at the bottom portion of the application.
![](figures/TextStatisticsFromNotepadPlusPlus.png)

This project will also display the top 10 most frequently occuring word from the file.

Results show that the computed text statistics values closely match that of the values from Microsoft Word and Notepad++. The top most frequently occuring words coming from a compilation of all Shakespeare literature corresponds to the most commonly used words in English.

The experience from this endeavor will be used in creating an email spam filter model, which depends heavily on features extracted from sample email texts.

### String Manipulation

The next cells will show how common Python functions (e.g., __len__, __print__, __split__, __lower__, and __set__) can be used in this project. The sample text will be taken from Tolkien's The Hobbit novel.  

"In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort." 

In [1]:
text = "In a hole in the ground there lived a hobbit."

In [2]:
# total characters
numChars = len(text)
print("Total number of characters: ", numChars)

Total number of characters:  45


In [3]:
# Split and total number of words after split
splitWords = text.split()
numWords = len(splitWords)
print("Total number of words: ", numWords)
print("Words: ", splitWords)

Total number of words:  10
Words:  ['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']


There are several observations that can be made from this naive implementation to get words from the text.

The 10th word is "hobbit.", with a period at the end of the word. This will cause issues later if another word "hobbit" is encountered. The code may consider it to be different words, even if the difference was due to a punctuation mark at the end of the sentence. 

The first word "In" may also be considered distinct to the fourth word "in" if preprocessing is not performed on the raw text.

The first preprocessing that can be done is to convert everything to lowercase using the __lower__ function.

In [4]:
# Convert to lowercase
splitWords = text.lower().split()
splitWords

['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']

In [5]:
# Get unique words
distinctWords = set(text.lower().split())
distinctWords

{'a', 'ground', 'hobbit.', 'hole', 'in', 'lived', 'the', 'there'}

Notice that there is only a single instance of the word "in". 

However, the string "hobbit." is still currently considered as a distinct word. We need to take note of the characters that need to be retained in the text.

In [6]:
# retain these characters (store them in a set)
retainChars = {'a','b','c','d','e','f','g','h','i','j',
               'k','l','m','n','o','p','q','r','s','t',
               'u','v','w','x','y','z',' ','-',"'"}

The apostrophe is retained to handle word contractions such "We'd", which stands for "We had" or "We would". Removing the apostrophe will transform "We'd" to "Wed", which is also a valid word. 

The __normalize__ function below will combine the two operations of converting the characters to lowercase and only retaining characters specified in the set, retainChars. 

In [7]:
def normalize(s):
    """Convert string to lowercase and keep only characters 
        specified in retainChars"""
    return ''.join(char for char in s.lower() if char in retainChars)

In [8]:
normalizedText = normalize(text)
distinctWords = set(normalizedText.split())
print("Normalized text: ", normalizedText)
print("Distinct words: ", distinctWords)


Normalized text:  in a hole in the ground there lived a hobbit
Distinct words:  {'hobbit', 'there', 'lived', 'a', 'ground', 'the', 'hole', 'in'}


Notice that the word hobbit does not contain the period anymore after passing the raw text to the __normalize__ function.

### Using Regular Expressions

The __normalize__ function can be rewritten to use the concept of regular expressions (regex) to retain only the characters specified in *retainChars*. The equivalent regex pattern for *retainChars* is __r'[a-z\s\'-]'__.

Substituting all characters that do not belong to the *retainChars* set with empty string values will also achieve the same effect as creating a new string variable and retaining only the characters in the *retainChars* set.

To achieve this scenario, we will be using the __re.sub__ function with a pattern of __r'[^a-z\s\'-]'__. The __^__ symbol means  negation.

In [9]:
import re
text = "In a hole in the ground there lived a middle-aged hobbit. I'd like to live there."
normalizedText = re.sub(r'[^a-z\s\'-]','',text.lower())
print("Normalized text: ", normalizedText)

Normalized text:  in a hole in the ground there lived a middle-aged hobbit i'd like to live there


The regular expression of the __normalize__ function is shown below. Note that before the input text string is converted to all lowercase, the leading and trailing spaces are trimmed using the __strip__ function.

In [10]:
import re
def normalizeRegEx(s):
    """Convert string to lowercase and keep only characters 
        specified in regex pattern"""
    return re.sub(r'[^a-z\s\'-]','',s.strip().lower())

text = "In a hole in the ground there lived a middle-aged hobbit. I'd like to live there."
normalizedText = normalizeRegEx(text)
normalizedText

"in a hole in the ground there lived a middle-aged hobbit i'd like to live there"

### Extracting Basic Text Statistics

We are interested to find the following information from the text file: (a) number of words, (b) characters (with spaces), (c) characters (no spaces), and (d) lines.

The following code snippet can extract these pieces of information.

For the number of lines, we follow the assumption that if the file is empty, the number of lines is considered to be zero. However, if the there is at least single character in the file, even if there is no newline character (__\n__), the number of lines is considered to be one. 

In [11]:
text = "In a hole in the ground there lived a middle-aged hobbit. I'd like to live there."

# (a) number of words
numWords = len(normalizeRegEx(text).split())
print("Number of words: ", numWords)

# (b) characters (with spaces)
numChars = len(text)
print("Number of characters (with spaces): ", numChars)

# (c) characters (no spaces)
textNoSpaces = re.sub(r'[\s]','', text)
numCharsNoSpace = len(textNoSpaces)
print(textNoSpaces)
print("Number of characters (no spaces): ", numCharsNoSpace)

# (d) lines
numLines = 1 + text.count("\n") if len(text) > 0 else 0 # assume default of 1 line if there is content
print("Number of lines: ", numLines)

Number of words:  16
Number of characters (with spaces):  81
Inaholeinthegroundtherelivedamiddle-agedhobbit.I'dliketolivethere.
Number of characters (no spaces):  66
Number of lines:  1


### Extracting the Number of Paragraphs

For the purposes of this project, we consider a new paragraph is started when a newline character is entered, with the option of additional white space characters before a new valid character is entered. The regex pattern for this scenario is: __r"\n\s*"__. 

To get the number of paragraphs, we just need to determine the number of instances this pattern is encountered in the input text. 

Similar to the number of lines, we follow the assumption that if the file is empty, the number of paragraphs is considered to be zero. However, if the there is at least single character in the file, even if there is no newline character (\n), the number of paragraphs is considered to be one. 

To remove the effect of newline characters before and after the main body of text, the leading and trailing whitespaces removed using the __strip__ function.


The next cells will consider several possibilities related to the extracting the number of paragraphs.

In [12]:
import re

In [13]:
# Case 1: Empty text (No paragraph)
textpar = "";

# Get number of paragraphs
textpar = textpar.strip()

# Lets use a regular expression to match a few date strings.
regex = r"\n\s*"
matches = re.findall(regex, textpar)
numParagraphs = 1 + len(matches) if len(textpar) > 0 else 0 # assume default of 1 paragraph if there is content
print("Number of paragraphs: ", numParagraphs)


Number of paragraphs:  0


In [14]:
# Case 2: Single Sentence (One paragraph)
textpar = "The quick brown fox jumped over the lazy dog.";

# Get number of paragraphs
textpar = textpar.strip()

# Lets use a regular expression to match a few date strings.
regex = r"\n\s*"
matches = re.findall(regex, textpar)
numParagraphs = 1 + len(matches) if len(textpar) > 0 else 0 # assume default of 1 paragraph if there is content
print("Number of paragraphs: ", numParagraphs)


Number of paragraphs:  1


In [15]:
# Case 3: Two paragraphs
textpar = "The quick brown fox jumped over the lazy dog.\n\n\t\nThe quick brown fox jumped over the lazy dog.";

# Get number of paragraphs
textpar = textpar.strip()

# Lets use a regular expression to match a few date strings.
regex = r"\n\s*"
matches = re.findall(regex, textpar)
numParagraphs = 1 + len(matches) if len(textpar) > 0 else 0 # assume default of 1 paragraph if there is content
print("Number of paragraphs: ", numParagraphs)


Number of paragraphs:  2


### Creating the ___PrintTextStatistics___ Function

The code snippets above can be combined into a single function that we can call and print all text statistics in a concise manner. 



In [16]:
import re

def PrintTextStatistics(filename):
    """ Print text statistics for a given text file """
    
    with open(filename, 'r') as myfile:
        text = myfile.read()
    
    # (a) number of words
    numWords = len(normalizeRegEx(text).split())

    # (b) characters (with spaces)
    numChars = len(text)

    # (c) characters (no spaces)
    textNoSpaces = re.sub(r'[\s]','', text)
    numCharsNoSpaces = len(textNoSpaces)

    # (d) lines
    numLines = 1 + text.count("\n") if len(text) > 0 else 0 # assume default of 1 line if there is content
    
    text = text.strip() # remove leading/trailing spaces
    
    # (e) paragraphs
    regex = r"\n\s*\n*"
    matches = re.findall(regex, text)
    numParagraphs = 1 + len(matches) if len(text) > 0 else 0 # assume default of 1 paragraph if there is content
    
    # (f) print text statistics
    print("Text statistics for file: ", filename)
    print("Number of words: ", numWords)
    print("Number of characters (with spaces): ", numChars)
    print("Number of characters (no spaces): ", numCharsNoSpaces)
    print("Number of lines: ", numLines)
    print("Number of paragraphs: ", numParagraphs)

The new __PrintTextStatistics__ function will be tested on a real-world text file of *The Complete Works of William Shakespeare*, available through Project Gutenberg (http://www.gutenberg.org).

In [17]:
PrintTextStatistics("data/shakespeare.txt")

Text statistics for file:  data/shakespeare.txt
Number of words:  900068
Number of characters (with spaces):  5458199
Number of characters (no spaces):  4039809
Number of lines:  124457
Number of paragraphs:  114840


For comparison purposes, the results from the Word Count box in Microsoft Word are shown below.

![](figures/TextStatisticsFromShakespeare.png)

The number of characters (with no spaces) and the number of paragraphs match. The other numbers are slightly off. The results from Notepad++ are also shown below.

![](figures/TextStatisticsFromShakespeareNotepadPlusPlus.png)

This time, the number of characters (with spaces) and the number of lines match.

To explain the similarity and minor difference in numbers, this project shares the same definition of lines and characters (with spaces) with Notepad++. In terms of the number of paragraphs and characters (with no spaces), this project agrees with the definition from Microsoft Word. As for the definition of words, Microsoft Word may have an alternative approach on distinct word construction.

### Word Frequency

It would also be interesting to know what are the most frequently occurring words in the given text file. The general idea is to find the distinct words and count how much they occur in the text. The counts will then be sorted in a descending order, yielding the words that occur most frequently in the text.

The distinct words and the related counts will be stored in a dictionary structure, with the word as the key and counts as the value.

The __GetWordFrequencyDictionary__ function (which calls __normalizeRegEx__) below can be used to get this dictionary structure. 

In [18]:
import re
def normalizeRegEx(s):
    """Convert string to lowercase and keep only characters 
        specified in regex pattern"""
    return re.sub(r'[^a-z\s\'-]','',s.strip().lower())


In [19]:
# get word frequency
def GetWordFrequencyDictionary(text):
    """ Returns a dictionary where the keys are distinct words and the value are the counts of those words """
    
    text = normalizeRegEx(text)
    words = text.split()
    
    wordFrequency = {} # Initialize dictionary
    for word in words:
        if word in wordFrequency: # just increment count if word was discovered before
            wordFrequency[word] += 1
        else:
            wordFrequency[word] = 1 # add to dictionary
            
    return wordFrequency

### Enhanced *PrintTextStatisticsWithTopWords* Function

The original *PrintTextStatistics* function above will be modified to include printing the top 10 frequently occuring words in the text file. 

To find the number of characters (with no spaces) in the original *PrintTextStatistics* function, the __len__ function is applied to a new string variable (*textNoSpaces*) is generated from the original input text. The new implementation below uses regular expression to find all non-space characters and outputs the total number of matches as the number of characters (with no spaces).

In [20]:
import re
import numpy as np

def PrintTextStatisticsWithTopWords(filename):
    """ Print text statistics for a given text file including most frequently occurring words """
        
    with open(filename, 'r') as myfile:
        text = myfile.read()
        
    # characters (with spaces)
    numChars = len(text)
    
    # characters (with no spaces)
    matchNonSpace = re.findall(r'[^\s]', text)
    numCharsNoSpace = len(matchNonSpace)
        
    # lines
    numLines = 1 + text.count("\n") if len(text) > 0 else 0 # assume default of 1 line if there is content
    
    text = text.strip() # remove leading/trailing spaces for counting paragraphs
    
    # paragraphs
    regex = r"\n\s*\n*"
    matchesPar = re.findall(regex, text)
    numParagraphs = 1 + len(matchesPar) if len(text) > 0 else 0 # assume default of 1 paragraph if there is content
    
    wordFrequency = GetWordFrequencyDictionary(text)
    
    # number of words
    numWords = int( np.sum([wordFrequency[word] for word in  wordFrequency]) )
    
    # create list of (count, word) from the dictionary
    wordList = [(wordFrequency[word], word) for word in wordFrequency]
    wordList.sort(reverse=True)
    
    # print text statistics
    print("Text statistics for file: ", filename)
    print("Number of words: ", numWords)
    print("Number of characters (with spaces): ", numChars)
    print("Number of characters (no spaces): ", numCharsNoSpace)
    print("Number of lines: ", numLines)
    print("Number of paragraphs: ", numParagraphs)
    print("\nThe top 10 words in the file are: ")
    i = 1
    for count, word in wordList[:10]:
        print("{0:2d}. {1:10} {2:4d}".format(i, word, count))
        i+=1 

In [21]:
PrintTextStatisticsWithTopWords("data/shakespeare.txt")

Text statistics for file:  data/shakespeare.txt
Number of words:  900068
Number of characters (with spaces):  5458199
Number of characters (no spaces):  4039809
Number of lines:  124457
Number of paragraphs:  114840

The top 10 words in the file are: 
 1. the        27594
 2. and        26704
 3. i          20248
 4. to         19165
 5. of         18164
 6. a          14430
 7. you        13568
 8. my         12461
 9. that       11098
10. in         10953


Notice that the top 10 words from the real-world text file of The Complete Works of William Shakespeare (available at http://www.gutenberg.org) are in the list of the most common words in English (https://en.wikipedia.org/wiki/Most_common_words_in_English). 

### Acknowledgements

This project was inspired by the "*Case Study: Text Statistics*" chapter from the book, *Python: Visual QuickStart Guide (3rd Edition)*. 

This project improved upon the book approach by using regular expressions to find patterns from the input text. Additional text statistics values such as the number of paragraphs and number of characters (no spaces) are also provided. 

The results from this project are also compared to existing third-party software that provide similar text statistics.