# Word Count Analysis
## Introduction


This notebook implements a word count analysis tool. The program reads a text file, validates input words, and counts occurrences of search terms efficiently while ensuring search terms match whole words in the text.

The notebook uses Python re for regular expressions.


Function word_count_summary.() accept two arguments file_path and search_terms

search_terms can be one string, string can only be these characters: A-Z, a-z, 0-9, and underscore. Or search_terms can be a list of strings.

## Steps in the Solution
- Validation Functions:`validate_word.()` and `validate_search_terms`.Functions to validate words and search terms to ensure they adhere to the required format. 
- Word Boundary Detection: `is_whole_word.()` Make sure the search terms match only as whole words in the text.
- Result Formating Preprocessing: `print_single_word_count.()` and `print_word_count_table.()`.The results are displayed either as a single word count or a table with multiple search terms.
- Word Count Function: `word_count_summary.()`The word_count_summary function reads the text file and counts the occurrences of search terms, leveraging the above utility functions.
- Error catch for expected DataType and invalid characters.

 This is the first draft that makes the function working. Later, there is a improved version.

In [95]:
import re

def validate_word(word):
    """Validate the word to ensure it contains only valid characters.Check if the string contains only valid characters (A-Z, a-z, 0-9, and underscore)."""
    # limit the letter to only [A-Za-z0-9_]
    if not re.fullmatch(r'[\w]+', word):  
        raise ValueError(f"The word '{word}' contains invalid characters. Only A-Z, a-z, 0-9, and underscore '_' are allowed.")

def validate_search_terms(search_terms):
    """Validate search_terms as a string or a list of strings, checking whether each of the string in the list fits the stanard. And check each of the string in the list contains only allowed characters."""
    if isinstance(search_terms, str):
         # Validate single string directly
        validate_word(search_terms) 
        return [search_terms]  
    elif isinstance(search_terms, list):
        #search every item in the list input as a valid string. If not, raise a ValueError.
        for term in search_terms:
            # Ensure each term is a string
            if not isinstance(term, str):  
                raise TypeError(f"Expected a string, but got {type(term).__name__} for term '{term}'.")
            validate_word(term)
            # Return the original list if all are valid
        return search_terms  
    else:
        raise TypeError("search_terms should be a string or a list of strings.")

def is_whole_word(word, text):
    """Check if the word appears as a whole word in the processed text, matching case exactly."""
    index = text.find(word)
    while index != -1:
        # Ensure that the found word is at a word boundary
        if (index == 0 or not text[index - 1].isalnum() and text[index - 1] != '_') and \
           (index + len(word) == len(text) or not text[index + len(word)].isalnum() and text[index + len(word)] != '_'):
            return True
        index = text.find(word, index + 1)
    return False
    

def print_single_word_count(word, words):
    """Print the count of a single string in the sentences"""

    word_count = sum(is_whole_word(word, w) for w in words)
    if word_count == 1:
        print(f'The word `{word}` appears {word_count} time.')
    else:
        print(f'The word `{word}` appears {word_count} times.')

def print_word_count_table(search_terms, counts):
    """Print a table of search result in a formatted manner without using packages"""
    # Determine maximum length for keyword and count columns
    max_word_length = max(len(word) for word in search_terms + ["WORD"])
    max_count_length = max(len(str(count)) for count in counts + ["COUNT"])

    # Calculate column widths based on maximum lengths, adding padding
    word_col_width = max_word_length + 2  # Add padding for spaces
    count_col_width = max_count_length + 2  # Add padding for spaces
    
    # Print header with dynamic dashes
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    print(f"| {'WORD'.ljust(word_col_width - 1)}| {'COUNT'.rjust(count_col_width - 1)}|")
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    
    # Print each row with word and count, left-aligned words and right-aligned counts
    for keyword, count in zip(search_terms, counts):
        print(f"| {keyword.ljust(word_col_width - 1)}| {str(count).rjust(count_col_width - 1)}|")
    
    # Print footer with total
    total_count = sum(counts)
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    print(f"| {'TOTAL'.ljust(word_col_width - 1)}| {str(total_count).rjust(count_col_width - 1)}|")
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")

def word_count_summary(file_path, search_terms):
    """Read the file content and count the occurrences of search terms.

    Args:
        file_path (str): The path to the file to be read.
        search_terms (str or list): The search term(s) to count in the file. This needs to be string, or list of strings. 
        Among strings, it will only allow A-Z,a-z, 0-9 and _

        The string of search_terms will be split into a list of words, and each word will be counted separately.
        The list of search_terms will be validated to ensure all elements are strings.
    """
    try:
        # call Validate function
        search_terms = validate_search_terms(search_terms)

        # Read file 
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            
            # Split content into words based on compiled whole word pattern
            words = re.findall(r'\b\w+\b', content)

            # Proceed with counting based on the normalized search_terms
            counts = [sum(is_whole_word(keyword, word) for word in words) for keyword in search_terms]
            
            
            if len(search_terms) == 1:
                print_single_word_count(search_terms[0], words)
            else:
                print_word_count_table(search_terms, counts)

    except FileNotFoundError:
        print('Error: File not found.')
    except ValueError as ve:
        print(f"Validation error: {ve}")
    except TypeError as te:
        print(f"Type error: {te}")
    except Exception as e:
        print(f'An error occurred: {str(e)}')




In [96]:
# to prove this is a working solution
word_count_summary("../../pride-and-prejudice.txt", "the")

The word `the` appears 4060 times.


#### This is the second improved version of the script

To optimize the abovementioned code, here are some potential improvements:

 1. Use ordered set to store all the search items in validate_search_terms(), set can de-duplicate
 2. Use re.compile() instead of re.findall() 
 3. Simplify is_whole_word() condition via regualr expression
 4. Commenting improvement on type annotation and more details docstring comment



In [71]:
import re

def validate_word(word):
    """
    Validate the word to ensure it contains only valid characters.
    
    Args:
        word (str): The word to validate.

    Raises:
        ValueError: If the word contains invalid characters.
    """
    # Using \w to quick refer the matches of [A-Za-z0-9_]
    valid_word_pattern = re.compile(r'[\w]+')
    if not valid_word_pattern.fullmatch(word):
        raise ValueError(f"The word '{word}' contains invalid characters. Only A-Z, a-z, 0-9, and underscore '_' are allowed.")

def validate_search_terms(search_terms):
    """
    Validate search_terms as a string or a list of strings and ensure uniqueness while maintaining order.
    
    Args:
        search_terms (str or list): Words to validate.
    
    Returns:
        list: Validated and normalized list of search terms.

    Raises:
        TypeError: If search_terms is not a string or a list of strings.
        ValueError: If any term contains invalid characters.
    """
    if isinstance(search_terms, str):
        validate_word(search_terms)
        return [search_terms]
    #valid each elements in the list to also be [A-Za-z0-9_]
    elif isinstance(search_terms, list):
        # Creating an ordered set to avoid the duplicate elements in the list
        ordered_set = list(dict.fromkeys(search_terms))  
        for term in ordered_set:
            validate_word(term)
        return ordered_set
    else:
        raise TypeError("search_terms should be a string or a list of strings. Strin only allows A-Z, a-z, 0-9, and underscore '_'.")

def is_whole_word(word, text):
    """
    Check if the word appears as a whole word in the text. Avoid search 'there', but 'therefore' are being count

    Args:
        word (str): The word to search for.
        text (str): The text to search within.

    Returns:
        bool: True if the word appears as a whole word, False otherwise.
    """
    # Use re.escape to escape special characters in the word
    pattern = re.compile(rf'\b{re.escape(word)}\b')
    return bool(pattern.search(text))

def print_single_word_count(word, words):
    """
    Print the count of a single word in the text.
    
    Args:
        word (str): The word to count.
        words (list): List of words in the text.
    """
    word_count = sum(is_whole_word(word, w) for w in words)
    if word_count == 1:
        print(f'The word `{word}` appears {word_count} time.')
    else:
        print(f'The word `{word}` appears {word_count} times.')

def print_word_count_table(search_terms, counts):
    """
    Print a table of word counts in a formatted manner.
    
    Args:
        search_terms (list): List of search terms.
        counts (list): List of counts corresponding to the search terms.
    """

    # Determine maximum length for keyword and count columns
    max_word_length = max(len(word) for word in search_terms + ["WORD"])
    max_count_length = max(len(str(count)) for count in counts + ["COUNT"])

    # Calculate column widths based on maximum lengths, adding padding 
    word_col_width = max_word_length + 2  
    count_col_width = max_count_length + 2  
    
    # Print header with dynamic dashes
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    print(f"| {'WORD'.ljust(word_col_width - 1)}| {'COUNT'.rjust(count_col_width - 1)}|")
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    
    # Print each row with word and count, left-aligned words and right-aligned counts
    for keyword, count in zip(search_terms, counts):
        print(f"| {keyword.ljust(word_col_width - 1)}| {str(count).rjust(count_col_width - 1)}|")
    
    # Print footer with total
    total_count = sum(counts)
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")
    print(f"| {'TOTAL'.ljust(word_col_width - 1)}| {str(total_count).rjust(count_col_width - 1)}|")
    print(f"|{'-' * word_col_width}|{'-' * count_col_width}|")

def word_count_summary(file_path, search_terms):
    """Read the file content and count the occurrences of search terms.

    Args:
        file_path (str): The path to the file to be read.
        search_terms (str or list): The search term(s) to count in the file. This needs to be string, or list of strings. 
        Among strings, it will only allow A-Z,a-z, 0-9 and _

        The string of search_terms will be split into a list of words, and each word will be counted separately.
        The list of search_terms will be validated to ensure all elements are strings.
    """
    try:
        # Validate and normalize search terms
        search_terms = validate_search_terms(search_terms)

        # Initialize counts for each search term
        counts = {term: 0 for term in search_terms}
        
        # Read and process the file content
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            
            # Split content into words based on compiled whole word pattern
            words = re.findall(r'\b\w+\b', content)

            # Proceed with counting based on the normalized search_terms
            counts = [sum(is_whole_word(keyword, word) for word in words) for keyword in search_terms]
            
            # Print results
            if len(search_terms) == 1:
                print_single_word_count(search_terms[0], words)
            else:
                print_word_count_table(search_terms, counts)

    except FileNotFoundError:
        print('Error: File not found.')
    except ValueError as ve:
        print(f"Validation error: {ve}")
    except TypeError as te:
        print(f"Type error: {te}")
    except Exception as e:
        print(f'An error occurred: {str(e)}')




# Testing section

In [81]:
word_count_summary('../../pride-and-prejudice.txt', 5)

Type error: search_terms should be a string or a list of strings. Strin only allows A-Z, a-z, 0-9, and underscore '_'.


In [92]:
word_count_summary('../../pride-and-prejudice.txt', "15_th October_")


Validation error: The word '15_th October_' contains invalid characters. Only A-Z, a-z, 0-9, and underscore '_' are allowed.


In [90]:
word_count_summary('../../pride-and-prejudice.txt', "_there_")

The word `_there_` appears 5 times.


In [91]:
word_count_summary('../../pride-and-prejudice.txt', "there")

The word `there` appears 285 times.


In [72]:
word_count_summary('../../pride-and-prejudice.txt', "Elizabeth")

The word `Elizabeth` appears 634 times.


In [73]:
word_count_summary("../../pride-and-prejudice.txt", "the")

The word `the` appears 4060 times.


In [93]:
word_count_summary("../../pride-and-prejudice.txt", ["the",5])

Type error: Expected a string, but got int for term '5'.


In [74]:
word_count_summary("../../pride-and-prejudice.txt", ["Jane", "Elizabeth", "Mary", "Kitty", "Lydia"])

|-----------|-------|
| WORD      |  COUNT|
|-----------|-------|
| Jane      |    292|
| Elizabeth |    634|
| Mary      |     39|
| Kitty     |     71|
| Lydia     |    170|
|-----------|-------|
| TOTAL     |   1206|
|-----------|-------|


In [75]:
word_count_summary("../../pride-and-prejudice.txt", ['the', 'is', 'and', 'the'] )

|------|-------|
| WORD |  COUNT|
|------|-------|
| the  |   4060|
| is   |    831|
| and  |   3434|
|------|-------|
| TOTAL|   8325|
|------|-------|


In [77]:
word_count_summary("../../a-tale-of-two-cities.txt", "times")

The word `times` appears 51 times.


In [78]:
print(word_count_summary("../../a-tale-of-two-cities.txt", ["London", "Paris"])) 

|--------|-------|
| WORD   |  COUNT|
|--------|-------|
| London |     28|
| Paris  |     63|
|--------|-------|
| TOTAL  |     91|
|--------|-------|
None


In [79]:
print(word_count_summary("../../a-tale-of-two-cities.txt", "pizza"))

The word `pizza` appears 0 times.
None
