# L15: Text Analysis (Part I)
[1. Regular Expressions](#1.-Regular-Expressions)\
[2. Dictionary-Based Textual Analysis](#2.-Dictionary-Based-Textual-Analysis)

# 1. Regular Expressions

Import Python's regular expressions module `re`

In [1]:
import re

### Looking for Patterns in Text

This code demonstrates how to match a basic regex `r"OI"`To perform case-insensitive Regex matching, we can either specify IGNORECASE flag in a re function or convert the input string to lower case using string’s lower() method: in a sentence

In [2]:
# loads Python's regular expressions module
import re

text = "OI for FY 2019 was 12.4 billion, up more than eight percent from OI in FY 2018."

# returns a Match object of the first match, if it exists
x1 = re.search(r"OI", text)

# finds all matches of "OI"
x2 = re.findall(r"OI", text)

# splits text at ","
x3 = re.split(r",", text)

# replaces "OI" with " Operating Income "
x4 = re.sub(r"OI", "Operating Income", text)

print(f'Result of re. search :\n{x1}\n')
print(f'Result of re. findall :\n{x2}\n')
print(f'Result of re. split :\n{x3}\n')
print(f'Result of re.sub :\n{x4}')

Result of re. search :
<re.Match object; span=(0, 2), match='OI'>

Result of re. findall :
['OI', 'OI']

Result of re. split :
['OI for FY 2019 was 12.4 billion', ' up more than eight percent from OI in FY 2018.']

Result of re.sub :
Operating Income for FY 2019 was 12.4 billion, up more than eight percent from Operating Income in FY 2018.


To perform case-insensitive Regex matching, we can either specify IGNORECASE flag in a `re` function or convert the input string to lower case using string’s `lower()` method:

In [3]:
x1 = re.findall(r'MD&A', " This year's MD&a Section is located ... Please refer to our md&A section on page ... ", re.IGNORECASE)
x2 = re.findall(r'md&a', " This year's MD&a Section is located ... Please refer to our md&A section on page ... ".lower())
                   
print(x1)
print(x2)

['MD&a', 'md&A']
['md&a', 'md&a']


### Character Sets in Regex

In [4]:
text = "This project has increased our revenues by more than 70% in FY 2019."

# returns all single digit matches
x1 = re . findall(r'[0-9]', text)

# returns all non - word characters, also excludes spaces, periods, and commas
x2 = re.findall(r'[^a-zA-Z \.,]', text)

# returns all two - digit numbers followed by "%"
x3 = re.findall(r'\d\d%', text)

print(x1)
print(x2)
print(x3)

['7', '0', '2', '0', '1', '9']
['7', '0', '%', '2', '0', '1', '9']
['70%']


### Groups in Regex

In [5]:
regex = r"\b(?P<word>\w+)\s(?P=word)\b"
text = "we we expect this this trend to continue"

# prints an output where all double words are replaced with a single word
print(re.sub(regex,r"\g<word>", text))

we expect this trend to continue


In [6]:
date = "09/14/2020"

# specifies three named groups, namely 'Month', 'Day', and 'Year'
regex = r"(?P<Month>\d{1,2})/(?P<Day>\d{1,2})/(?P<Year>\d{2,4})"

# identifies regex matches in date
date_matches = re.search(regex, date)
# prints matches for each of the named groups
print("Month: ", date_matches.group('Month'))
print("Day: ", date_matches.group('Day'))
print("Year: ", date_matches.group('Year'))

Month:  09
Day:  14
Year:  2020


### Examples

**Example 1: Character Sets**

Assume we want to identify all numbers in text followed by % or the word percent. To do so, we create a character set of digits (allowing for period ‘.’ in numbers); we also specify the “% vs. percent” option using the regex “or” symbol, |.

In [7]:
text = """This project has resulted in over 70% of our 2019 revenues to date. 
As a result, our operating income increased by 9%, while our operating expenses 
increased by 12%. We had a 12.5 percent increase in regional sales."""

# recall that ?: after the left parenthesis specifies a non-capturing group
x = re.findall (r'[\d\.]+(?:\%|\s\bpercent\b)', text)
print(x)

['70%', '9%', '12%', '12.5 percent']


**Example 2: Character Sets, Quantifiers, Groups, Lookbehinds**

Assume we want to identify basic company information – CIK number, company name, filing date, and SIC industry code – from the company filing’s SEC header. To do so, we need to create separate regular expressions for all items that we want to capture.\
The following SEC filing is from https://www.sec.gov/Archives/edgar/data/80424/000008042418000100/0000080424-18-000100.txt

In [8]:
# an example of a standard input header used in SEC filings
header = """<SEC-DOCUMENT>0000080424-18-000100.txt : 20181019
<SEC-HEADER>0000080424-18-000100.hdr.sgml : 20181019
<ACCEPTANCE-DATETIME>20181019161731
ACCESSION NUMBER:		0000080424-18-000100
CONFORMED SUBMISSION TYPE:	10-Q
PUBLIC DOCUMENT COUNT:		68
CONFORMED PERIOD OF REPORT:	20180930
FILED AS OF DATE:		20181019
DATE AS OF CHANGE:		20181019

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			PROCTER & GAMBLE Co
		CENTRAL INDEX KEY:			0000080424
		STANDARD INDUSTRIAL CLASSIFICATION:	SOAP, DETERGENT, CLEANING PREPARATIONS, PERFUMES, COSMETICS [2840]
		IRS NUMBER:				310411980
		STATE OF INCORPORATION:			OH
		FISCAL YEAR END:			0630
[...]
</SEC-HEADER>"""

# CIK is the 10 - digit number, so we use the quantifier
# {10} to consider only 10 - digit numbers in the match
# this Regex specifies text " CENTRAL INDEX KEY:",
# followed by space \s(matched zero or many times
# as indicated by *), followed by a group capturing
# 10 - digit numbers
# Also, note that re. findall with a group Regex only
# returns the group match, and not the full match
cik = re.findall(r"CENTRAL INDEX KEY:\s*(\d{10})", header)

# This Regex specifies text " COMPANY CONFORMED NAME:",
# followed by space \s(matched zero or many times),
# followed by a group capturing any character one or
# many times
# flag MULTILINE makes ^ and $ characters capture
# beginning and end positions of text lines instead
# of only text files
company_name = re.findall(r"COMPANY CONFORMED NAME:\s*(.+)$", header, re.MULTILINE)

# This Regex specifies text "FILED AS OF DATE:",
# followed by space \s(matched zero or many times),
# followed by a group capturing 8 - digit numbers as
# all dates in the SEC headers are in the YYYYMMDD
# format
filing_date = re.findall(r'FILED AS OF DATE:\s*(\d{8})', header)

# This Regex uses a positive lookbehind to check if a
# 4 - digit number is preceded by text " STANDARD
# INDUSTRIAL CLASSIFICATION:"
sic = re.findall(r'(?<=STANDARD INDUSTRIAL CLASSIFICATION:).+(\d{4})', header)

print(cik)
print(company_name)
print(filing_date)
print(sic)

['0000080424']
['PROCTER & GAMBLE Co']
['20181019']
['2840']


**Example 3: Word Boundaries and Quantifiers**

Assume we want to identify all word matches that start with “risk”, but can have different endings (e.g., risks, risky, risking, etc.). We also want to calculate the percentage of such risk words relative to all words in text. To do so, we use regex word boundaries to perform single word matches.

In [9]:
text = """An investment in our common stock involves a
high degree of risk. You should carefully consider
the risks summarized below. The risks are discussed
more fully in the Risk Factors section of this
prospectus immediately following this prospectus
summary. These risks include, but are not limited
to, the following [...] These operations are risky
[...] Macroeconomic fluctuations increase the
riskiness of our operations. As indicated in
Section 2.1, our company's long-term risks include
[...] """

# this Regex matches a word boundary followed by a
# text string 'risk ', followed by an alphanumeric
# character(repeated zero or many times), followed
# by a word boundary ; re. IGNORECASE specifies a
# case - insensitive matching
risk_words = re.findall(r"\brisk\w*\b", text, re.IGNORECASE)

# matches all single words(allowing for '-'
# between two words and apostrophe)in text
all_words = re.findall(r"\b[a-zA-Z\'\-]+\b", text)

# function len () here returns the number of words
# that start with string 'risk ', i.e., the number
# of matches in risk_words list
risk_words_freq = len(risk_words)

# the number of all words in text
all_words_freq = len(all_words)

# percentage of risk - related words in text
text_riskiness = 100 * risk_words_freq / all_words_freq

print(risk_words)
print(text_riskiness)

['risk', 'risks', 'risks', 'Risk', 'risks', 'risky', 'riskiness', 'risks']
11.428571428571429


# 2. Dictionary-Based Textual Analysis

### Identifying Words and Sentences in Text

In [10]:
text = """We invested in six areas of the business that
account for nearly 40% of total Macy's sales.
Dresses, fine jewelry, big ticket, men's tailored,
women's shoes and beauty, these investments were
aimed at driving growth through great products, 
top-performing colleagues, improved environment and
enhanced marketing. All six areas continued to
outperform the balance of the business on market
share, return on investment and profitability. And
we capture approximately 9% of the market in these
categories."""

x = re.findall(r"\b[a-zA-Z\'\-]+\b", text)
# Regex "\b[a-zA-Z\'\-]+\b" searches for all words in
# text, allowing apostrophes and hyphens in words,
# e.g., company's, state-of-the-art

print(x)
print(len(x))

['We', 'invested', 'in', 'six', 'areas', 'of', 'the', 'business', 'that', 'account', 'for', 'nearly', 'of', 'total', "Macy's", 'sales', 'Dresses', 'fine', 'jewelry', 'big', 'ticket', "men's", 'tailored', "women's", 'shoes', 'and', 'beauty', 'these', 'investments', 'were', 'aimed', 'at', 'driving', 'growth', 'through', 'great', 'products', 'top-performing', 'colleagues', 'improved', 'environment', 'and', 'enhanced', 'marketing', 'All', 'six', 'areas', 'continued', 'to', 'outperform', 'the', 'balance', 'of', 'the', 'business', 'on', 'market', 'share', 'return', 'on', 'investment', 'and', 'profitability', 'And', 'we', 'capture', 'approximately', 'of', 'the', 'market', 'in', 'these', 'categories']
73


In [11]:
# Regex pattern that identifies a sentence
# re. compile compiles a regular expression pattern
# into a regular expression object in Python
sentence_regex=re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")

def identify_sentences(input_text: str):
    # finds all matches of sentence_regex in input_text
    sentences = re.findall(sentence_regex, input_text)
    return sentences

sentences = identify_sentences(text)

# enumerate is a Python function that when applied to
# a list , returns list elements along with their
# indexes(counter); 1 indicates that the counter
# should start from 1 instead of default 0
for counter, sentence in enumerate(sentences, 1):
    print(counter, sentence)

1 We invested in six areas of the business that
account for nearly 40% of total Macy's sales.
2 Dresses, fine jewelry, big ticket, men's tailored,
women's shoes and beauty, these investments were
aimed at driving growth through great products, 
top-performing colleagues, improved environment and
enhanced marketing.
3 All six areas continued to
outperform the balance of the business on market
share, return on investment and profitability.
4 And
we capture approximately 9% of the market in these
categories.


### spacy

Install `spacy` and its English (or other language) model: \
https://spacy.io/usage \
https://spacy.io/models/en
* pip install -U pip setuptools wheel
* pip install -U spacy
* python -m spacy download en_core_web_sm

Use `spacy` to identifying words and sentences

In [12]:
import spacy

# load the English language model in spacy
nlp = spacy.load('en_core_web_sm')

# create an "nlp" object that parses a textual document
a_text = nlp(text)

# create a list of word tokens; note, this list will
# include punctuation marks and other symbols
token_list = []
for token in a_text:
    token_list.append(token.text)
print(token_list)

sentences = list(a_text.sents)

# print all sentences
for counter, sentence in enumerate(sentences, 1) :
    print(counter, sentence)

['We', 'invested', 'in', 'six', 'areas', 'of', 'the', 'business', 'that', '\n', 'account', 'for', 'nearly', '40', '%', 'of', 'total', 'Macy', "'s", 'sales', '.', '\n', 'Dresses', ',', 'fine', 'jewelry', ',', 'big', 'ticket', ',', 'men', "'s", 'tailored', ',', '\n', 'women', "'s", 'shoes', 'and', 'beauty', ',', 'these', 'investments', 'were', '\n', 'aimed', 'at', 'driving', 'growth', 'through', 'great', 'products', ',', '\n', 'top', '-', 'performing', 'colleagues', ',', 'improved', 'environment', 'and', '\n', 'enhanced', 'marketing', '.', 'All', 'six', 'areas', 'continued', 'to', '\n', 'outperform', 'the', 'balance', 'of', 'the', 'business', 'on', 'market', '\n', 'share', ',', 'return', 'on', 'investment', 'and', 'profitability', '.', 'And', '\n', 'we', 'capture', 'approximately', '9', '%', 'of', 'the', 'market', 'in', 'these', '\n', 'categories', '.']
1 We invested in six areas of the business that
account for nearly 40% of total Macy's sales.

2 Dresses, fine jewelry, big ticket, men'

### Stemming and Lemmatization

We first need to import NLTK’s stemming and lemmatization modules and then apply stem and lemmatize commands to the words of interest


In [14]:
import nltk
nltk.download('wordnet')

# import Porter stemmer Module
from nltk.stem import PorterStemmer
# import WordNet lemmatization Module
from nltk.stem import WordNetLemmatizer

# object for Porter stemmer
stemmer = PorterStemmer()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Then , performing stemming on single words is as simple as:
print(f"Stemming for 'increasing' is {stemmer.stem('increasing')}")
print(f"Stemming for 'increases' is {stemmer.stem('increases')}")
print(f"Stemming for 'increased' is {stemmer.stem('increased')}")

# To improve the accuracy of lemmatization , we need to
# provide each word 's part of the speech (POS) specifying POS as verb "v"
print(f"Lemmatization for 'increasing' is {lemmatizer.lemmatize('increasing', pos='v')}")
print(f"Lemmatization for 'increases' is {lemmatizer.lemmatize('increases', pos='v')}")
print(f"Lemmatization for 'increased' is {lemmatizer.lemmatize('increased', pos='v')}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...


Stemming for 'increasing' is increas
Stemming for 'increases' is increas
Stemming for 'increased' is increas
Lemmatization for 'increasing' is increase
Lemmatization for 'increases' is increase
Lemmatization for 'increased' is increase


Performing lemmatization or stemming on a sentence level requires more work as we need to split sentences into single words and identify each word’s part of the speech

In [15]:
# WordNet is just another NLTK corpus reader
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

# import NLTK tokenizer and (part of speech) POS tagger
from nltk import word_tokenize, pos_tag
# import Porter stemmer class
from nltk.stem import PorterStemmer
# import WordNet lemmatizer class
from nltk.stem import WordNetLemmatizer

# default dictionary is similar to Python 's regular
# dictionary, but allows the dictionary to return a
# default value if a requested key does not exist in
# the dictionary from collections import defaultdict

# object for Porter stemmer
stemmer = PorterStemmer()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# create a dictionary where single-letter keys are
# mapped to part of speech (noun, adjective, etc.)
# WordNet identifiers; by default, if a key does not
# exists the dictionary , return noun(wordnet.NOUN)
tag_map = nltk.defaultdict(lambda: wordnet.NOUN)
# add key 'J' to the dictionary indicating adjective
tag_map ['J'] = wordnet.ADJ
# add key 'V' to the dictionary indicating verb
tag_map ['V'] = wordnet.VERB
# add key 'R' to the dictionary indicating adverb
tag_map ['R'] = wordnet.ADV

text = """We delivered adjusted earnings per share of $2.12.
For the year, comparable sales were down 0.7%
on an owned plus licensed basis, and we delivered
adjusted earnings per share of $2.91."""

# function that stems text
def stem_text(text:str):
    # split text into(word)tokens
    tokens = word_tokenize(text)
    stemmed_text = []
    for token in tokens:
        stem = stemmer.stem(token)
        stemmed_text.append(stem)
    # concatenate stemmed tokens elements with
    # space (" ") in-between
    return " ".join(stemmed_text)

# function that to lemmatizes text
def lemmatize_text(text:str) :
    # splits text into tokens
    tokens = word_tokenize(text)
    lemmatized_text = []
    for token, tag in pos_tag(tokens):
        # lemmatize word tokens , tag [0] returns POS
        # letter identifier
        lemma = lemmatizer.lemmatize(token, tag_map[tag[0]])
        lemmatized_text.append(lemma)
    # concatenate lemmatized tokens elements with
    # space in - between
    return " ".join(lemmatized_text)

# print stemmed version of text
print(stem_text(text))
# print lemmatized version of text
print(lemmatize_text(text))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


we deliv adjust earn per share of $ 2.12 . for the year , compar sale were down 0.7 % on an own plu licens basi , and we deliv adjust earn per share of $ 2.91 .
We deliver adjusted earnings per share of $ 2.12 . For the year , comparable sale be down 0.7 % on an owned plus licensed basis , and we deliver adjusted earnings per share of $ 2.91 .


### Tone Analysis

We begin by uploading dictionaries of words that we are interested in counting. \
We recommend having these dictionary files in plain text (.txt) format, either tab delimited or comma separated, with every dictionary word / phrase in a separate line. \
In the example below, both “positive.txt” and “negative.txt” dictionary files contain base-form words as well as inflected words (e.g., increase, increases, increasing, increased), so we do not need to perform word stemming or lemmatization.

In [16]:
# Let us start with a simple tone analysis, where each
# word is equally-weighted and we do not account for
# negators.

# First , we need to specify the locations of
# our dictionary files.
# file path(location)to a text file with positive
# words ; every word is in a separate line in the file
positive_words_dict = r"./positive.txt"
# file path to a text file with negative words
negative_words_dict = r"./negative.txt"

# To be able to match all positive and negative words
# from the dictionaries , we need to create a list of
# regular expressions corresponding to these words

# The following function reads all dictionary terms
# to a Python list , and converts the terms regular
# expressions

def create_dict_regex_list(dict_file:str):
    """Creates a list of regex expressions of
    dictionary terms."""
    # opens the specified dict_file in "r"(read)mode
    with open(dict_file ,"r") as file :
        # reads the content of the file
        # line -by - line and creates a list of
        # dictionary phrases
        dict_terms = file.read().splitlines()
    # re. compile(pattern)in Python compiles a regular
    # expression pattern , which can be used for
    # matching using its re.search , re. findall , etc.
    # by adding "\b" (i.e. , word boundary)on each
    # side of a dictionary term in Regex , we force
    # an exact match that dictionary term
    dict_terms_regex = [re.compile(r'\b' + term + r'\b') for term in dict_terms]
    # specifies the output of the function - in our
    # case , a list of Regex expressions that
    # correspond to the input dictionary file
    return dict_terms_regex

# Now we can apply our function to create Regex lists
# for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list(positive_words_dict)
negative_dict_regex = create_dict_regex_list(negative_words_dict)

# print the first three entries of each Regex dictionary
print(positive_dict_regex[0:3])
print(negative_dict_regex[0:3])

[re.compile('\\bable\\b'), re.compile('\\babundance\\b'), re.compile('\\babundant\\b')]
[re.compile('\\babandon\\b'), re.compile('\\babandoned\\b'), re.compile('\\babandoning\\b')]


Next, we need to write a function that will count positive, negative, and all words in a given text, so we can calculate document Tone as follows:\
$Tone(\%)=100×\dfrac{(PositiveWordCount − NegativeWordCount)}{TotalWordCount}$

In [17]:
def get_tone(input_text:str):
    """ Counts All and Specific Words in Text """

    ### Positive Words ###

    # finds all regex matches and returns them as a
    # list of lists so, the output of this search
    # will be of the following format :
    # [['able'] , [] , ['abundant','abundant'] , [] , ... ]

    positive_words_matches = [re.findall(regex, input_text) for regex in positive_dict_regex]

    # len() measures the length of each list match
    # so , the output of this list transformation
    # will be of the following format: [1, 0 , 2, 0,...]
    positive_words_counts = [len(match) for match in positive_words_matches]
    positive_words_sum = sum(positive_words_counts)

    ### Negative Words ###

    # in similar manner , we can get word counts for
    # negative words finds all matches of negative words'
    # regular expressions
    negative_words_matches = [re.findall(regex, input_text) for regex in negative_dict_regex]

    # calculates the number of matches for each
    # dictionary term regex
    negative_words_counts = [len(match) for match in negative_words_matches]
    negative_words_sum = sum(negative_words_counts)

    ### Total Words ###

    # searches for all words in text, allowing
    # apostrophes and hyphens in words, e.g.,
    # "company's" , "state-of-the-art"
    total_words = re.findall(r"\b[a-zA-Z\'\-]+\b", input_text)

    # calculates the number of all words in text
    total_words_count = len(total_words)

    # Finally , we can calculate Tone
    #(expressed in % terms)as:
    tone = 100 * (positive_words_sum-negative_words_sum) / total_words_count
    return (total_words_count, positive_words_sum, negative_words_sum, tone)

# Applying our count_words function to an input text :
counts = get_tone("""At FedEx Ground , we have the market
leading e-commerce portfolio. We continue to see
strong demand across all customer segments with our
new seven-day service . We will increase our speed
advantage during the New Year. Our Sunday roll-out
will speed up some lanes by one and two full
transit days. This will increase our advantage
significantly. And as you know, we are already
faster by at least one day when compared to UPS's
ground service in 25% of lanes. It is also really
important to note our speed advantage and seven-day
service is also very valuable for the premium B2B
sectors, including healthcare and perishables
shippers. Now, turning to Q2, I'm not pleased with
our financial results.""")

# output the results as (Total Word Count,
# Number of Positive Words , Number of Negative Words ,
# Tone)
print(counts)

(114, 7, 0, 6.140350877192983)


Here is a list of negators we might want to consider in our regular expressions: \
**not, never, no, none, nobody, nothing, don’t, doesn’t, won’t, shan’t, didn’t, shouldn’t, wouldn’t, couldn’t, can’t, cannot, neither, nor** \
To account for negators in our previous code, we need to rewrite our regular expressions for word counts as follows:

In [18]:
# First, we update our function that compiles regular expressions
def create_dict_regex_list_with_negators(dict_file:str):
    """Creates a list of regex expressions of dictionary terms."""
    with open (dict_file ,"r") as file:
        # reads dictionary lines one -by -one
        dict_terms = file.read().splitlines()
        # the first capturing group in this Regex
        # captures all possible negators , allowing for
        # zero or one match as indicated by ? after the
        # group ; the second group captures dictionary terms
        dict_terms_regex = [re.compile(r"(not|never|no|none|nobody|nothing|don\'t\
|doesn\'t|won\'t|shan\'t|didn\'t|shouldn\'t|wouldn\'t|couldn\'t|can\'t\
|cannot|neither|nor)?\s(" + term + r")\b") for term in dict_terms]
        
        # returns a list of Regex expressions that
        # correspond to the input dictionary file ,
        # allowing for negators
        return dict_terms_regex
    
# Now we can apply our function to create Regex lists
# for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list_with_negators(positive_words_dict)
negative_dict_regex = create_dict_regex_list_with_negators(negative_words_dict)  

# prints the first entries of each Regex dictionary
print(positive_dict_regex[0])
print(negative_dict_regex[0])

re.compile("(not|never|no|none|nobody|nothing|don\\'t\\\n|doesn\\'t|won\\'t|shan\\'t|didn\\'t|shouldn\\'t|wouldn\\'t|couldn\\'t|can\\'t\\\n|cannot|neither|nor)?\\s(able)\\b")
re.compile("(not|never|no|none|nobody|nothing|don\\'t\\\n|doesn\\'t|won\\'t|shan\\'t|didn\\'t|shouldn\\'t|wouldn\\'t|couldn\\'t|can\\'t\\\n|cannot|neither|nor)?\\s(abandon)\\b")


Then, the updated version of our function to calculate document tone is as follows:

In [19]:
# calculates tone with negators
def get_tone2(input_text:str):
    """Counts All and Specific Words in Text, and checks for the presence of negators"""

    # find all words in text
    total_words = re.findall(r"\b[a-zA-Z\'\-]+\b", input_text)
    total_words_count = len(total_words)
    
    # Positive Words #
    # To account for negators , we can separately count
    # positive and negated positive words
    positive_word_count = 0
    negated_positive_word_count = 0
    
    for regex in positive_dict_regex:
        # searches for all occurences of Regex
        matches = re.findall(regex, input_text)
        for match in matches:
            # if match is not empty
            if len(match)>0:
                # prints the match output ; this
                # is for illustration purposes
                # (i.e. , optional)
                print(match)
            # if the first element of the match
            # is empty , no negator is present
            if match[0] =='':
                # so , increase the count of
                # positive words by 1
                positive_word_count += 1
            else:
                # otherwise , a negator is present ,
                # so increase the count of negated
                # positive words by 1
                negated_positive_word_count += 1
                
    # If we are simply shifting the sentiment of negated
    # positive words(from +1 to -1) , then the final
    # positive word count is just :
    positive_words_sum = positive_word_count
    
    # Repeat the same for Negative Words :
    negative_word_count = 0
    negated_negative_word_count = 0
    
    for regex in negative_dict_regex :
        # search for all occurences of Regex
        matches = re.findall(regex, input_text)
        for match in matches :
            # if match is not empty
            if len(match)>0:
                print(match)
            # if the first element of the match
            # is empty , no negator is present
            if match[0] == '':
                # so , increase the count of
                # negative words by 1
                negative_word_count += 1
            else :
                # otherwise , a negator is present , so
                # increase the count of negated
                # negative words by 1
                negated_negative_word_count += 1
    # If we are simply shifting the sentiment of negated
    # negative words(from -1 to +1) , then the final
    # negative word count is just :
    negative_words_sum = negative_word_count
            
    # Then, Tone is:
    tone = 100 *(positive_words_sum - negative_words_sum)/ total_words_count
    return(total_words_count, positive_words_sum, negative_words_sum, tone)

# Applying function get_tone2 function to an
# example text :

counts = get_tone2("""At FedEx Ground , we have the market
leading e-commerce portfolio. We continue to see
strong demand across all customer segments with our
new seven-day service . We will increase our speed
advantage during the New Year. Our Sunday roll-out
will speed up some lanes by one and two full
transit days. This will increase our advantage
significantly. And as you know, we are already
faster by at least one day when compared to UPS's
ground service in 25% of lanes. It is also really
important to note our speed advantage and seven-day
service is also very valuable for the premium B2B
sectors, including healthcare and perishables
shippers. Now, turning to Q2, I'm not pleased with
our financial results.""")

# output results
print(counts)

('', 'advantage')
('', 'advantage')
('', 'advantage')
('', 'leading')
('not', 'pleased')
('', 'strong')
('', 'valuable')
(114, 6, 0, 5.2631578947368425)


Finally, it is also possible that there are additional words between a given negator and a dictionary term (e.g., “not very encouraging” or “never went well”). In this case, we can modify the regular expressions above (in the dict_terms_regex list) and allow them to match phrases (N-grams) that contain dictionary terms (as opposed to individual dictionary terms). In other words, we can modify the regular expressions to allow for extra word(s) between the negators and dictionary terms:

In [None]:
# the first capturing group in this Regex
# captures all possible negators, allowing for
# zero or one match as indicated by ? after the
# group ; the second group captures dictionary terms ;
# the non-capturing group ,s(:?\w+\s){0,2}, matches
# either none ,one ,or two words between a negator
# and a dictionary term.

dict_terms_regex = [re.compile(r"(not|never|no|none|nobody|nothing|don\'t\
|doesn\'t|won\'t|shan\'t|didn\'t|shouldn\'t|wouldn\'t|couldn\'t|can\'t\
|cannot|neither|nor)?\s(:?\w+\s){0,2}(" + term + r")\b") for term in dict_terms]