# L16: Text Analysis(Part II)
[1. Quantifying Text Complexity](#1.-Quantifying-Text-Complexity)\
[2. Sentence Structure and Classification](#2.-Sentence-Structure-and-Classification)\
[3. Measuring Text Similarity](#3.-Measuring-Text-Similarity)

# 1. Quantifying Text Complexity

### Calculating Text Length

Define `count_words` function that calculates word count for an input text. \
In the following code, function `identify_words` uses a regular expression to find and output all words in a given text. Function count_words first applies `identify_words` to a given text to get a list of all words from that text(stored in variable `words`). It then outputs the length of that list, which is equivalent to the total number of words in text.

In [1]:
import re
def identify_words(input_text :str):
    """ Extracts all words from a given text. """
    words = re.findall(r"\b[a-zA-Z\'\-]+\b", input_text)
    return words
def count_words(input_text : str):
    """ Counts the number of words in a given text. """
    words = identify_words(input_text)
    # calculates the number of words in a given text
    word_count = len(words)
    return word_count

Applying `count_words` to a short text would yield:

In [2]:
# excerpt from Microsoft Corporation's 2016 10-K.
text = """ We acquire other companies and intangible
assets and may not realize all the economic benefit
from those acquisitions , which could cause an
impairment of goodwill or intangibles . We review
our amortizable intangible assets for impairment
when events or changes in circumstances indicate
the carrying value may not be recoverable . We test
goodwill for impairment at least annually . Factors
that may be a change in circumstances , indicating
that the carrying value of our goodwill or
amortizable intangible assets may not be
recoverable , include a decline in our stock price
and market capitalization , reduced future cash flow
estimates , and slower growth rates in industry
segments in which we participate . We may be
required to record a significant charge on our
consolidated financial statements during the period
in which any impairment of our goodwill or
amortizable intangible assets is determined ,
negatively affecting our results of operations ."""
text_length = count_words(text)
print(f"Number of words in text : {text_length}")

Number of words in text : 143


In a similar manner, we can write a function that counts the number of sentences in an input text

In [3]:
def identify_sentences(input_text:str):
    """ Extracts all sentences from a given text. """
    sentences = re.findall(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]", input_text)
    return sentences

def count_sentences(input_text:str) :
    """ Counts the number of sentences in input_text. """
    sentences = identify_sentences(input_text)
    sentence_count = len(sentences)
    return sentence_count

num_sentences = count_sentences(text)
print(f"Number of sentences in text: {num_sentences}")

Number of sentences in text: 5


### Measuring Text Readability Using the Fog Index

Write a function `count_syllables` that counts the number of syllables in a given word, and a function `is_complex_word` that for a given word returns True if the word has more than three syllables in it, and False otherwise


In [4]:
# regex pattern that matches vowels in a word
# (case-insensitive); used for syllable count
re_syllables = re.compile(r'(^|[^aeuoiy])(?!e$)[aeouiy]', re.IGNORECASE)
def count_syllables(word:str):
    """ Counts the number of syllables in a word. """
    # gets all syllable regex pattern matches
    # in the input word
    syllables_matches = re_syllables.findall(word)
    return len(syllables_matches)
def is_complex_word(word:str):
    """ Checks whether word has three or more syllables. """
    return count_syllables(word)>= 3

Consider the following example with `count_syllables` and `is_complex_word` functions:

In [5]:
print("Number of syllables in word \"Text\":", count_syllables("Text"))
print("Is word \"Text\" complex?", is_complex_word("Text"))
print("Number of syllables in word \"analysis\":", count_syllables("analysis"))
print("Is word \"analysis\" complex?", is_complex_word(" analysis "))
print("Number of syllables in word \"procedure\":", count_syllables("procedure"))
print("Is word \"procedure\" complex?", is_complex_word("procedure"))

Number of syllables in word "Text": 1
Is word "Text" complex? False
Number of syllables in word "analysis": 4
Is word "analysis" complex? True
Number of syllables in word "procedure": 3
Is word "procedure" complex? True


Write a function that computes the fog index score

In [6]:
def calculate_fog(text:str):
    """ Calculates the fog index for a given text. """
    # extracts all sentences from the input text
    sentences = identify_sentences(text)
    # extracts all words from the input text
    words = identify_words(text)
    # creates a list of complex words by using
    # is_complex_word function as a filter
    complex_words = list(filter(is_complex_word, words))
    # calculates and returns the fog index
    return 0.4*(float(len(words)) / float(len(sentences)) + 100*float(len(complex_words)) / float(len(words)))

In [7]:
fog_score = calculate_fog(text)
print ("The fog index score is", fog_score)

The fog index score is 21.78965034965035


### Using Python Packages to Calculate the Fog Index
* pip install py-readability-metrics
* python -m nltk.downloader punkt

In [8]:
# Readability class provides methods to compute various
# readability metrics
from readability import Readability

# create a new Readability object with the example text
# as an input
r = Readability(text)

# calculate and output the fog index
fog_score = r.gunning_fog()
print(fog_score)

score: 21.78965034965035, grade_level: 'college_graduate'


# 2. Sentence Structure and Classification

### Identifying forward-looking sentences

We will start with generating regular expressions that correspond to future-oriented terms as per Appendix “Identifying
Forward-Looking Disclosures” in Muslu et al. (2015).

In [9]:
import re
# To identify FLS , we need a dictionary file that
# includes future - oriented verbs and their
# conjugations as well as terms that identify
# references to the future . In our case , this
# file is "fls_terms.txt."

# file path(location)to a text file with FLS
# terms(dictionary structure : one term per line)
fls_terms_file = r".\fls_terms.txt"

# next , create a list of regex expressions that
# match FLS terms
def create_fls_regex_list(fls_terms_file : str):
    """ Creates a list of regex expressions of FLS terms """
    
    # opens the specified dict_file in "r" (read) mode
    with open(fls_terms_file ,"r") as file:
        # reads the content of the file line -by - line
        # and creates a list of FLS terms
        fls_terms = file.read().splitlines()
        
    # creates a list of FLS regex expressions by adding
    # word boundary (\b) anchors to the beginning and
    # the ending of each FLS term
    fls_terms_regex = [re.compile(r'\b'+term+r'\b') for term in fls_terms]
    return fls_terms_regex

# creates a list of FLS regex expressions
fls_terms_regex = create_fls_regex_list (fls_terms_file)
print(fls_terms_regex[0:3])

[re.compile('\\bwill\\b'), re.compile('\\bfuture\\b'), re.compile('\\bnext fiscal\\b')]


Let's write a function that checks if a sentence is forward-looking or not

In [10]:
def is_forward_looking (sentence:str, year:int):
    """ Returns whether sentence is forward-looking."""
    # creates a list of regex expression that match up
    # to 10 years into the future
    future_year_terms =[re.compile(r"[^$,]\b" + str(y) + r"\b(?!(%|,\d|.\d))") for y in range(year+1, year+10)]
    
    # combines FLS regex expressions , i.e. , regular
    # expressions for FLS terms and future years
    fls_terms_with_future_years = fls_terms_regex + future_year_terms
    
    for fls_term in fls_terms_with_future_years :
        # fls_term . search(sentence)returns a match
        # object if there is a match , and " None "
        # if there is no FLS term match in the
        # sentence
        if fls_term.search(sentence):
            return True
    return False

# Input text - excerpt from Apple 's Q4 2018
# Earnings Conference Call Transcript
text = """ Finally , we launched a completely new website
experience for Atlanta . The new online experience
provides a modern and fresh brand look and includes
enhanced simplicity and flexibility for shopping and
buying that easily transitions to a home delivery or
in - store experience . We are excited to put the customer
in the driver seat . This experience is a unique and
powerful integration of our own in - store and online
capabilities . Keep in mind , we will continue to improve
both the customer and associate experience in Atlanta
and use these earnings to inform how we roll out into
other markets . As we previously announced , we
anticipate having the omni channel experience available
to the majority of our customers by February 2020. To
expand omni channel , we anticipate opening additional
customer experience centers . We 're currently in the
process of planning the next locations while taking
state regulations into consideration ."""
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
                               
def identify_sentences(input_text:str) :
    sentences = re.findall(sentence_regex, input_text)
    return sentences
                               
sentences = identify_sentences(text)
for sentence in sentences:
    print("\033[1m", is_forward_looking(sentence, 2018), "\033[0m", ":", sentence)

[1m False [0m : Finally , we launched a completely new website
experience for Atlanta .
[1m False [0m : The new online experience
provides a modern and fresh brand look and includes
enhanced simplicity and flexibility for shopping and
buying that easily transitions to a home delivery or
in - store experience .
[1m False [0m : We are excited to put the customer
in the driver seat .
[1m False [0m : This experience is a unique and
powerful integration of our own in - store and online
capabilities .
[1m True [0m : Keep in mind , we will continue to improve
both the customer and associate experience in Atlanta
and use these earnings to inform how we roll out into
other markets .
[1m True [0m : As we previously announced , we
anticipate having the omni channel experience available
to the majority of our customers by February 2020.
[1m True [0m : To
expand omni channel , we anticipate opening additional
customer experience centers .
[1m False [0m : We 're currently in the
proc

### Dictionary Approach to Sentence Classification

In [11]:
# This code implements is a simplified version of
# sentence classification as earnings - oriented or
# not and quantitative or not as in Bozanic et
# al.(2018)

# regex for identifying sentences
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")

def identify_sentences(input_text:str):
    """ Returns all sentences in the input text """
    sentences = re.findall(sentence_regex, input_text)
    return sentences

earn_terms = ["earnings", "EPS", "income", "loss",
              "losses", "profit", "profits"]
quant_terms = ["thousand", "thousands", "million",
               "millions", "billion", "billions",
               "percent", "%", "dollar", "dollars", "$"]

# creates a list of earnings regex expressions
earn_terms_regex = [re.compile(r'\b' + term + r'\b') for term in earn_terms]

# creates a list of regexes for quantitative terms
quant_terms_regex = [re.compile(r'\b' + term + r'\b') for term in quant_terms]

# checks if there is a match for at least one earnings
# term in the input sentence
def is_earn_oriented(sentence:str) :
    """ Checks whether a sentence is earnings-oriented. """
    for term in earn_terms_regex:
        if term.search(sentence, re.IGNORECASE):
            return True
    return False

# checks if there is a match for at least one
# qualitative term in the input sentence
def is_quantitative(sentence : str):
    """ Checks whether a sentence is quantitative in nature. """
    for term in quant_terms_regex:
        if term.search(sentence, re.IGNORECASE):
            return True
    return False

# input text
text = """Operating income margins, excluding the
restructuring charges, are projected to be in the
range of 4.5% to 4.8%, and interest expense and
other income are forecasted to be approximately
$18 million and $6 million, respectively. While
operating performance is expected to remain
strong, Agribusiness profits are expected to be
lower in the third and fourth quarters as pricing
for subsequent sales will not match the high level
of the June delivery. The Company expects its
capital expenditures in 2008 to be approximately
$300 million, an 8% reduction from 2007 capital
expenditures of $326 million. During the third
quarter, the company made further progress
implementing the strategic cost reductions that
will support the targeted growth investments
announced in July 2005."""

sentences = identify_sentences(text)

# next, we classify each sentence as earnings-
# oriented or not, quantitative or not
for sentence in sentences:
    print("\033[1m", "*** Earnings-oriented:", is_earn_oriented(sentence),
           "\033[1m", "*** Quantitative :", is_quantitative(sentence),
           "\033[0m", "---", sentence)

[1m *** Earnings-oriented: True [1m *** Quantitative : True [0m --- Operating income margins, excluding the
restructuring charges, are projected to be in the
range of 4.5% to 4.8%, and interest expense and
other income are forecasted to be approximately
$18 million and $6 million, respectively.
[1m *** Earnings-oriented: True [1m *** Quantitative : False [0m --- While
operating performance is expected to remain
strong, Agribusiness profits are expected to be
lower in the third and fourth quarters as pricing
for subsequent sales will not match the high level
of the June delivery.
[1m *** Earnings-oriented: False [1m *** Quantitative : True [0m --- The Company expects its
capital expenditures in 2008 to be approximately
$300 million, an 8% reduction from 2007 capital
expenditures of $326 million.
[1m *** Earnings-oriented: False [1m *** Quantitative : False [0m --- During the third
quarter, the company made further progress
implementing the strategic cost reductions that
will

### Identifying Sentence Subjects and Objects

First, we demonstrate how to extract sentences from text using spacy:

In [13]:
import spacy
# load spacy 's English language model
nlp = spacy.load("en_core_web_sm")

# a sample text
text = """Q1 revenue reached $12.7 billion. We are
thrilled with the continued growth of Apple Card.
We experienced some product shortages due to very
strong customer demand for both Apple Watch and
AirPod during the quarter. Apple is looking at
buying U.K. startup for $1 billion."""

# parses the input text using spacy 's nlp class
parsed_text = nlp(text)

# gets a list of sentences identified by spacy
# property " sents " yields identified sentences
sentences = list(parsed_text.sents)

# recall that function enumerate () when applied
# to a list, returns its elements along with their
# indexes
for num, sentence in enumerate(sentences,1):
    print("Sentence", str(num), ":", sentence)

Sentence 1 : Q1 revenue reached $12.7 billion.
Sentence 2 : We are
thrilled with the continued growth of Apple Card.

Sentence 3 : We experienced some product shortages due to very
strong customer demand for both Apple Watch and
AirPod during the quarter.
Sentence 4 : Apple is looking at
buying U.K. startup for $1 billion.


Next, we can apply spacy’s tagging method to identify subjects and objects in sentences

In [14]:
def sentence_subj_obj(sentence) :
    """ Identifies subjects and objects in a sentence """
    results = []
    for token in sentence :
        # records the token 's text and its dependency
        entry = {"Token": token.text, 
                 "Dependency": token.dep_}
        results.append(entry)

    # spacy parses token dependencies and assigns a
    # dependency code for each token ; tokens that are
    # either objects or subjects will include "obj" or
    # " subj " in their dependency codes ; for a full list
    # of spacy 's dependencies and their codes , visit
    # spacy.io
    
    # creates a new list of tokens and their
    # dependencies based on results list by keeping
    # only tokens with "obj" and " subj " dependencies
    filtered_results =[entry for entry in results
                       if ('obj ' in entry['Dependency'])
                       or ('subj' in entry['Dependency'])]
    return filtered_results

# recall that function enumerate () when applied to a
# list , returns its elements along with their indexes
for num , sentence in enumerate(sentences,1) :
    print ("Sentence", str(num), ":", sentence_subj_obj(sentence))

Sentence 1 : [{'Token': 'revenue', 'Dependency': 'nsubj'}]
Sentence 2 : [{'Token': 'We', 'Dependency': 'nsubjpass'}]
Sentence 3 : [{'Token': 'We', 'Dependency': 'nsubj'}]
Sentence 4 : [{'Token': 'Apple', 'Dependency': 'nsubj'}]


Finally, spacy allows us to easily output and visualize complete sentence structure with all word dependencies:

In [15]:
# displacy allows to visualize a sentence structure
from spacy import displacy

# tags all(word)tokens in an input sentence
def sentence_tagging(sentence):
    results = []
    for token in sentence:
        # gets a token, its lemmatized version, POS,
        # dependency, and checks whether it is a stop
        # word or not
        entry = {"Token": token.text ,
                 "Lemma_Token": token.lemma_ ,
                 "POS": token.pos_ ,
                 "Dependency": token.dep_ ,
                 "Stop_word": token.is_stop}
        results.append(entry)
    return results

# applies sentence_tagging to all sentences
tagged_sentences = [sentence_tagging(s) for s in sentences]

# prints the output for the first sentence
print(tagged_sentences[0])

# visualizes sentence dependency
displacy.render(parsed_text, style ="dep")

[{'Token': 'Q1', 'Lemma_Token': 'Q1', 'POS': 'PROPN', 'Dependency': 'compound', 'Stop_word': False}, {'Token': 'revenue', 'Lemma_Token': 'revenue', 'POS': 'NOUN', 'Dependency': 'nsubj', 'Stop_word': False}, {'Token': 'reached', 'Lemma_Token': 'reach', 'POS': 'VERB', 'Dependency': 'ROOT', 'Stop_word': False}, {'Token': '$', 'Lemma_Token': '$', 'POS': 'SYM', 'Dependency': 'quantmod', 'Stop_word': False}, {'Token': '12.7', 'Lemma_Token': '12.7', 'POS': 'NUM', 'Dependency': 'compound', 'Stop_word': False}, {'Token': 'billion', 'Lemma_Token': 'billion', 'POS': 'NUM', 'Dependency': 'dobj', 'Stop_word': False}, {'Token': '.', 'Lemma_Token': '.', 'POS': 'PUNCT', 'Dependency': 'punct', 'Stop_word': False}]


### Identifying Named Entities

First, we demonstrate how to identify and extract named entities from text:

In [16]:
# create a dictionary with descriptions for spacy 's
# entity type codes ; the list is available on spacy.io
entity_type_descriptions = {
    'PERSON':'People, including fictional.',
    'NORP':' Nationalities or religious or political groups.',
    'FAC':'Buildings, airports, highways, bridges, etc.',
    'ORG':'Companies, agencies, institutions, etc.',
    'GPE':'Countries, cities, states.',
    'LOC':'Non -GPE locations, mountain ranges, bodies of water.',
    'PRODUCT':'Objects, vehicles, foods, etc. (Not services.)',
    'EVENT':'Named hurricanes, battles, wars, sports events, etc.',
    'WORK':'OF_ART Titles of books, songs, etc.',
    'LAW':'Named documents made into laws.',
    'LANGUAGE':'Any named language.',
    'DATE':'Absolute or relative dates or periods.',
    'TIME':'Times smaller than a day.',
    'PERCENT':'Percentage, including "%". ',
    'MONEY':'Monetary values, including unit.',
    'QUANTITY':'Measurements, as of weight or distance.',
    'ORDINAL':'" first ", " second ", etc.',
    'CARDINAL':'Numerals that do not fall under another type.'}

# gets a list of all named entities identified
# by spacy, and output them
# property "ents" returns all identified named
# entities in the text
named_entities = parsed_text.ents

for ent in named_entities :
    # gets the named entity (ent. text)
    entity = ent.text
    # gets the named entity type code
    # (e.g., PERSON, ORG, etc.)
    entity_type = ent.label_
    # gets the named entity description from
    # entity_type_descriptions dictionary using
    # its type code
    entity_desc = entity_type_descriptions[entity_type]
    print(f'{ entity:<15}{entity_type:<10}{entity_desc}')

$12.7 billion  MONEY     Monetary values, including unit.
Apple Card     ORG       Companies, agencies, institutions, etc.
Apple Watch    ORG       Companies, agencies, institutions, etc.
AirPod         ORG       Companies, agencies, institutions, etc.
the quarter    DATE      Absolute or relative dates or periods.
Apple          ORG       Companies, agencies, institutions, etc.
U.K.           GPE       Countries, cities, states.
$1 billion     MONEY     Monetary values, including unit.


Now, we can calculate the specificity measure by dividing the number of named entities by the number of words in text:

In [17]:
# counts the number of all words
# we assume that every token in a sentence is a word
# unless it is punctuation.
num_words = len([token for token in parsed_text if not token.is_punct])

num_entities = len(named_entities)
specificity_score = num_words / num_entities

print('Number of named entities:', num_entities)
print('Number of words:', num_words)
print('Specificity score:', specificity_score)

Number of named entities: 8
Number of words: 52
Specificity score: 6.5


### Using Stanford NLP for part-of-speech and named entity recognition tasks

`Stanza` can be installed using either conda or pip as follows:
* conda install -c stanfordnlp stanza
* pip install stanza

Before processing text, we need to download a Stanza language module and create a `Pipeline` object. `Pipeline` object specifies the type of processing that will be applied to a given text (e.g., tokenization, lemmatization, dependency parsing, etc.).

In [19]:
import stanza
# downloads the English module. The size of the
# downloaded module is about 400 MB. The module
# has to be download only once
stanza.download('en')

# creates a(text processing)Pipeline object using
# the English language module with tokenizer , part
# of speech and named entity recognition
nlp = stanza.Pipeline(lang = 'en', processors = 'tokenize, pos, ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

2023-12-26 16:12:14 INFO: Downloading default packages for language: en (English) ...
2023-12-26 16:12:14 INFO: File exists: C:\Users\yangs\stanza_resources\en\default.zip
2023-12-26 16:12:17 INFO: Finished downloading models and saved to C:\Users\yangs\stanza_resources.
2023-12-26 16:12:17 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

2023-12-26 16:12:18 INFO: Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| pos       | combined_charlm           |
| ner       | ontonotes-ww-multi_charlm |

2023-12-26 16:12:18 INFO: Using device: cpu
2023-12-26 16:12:18 INFO: Loading: tokenize
2023-12-26 16:12:18 INFO: Loading: mwt
2023-12-26 16:12:18 INFO: Loading: pos
2023-12-26 16:12:18 INFO: Loading: ner
2023-12-26 16:12:18 INFO: Done loading processors!


Next, we create a `Stanza` document object by providing an input text. `Stanza` will immediately parse the input text using previously specified text processors at this step. We can retrieve parsed sentences and words through document object properties:

In [20]:
# sample text (same as in the previous example)
text = """ Q1 revenue reached $12.7 billion. We are
thrilled with the continued growth of Apple Card.
We experienced some product shortages due to very
strong customer demand for both Apple Watch and
AirPod during the quarter. Apple is looking at
buying U.K. startup for $1 billion."""

# creates Stanza document object
doc = nlp(text)

# extracts sentences
sentences = doc.sentences

print('Sentences:')
# prints the first 20 characters of each sentence
for sentence in sentences :
    print(sentence.text[0:20] + '...')

print('\nWords:')
# prints all the words in the first sentence
for word in sentences[0].words:
    print(word.text)

Sentences:
Q1 revenue reached $...
We are
thrilled with...
We experienced some ...
Apple is looking at
...

Words:
Q1
revenue
reached
$
12.7
billion
.


For each word, we can output its part-of-speech tag by accessing the value of its `.pos` property:

In [21]:
# outputs POS information for each word in the second sentence
for word in sentences[1].words :
    print(f'{word.text:<10} {word.pos}')

We         PRON
are        AUX
thrilled   VERB
with       ADP
the        DET
continued  VERB
growth     NOUN
of         ADP
Apple      PROPN
Card       PROPN
.          PUNCT


Similarly, we can output all entities identified by `Stanza`’s NER processor for a given text (or individual sentence) by accessing `.ents` property:

In [22]:
# outputs all entities identified in the input text
for ent in doc.ents:
    print (f'{ent.text:<15} {ent.type}')

Q1              ORG
$12.7 billion   MONEY
Apple Card      ORG
Apple Watch     ORG
AirPod          ORG
the quarter     DATE
Apple           ORG
U.K.            GPE
$1 billion      MONEY


# 3. Measuring Text Similarity

### Text Similarity Measure for Long Text: Cosine Similarity

To access NLTK’s word tokenizer and the list of stop words we need to download two NLTK modules as follows (this has to be done only once):

In [23]:
import nltk
# download NLTK's stopwords module
nltk.download('stopwords')
# downlod NLTK's punkt module
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yangs\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yangs\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

When calculating text similarity, we should exclude these punctuation character tokens as they introduce noise to bag-of-words vectors. Conveniently, Python includes a list of punctuation characters; we only need to add apostrophe to that list.

In [24]:
# Python includes a collection of all punctuation
# characters
from string import punctuation
# add apostrophe to the punctuation character list
punctuation_w_apostrophe = punctuation + "’"
# print all characters
print(punctuation_w_apostrophe)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~’


Now, we can write a custom word tokenizer using NTLK’s list of stop words and the Porter stemmer:

In [25]:
# imports word tokenizer from NLTK
from nltk import word_tokenize
# imports list of stop words from NLTK
from nltk.corpus import stopwords
# imports Porter Stemmer module from NLTK
from nltk.stem import PorterStemmer

# creates a list of English stop words
set_stopwords = set(stopwords.words('english'))
# creates a Porter stemmer object
stemmer = PorterStemmer()

# creates a custom tokenizer that removes stop words ,
# punctuation , and stems the remaining words
def custom_tokenizer(text:str) :
    # gets all tokens(words)from the lower - cased
    # input text
    tokens = word_tokenize(text.lower())
    # filters out stop words
    no_sw_tokens = [t for t in tokens if t not in set_stopwords]
    # filters out punctuation character tokens
    no_sw_punct_tokens = [t for t in no_sw_tokens if t not in punctuation_w_apostrophe]
    # stems the remaining words
    stem_tokens = [stemmer.stem(t) for t in no_sw_punct_tokens]
    # returns stemmed tokens(words)
    return stem_tokens

Let us demonstrate how this tokenizer works using text excerpts from business description sections of 10-K filings of three telecommunication companies:

In [26]:
# excerpt from Verizon Communications Inc. 2018 10 -K
doc_verizon = """Verizon Communications Inc. (Verizon
or the Company) is a holding company that, acting
through its subsidiaries, is one of the world’s
leading providers of communications, information
and entertainment products and services to
consumers, businesses and governmental agencies."""

# excerpt from AT&T Inc. 2018 10 -K
doc_att = """We are a leading provider of
communications and digital entertainment services
in the United States and the world. We offer our
services and products to consumers in the U.S.,
Mexico and Latin America and to businesses and
other providers of telecommunications services
worldwide."""

# excerpt from Sprint Corporation 2018 10 -K
doc_sprint = """Sprint Corporation, including its
consolidated subsidiaries, is a communications
company offering a comprehensive range of wireless
and wireline communications products and services
that are designed to meet the needs of individual
consumers, businesses, government subscribers and
resellers."""

tokens_verizon = custom_tokenizer(doc_verizon)
print(tokens_verizon)

tokens_att = custom_tokenizer(doc_att)
print(tokens_att)

tokens_sprint = custom_tokenizer(doc_sprint)
print(tokens_sprint)

['verizon', 'commun', 'inc.', 'verizon', 'compani', 'hold', 'compani', 'act', 'subsidiari', 'one', 'world', 'lead', 'provid', 'commun', 'inform', 'entertain', 'product', 'servic', 'consum', 'busi', 'government', 'agenc']
['lead', 'provid', 'commun', 'digit', 'entertain', 'servic', 'unit', 'state', 'world', 'offer', 'servic', 'product', 'consum', 'u.s.', 'mexico', 'latin', 'america', 'busi', 'provid', 'telecommun', 'servic', 'worldwid']
['sprint', 'corpor', 'includ', 'consolid', 'subsidiari', 'commun', 'compani', 'offer', 'comprehens', 'rang', 'wireless', 'wirelin', 'commun', 'product', 'servic', 'design', 'meet', 'need', 'individu', 'consum', 'busi', 'govern', 'subscrib', 'resel']


Finally, we can use Scikit-learn’s CountVectorizer class o convert text documents to bag-of-words vectors:

In [27]:
# CountVectorizer converts text to bag-of-words vectors
from sklearn.feature_extraction.text import CountVectorizer

# creates a list of three documents; one for each
# company
documents = [doc_verizon, doc_att, doc_sprint]

# creates a CountVectorizer object with the custom
# tokenizer
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

# converts text documents to bag-of-word vectors
count_vecs = count_vectorizer.fit_transform(documents)

# prints first ten bag-of-words features(words)
print(count_vectorizer.get_feature_names_out()[:10])

# prints first ten bag-of-words elements(counts)for
# each vector the output is a matrix where each row
# represents a document vector the element(count )
# order in each vector corresponds to the order of
# the bag-of-word features
print(count_vecs.toarray()[:,:10])

['act' 'agenc' 'america' 'busi' 'commun' 'compani' 'comprehens' 'consolid'
 'consum' 'corpor']
[[1 1 0 1 2 2 0 0 1 0]
 [0 0 1 1 1 0 0 0 1 0]
 [0 0 0 1 2 1 1 1 1 1]]




Calculating **Cosine Similarity**

In [28]:
# cosine_similarity calculates cosine similarity
# between vectors
from sklearn.metrics.pairwise import cosine_similarity

# calculates text cosine similarity and stores results
# in a matrix . The matrix stores pairwise similarity
# scores for all documents , similarly to a covariance
# matrix
cosine_sim_matrix = cosine_similarity(count_vecs)

# outputs the similarity matrix
print(cosine_sim_matrix)

[[1.         0.44854261 0.40768712]
 [0.44854261 1.         0.32225169]
 [0.40768712 0.32225169 1.        ]]


We can slightly modify our previous code to create bag-of-words vectors with IDF weights by using `TfidfVectorizer` class instead of `CountVectorizer`. The former class automatically calculates and applies IDF weights for each documents in the list (corpus) of documents.

In [29]:
# TfidfVectorizer converts text to TF -IDF bag -of - words
# vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# creates a TfidfVectorizer object with the custom
# tokenizer
tfidf_vectorizer = TfidfVectorizer(tokenizer = custom_tokenizer)

# converts text documents to TF -IDF vectors
tfidf_vecs = tfidf_vectorizer.fit_transform(documents)

# prints first four bag -of - words features(words)
print(tfidf_vectorizer.get_feature_names_out()[:4])

# prints first four bag -of - words TF -IDF counts for each
# vector.The output is a matrix where each row
# represents a document vector
print(tfidf_vecs.toarray()[:,:4]) # prints first four elements of each vector

['act' 'agenc' 'america' 'busi']
[[0.22943859 0.22943859 0.         0.13551013]
 [0.         0.         0.23464902 0.13858749]
 [0.         0.         0.         0.13365976]]


To compute the cosine similarity between TF-IDF vectors, we can use NTLK’s `cosine_similarity` function 

In [30]:
# computes the cosine similarity matrix for TF -IDF vectors
tfidf_cosine_sim_matrix = cosine_similarity(tfidf_vecs)

# outputs the similarity matrix
print(tfidf_cosine_sim_matrix)

[[1.         0.30593809 0.23499515]
 [0.30593809 1.         0.17890296]
 [0.23499515 0.17890296 1.        ]]


### Text Similarity Measure for Short Text: Levenshtein Distance

`NLTK` library provides a function called `edit_distance` that calculates the Levenshtein distance between two pieces of text:

In [31]:
# edit_distance computes Levenshtein distance between
# two pieces of text
from nltk import edit_distance
# example : account and accounts
print(edit_distance("account","accounts"))
# example : account and count
print(edit_distance("account","count "))
# example : account and access
print(edit_distance("account","access"))

1
3
4


Creating a Similarity Measure using the Levenshtein Distance

In [32]:
# similarity measure based on the Levenshtein distance
# greater values indicate more similar text
def edit_similarity(t1, t2):
    # lowercase the input strings
    (t1, t2) = (t1.lower(), t2.lower())
    # calculates the Levenshtein distance between the
    # input strings
    distance = edit_distance(t1, t2)
    # calculates length of the longest input string
    longest_text_len = max(len(t1),len(t2))
    # if both t1 and t2 are empty strings, they are
    # identical ; thus return 1 as the output
    if longest_text_len == 0:
        return 1.0
    # else compute the similarity measure as
    # 1 -(levenshtein_distance / length of the longest input string)
    else:
        return(1.0 - float(distance)/float(longest_text_len))

Let us demonstrate how to apply this similarity measure on an example. Consider the problem of matching observations based on company names. The name of an S&P 500 firm, Fidelity National Information Services, is recorded in Capital IQ’s Compustat database as “Fidelity National Info Svcs”.

In [33]:
# original company name
orig_name = "Fidelity National Information Services"
# shortened company name
comp_name = "Fidelity National Info Svcs"

# calculates and outputs the Levenshtein distance
levenshtein_distance = edit_distance(orig_name, comp_name)
print("Levenshtein distance:", levenshtein_distance)

# calculates and output the similarity score based on
# Levenshtein distance
levenshtein_similarity = edit_similarity(orig_name, comp_name)
print("Levenshtein similarity score:", levenshtein_similarity)

Levenshtein distance: 11
Levenshtein similarity score: 0.7105263157894737


### Measuring Semantic Similarity using Word2Vec Embedding Model

**Data preprocessing steps**: 
* removing stop words, special characters, numbers, and extra spaces
* extracting individual words from the input text

In [34]:
import re
# imports word tokenizer from NLTK
import nltk
# download NLTK 's stopwords module
nltk.download('stopwords')
from nltk import word_tokenize
# imports list of stop words from NLTK
from nltk.corpus import stopwords

# creates a list of English stop words
set_stopwords = set(stopwords.words('english'))

# path to the input txt file with Apple 's 2018 MD&A
input_file = r"./Apple_MDNA.txt"

# reads file content
file_content = open(input_file, "r", encoding='utf-8').read()

# converts text to lowercase ; removes all special characters, digits and extra spaces
processed_content = file_content.lower()
processed_content = re.sub(r'[^a-zA-Z]', ' ',processed_content)
processed_content = re.sub(r'\s+', ' ',processed_content)

# creates a list of lists of individual words - this is the input format to Word2Vec model
processed_content = [processed_content]
words = [nltk.word_tokenize(e) for e in processed_content ]

# removes stop words from the list of words
for i in range (len(words)) :
    words [i] = [w for w in words[i] if w not in set_stopwords]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yangs\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We can use `Gensim` library to build our Word2Vec model. `Gensim` can be installed using either `conda` or `pip` as follows:
* conda install -c anaconda gensim
* pip install gensim

Based on Apple’s MD&A wording, we can find similar or related words to the word “sales”:

In [35]:
# imports Word2Vec from Gensim library
from gensim.models import Word2Vec

# creates a Word2Vec model , ignoring words that occur 
# less than two times in the input text
word2vec = Word2Vec(words, min_count = 2)

# identifies most related / similar words to 'sales'
# based on the input text provided
related_words = word2vec.wv.most_similar('sales')
related_words

[('company', 0.9615598320960999),
 ('tax', 0.9547304511070251),
 ('may', 0.9459083080291748),
 ('net', 0.9442216753959656),
 ('foreign', 0.9435575604438782),
 ('product', 0.9414733648300171),
 ('financial', 0.9407042860984802),
 ('revenue', 0.9382311701774597),
 ('billion', 0.9371017813682556),
 ('cash', 0.9360306859016418)]

In the example above, we used only one textual document to train the Word2Vec model. However, the performance of the model in identifying word clusters and similarities will greatly improve when we increase the training corpus. A popular option for training Word2Vec is the Google News dataset model. It consists of 300-dimensional embeddings for around three million words and phrases (see https://code.google.com/archive/p/word2vec/ for details and to download ‘GoogleNewsvectors-negative300.bin.gz’ file (∼1.5GB)). With the pre-trained model we can access the word vectors and get the similarity scores as follows:

In [36]:
from gensim.models import KeyedVectors
# load embeddings directly from the downloaded file
# called "GoogleNews-vectors-negative300.bin"
model = KeyedVectors.load_word2vec_format(r'C:\Users\yangs\Downloads\GoogleNews-vectors-negative300.bin', binary = True)

# similarity between pairs of words
a = model.similarity('confident', 'uncertain')
b = model.similarity('recession', 'crisis')
# most similar words
c = model.most_similar('accounting')
# identifies a word that does not belong in the list
d = model.doesnt_match("good great amazing bad".split())

print(a)
print(b)
print(c)
print(d)

0.38531393
0.59829676
[('Accounting', 0.6579887270927429), ('bookkeeping', 0.6002781391143799), ('auditing', 0.5503429174423218), ('Arthur_Andersen_Enron', 0.5320826768875122), ('restatement', 0.5319856405258179), ('accountancy', 0.5315807461738586), ('bookeeping', 0.5051406621932983), ('Generally_Accepted_Accounting_Principles', 0.5034367442131042), ('accouting', 0.5023786425590515), ('Irina_Parkhomenko_spokeswoman', 0.49402597546577454)]
bad
