# Classification performance of the keyword search algorithm

The data and code here produces the in-sample and out-of-sample classification performance statistics.

I use keyword search to find forward-looking sentences. The keyword list for forward-looking is uploaded [here](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/fls_terms2.txt). Besides, I determine that if a sentence contains words of years for the coming ten years, it is also forward-looking. The word list is adapted from [previous studies](https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2014.1921).

Among the forward-looking statements, I use a time frame keyword list to identify those with time frames. The keyword list for time frames is uploaded [here](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/qt_terms2.txt).The time frame keyword list is adapted from  [previous studies](https://link.springer.com/article/10.1007/s11142-015-9329-8).

In [1]:
import pandas as pd
gold_standard = pd.read_excel("Gold Standard Corpus.xlsx")
gold_standard[['file_name', 'last_update', 'section', 'context', 
               'speaker_number', 'future_year',
               'speaker_text', 'fl', 'qt', 'fl_qt', 'fl_cal',
               'qt_cal', 'fl_qt_cal', 'training sample', 'test sample']]

Unnamed: 0,file_name,last_update,section,context,speaker_number,future_year,speaker_text,fl,qt,fl_qt,fl_cal,qt_cal,fl_qt_cal,training sample,test sample
0,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,Okay.,0,0,0,0,0,0,0,0
1,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,"And lastly from me, can you kind of give us a ...",0,0,0,0,0,0,1,0
2,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,What's the share of wallet for Knoll products?,0,0,0,0,0,0,1,0
3,702568_T,2003-01-30 16:25:51+11:00,1,qa,109,2003,Okay.,0,0,0,0,0,0,1,0
4,702568_T,2003-01-30 16:25:51+11:00,1,qa,109,2003,So I don't think anybody said much more.,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1283,2723598_T,2010-02-17 13:53:38+11:00,1,pres,5,2010,"With that, let's open the call to your questions.",0,0,0,0,0,0,0,1
1284,2723598_T,2010-02-17 13:53:38+11:00,1,pres,5,2010,"Yvette, please review the Q&A procedure.",0,0,0,0,0,0,1,0
1285,739440_T,2003-05-07 12:37:12+10:00,1,qa,154,2003,This is a follow up question from Chris Joseph...,0,0,0,0,0,0,1,0
1286,739440_T,2003-05-07 12:37:12+10:00,1,qa,154,2003,from [inaudible].,0,0,0,0,0,0,1,0


The underlying text associated with the gold standard is found in [`Gold Standard Corpus.xlsx`](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/Gold%20Standard%20Corpus.xlsx). It is composed of 200 randomly picked up paragraphs of the management (1288 sentences). As you can see, each sentence has identifiers for where it comes from. And values for `fl_cal`,`qt_cal` and `fl_qt_cal`, reflecting whether the algorithm determines that whether the sentence is forward-looking, whether it contains time frames and whether it is a forward-looking statement with time frames. Similarly, each observation has values for `fl`,`qt` and `fl_qt`, which is whether I manually determine that whether the sentence is forward-looking, whether it contains time frames and whether it is a forward-looking statement with time frames. The underlying earnings calls' scripts are from the Melbourne Centre for Corporate Governance and Regulation (MCCGR) database. 

The following function applies the identification functions to each sentence and shows how `fl_cal`, `qt_cal` and `fl_qt_cal` in `gold_standard` are produced.

In [2]:
import re

def create_regex_list(terms_file:str):
    """Creates a list of regex expressions of
    short term orientation"""

    # opens the specified dict_file in "r" (read) mode
    with open(terms_file,"r") as file:
        # reads the content of the file line-by-line
        # and creates a list of terms
        terms = file.read().splitlines()

    # creates a list of regex expressions by adding
    # word boundary (\b) anchors to the beginning and
    # the ending of each FLS term
    terms_regex = [re.compile(r'\b' + term + r'\b') for term in terms]
    return terms_regex

In [3]:
def get_future_year_regex(future_year):
    return [re.compile(r"[^$,]\b" + str(y) + r"\b(?!(%|,\d|.\d))") 
            for y in range(future_year + 1, future_year + 10)]

def is_forward_looking(sentence:str, future_year=None):
    """Returns whether sentence is forward-looking."""
    fls_terms_regex = create_regex_list(r"fls_terms2.txt")
    
    if future_year:
        regex = get_future_year_regex(future_year) + fls_terms_regex
    else:
        regex = fls_terms_regex
    for term in regex:
        if term.search(sentence.lower()):
            return True
    return False

In [4]:
def count_term(text:str, regex):
    """Returns the number of long-term oriented."""
    text = text.lower()
    count = 0
    for term in regex:
      count = count + len(re.findall(term, text))
    return count

def is_term(text:str, regex): 
  return count_term(text, regex) > 0

def is_qt_term(text:str): 
  return is_term(text, create_regex_list(r"qt_terms2.txt"))

def is_future_year(text, future_year):
    return is_term(text, get_future_year_regex(future_year))

In [5]:
gold_standard['is_future_year'] = gold_standard.apply(lambda x: is_future_year(x['speaker_text'], x['future_year']), axis=1)
gold_standard['is_term'] = gold_standard['speaker_text'].map(is_qt_term)
gold_standard['fl_cal'] = gold_standard.apply(lambda x: is_forward_looking(x['speaker_text'], x['future_year']), axis=1)
gold_standard['qt_cal'] = (gold_standard.is_term | gold_standard.is_future_year)
gold_standard["fl_qt_cal"] = gold_standard.fl_cal & gold_standard.qt_cal

The following function generates statistics to report the performance of the algorithm on identifying forward-looking statements and forward-looking statements with time frames.

In [6]:
def print_stats(df, type = 'fl'):
    var = type
    var_cal = type + '_cal'
    
    tn = sum((df[var] == df[var_cal]) & ~df[var_cal])
    fp = sum((df[var] != df[var_cal]) & df[var_cal])
    fn = sum((df[var] != df[var_cal]) & ~df[var_cal])
    tp = sum((df[var] == df[var_cal]) & df[var_cal])
    
    print("Accuracy {:.2f}%".format( 100 * (tp + tn)/(tp + tn + fp + fn)))
    if tp + fp > 0:
        print("Precision {:.2f}%".format( 100 * tp/(tp + fp)))
    if tp + fn > 0:
        print("True positive rate {:.2f}%".format( 100 * tp/(tp + fn)))

In [7]:
print_stats(gold_standard[gold_standard['training sample']==1], type = 'fl')

Accuracy 95.23%
Precision 82.09%
True positive rate 85.94%


In [8]:
print_stats(gold_standard[gold_standard['training sample']==1], type = 'fl_qt')

Accuracy 97.16%
Precision 90.20%
True positive rate 69.70%


In [9]:
print_stats(gold_standard[gold_standard['training sample']==0], type = 'fl')

Accuracy 95.82%
Precision 83.61%
True positive rate 87.93%


In [10]:
print_stats(gold_standard[gold_standard['training sample']==0], type = 'fl_qt')

Accuracy 96.81%
Precision 76.00%
True positive rate 73.08%
