# Classification performance of the keyword search algorithm

The data and code here produces the in-sample and out-of-sample classification performance statistics.

I use keyword search to find forward-looking sentences. The keyword list for forward-looking is uploaded [here](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/fls_terms2.txt). Besides, I determine that if a sentence contains words of years for the coming ten years, it is also forward-looking. The word list is adapted from [previous studies](https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2014.1921).

Among the forward-looking statements, I use a time frame keyword list to identify those with time frames. The keyword list for time frames is uploaded [here](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/qt_terms2.txt).The time frame keyword list is adapted from  [previous studies](https://link.springer.com/article/10.1007/s11142-015-9329-8).

In [86]:
import pandas as pd
gold_standard = pd.read_excel("Gold Standard Corpus.xlsx")
gold_standard[['file_name', 'last_update', 'section', 'context', 'speaker_number', 'future_year','speaker_text', 'fl', 'qt', 'fl_qt', 'fl_cal',
       'qt_cal', 'fl_qt_cal', 'training sample', 'test sample']]

Unnamed: 0,file_name,last_update,section,context,speaker_number,future_year,speaker_text,fl,qt,fl_qt,fl_cal,qt_cal,fl_qt_cal,training sample,test sample
0,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,Okay.,0,0,0,0,0,0,0,0
1,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,"And lastly from me, can you kind of give us a ...",0,0,0,0,0,0,1,0
2,12614899_T,2019-08-01 19:19:18+10:00,1,qa,23,2019,What's the share of wallet for Knoll products?,0,0,0,0,0,0,1,0
3,702568_T,2003-01-30 16:25:51+11:00,1,qa,109,2003,Okay.,0,0,0,0,0,0,1,0
4,702568_T,2003-01-30 16:25:51+11:00,1,qa,109,2003,So I don't think anybody said much more.,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1283,2723598_T,2010-02-17 13:53:38+11:00,1,pres,5,2010,"With that, let's open the call to your questions.",0,0,0,0,0,0,0,1
1284,2723598_T,2010-02-17 13:53:38+11:00,1,pres,5,2010,"Yvette, please review the Q&A procedure.",0,0,0,0,0,0,1,0
1285,739440_T,2003-05-07 12:37:12+10:00,1,qa,154,2003,This is a follow up question from Chris Joseph...,0,0,0,0,0,0,1,0
1286,739440_T,2003-05-07 12:37:12+10:00,1,qa,154,2003,from [inaudible].,0,0,0,0,0,0,1,0


The underlying text associated with the gold standard is found in [`Gold Standard Corpus.xlsx`](https://github.com/yiyangw2/time_frame_gold_corpus/blob/main/Gold%20Standard%20Corpus.xlsx). It is composed of 200 randomly picked up paragraphs of the management (1288 sentences). As you can see, each sentence has identifiers for where it comes from. And values for `fl_cal`,`qt_cal` and `fl_qt_cal`, reflecting whether the algorithm determines that whether the sentence is forward-looking, whether it contains time frames and whether it is a forward-looking statement with time frames. Similarly, each observation has values for `fl`,`qt` and `fl_qt`, which is whether I manually determine that whether the sentence is forward-looking, whether it contains time frames and whether it is a forward-looking statement with time frames. The underlying earnings calls' scripts are from the Melbourne Centre for Corporate Governance and Regulation (MCCGR) database. 

The following function applies the identification functions to each sentence and shows how `fl_cal`, `qt_cal` and `fl_qt_cal` in `gold_standard` are produced.

In [87]:
import re
import nltk

from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import re

def word_count(sent):
    words = word_tokenize(sent)
    words_clean = [t for t in words if not _is_punctuation(t)]
    # print(words_clean)
    return len(words_clean)

punctuation_regex = r'[!\"#$%&\'\\()*+,-./:;<=>?@[\]^_`{|}~’]'

def _is_punctuation(token):
    match = re.match(r'^' + punctuation_regex + r'$', token)
    return match is not None

def clean_sent(sent):
    return re.sub(punctuation_regex, "", sent)

def text_word_count(text):
    sents = sent_tokenize(text)
    clean_sents = [clean_sent(sent) for sent in sents]
    return sum([word_count(sent) for sent in clean_sents])

def create_regex_list(terms_file:str):
    """Creates a list of regex expressions of
    short term orientation"""

    # opens the specified dict_file in "r" (read) mode
    with open(terms_file,"r") as file:
        # reads the content of the file line-by-line
        # and creates a list of terms
        terms = file.read().splitlines()

    # creates a list of regex expressions by adding
    # word boundary (\b) anchors to the beginning and
    # the ending of each FLS term
    terms_regex = [re.compile(r'\b' + term + r'\b') for term in terms]
    return terms_regex


def create_fls_regex_list(fls_terms_file:str):
    """Creates a list of regex expressions of
    FLS terms"""

    # opens the specified dict_file in "r" (read) mode
    with open(fls_terms_file,"r") as file:
        # reads the content of the file line-by-line
        # and creates a list of FLS terms
        fls_terms = file.read().splitlines()

    # creates a list of FLS regex expressions by adding
    # word boundary (\b) anchors to the beginning and
    # the ending of each FLS term
    fls_terms_regex = [re.compile(r'\b' + term + r'\b') for term in fls_terms]
    return fls_terms_regex

fls_terms_file = r"fls_terms2.txt"

fls_terms_regex = create_fls_regex_list(fls_terms_file)

def is_forward_looking(sentence:str, fls_terms_regex):
    """Returns whether sentence is forward-looking."""
    for fls_term in fls_terms_regex:
        if fls_term.search(sentence):
            return True
    return False

qt_terms_file = r"qt_terms2.txt"

qt_terms_regex = create_regex_list(qt_terms_file)
# print(lt_terms_regex[0:3])

def count_term(text:str, qt_terms_regex_with_future_years):
    """Returns the number of long-term oriented."""
    text = text.lower()
    qt_count = 0
    for qt_term in qt_terms_regex_with_future_years:
      qt_count = qt_count + len(re.findall(qt_term, text))
    return qt_count

def is_term(text:str, qt_terms_regex):
  text = text.lower() 
  if count_term(text, qt_terms_regex) > 0:
    return True
  else:
    return False


def is_future_year(text:str, future_year_terms):
  text = text.lower()
  if count_term(text, future_year_terms) > 0:
    return True
  else:
    return False
  return False

In [91]:
def disclosure_horizon(the_text, fls_key_list=fls_terms_regex, qt_key_list=qt_terms_regex, future_year=2023): 
  fl_dict = {}
  sentences = nltk.tokenize.sent_tokenize(the_text)
  nfl_sent = 0
  nfl_qt_sent = 0
  nfl_qt_sent_keys = 0 
  nfl_qt_sent_year = 0
  nfl_qt_sent_both = 0
  nfl_nqt_sent = 0 
  fl_sent = 0
  fl_qt_sent = 0
  fl_qt_sent_keys = 0 
  fl_qt_sent_year = 0
  fl_qt_sent_both = 0
  fl_nqt_sent = 0 
  
  future_year = int(future_year)  
  future_year_terms=[re.compile(r"[^$,]\b" + str(y) + r"\b(?!(%|,\d|.\d))") for y in range(future_year+1,future_year+10)]
  fls_keys = fls_key_list + future_year_terms
  for sent in sentences:
    sent_lower = sent.lower()
    if is_forward_looking(sent_lower, fls_keys): 
      fl_sent = fl_sent + 1
      if is_term(sent_lower, qt_key_list) and is_future_year(sent_lower, future_year_terms):
        fl_qt_sent_both = fl_qt_sent_both + 1
      elif is_term(sent_lower, qt_key_list) and (not is_future_year(sent_lower, future_year_terms)):
        fl_qt_sent_keys = fl_qt_sent_keys + 1
      elif (not is_term(sent_lower, qt_key_list)) and is_future_year(sent_lower, future_year_terms):
        fl_qt_sent_year = fl_qt_sent_year + 1
      else:
        fl_nqt_sent = fl_nqt_sent + 1
    else:
      nfl_sent = nfl_sent + 1
      if is_term(sent_lower, qt_key_list) and is_future_year(sent_lower, future_year_terms):
        nfl_qt_sent_both = nfl_qt_sent_both + 1
      elif is_term(sent_lower, qt_key_list) and (not is_future_year(sent_lower, future_year_terms)):
        nfl_qt_sent_keys = nfl_qt_sent_keys + 1
      elif (not is_term(sent_lower, qt_key_list)) and is_future_year(sent_lower, future_year_terms):
        nfl_qt_sent_year = nfl_qt_sent_year + 1
      else:
        nfl_nqt_sent = nfl_nqt_sent + 1
  
  fl_qt_sent = fl_qt_sent_keys + fl_qt_sent_year + fl_qt_sent_both
  nfl_qt_sent = nfl_qt_sent_keys + nfl_qt_sent_year + nfl_qt_sent_both
  
  fl_dict.update({"fl_sent": fl_sent})
  fl_dict.update({"fl_qt_sent": fl_qt_sent}) 
  fl_dict.update({"fl_qt_sent_both": fl_qt_sent_both})   
  fl_dict.update({"fl_qt_sent_keys": fl_qt_sent_keys})   
  fl_dict.update({"fl_qt_sent_year": fl_qt_sent_year})   
  fl_dict.update({"fl_nqt_sent": fl_nqt_sent}) 
  fl_dict.update({"nfl_sent": nfl_sent})
  fl_dict.update({"nfl_qt_sent": nfl_qt_sent}) 
  fl_dict.update({"nfl_qt_sent_both": nfl_qt_sent_both})   
  fl_dict.update({"nfl_qt_sent_keys": nfl_qt_sent_keys})   
  fl_dict.update({"nfl_qt_sent_year": nfl_qt_sent_year})   
  fl_dict.update({"nfl_nqt_sent": nfl_nqt_sent})   
  text = the_text.lower()
  fl_dict.update({"word": text_word_count(text)})
  fl_dict.update({"sent": len(sentences)})
  
  if len(sentences) > 0 and (text_word_count(the_text)>0):
    return fl_dict


In [92]:
def sent_classifier(sent, fls_key_list=fls_terms_regex, qt_key_list=qt_terms_regex, future_year=2023): 
  
  future_year = int(future_year)  
  future_year_terms=[re.compile(r"[^$,]\b" + str(y) + r"\b(?!(%|,\d|.\d))") for y in range(future_year+1,future_year+10)]
  fls_keys = fls_key_list + future_year_terms
  sent_lower = sent.lower()
  
  fl = 0
  qt = 0
  
  if is_forward_looking(sent_lower, fls_keys): 
    fl = 1
  if is_term(sent_lower, qt_key_list) or is_future_year(sent_lower, future_year_terms):
    qt = 1
  
  if text_word_count(sent)>0:
    return fl, qt


In [93]:
list1=[]
list2=[]

for i in range(len(gold_standard.speaker_text.tolist())):
  list1.append(sent_classifier(sent=gold_standard.speaker_text.tolist()[i], future_year=gold_standard.future_year.tolist()[i])[0])
  list2.append(sent_classifier(sent=gold_standard.speaker_text.tolist()[i], future_year=gold_standard.future_year.tolist()[i])[1])

gold_standard['fl_cal'] = list1
gold_standard['qt_cal'] = list2

In [94]:
def fl_qt(fl, qt):
    return fl * qt

gold_standard["fl_qt_cal"] = gold_standard.apply(lambda x: fl_qt(x["fl_cal"], x["qt_cal"]), axis=1)

The following function generates statistics to report the performance of the algorithm on identifying forward-looking statements and forward-looking statements with time frames.

In [95]:
def print_stats(df, type = 'fl'):
    var = type
    var_cal = type + '_cal'
    
    tn = sum((df[var] == df[var_cal]) & ~df[var_cal])
    fp = sum((df[var] != df[var_cal]) & df[var_cal])
    fn = sum((df[var] != df[var_cal]) & ~df[var_cal])
    tp = sum((df[var] == df[var_cal]) & df[var_cal])
    
    print("Accuracy {:.2f}%".format( 100 * (tp + tn)/(tp + tn + fp + fn)))
    if tp + fp > 0:
        print("Precision {:.2f}%".format( 100 * tp/(tp + fp)))
    if tp + fn > 0:
        print("True positive rate {:.2f}%".format( 100 * tp/(tp + fn)))

In [96]:
print_stats(gold_standard[gold_standard['training sample']==1], type = 'fl')

Accuracy 95.23%
Precision 82.09%
True positive rate 85.94%


In [97]:
print_stats(gold_standard[gold_standard['training sample']==1], type = 'fl_qt')

Accuracy 97.16%
Precision 90.20%
True positive rate 69.70%


In [98]:
print_stats(gold_standard[gold_standard['training sample']==0], type = 'fl')

Accuracy 95.82%
Precision 83.61%
True positive rate 87.93%


In [99]:
print_stats(gold_standard[gold_standard['training sample']==0], type = 'fl_qt')

Accuracy 96.81%
Precision 76.00%
True positive rate 73.08%
