# Text to Features

### Overview

In this notebook, we load the dataframe `df_tsla OR aapl.csv` of Tweets containing the index names `AAPL` and `TSLA`. We compute the (normalized) number of words from various word libraries for each Tweet in the string format. The word libraries include:
- `Henry08_poswords.txt` and `Henry08_negwords.txt`, containing positive and negative words, respectively, from Henry (2008).
- `LM11_pos_words.txt` and `LM11_neg_words.txt`, containing words related to positive and negative sentiments, respectively, from Loughran and McDonald (2011).
- `ML_positive_bigram.csv` and `ML_negative_bigram.csv`, containing positive and negative bigrams (no trigrams??????), respectively, from Hagenau et al. (2013). 
- `news_library.txt`, containing names of mainstream business news agencies.

The word counts will be added as new columns to the dataframe and will be saved on a new file: `df_tsla_aapl_features_added.csv`, which will be used for our model fit.

### Libraries

We use `pandas` to for dataframe and `spaCy` for linguistic operations.

In [3]:
import pandas as pd
import spacy

We use `spaCy`'s `en_core_web_sm` model as the underlying English language processing model. Throughout this notebook, denote the model by `nlp`.

In [2]:
nlp = spacy.load('en_core_web_sm')

Also, we use `PhraseMatcher` (https://spacy.io/api/phrasematcher) to find word counts in order to be able to work with bigrams.

In [4]:
from spacy.matcher import PhraseMatcher

### Functions

In this section, we write all the functions necessary to compute the word counts. 

In [None]:
def txt_to_tokens(filename: str):
    

In [None]:
def csv_to_tokens(filename: str):

In [7]:
def tweet_to_wordlocs(tweet, keys, case_sensitive=False):
    """
    Input:
    tweet -> The tweet text in string
    keys -> The list of lists of key words/bigrams. For example, keys = [henry08_pos, henry08_neg, ..., newslib]
            Each key word/bigram is assumed to contain only English letter and space.
    case_sensitive -> If True, match the terms in the case-sensitive manner. 
    
    Output:
    wordlocs -> A list whose elements are of the form [keyloc, start, end], such that for the key word/bigram 
    from the i-th library, keys[i], located in tweet from tweet[j] to tweet[j+k-1] inclusive, we have 
    keyloc = i, start = j and end = j+k. The elements of wordlocs are sorted based on the value of start.
    """
    # Define the phrase matcher according to input case sensitivity
    if case_sensitive:
        matcher = PhraseMatcher(nlp.vocab)
    else:
        matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
    
    # Tokenize the tweet
    tweet_text = nlp(tweet)
    
    # Tokenize the key phrases and introduce them to the phrase matcher model.
    for i in range(len(keys)):
        phrases = [nlp(phrase) for phrase in keys[i]]
        matcher.add(str(i), phrases)
    
    # Find the matches
    matches = matcher(tweet_text)
    
    # Define and populate wordlocs
    wordlocs = []
    for i in range(len(matches)):
        keyloc_shifted, start, end = matches[i]
        keyloc = int(nlp.vocab.strings[keyloc_shifted])
        wordlocs.append([keyloc, start, end])
    
    return wordlocs

In [8]:
def wordlocs_to_wordcounts(wordlocs, tweet_length, num_keys, normalize=True):
    """
    Input:
    wordlocs -> The list of keys and locations found in the tweet, c.f. tweet_to_wordlocs function.
    tweet_length -> The length of tweet in words.
    num_keys -> The number of phrase lists, i.e. len(keys) from tweet_to_wordlocs function.
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define and populate wordcounts
    wordcounts = [0 for i in range(num_keys)]
    for j in range(len(wordlocs)):
        wordcounts[wordlocs[j][0]] += 1
    
    # Perform normalization as needed
    if normalize:
        return [wordcounts[i]/tweet_length for i in range(num_keys)]
    
    # In the case where normalization is not called for
    return wordcounts

In [10]:
text = open('Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_negwords.txt', 'r').read()

In [11]:
text

'negative negatives fail fails failing failure weak weakness weaknesses difficult difficulty hurdle hurdles obstacle obstacles slump slumps slumping slumped uncertain uncertainty unsettled unfavorable downturn depressed disappoint disappoints disappointing disappointed disappointment risk risks risky threat threats penalty penalties down decrease decreases decreasing decreased decline declines declining declined fall falls falling fell fallen drop drops dropping dropped deteriorate deteriorates deteriorating deteriorated worsen worsens worsening weaken weakens weakening weakened worse worst low lower lowest less least smaller smallest shrink shrinks shrinking shrunk below under challenge challenges challenging challenged\n'