# Text to Features


### Overview

In this notebook, we load the table `tweets_data.csv` of Tweets. We compute the (normalized) number of words from various word libraries for each Tweet in the string format. The word libraries include:
- `Henry08_poswords.txt` and `Henry08_negwords.txt`, containing positive and negative words, respectively, from Henry (2008).
- `LM11_pos_words.txt` and `LM11_neg_words.txt`, containing words related to positive and negative sentiments, respectively, from Loughran and McDonald (2011).
- `ML_positive_bigram.csv` and `ML_negative_bigram.csv`, containing positive and negative bigrams (no trigrams??????), respectively, from Hagenau et al. (2013). 
- `news_library.txt`, containing names of mainstream business news agencies.
Furthermore, we identify the stock indices listed in `snp500_list.csv` and `nyse_list.csv` for each row in the table.

The word counts and the list of mentioned stock indices will be added as new columns to the dataframe and will be saved on a new file: `tweets_data_features_added.csv`. The file will be read into the notebook that performs our model fit.

This notebook can easily be modified to treat a different tweet table file.


### Libraries

We use `pandas` for dataframe and `spaCy` for linguistic operations.

In [1]:
import pandas as pd
import spacy

We use `spaCy`'s `en_core_web_sm` model as the underlying English language processing model. Throughout this notebook, denote the model by `nlp`.

In [2]:
nlp = spacy.load('en_core_web_sm')

Also, we use `PhraseMatcher` (https://spacy.io/api/phrasematcher) to find word counts in order to conveniently work with bigrams.

In [3]:
from spacy.matcher import PhraseMatcher

### Setup

For the code in this section to work, one must render the functions in [Helper Functions](#the_destination) section first. The docstrings for helper functions also provide additional details about the method employed.
<br><br>
Define the local path of your repository folder.

In [10]:
localpath = "/Users/josht/Documents/GitHub/erdos_twitter_project"

Load the file `tweets_data.csv` into a dataframe. 

In [11]:
# Run this for real data
#df_tweets = pd.read_csv(localpath + "/data/tweets_data.csv")

# Run this during the trial run
#df_tweets = pd.read_csv(localpath + "/data/Tweets_Raw/df_tsla OR aapl.csv")

# Starbucks Example
df_tweets = pd.read_csv(localpath + "/data/df_Starbucks.csv")

The tweet contents appear as strings in the `text` column.

In [12]:
df_tweets.head(3)

Unnamed: 0,created_at,entities_cashtags,entities_hashtags,entities_urls,public_metrics_like_count,public_metrics_quote_count,public_metrics_reply_count,public_metrics_retweet_count,text,entities_mentions,created_at_user,public_metrics_followers_count,public_metrics_following_count,public_metrics_listed_count,public_metrics_tweet_count,media_type
0,2021-09-30 19:59:36,0,0,2,4,0,0,0,Campus labor shortage delays opening of the St...,1,2021-07-26 18:14:59,21,28,0,35,0
1,2021-09-30 19:59:12,0,0,0,4,0,3,0,Yo what are fire Starbucks drinks,0,2020-05-23 16:28:47,96,116,0,919,0
2,2021-09-30 19:59:07,0,0,2,0,0,0,0,https://t.co/dTZGW5bmsO\n\nhttps://t.co/APZH7I...,2,2010-09-16 05:43:14,4151,4962,56,391400,0


We load the libraries of key phrases and news agencies names, then put them in a list called `keys`. Each element of `keys` is a set of words from the corresponding library file.

In [13]:
keyfiles_words = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_poswords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_negwords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_pos_words.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_neg_words.txt"]

In [14]:
keyfiles_bigrams = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_positive_bigram.csv",
                   localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_negative_bigram.csv"]

In [15]:
keyfiles_news = [localpath + "/Twitter_Sentiment_Analysis/Relevance_Feature_Libraries/news_library.txt"]

In [16]:
# All libraries
keys = [get_keywords(keyfile) for keyfile in keyfiles_words] + [get_keybigrams(keyfile) for keyfile in keyfiles_bigrams] + [get_news_agencies(keyfile) for keyfile in keyfiles_news]

# Ignoring bigrams (significantly faster)
#keys = [get_keywords(keyfile) for keyfile in keyfiles_words] + [get_news_agencies(keyfile) for keyfile in keyfiles_news]


To see how many words are there in each library, we compute

In [17]:
[len(keys[i]) for i in range(len(keys))]

[104, 85, 354, 2355, 12130, 13330, 23]

Define the library names in a legible manner for reference.

In [18]:
# Run this for real data
key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg", "Hagenau13_pos", "Hagenau13_neg", "News_agencies"]

# Run this during the trial run
#key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg", "News_agencies"]

Now, we prepare the list of stock indices, starting once again from the file paths.

In [19]:
keyfiles_stocks = [localpath + "/data/Stock_indices/snp500_list.csv",
                  localpath + "/data/Stock_indices/nyse_list.csv"]

We load the dataframe from each csv file, then extract the list of stock indices acronyms from the dataframe. Finally, we store all lists of indices into `stocks`, in a similar manner to `keys`.

In [20]:
stocks = []

for file in keyfiles_stocks:
    df_stocks = pd.read_csv(file)
    stocks.append(list(df_stocks["Symbol"]))

For `snp500_list.csv`, the company names are available under column `Security`. However, for `nyse_list.csv`, only index names are available under column `Name`. Unfortunately, there is no regular pattern to go from index names to company names. Also, index names are so long and specific that it is probably very rarely mentioned in full on Twitter. However, I am open to an alternative solution.

In [21]:
company_names = []

# Add company names from snp500_list.csv
df_stocks = pd.read_csv(keyfiles_stocks[0])
company_names.append(list(df_stocks["Security"]))

# As a place holder for nyse_list.csv, make it an empty list
company_names.append([])

Finally, define the stock index library names for reference.

In [22]:
stock_library = ["S&P500", "NYSE"]

### Mentioned Stock Indices

We apply `get_stock_list` to each tweet in `df_tweets["text"]` and each stock library in `stocks`. We store the stock lists in the list called `mentioned_stocks`.

In [23]:
mentioned_stocks = [[[] for i in range(df_tweets.shape[0])] for j in range(len(stocks))]

for j in range(len(stocks)):
    for i in range(df_tweets.shape[0]):
        mentioned_stocks[j][i] = get_stock_list(df_tweets["text"].iloc[i], stocks[j])

Then, we put `mentioned_stocks` into corresponding new columns, e.g. `Mentioned_stocks_S&P500`.

In [24]:
for j in range(len(stocks)):
    df_tweets["Mentioned_stocks_" + stock_library[j]] = mentioned_stocks[j]

Now, we see that new columns are added listing the mentioned stocks. Some rows have empty lists for both columns.

In [25]:
df_tweets.sample(3)

Unnamed: 0,author_id,created_at,created_at_user,location,name,public_metrics_followers_count,public_metrics_following_count,public_metrics_like_count,public_metrics_listed_count,public_metrics_quote_count,public_metrics_reply_count,public_metrics_retweet_count,public_metrics_tweet_count,source,text,tweet_id,username,Mentioned_stocks_S&P500,Mentioned_stocks_NYSE
28,1023266600223481856,2018-12-31T23:41:21.000Z,2018-07-28T17:59:03.000Z,,Polixenes,4163,1023,2,102,0,1,0,13389,Twitter Web Client,@DeanSheikh1 @Jekajojo @schlosta2 @elonmusk @T...,1079885248568135680,Polixenes13,[],[]
10,24843079,2018-12-31T23:50:05.000Z,2009-03-17T05:04:24.000Z,,"Dan Stringer, SEC Pimp",7509,2107,0,0,0,0,22,117795,Twitter for iPhone,RT @4Awesometweet: 79% of Electric Vehicle ta...,1079887447155040256,Danstringer74,[],[]
16,1042888795501289472,2018-12-31T23:43:19.000Z,2018-09-20T21:30:39.000Z,,Capvalue89,14,923,0,0,0,0,60,1192,Twitter for iPhone,RT @ElonBachman: Am I hallucinating? $TSLA jus...,1079885743437295617,CapitalOnValue,[TSLA],[]


Finally, to save time for future steps, we drop from `df_tweets` the rows whose tweet mentions no stock index.

In [26]:
# Collect the indices for the rows in which no stock from each stock library is mentioned.
emp_ind = []
for i in range(df_tweets.shape[0]):
    if len(df_tweets["Mentioned_stocks_S&P500"].iloc[i]) == 0 and len(df_tweets["Mentioned_stocks_NYSE"].iloc[i]) == 0:
        emp_ind.append(i)

df_tweets_shorten = df_tweets.drop(emp_ind).copy()


We see that some rows have been removed.

In [27]:
print("Total number of rows:", len(df_tweets.index))

Total number of rows: 30


In [28]:
print("Total number of rows that mention a stock:", len(df_tweets_shorten.index))

Total number of rows that mention a stock: 15


### Word Counts

We apply `tweet_to_wordcounts` to each tweet in `df_tweets_shorten["text"]` that mentions at least one stock, either through indices or company names. Then, we store the results in `wordcounts_all`. 
<br><br>
Warning: this step may take a while.

In [29]:
wordcounts_all = [[-1 for i in range(df_tweets_shorten.shape[0])] for j in range(len(keys))]

for i in range(df_tweets_shorten.shape[0]):
    
    # Run this if ignoring bigrams
    # wordcounts = tweet_to_wordcounts(df_tweets_shorten["text"].iloc[i], keys)
    
    # Run this if including bigrams
    wordcounts = tweet_to_wordcounts(df_tweets_shorten["text"].iloc[i], keys[:4] + [keys[-1]])
    bigramcounts = tweet_to_bigramcounts(df_tweets_shorten["text"].iloc[i], keys[4:6])
    
    for j in range(len(keys)):
        if j <= 3:
            wordcounts_all[j][i] = wordcounts[j]
        elif j == len(keys) - 1:
            wordcounts_all[-1][i] = wordcounts[-1]
        else:
            wordcounts_all[j][i] = bigramcounts[j - 4]

Then, we put the resulting word counts for each phrase library into the corresponding new column, e.g. `Word_count_Henry08_pos`.

In [30]:
for j in range(len(keys[:-1])):
    df_tweets_shorten["Word_count_" + key_library[j]] = wordcounts_all[j]

Finally, we add the column for the number of news agency names that appear.

In [31]:
df_tweets_shorten["News_agencies_names_count"] = wordcounts_all[-1]

### Results

Applying all the above operations related to word counts and mentioned stock indices, we modify `df_tweets` to the following form. Note that word counts are normalized, i.e. divided, by the total word count for each tweet.

In [32]:
df_tweets_shorten[["text", "Mentioned_stocks_S&P500", "Mentioned_stocks_NYSE", "Word_count_Henry08_pos",
                  "Word_count_Henry08_neg", "Word_count_LM11_pos", "Word_count_LM11_neg", "Word_count_Hagenau13_pos", "Word_count_Hagenau13_neg","News_agencies_names_count"]]

Unnamed: 0,text,Mentioned_stocks_S&P500,Mentioned_stocks_NYSE,Word_count_Henry08_pos,Word_count_Henry08_neg,Word_count_LM11_pos,Word_count_LM11_neg,Word_count_Hagenau13_pos,Word_count_Hagenau13_neg,News_agencies_names_count
0,"$BLSP huge volume, closes up 12.5%. Shares st...","[MSFT, AAPL, AMZN, FB, TSLA, BRK.B]",[HRI],0.003311,0.0,0.0,0.0,0.0,0.0,0.0
1,"RT @Polixenes13: Ross, please just never stop ...",[TSLA],[],0.0,0.0,0.0,0.007143,0.0,0.0,0.0
2,@kimpaquette I pity the fool that is shorting ...,[TSLA],[],0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,$TSLA passed 190K model 3 VIN registered. Yeah...,[TSLA],[],0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,RT @TeslaCharts: Fraud. Fraud. Fraud. Fraud. \...,[TSLA],[],0.0,0.0,0.0,0.013333,0.0,0.0,0.0
8,FCC to suspend most operations this week due t...,[AAPL],[],0.008264,0.0,0.0,0.016529,0.0,0.008264,0.0
9,https://t.co/0HOIOe45er\n\n$TSLA news,[TSLA],[],0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,Props to @GerberKawasaki for at least spelling...,[TSLA],[],0.0,0.010526,0.0,0.0,0.0,0.0,0.0
12,RT @stockmarkettv: Tesla Production Numbers So...,[TSLA],[],0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,"Bottom Filled on $WDBG Closed up 23% , Ready t...","[MSFT, AAPL, AMZN, FB, TSLA, BRK.B]",[HRI],0.009967,0.0,0.0,0.003322,0.0,0.0,0.0


Finally, we save the new `df_tweets` onto a new csv file called `tweets_data_features_added.csv`.

In [33]:
# Run this for real data
#df_tweets_shorten.to_csv(localpath + "/data/Tweets_Preprocessed/tweets_data_features_added.csv", index=False)

# Run this during the trial run
df_tweets_shorten.to_csv(localpath + "/data/Tweets_Preprocessed/df_tsla_aapl_features_added.csv", index=False)

<a id='the_destination'></a>
### Helper Functions

For brevity, we write down all the necessary but lengthy functions in this section.

In [4]:
def get_keywords(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains keywords separated by space.
                
    Output: 
    keywords -> The set of strings, each of which is a word from the txt file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keywords = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_word = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == " ":    # When running into " ", we have finished reading a word. 
            keywords.append(this_word)
            this_word = ""
        elif text[i:] == "\n":   # This may occur at the end of the string.
            break
        else:     # With an additional letter, just add it to the current word.
            this_word = this_word + text[i].lower()
    
    # If the string does not end in " ", we will need to append the last word to keywords.
    if this_word != "":
        keywords.append(this_word)
    
    # Return the result in the set format.
    return set(keywords)

In [5]:
def get_keybigrams(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains key bigrams separated by "\n". 
    
    This function should work for any n-grams, given that the text file is written in the same format.
                
    Output: 
    keybigrams_lemm -> The set of strings, each of which is a bigram from the csv file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keybigrams = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_bigram = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == "\n":    # When running into "\n", we have finished reading a bigram.
            keybigrams.append(this_bigram)
            this_bigram = ""
        else:     # With an additional letter or space, just add it to the current bigram.
            this_bigram = this_bigram + text[i].lower()
    
    # If the string does not end in "\n", we will need to append the last bigram to keybigrams.
    if this_bigram != "":
        keybigrams.append(this_bigram)
    
    # Return the result in the set format.
    return set(keybigrams)

In [6]:
def get_news_agencies(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains names of news agencies separated by space.
                
    Output: 
    news_agencies -> The set of strings, each of which is a name of news agency from the input file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define news_agencies
    news_agencies = []
    
    # This is to keep track of the news agency name we are reading as we traverse text.
    this_word = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == " ":    # When running into " ", we have finished reading an agency's name. 
            news_agencies.append(this_word)
            this_word = ""
        elif text[i:] == "\n":   # This may occur at the end of the string.
            break
        else:     # With an additional letter, just add it to the current agency's name.
            this_word = this_word + text[i].lower()
    
    # If the string does not end in " ", we will need to append the last agency's name to news_agencies.
    if this_word != "":
        news_agencies.append(this_word)
    
    # Return the result in the set format.
    return set(news_agencies)

In [7]:
def tweet_to_wordcounts(tweet, keys, normalize=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of sets of key words. For example, keys = [henry08_pos, henry08_neg, ..., newslib]
            Each keyword is assumed to contain only English letter. WARNING: must remove bigrams
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define a spaCy's doc object for the tweet
    tweet_doc = nlp(tweet.lower())
    
    # Convert the doc object into a set of words
    tweet_words = set([token.text for token in tweet_doc])
    
    # Initialize wordcounts
    wordcounts = []
    
    # For each words library, we count the number of words in tweet using the more efficient 
    # intersection method. Then, if called for, we normalize the count by the length of the raw tweet.
    for i in range(len(keys)):      # Not including the news agencies for now
        this_wordcount = len(tweet_words.intersection(keys[i])) 
        if normalize:
            this_wordcount_normalized = this_wordcount / len(tweet)
            wordcounts.append(this_wordcount_normalized)
        else:
            wordcounts.append(this_wordcount)
    
    return wordcounts

In [8]:
def tweet_to_bigramcounts(tweet, keys, normalize=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of sets of key bigrams. For example, keys = ["Hagenau13_pos", "Hagenau13_neg"]
            Each key bigram is assumed to contain only English letter and space.
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define a spaCy's doc object for the tweet
    tweet_doc = nlp(tweet.lower())
        
    # Convert the doc object into a list of words with stop words removed, in accordance with the bigram libraries.
    tweet_words = [token.text for token in tweet_doc if not token.is_stop]
    
    # Define the set of bigrams from the tweet, consisting of pairs of neighboring words.
    tweet_bigrams = set([tweet_words[i] + " " + tweet_words[i+1] for i in range(len(tweet_words) - 1)])
    
    # Initialize wordcounts
    wordcounts = []
    
    # For each bigrams library, we count the number of bigrams in tweet_bigrams using the more efficient 
    # intersection method. Then, if called for, we normalize the count by the length of the raw tweet.
    for i in range(len(keys)):      # Not including the news agencies for now
        this_wordcount = len(tweet_bigrams.intersection(keys[i])) 
        if normalize:
            this_wordcount_normalized = this_wordcount / len(tweet)
            wordcounts.append(this_wordcount_normalized)
        else:
            wordcounts.append(this_wordcount)
    
    return wordcounts

In [9]:
def get_stock_list(tweet, stock_indices, company_names=[]):
    """
    Input:
    tweet -> The raw tweet text in string
    stock_indices -> The list of stock indices in string
    company_names -> The list of company names corresponding to stock_indices
                     If not provided, company names will not be searched for in tweet.
    
    Output:
    stock_list -> A list of stock indices in stock_indices that are mentioned in tweet,
                  either by indices or by company names.
    """
    # To make this case-insensitive, make tweet all lowercase.
    tweet_processed = tweet.lower()
    
    # Initialize stock_list as an empty list.
    stock_list = []
    
    # For each stock index, make it lowercase then find if it appears in tweet_processed.
    # To avoid false positives (in finding a mention), we only consider indices followed by " "
    # and preceeded by "$" or "#"
    for i in range(len(stock_indices)):
        loc_dollar = tweet_processed.find("$" + stock_indices[i].lower() + " ")
        loc_hashtag = tweet_processed.find("#" + stock_indices[i].lower() + " ")
        if max(loc_dollar, loc_hashtag) >= 0:
            stock_list.append(stock_indices[i])
    
    # For each company name, if available, make it lowercase then find if it appears in tweet_processed.
    for i in range(len(company_names)):
        loc = tweet_processed.find(" " + company_names[i].lower() + " ")
        if loc >= 0:
            stock_list.append(stock_indices[i])
    
    return list(set(stock_list))

### Appendix

The functions below are no longer used in the current version. We merely keep them in case we decide to revert the changes.

In [None]:
def key_lemmatize(keywords):
    """
    Input:
    keywords -> The list of key words/bigrams extracted from a library file.
    
    Output:
    keywords_lemm -> A list with elements from keywords, each converted into its lemma form using spaCy
    """
    # Define keywords_lemm
    keywords_lemm = []
    
    # Populate keywords_lemm by strings written from the lemma of the word.
    for i in range(len(keywords)):
        this_doc = nlp(keywords[i])
        this_word_lemm = ""
        for token in this_doc:
            this_word_lemm = this_word_lemm + token.lemma_ + " "   # Add space in case there are multiple words
        keywords_lemm.append(this_word_lemm[:-1])      # Discard the final space
    
    return keywords_lemm

In [None]:
def preprocess_tweet(tweet_doc):
    """
    Input:
    tweet_doc -> The raw tweet text in spaCy's doc type
    
    Output:
    processed_words -> A list of words in tweet_doc with stop words (do, is, not, you, etc) removed.
    """
    # Define and populate the list of all tokens in doc_raw
    token_list = []
    for token in tweet_doc:
        token_list.append(token)
    
    # Define and write down the tweet without stop words
    tweet_cleaned = ""
    for token in token_list:
        if not token.is_stop:
            tweet_cleaned = tweet_cleaned + token.text + " "
    
    # Finally, convert to doc once again
    return nlp(tweet_cleaned)

In [None]:
def tweet_to_wordlocs(tweet, keys, remove_stop=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of lists of key words/bigrams. For example, keys = [henry08_pos, henry08_neg, ..., newslib]
            Each key word/bigram is assumed to contain only English letter and space.
    remove_stop -> If True, remove stop words (do, is, not, you, etc) from tweet as a preprocessing step.
                   This option is recommended if the user would like to use the bigram lists by Hagenau et al (2013)
                   because they come with stop words removed, e.g. "able add" assumes that "to" has been removed.
    
    Output:
    wordlocs -> A list whose elements are of the form [keyloc, start, end], such that for the key word/bigram 
    from the i-th library, keys[i], located in tweet from tweet[j] to tweet[j+k-1] inclusive, we have 
    keyloc = i, start = j and end = j+k. The elements of wordlocs are sorted based on the value of start.
    """
    # Define the case-insensitive phrase matcher.
    matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
    
    # Convert tweet into a spaCy's doc object with stop words removed if called for.
    if remove_stop:
        tweet_text = preprocess_tweet(tweet)
    else:
        tweet_text = nlp(tweet)
    
    # Tokenize the key phrases and introduce them to the phrase matcher model.
    for i in range(len(keys)):
        phrases = [nlp(phrase) for phrase in keys[i]]
        matcher.add(str(i), phrases)
    
    # Find the matches
    matches = matcher(tweet_text)
    
    # Define and populate wordlocs
    wordlocs = []
    for i in range(len(matches)):
        keyloc_shifted, start, end = matches[i]
        keyloc = int(nlp.vocab.strings[keyloc_shifted])
        wordlocs.append([keyloc, start, end])
    
    return wordlocs

In [None]:
def wordlocs_to_wordcounts(wordlocs, tweet_length, num_keys, normalize=True):
    """
    Input:
    wordlocs -> The list of keys and locations found in the tweet, c.f. tweet_to_wordlocs function.
    tweet_length -> The length of tweet in words.
    num_keys -> The number of phrase lists, i.e. len(keys) from tweet_to_wordlocs function.
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define and populate wordcounts
    wordcounts = [0 for i in range(num_keys)]
    for j in range(len(wordlocs)):
        wordcounts[wordlocs[j][0]] += 1
    
    # Perform normalization as needed
    if normalize:
        return [wordcounts[i]/tweet_length for i in range(num_keys)]
    
    # In the case where normalization is not called for
    return wordcounts