## Introduction

In this notebook, we attempt to predict whether trading the stock index mentioned in a tweet will yield a return greater than 1%, given the popularity, sentiment scores and other features related to the tweet. We assume that we enter into a long position in the stock index at a fixed time, $t_1$, after the tweet, and we similarly sell off the index at a later fixed time, $t_2$. 

We will use 2 Bayesian-based classifiers in this notebook: Naive Bayes Classifier and quadratic discriminant analysis (QDA).




## Naive Bayes Classifier

Recall that the Naive Bayes Model assumes that the likelihood distribution, $P(X_j|y=c)$, for each class $c$ and feature $j$ is independent. To adhere to this central assumption for the model, we will only include features that are not strongly correlated to one another. 

We begin by downloading the data and performing the train-test split. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_parquet("/Users/josht/Documents/tweet_stock_merged_data1.parquet")

In [4]:
df.sample(5)

Unnamed: 0,time_stamp,entities_cashtags,entities_hashtags,entities_urls,like_count,quote_count,reply_count,retweet_count,text,entities_mentions,...,Company_ticker,time_of_day,morning,evening,night,buy_price,delta_buy,sell_price,delta_sell,return
2958,2021-03-31 08:59:03,0,0,1,11,0,0,8,The global shortage of semiconductors and wint...,0,...,COST,morning,0,1,0,349.007882,297.0,353.59224,424917.0,1.31354
3684,2021-09-10 08:48:48,0,0,1,13,0,0,6,As Mexico’s economy rebounds from its biggest ...,0,...,ARE,morning,0,1,0,201.69855,2532.0,199.471485,348132.0,-1.104155
24997,2021-06-01 17:10:34,0,0,2,1,0,0,0,Another key U.S. inflation gauge surges in Apr...,0,...,KEY,afternoon,1,0,0,23.14,58406.0,22.75,485666.0,-1.685393
5643,2021-03-31 06:47:02,0,0,1,350,23,31,112,Pfizer says Covid vaccine is 100% effective in...,0,...,PFE,morning,0,1,0,35.400956,58.0,35.400956,422038.0,0.0
2485,2020-07-28 16:46:38,0,0,1,51,8,8,9,AMD pops after it raises revenue forecast for ...,0,...,AMD,afternoon,1,0,0,74.42,22.0,77.45,472462.0,4.071486


In [5]:
df.columns

Index(['time_stamp', 'entities_cashtags', 'entities_hashtags', 'entities_urls',
       'like_count', 'quote_count', 'reply_count', 'retweet_count', 'text',
       'entities_mentions', 'created_at_user', 'followers_count',
       'following_count', 'listed_count', 'tweet_count', 'media_type',
       'Company_name', 'Word_count_News_agencies', 'Word_count_Henry08_pos',
       'Word_count_Henry08_neg', 'Word_count_LM11_pos', 'Word_count_LM11_neg',
       'Word_count_Hagenau13_pos', 'Word_count_Hagenau13_neg',
       'Tweet_Length_characters', 'Tweet_Length_words', 'Compound_vader',
       'Positive_vader', 'Negative_vader', 'Neutral_vader', 'Company_ticker',
       'time_of_day', 'morning', 'evening', 'night', 'buy_price', 'delta_buy',
       'sell_price', 'delta_sell', 'return'],
      dtype='object')

In [7]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size = 0.2, shuffle = True)

In [8]:
df.shape

(724283, 40)

In [9]:
df_train.shape

(579426, 40)

In [10]:
df_test.shape

(144857, 40)

These are the potential features we will use in our model. The last three are categorical and result from onehot encoding. We will drop the last one to avoid redundancy. 

In [6]:
potential_features = ['like_count', 'quote_count', 'reply_count', 'retweet_count', 'followers_count', 
                      'following_count', 'listed_count', 'tweet_count', 'Word_count_News_agencies', 
                      'Word_count_Henry08_pos', 'Word_count_Henry08_neg', 'Word_count_LM11_pos', 
                      'Word_count_LM11_neg', 'Word_count_Hagenau13_pos', 'Word_count_Hagenau13_neg', 
                      'Tweet_Length_characters', 'Tweet_Length_words', 'Compound_vader', 'Positive_vader', 
                      'Negative_vader', 'Neutral_vader', 'morning', 'evening', 'night']

Now, we perform the feature selection step for the quantitative features by computing the correlation matrix (on the training set).

In [19]:
corr_matrix_df = df_train[potential_features[:-3]].corr()

In [20]:
corr_matrix_df

Unnamed: 0,like_count,quote_count,reply_count,retweet_count,followers_count,following_count,listed_count,tweet_count,Word_count_News_agencies,Word_count_Henry08_pos,...,Word_count_LM11_pos,Word_count_LM11_neg,Word_count_Hagenau13_pos,Word_count_Hagenau13_neg,Tweet_Length_characters,Tweet_Length_words,Compound_vader,Positive_vader,Negative_vader,Neutral_vader
like_count,1.0,0.495638,0.502478,0.891131,0.016976,0.000108,0.017122,-0.038683,-0.005451,-0.002728,...,0.00288,0.02594,-0.001635,-0.00157,0.016688,0.031433,-0.013853,0.006404,0.025028,-0.023102
quote_count,0.495638,1.0,0.489429,0.561916,0.037296,-0.000925,0.038635,0.010126,-0.002593,-0.007294,...,-0.005972,0.015188,-0.004874,-0.004131,0.008671,0.007655,-0.012275,-0.003795,0.013255,-0.006421
reply_count,0.502478,0.489429,1.0,0.410516,0.019404,-0.008374,0.021807,-0.044171,-0.008027,0.000312,...,0.00535,0.03499,-0.002164,-0.000158,0.041206,0.059907,-0.020078,0.005187,0.036431,-0.030284
retweet_count,0.891131,0.561916,0.410516,1.0,0.041915,0.000963,0.043701,-0.008367,-0.00527,-0.005535,...,-0.006157,0.040136,-0.006272,-0.005834,0.026676,0.028313,-0.030214,-0.008275,0.032422,-0.016513
followers_count,0.016976,0.037296,0.019404,0.041915,1.0,-0.017614,0.950704,0.511766,-0.032219,-0.053775,...,-0.055389,0.027471,-0.065368,-0.049618,-0.004304,-0.145654,-0.053585,-0.032438,0.037051,-0.000269
following_count,0.000108,-0.000925,-0.008374,0.000963,-0.017614,1.0,-0.035016,-0.123873,0.157147,-0.019587,...,-0.013278,0.008763,-0.017772,-0.018363,-0.015399,-0.000819,-0.011607,-0.023603,-0.0095,0.025912
listed_count,0.017122,0.038635,0.021807,0.043701,0.950704,-0.035016,1.0,0.555819,-0.037679,-0.052634,...,-0.060241,0.04225,-0.065781,-0.040962,-0.007792,-0.139311,-0.063256,-0.034747,0.051558,-0.008786
tweet_count,-0.038683,0.010126,-0.044171,-0.008367,0.511766,-0.123873,0.555819,1.0,-0.071893,-0.073595,...,-0.087251,-0.052534,-0.071131,-0.060577,-0.167681,-0.238944,-0.05401,-0.069261,-0.008499,0.062152
Word_count_News_agencies,-0.005451,-0.002593,-0.008027,-0.00527,-0.032219,0.157147,-0.037679,-0.071893,1.0,0.073828,...,0.029116,0.008959,0.033386,-0.005771,0.067424,0.09036,0.01808,-0.013756,-0.030258,0.032808
Word_count_Henry08_pos,-0.002728,-0.007294,0.000312,-0.005535,-0.053775,-0.019587,-0.052634,-0.073595,0.073828,1.0,...,0.43931,0.015355,0.25748,0.124122,0.177828,0.178001,0.211411,0.233783,-0.040444,-0.160277


In [21]:
corr_matrix = np.array(corr_matrix_df)

In [22]:
corr_cutoff = 0.2

for i in range(21):
    for j in range(i+1, 21):
        if abs(corr_matrix[i, j]) > corr_cutoff:
            print("(", potential_features[i], ",", potential_features[j], "):", corr_matrix[i, j])

( like_count , quote_count ): 0.4956381363991591
( like_count , reply_count ): 0.5024777725205849
( like_count , retweet_count ): 0.8911314947551016
( quote_count , reply_count ): 0.4894287489819968
( quote_count , retweet_count ): 0.5619161421793059
( reply_count , retweet_count ): 0.4105164848734175
( followers_count , listed_count ): 0.950703868913496
( followers_count , tweet_count ): 0.5117657685029273
( listed_count , tweet_count ): 0.5558185618374456
( tweet_count , Tweet_Length_words ): -0.23894393772741623
( Word_count_Henry08_pos , Word_count_LM11_pos ): 0.439310412392636
( Word_count_Henry08_pos , Word_count_Hagenau13_pos ): 0.2574804250792026
( Word_count_Henry08_pos , Compound_vader ): 0.21141141443012385
( Word_count_Henry08_pos , Positive_vader ): 0.23378296051735503
( Word_count_Henry08_neg , Word_count_LM11_neg ): 0.23789377482671648
( Word_count_Henry08_neg , Negative_vader ): 0.24751615735319674
( Word_count_LM11_pos , Compound_vader ): 0.3579687708574853
( Word_coun

## Quadratic Discriminant Analysis

QDA assumes that the joint likelihood distribution for each class $c$ is a multivariate normal distribution: $P(X_1,\ldots,X_m|y=c) \sim \mathcal{N}(\mu_c, \Sigma_c)$. In order to closely approximate the model assumption, we will take the logarithm of some of the features whose distributions are highly skewed to the right.