<center> <h3>Sentiment Analysis on Stock Tweets</h3> </center>
<center><h4>Sina Soltanieh, Byron Phan, Nadezhda Shiroglazova</h4></center>

<hr style="height:2px; border:none; color:black; background-color:black;">

#### Executive Summary:

This project is focused on predicting the sentiment of tweets regarding stock performance, which has been spurred by increased social media presence and volatility of prices since the COVID-19 pandemic began. Our data consists of two datasets, one for tweets which our used in our machine learning model, and one for public stock financials, which we wrangle prepare for future machine learning work. We are interested in classifying these tweets into three classes: positive, negative, and neutral. To classify the sentiment of the tweets, we try three machine learning models: Linear SVM, Logistic Regression, and Multinomial Naive Bayes. We find that Logistic Regression and Linear SVM are best suited for this application, with Linear SVM slightly favored when selecting the best 100 features. Our results demonstrate that sentiment of stock tweets can be classified by these models, albeit with significant room for improvement.

<hr style="height:2px; border:none; color:black; background-color:black;">

## Outline
1. <a href='#1'>INTRODUCTION</a>
2. <a href='#2'>METHOD</a>
3. <a href='#3'>RESULTS</a>
4. <a href='#4'>DISCUSSION</a>

<a id="1"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 1. INTRODUCTION

<h4>Problem Statement</h4>

This project is focused on predicting the sentiment of tweets regarding stock performance. We believe that the ability to analyze the sentiment of stock tweets can help predict the future performance of the underlying securities. By also analyzing financial data it may even be possible to predict a target closing price.

This project only covers the analysis of stock tweets with machine learning techniques but we do also demonstrate some preliminary steps to preparing financial data for ML analysis.

<h4>Significance of the Problem</h4>

The rise of “meme stocks” such as Gamestop, which gained huge social media traction and astronomical gains and losses within mere hours, begs the question of the degree of influence social media has on stock prices. Thus, this project will focus on collecting and analyzing tweets about public stocks and use a predictive model to estimate the closing price of these stocks on future days. 

Previous work has been done in this field. Kordonis et al. analyzed how stock prices are effected by tweet sentiment. They used ML algorithms including SVM and Naive Bayes and achieved an accuracy of 87% with closing price prediction errors under 10%. 

Kordonis, J., Symeonidis, S., & Arampatzis, A. (2016). Stock Price Forecasting via Sentiment Analysis on Twitter. In Proceedings of the 20th Pan-Hellenic Conference on Informatics. PCI ’16: 20th Pan-Hellenic Conference on Informatics. ACM. https://doi.org/10.1145/3003733.3003787

<h4>Questions</h4>

* What ML model would we use to minimize the classifier accuracy difference between testing and training data to increase generalization of the model?
* Are there any financial data that correlate with each other (e.g. High might correlate with HP%)?
* How can we improve the performance of sentiment classification specifically for tweets?

These questions largely depend on the type of tweets collected. After analyzing classifier accuracy, adjustments to tweet collection methods can be made to potentially increase accuracy/lower overfitting, e.g. lang='en' parameter may help improve model accuracy but restricts unclassified tweets, we will update to the full-archive Twitter api (only 250 requests/mo allowed) once we verify functionality of classifier, we can filter out tweets with more than a set threshold of tickers ($) to get more accurate sentiments on a specific public company, and we can use emoji supported sentiment lexicon for initial polarity score label for our sentiment classifier.

<a id="2"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 2. METHOD

### 2.1. Data Acquisition

There are two datasets used, a financial dataset which includes 5 years of stock data and derived financial metrics for various companies (~1500 rows), as well as a dataset of tweets tweets (~1000 tweets for now). The tweets dataset variables are date, tweet, and sentiment score. 

For the financial dataset, the data is quantitative and recorded for each trading day. The variables are as follows: 
- Date - trading day date in YYYY-MM-DD format
- Close - stock price at the end of the trading day
- Open - stock price at the beginning of the trading day
- High - stock's highest price during the trading day
- Low - stock's lowest price during the trading day
- Volume - amount of stock traded
- Change % - percent change in price from open to close
- Dividends - dividends paid on trading day
- Stock Splits - ratio of shares obtained to previous during stock split event
- HL % - difference between high and low price
- HPR - return from previous day including profit (dividends)
- Market Capitalization - total market value of equity

### 2.2. Data Analysis

The target for the tweets dataset is a polarity score, Sentiment Score, provided by an external sentiment lexicon and corpus (AFINN).

To analyze our data, we are tackling classification. The sentiment analysis classifier will vectorize tweet text and predict a sentiment class: positive, neutral, or negative.

For the sentiment classifier, the following algorithms will be utilized: LinearSVC, MultinomialNB, and LogisticRegression. We chose to focus on these algorithms (other than DecisionTreeRegressor)  because the following source mentions they are the most effective to tackle problems regarding text sentiment: A. Pak and P. Paroubek. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Lrec, pages 1320–1326, 2010., Random Forest - type of decision tree algorithm, https://link.springer.com/article/10.1007/s10796-021-10135-7.

- LinearSVC - Support Vector Classification: works by using a linear kernel to find the best fit hyperplane (maximized margin) which splits data into classes
- MultinomialNB - Naive Bayes: Bayesian classification algorithm suited for discrete features
- LogisticRegression: uses a logistic model e.g. sigmoid to predict binomial classes

For the sentiment classifier we are utilizing tf-idf with n-grams to vectorize the tweets into a matrix of many adjacent word-combination features. N-grams help cover combinations of words that may have different meanings and sentiment together than apart (e.g. “bad not good” vs “good not bad”). We will also remove stopwords to combat overfitting. We are planning on normalizing stock price data using a standard scaler, because stock prices are based on shares outstanding and aren’t reflective of the market capitalization (true market value of equity). 

It would be useful to visualize stock price data for a stock to get the gist of the price trend for a particular ticker. Additionally, a scatter matrix can be used to visualize correlations between features, which is also useful for identifying which features break conditional independence for Naive Bayes classification algorithms. Also included but not one of the two visualizations is a word cloud using the wordcloud library just to indicate the most common words among our tweet corpus. It also shows that restricting multiple tickers within tweets would be a good idea, as we don’t want overall sentiment to be masked by large numbers of other companies.

<a id="3"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 3. RESULTS

In [None]:
import pandas as pd
import yfinance as yf
import re
import plotly.express as px
import stylecloud
from afinn import Afinn
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

### 3.1. Data Wrangling

In [None]:
TWEETS_URL = 'https://raw.githubusercontent.com/sisolta/ds3000/main/initial_tweets.csv'
tweets = pd.read_csv(TWEETS_URL, index_col=0)
tweets.head()

Unnamed: 0,Date,Original Text,Clean Text,Important Text,Sentiment Score
0,2021-11-14 09:23:49+00:00,most profitable crypto group\n\n$RCAT $NAKD $C...,most profitable crypto group\n\n$rcat $nakd $c...,profitable crypto group rcat nakd cei fami bbi...,2.0
1,2021-11-14 09:17:13+00:00,5 Advanced Secrets Every Options Trader Should...,5 advanced secrets every options trader should...,5 advanced secrets every options trader know s...,1.0
2,2021-11-14 09:06:40+00:00,✅Stocks \n✅Options \n✅Day trading \n\n$CLDR $N...,✅stocks \n✅options \n✅day trading \n\n$cldr $n...,✅stocks ✅options ✅day trading cldr nok abev zn...,2.0
3,2021-11-14 09:00:51+00:00,$NVDA　圧倒的。\n\nFB復活するかな。\n\n$AAPL\n$AMZN\n$FB\n...,$nvda　圧倒的。\n\nfb復活するかな。\n\n$aapl\n$amzn\n$fb\n...,nvda 圧倒的。 fb復活するかな。 aapl amzn fb goog msft nvda,0.0
4,2021-11-14 09:00:37+00:00,"$ATOM call, its a key level, trade safe..... \...","$atom call, its a key level, trade safe..... \...","atom call , key level , trade safe ..... soul ...",1.0


In [None]:
STOCKS_URL = 'https://raw.githubusercontent.com/sisolta/ds3000/main/stocks_historical.csv'
stocks_historical = pd.read_csv('stocks_historical.csv', index_col=0, header=[0,1])
print(stocks_historical.head())
aapl = stocks_historical['AAPL']
aapl.head()

              GOOG                                                       \
              Open    High     Low   Close Adj Close   Volume Dividends   
Date                                                                      
2016-12-09  780.00  789.43  779.02  789.29    789.29  1821900         0   
2016-12-12  785.04  791.25  784.35  789.27    789.27  2104100         0   
2016-12-13  793.90  804.38  793.34  796.10    796.10  2145200         0   
2016-12-14  797.40  804.00  794.01  797.07    797.07  1704200         0   
2016-12-15  797.34  803.00  792.92  797.85    797.85  1626500         0   

                           AMZN          ...      TSLA                AAPL  \
           Stock Splits    Open    High  ... Dividends Stock Splits   Open   
Date                                     ...                                 
2016-12-09            0  770.00  770.25  ...         0          0.0  28.08   
2016-12-12            0  766.40  766.89  ...         0          0.0  28.32   
2016-12-1

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-12-09,28.08,28.67,28.08,28.49,26.81,137610400,0.0,0.0
2016-12-12,28.32,28.75,28.12,28.33,26.66,105497600,0.0,0.0
2016-12-13,28.46,28.98,28.44,28.8,27.11,174935200,0.0,0.0
2016-12-14,28.76,29.05,28.75,28.8,27.11,136127200,0.0,0.0
2016-12-15,28.84,29.18,28.81,28.95,27.25,186098000,0.0,0.0


#### Data Preprocessing

We are preprocessing tweets to clear potential disruptors to our sentiment analysis model. We are looking to clear mentions, links, eliminate punctuation, and remove the hashtag symbol. We are not removing the actual hashtags as they usually contain important identifiers relating to the overall tweet sentiment (this was done in our dataset generation code).

In [None]:
# regex for pattern matching url
# source: https://github.com/Traumatizn/RegEx/blob/main/Python/Url_Pattern.md
url_pattern = r'((?:(?<=[^a-zA-Z0-9]){0,}(?:(?:https?\:\/\/){0,1}(?:[a-zA-Z0-9\%]{1,}\:[a-zA-Z0-9\%]{1,}[@]){,1})(?:(?:\w{1,}\.{1}){1,5}(?:(?:[a-zA-Z]){1,})|(?:[a-zA-Z]{1,}\/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\:[0-9]{1,4}){1})){1}(?:(?:(?:\/{0,1}(?:[a-zA-Z0-9\-\_\=\-]){1,})*)(?:[?][a-zA-Z0-9\=\%\&\_\-]{1,}){0,1})(?:\.(?:[a-zA-Z0-9]){0,}){0,1})'

# preprocesses tweets
def preprocess_tweet(tweet):
    tweet_all_lower = tweet.lower()
    tweet_no_mentions = re.sub(r'@[A-Za-z0-9_]+', '', tweet_all_lower)
    tweet_no_links = re.sub(url_pattern, '', tweet_no_mentions)
    tweet_no_hashtag_symbol = re.sub(r'#', '', tweet_no_links)
    return tweet_no_hashtag_symbol

tweets.set_index('Date')
tweets['Clean Text'] = tweets['Original Text'].apply(preprocess_tweet)
tweets.head()

Unnamed: 0,Date,Original Text,Clean Text,Important Text,Sentiment Score
0,2021-11-14 09:23:49+00:00,most profitable crypto group\n\n$RCAT $NAKD $C...,most profitable crypto group\n\n$rcat $nakd $c...,profitable crypto group rcat nakd cei fami bbi...,2.0
1,2021-11-14 09:17:13+00:00,5 Advanced Secrets Every Options Trader Should...,5 advanced secrets every options trader should...,5 advanced secrets every options trader know s...,1.0
2,2021-11-14 09:06:40+00:00,✅Stocks \n✅Options \n✅Day trading \n\n$CLDR $N...,✅stocks \n✅options \n✅day trading \n\n$cldr $n...,✅stocks ✅options ✅day trading cldr nok abev zn...,2.0
3,2021-11-14 09:00:51+00:00,$NVDA　圧倒的。\n\nFB復活するかな。\n\n$AAPL\n$AMZN\n$FB\n...,$nvda　圧倒的。\n\nfb復活するかな。\n\n$aapl\n$amzn\n$fb\n...,nvda 圧倒的。 fb復活するかな。 aapl amzn fb goog msft nvda,0.0
4,2021-11-14 09:00:37+00:00,"$ATOM call, its a key level, trade safe..... \...","$atom call, its a key level, trade safe..... \...","atom call , key level , trade safe ..... soul ...",1.0


We are preprocessing stock data as stock price isn't indicative of firm value, and normalizing these values will help generalize the model to other stock tickers. Standard scaler normalizes each column by symetrically centering the data around the mean.

In [None]:
# normalize stock data
def normalize_data(df):
    return pd.DataFrame(StandardScaler().fit_transform(df), index=df.index, columns=df.columns)

aapl_normalized = normalize_data(aapl)
aapl_normalized.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-12-09,-1.138828,-1.130382,-1.133433,-1.129279,-1.127806,0.294495,-0.126194,-0.028194
2016-12-12,-1.132731,-1.128374,-1.132405,-1.133341,-1.131568,-0.277693,-0.126194,-0.028194
2016-12-13,-1.129175,-1.122601,-1.124177,-1.121409,-1.120281,0.959551,-0.126194,-0.028194
2016-12-14,-1.121555,-1.120844,-1.116207,-1.121409,-1.120281,0.268067,-0.126194,-0.028194
2016-12-15,-1.119523,-1.117582,-1.114664,-1.117602,-1.11677,1.15845,-0.126194,-0.028194


#### Feature Extraction

We are performing feature extraction on tweets by cleaning the tweet and defining a target for supervised learning. Here, we remove stopwords and ticker signs and use an external lexicon (afinn) to score the tweet's sentiment. This was done in the dataset generation jupyter notebook.

In [None]:
# clean tweet and score it using external lexicon for target in supervised learning
afinn = Afinn()
def filter_and_score(clean_tweet):
    # tokenize and filter tweet
    tweet_tokens = [word for word in word_tokenize(clean_tweet) if word not in stopwords.words('english') and not word == '$']
    important_text = ' '.join(tweet_tokens)
    # score tweet - sentiment score b/w -6 and +6 based on positivity of tweet
    # -1 = negative, 0 = neutral, 1 = positive transformation
    score = afinn.score()
    return important_text, score

# done in dataset generation notebook, nltk modules must be installed (large)
# tweets['Important Text'], tweets['Sentiment Score'] = zip(*tweets['Clean Text'].map(filter_and_score))
tweets.head()

Unnamed: 0,Date,Original Text,Clean Text,Important Text,Sentiment Score
0,2021-11-14 09:23:49+00:00,most profitable crypto group\n\n$RCAT $NAKD $C...,most profitable crypto group\n\n$rcat $nakd $c...,profitable crypto group rcat nakd cei fami bbi...,2.0
1,2021-11-14 09:17:13+00:00,5 Advanced Secrets Every Options Trader Should...,5 advanced secrets every options trader should...,5 advanced secrets every options trader know s...,1.0
2,2021-11-14 09:06:40+00:00,✅Stocks \n✅Options \n✅Day trading \n\n$CLDR $N...,✅stocks \n✅options \n✅day trading \n\n$cldr $n...,✅stocks ✅options ✅day trading cldr nok abev zn...,2.0
3,2021-11-14 09:00:51+00:00,$NVDA　圧倒的。\n\nFB復活するかな。\n\n$AAPL\n$AMZN\n$FB\n...,$nvda　圧倒的。\n\nfb復活するかな。\n\n$aapl\n$amzn\n$fb\n...,nvda 圧倒的。 fb復活するかな。 aapl amzn fb goog msft nvda,0.0
4,2021-11-14 09:00:37+00:00,"$ATOM call, its a key level, trade safe..... \...","$atom call, its a key level, trade safe..... \...","atom call , key level , trade safe ..... soul ...",1.0


We are performing feature extraction on stock tweets to create new columns to represent important financial metrics, namely percent change, holding period return, high low percentage, and market capitalization.

In [None]:
# add features to stock data
# move ticker info to subindex date
stocks_historical = stocks_historical.stack(level=0)

# add key metrics as additional features
# percent change
stocks_historical['Change %'] = (stocks_historical['Close'] - stocks_historical['Open']) / stocks_historical['Open'] * 100
# holding period return
stocks_historical['HPR'] = (stocks_historical['Close'] - stocks_historical.shift(1)['Close'] + stocks_historical['Dividends']) / stocks_historical.shift(1)['Close'] * 100
# high low percentage
stocks_historical['HL %'] = (stocks_historical['High'] - stocks_historical['Low']) / stocks_historical['Low'] * 100

# restructure index so tickers appear together as first column level
stocks_historical = stocks_historical.unstack().swaplevel(0, 1, axis=1)
# drop first row - doesn't have calculated HPR
stocks_historical.drop(index=stocks_historical.index[0], axis=0, inplace=True)

# brute force market capitalization
get_current_shares = lambda ticker: yf.Ticker(ticker).info.get('sharesOutstanding', 0) / 1e9
# stocks_historical = stocks_historical.assign(MktCap=get_current_market_cap)
for ticker in stocks_historical.columns.unique(level=0).values:
    stocks_historical[ticker, 'Mkt Cap ($B)'] = get_current_shares(ticker) * stocks_historical[ticker, 'Close']
stocks_historical = stocks_historical.sort_index(axis=1).round(2)
stocks_historical.head()

Unnamed: 0_level_0,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,...,TSLA,TSLA,TSLA,TSLA,TSLA,TSLA,TSLA,TSLA,TSLA,TSLA
Unnamed: 0_level_1,Adj Close,Change %,Close,Dividends,HL %,HPR,High,Low,Mkt Cap ($B),Open,...,Close,Dividends,HL %,HPR,High,Low,Mkt Cap ($B),Open,Stock Splits,Volume
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2016-12-12,26.66,0.04,28.33,0.0,2.24,-26.3,28.75,28.12,464.79,28.32,...,38.49,0.0,1.67,-95.12,38.88,38.24,38.65,38.56,0.0,12194500
2016-12-13,27.11,1.19,28.8,0.0,1.9,-25.18,28.98,28.44,472.5,28.46,...,39.63,0.0,4.3,-95.02,40.26,38.6,39.8,38.64,0.0,34119500
2016-12-14,27.11,0.14,28.8,0.0,1.04,-27.33,29.05,28.75,472.5,28.76,...,39.74,0.0,3.18,-95.01,40.6,39.35,39.91,39.75,0.0,20754500
2016-12-15,27.25,0.38,28.95,0.0,1.28,-27.15,29.18,28.81,474.97,28.84,...,39.52,0.0,1.7,-95.05,40.15,39.48,39.69,39.68,0.0,16098000
2016-12-16,27.29,-0.45,28.99,0.0,0.73,-26.64,29.12,28.91,475.62,29.12,...,40.5,0.0,2.53,-94.88,40.52,39.52,40.67,39.62,0.0,18984500


#### Data Wrangling: Extract Features & Target

We are extracting the features and target for our tweet sentiment classifier.

In [None]:
# create features and target for text dataframe
def features_and_target(df, features, target, target_fx):
    features = tweets[features]
    target = tweets[target].apply(target_fx)
    return features, target

features, target = features_and_target(tweets, 'Important Text', 'Sentiment Score', lambda x: 0 if x == 0 else abs(x) / x)
features.head()

### 3.2. Data Exploration

#### Line Chart

In [None]:
fig = px.line(stocks_historical['AAPL'].reset_index(), x='Date', y='Close', title="AAPL Historical Closing Price")
fig.show()

<a href="https://ibb.co/6BjKqY6"><img src="https://i.ibb.co/2NGzpqX/Screen-Shot-2021-12-09-at-4-00-47-PM.png" alt="Screen-Shot-2021-12-09-at-4-00-47-PM" border="0"></a>

*Figure I: Line chart of financial dataset which displays the historical close price of Apple Inc. (AAPL) over 5 years.*

*Variables: Date, Close*

*The line chart shows a gradual increase in price, and much higher growth rates and volatilities when COVID hit.*

#### Scatter Matrix

In [None]:

fig = px.scatter_matrix(stocks_historical['AAPL'],
    dimensions=["High", 'HPR', 'HL %', 'Mkt Cap ($B)'])
fig.show()
fig.update_layout(
    width=1600,
    height=1600,
    hovermode='closest',
)

<img src="https://i.ibb.co/4TH4S7V/image2.png" alt="image2" border="0">

*Figure II: part of a scatter matrix visualizing correlations between features in the financial dataset.*

*Variables: High, HPR, HL %, Mkt Cap ($B)*

*Here, we can identify features that break conditional independence for Naive Bayes classification algorithms as their associated scatterplot with a differently-named feature will be linear.*

#### Wordcloud

In [None]:
stylecloud.gen_stylecloud(text=' '.join(list(tweets['Important Text'])), max_words=100, output_name="vis_tweets_wc.png")

<img src="https://i.ibb.co/pfdYnwg/vis-tweets-wc.png" alt="vis-tweets-wc" border="0">

*Figure III: wordcloud for current tweet corpus for ‘\\$AAPL’ search.*

*Variables: Important Text*

*The wordcloud shows many other stock tickers. This suggests to exclude tweets with other tickers such as ‘\\$AAPL \\$TSLA \\$SPY’ to increase sentiment classifier accuracy.*

#### Dimensionality Reduction

In [None]:
# reduce into two dimensions
# TruncatedSVD uses singular value decomposition (SVD) to reduce dimensions
# The algorithm works on sparse matrices as well as fractional data returned by the tfidf vectorizer
# Latent semantic analysis (LSA)
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
svd = TruncatedSVD(n_components= 2, random_state=3000)

# standardize the features so they are all on the same scale
vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(features)
features_vect = vect.transform(features)

# transform the standardized features using the SVD algorithm 
reduced_data = svd.fit_transform(features_vect)

# rename columns       
reduced_df = pd.DataFrame(reduced_data, columns = ["Component1", "Component2"])
reduced_df["target"] = target
print(reduced_df[:5])
svd.components_

   Component1  Component2  target
0    0.113307    0.148365     1.0
1    0.155884    0.115688     1.0
2    0.159520    0.093583     1.0
3    0.212081    0.150393     0.0
4    0.141513    0.102058     1.0


In [None]:
graph = px.scatter(reduced_df, x='Component1', y='Component2', color = 'target')
graph.show()

<img src="https://i.ibb.co/PwjmJpF/newplot.png" alt="newplot" border="0">

*Dimensionality reduction of tweet text into two components pictures as a scatter plot using latent semantic analysis.*

*Variables: all --> SVD --> Component1, Component2, target*


### 3.3. Model Training

In [None]:
# splits features and target into training and testing data
def split_train_test(features, target):
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    return X_train, X_test, y_train, y_test

# create training and testing data
X_train, X_test, y_train, y_test = split_train_test(features, target)

In [None]:
# define classifiers
classifiers = {
    'Multinomial Naive Bayes': MultinomialNB(),
    'Support Vector Machine': LinearSVC(),
    'Logistic Regression': LogisticRegression()
}

In [None]:
# vectorize and create vocabulary
# utilize n-grams to preserve context
vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#### All Features

In [None]:
# train classifiers and print results
def classifiers_percentage_split(classifiers, X_train, X_test, y_train, y_test):
    for classifier_name, classifier_object in classifiers.items():
        # train the classifier
        model = classifier_object.fit(X=X_train, y=y_train)
        print(classifier_name)
        # classification accuracy metric
        print(f'\tClassification accuracy on training set: {model.score(X_train, y_train):.3f}')
        print(f'\tClassification accuracy on testing set: {model.score(X_test, y_test):.3f}')
        # f1-score metric
        print(f'\tf1-score on training set: {f1_score(y_true=y_train, y_pred=model.predict(X_train), average="weighted"):.3f}')
        print(f'\tf1-score on testing set: {f1_score(y_true=y_test, y_pred=model.predict(X_test), average="weighted"):.3f}\n')

classifiers_percentage_split(classifiers, X_train_vectorized, X_test_vectorized, y_train, y_test)

Multinomial Naive Bayes
	Classification accuracy on training set: 0.812
	Classification accuracy on testing set: 0.720
	f1-score on training set: 0.798
	f1-score on testing set: 0.699

Support Vector Machine
	Classification accuracy on training set: 0.913
	Classification accuracy on testing set: 0.720
	f1-score on training set: 0.911
	f1-score on testing set: 0.710

Logistic Regression
	Classification accuracy on training set: 0.864
	Classification accuracy on testing set: 0.732
	f1-score on training set: 0.856
	f1-score on testing set: 0.714



The algorithms performed well on the training set, but performed significantly worse on the testing set. This indicates that the model is overfitted.

#### Selected Features

In [None]:
# SelectKBest to select the most important features using a chi squared test
def SelectKBest_feature_selection(n_features, classifiers, X_train, X_test, y_train, y_test):
    select = SelectKBest(score_func=chi2, k=n_features)
    select.fit(X_train, y_train)
    X_train_selected = select.transform(X_train)
    X_test_selected = select.transform(X_test)
    # reuse previous function to fit and print results
    return classifiers_percentage_split(classifiers, X_train_selected, X_test_selected, y_train, y_test)

SelectKBest_feature_selection(100, classifiers, X_train_vectorized, X_test_vectorized, y_train, y_test)

Multinomial Naive Bayes
	Classification accuracy on training set: 0.717
	Classification accuracy on testing set: 0.688
	f1-score on training set: 0.686
	f1-score on testing set: 0.653

Support Vector Machine
	Classification accuracy on training set: 0.803
	Classification accuracy on testing set: 0.740
	f1-score on training set: 0.794
	f1-score on testing set: 0.727

Logistic Regression
	Classification accuracy on training set: 0.772
	Classification accuracy on testing set: 0.728
	f1-score on training set: 0.754
	f1-score on testing set: 0.707



### 3.4. Model Optimization

We are performing model optimization by hyperparameter tuning in order to reduce overfitting and increase generalizability of the classifiers to new data.

In [None]:
nb_params_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
svm_params_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
log_params_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
def hyperparameter_tune_algorithms(X_train, y_train):
    nb = GridSearchCV(MultinomialNB(), nb_params_grid, cv=5)
    nb.fit(X_train, y_train)
    print(f'NB best parameters: {nb.best_params_}')
    svm = GridSearchCV(LinearSVC(max_iter=10000), svm_params_grid, cv=5)
    svm.fit(X_train, y_train)
    print(f'SVM best parameters: {svm.best_params_}')
    log = GridSearchCV(LogisticRegression(max_iter=10000), log_params_grid, cv=5)
    log.fit(X_train, y_train)
    print(f'Log best parameters: {log.best_params_}')
    return nb, svm, log

nb_tuned, svm_tuned, log_tuned = hyperparameter_tune_algorithms(X_train_vectorized, y_train)

NB best parameters: {'alpha': 0.001}
SVM best parameters: {'C': 0.1}
Log best parameters: {'C': 1}


### 3.5. Model Testing

In [None]:
# define classifiers with tuned algorithms
tuned_classifiers = {
    'Multinomial Naive Bayes': nb_tuned,
    'Support Vector Machine': svm_tuned,
    'Logistic Regression': log_tuned
}

#### All Features

In [None]:
# train classifiers and print results
def tuned_classifiers_predict(classifiers, X_train, X_test, y_train, y_test):
    for classifier_name, classifier_object in classifiers.items():
        # train the classifier
        model = classifier_object.fit(X=X_train, y=y_train)
        print(classifier_name)
        # classification accuracy metric
        print(f'\tClassification accuracy on testing set: {model.score(X_test, y_test):.3f}')
        # f1-score metric
        print(f'\tf1-score on testing set: {f1_score(y_true=y_test, y_pred=model.predict(X_test), average="weighted"):.3f}\n')

tuned_classifiers_predict(tuned_classifiers, X_train_vectorized, X_test_vectorized, y_train, y_test)

Multinomial Naive Bayes
	Classification accuracy on testing set: 0.692
	f1-score on testing set: 0.680

Support Vector Machine
	Classification accuracy on testing set: 0.728
	f1-score on testing set: 0.708

Logistic Regression
	Classification accuracy on testing set: 0.732
	f1-score on testing set: 0.714



#### Selected Features

In [None]:
# train classifiers and print results for subset of features
def tuned_classifiers_predict_subset(n_features, classifiers, X_train, X_test, y_train, y_test):
    select = SelectKBest(score_func=chi2, k=n_features)
    select.fit(X_train, y_train)
    X_train_selected = select.transform(X_train)
    X_test_selected = select.transform(X_test)
    # reuse previous function to fit and print results
    return tuned_classifiers_predict(classifiers, X_train_selected, X_test_selected, y_train, y_test)

tuned_classifiers_predict_subset(100, tuned_classifiers, X_train_vectorized, X_test_vectorized, y_train, y_test)

Multinomial Naive Bayes
	Classification accuracy on testing set: 0.720
	f1-score on testing set: 0.695

Support Vector Machine
	Classification accuracy on testing set: 0.740
	f1-score on testing set: 0.730

Logistic Regression
	Classification accuracy on testing set: 0.736
	f1-score on testing set: 0.727



<a id="4"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">


## 4. DISCUSSION

For this project we compared Linear SVC, MultinomialNB, and Logistic Regression. 

When training with all features, each model performed as follows:
  * `Linear SVM`
    * `Accuracy (test):  0.720 `
    * `F1 (train): 0.911 `
    * `F1 (test): 0.710`
  * `MultinomialNB`
    * `Accuracy (test):  0.720 `
    * `F1 (train): 0.798 `
    * `F1 (test): 0.699`
  * `Logistic Regression`
    * `Accuracy (test):  0.732 `
    * `F1 (train): 0.856 `
    * `F1 (test): 0.714`
  
All three models showed similar test set accuracy and f1 scores, with Linear SVM and Logistic Regression slightly outperforming Multinomial Naive Bayes. However, while Linear SVM had a relatively high F1 score, this model suffered from the greatest overfitting as demonstrated by the large difference between its train and test set F1 scores.

Each model was then trained on the top 100 features selected from the vectorized dataset using a chi-squared test. When trained on this reduced feature set, each model performed similarly but Linear SVM overfitting was significantly reduced. 

We used grid search cross validation to find the optimal hyperparameters for each model. The optimal alpha value for Multinomial Naive Bayes was 0.001, while the optimal C values for SVM and Logistic Regression were found to be 0.1 and 1, respectively.

Training these tuned models on the entire test set yielded results not significantly different from their defaults. 

Because the best performance was yielded when training a Linear SVM model on 100 selected best features, Linear SVM would be the best choice on a refined dataset. However, Logistic Regression offered better accuracy and F1 scores on unselected datasets. For this reason, both Linear SVM and Logistic Regression would be suitable choices for our predictive model, depending on whether features were selected.

Based on the performance of these models, we can conclude that our models are able to predict the sentiment of a tweet to some respect but not to a satisfactory accuracy of at least 90%.

To answer our original questions:

* What ML model would we use to minimize the classifier accuracy difference between testing and training data to increase generalization of the model?

LinearSVM provided the best performance on a dataset with selected features, though it exhibited greater overfitting on the entire dataset. For an unselected dataset, Logistic Regression should be used. Therefore, both of these models are well suited for our application.

* Are there any financial data that correlate with each other (e.g. High might correlate with HP%)?

The financial data that we analyzed largely did not correlate with each other, with the exception of Market Cap and High price. However, this is not a significant correlation as a high price would be trivially expected to correlate with a larger market cap when looking at a single ticker.

* How can we improve the performance of sentiment classification specifically for tweets?

Selecting for the 100 best features significantly reduced overfitting for LinearSVM and improved model accuracy for all models tested. Because we only tried the best 100 features, selecting for different numbers of features might further increase performance.
  

The results of our project indicate that stock tweets can be analyzed with some degree of accuracy. Our project can extract needed information from tweets and achieve accuracies of ~73%. However, the usage of these tweets to predict future stock performance may be dubious. Many of the tweets that we analyzed were clearly posted by bots and many of them tagged several popular symbols in a clear attempt to attract user attention. Because of the large volume of these low-quality tweets, Twitter data may not be entirely reliable. Better cleaning techniques might be employed to only search for tweets posted by humans.

We identify little to no ethical issues with this project. The tweets we analyze are public and meant to be analyzed in this manner. However, the stock market is volatile and any results that tweet analysis may produce should not be taken as gospel. Improperly relying on machine learning techniques would be especially harmful to individual retail traders, possibly increasing inequitable outcomes in financial markets.

In this project, we used natural language processing techniques to classify the sentiment of stock tweets. Our models presented relatively good performance, but still need to be improved. Future experiments may include, as mentioned, better cleaning techniques to select for genuine human tweets. Tweets can be filtered by the quality of their poster, including location, frequency of posting, and amount of activity on their posts. With better data, we can improve the quality of our predictions. With higher quality predictions, NLP analysis of stock tweets might be applicable in projecting the performance of stocks.

<a id="5"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

### CONTRIBUTIONS

- Sections 1 and 2 (Introduction and Method) were done together.
- Section 3 visualizations were split among all of us, Sina did wrangling and model optimization.
- Section 4 was done by Byron and Nadia.