# Objectives 
* Classify tweets as bullish, neutral or bearish for a particular asset
* Use classified tweets to build and backtest a trading strategy

Consider model assumptions


# 0. Setup

In [None]:
!pip install GetOldTweets3 #Some documentation at https://github.com/Mottl/GetOldTweets3

import pandas as pd
import GetOldTweets3 as got
from datetime import datetime, date

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

Collecting GetOldTweets3
  Downloading https://files.pythonhosted.org/packages/ed/f4/a00c2a7c90801abc875325bb5416ce9090ac86d06a00cc887131bd73ba45/GetOldTweets3-0.0.11-py3-none-any.whl
Collecting pyquery>=1.2.10
  Downloading https://files.pythonhosted.org/packages/78/43/95d42e386c61cb639d1a0b94f0c0b9f0b7d6b981ad3c043a836c8b5bc68b/pyquery-1.4.1-py2.py3-none-any.whl
Collecting cssselect>0.7.9
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Installing collected packages: cssselect, pyquery, GetOldTweets3
Successfully installed GetOldTweets3-0.0.11 cssselect-1.1.0 pyquery-1.4.1


# 1. Scraping tweets
Things to consider when picking an asset: 
* Tweets per unit of time
* Price movement over relevant time period. Has the price risen and fallen within the period, or has it moved in one direction?
* Availability of price data
* 'Quality' of tweets, judged on a small sample

---

Some sources of time series data: Yahoo Finance, OFX, Federal Reserve, Investing.com

In [None]:
#Collecting tweets containing part or all of the string text_query
text_query = 'tsla'
count = 100 
#Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
                                            .setMaxTweets(count)\
                                            .setTopTweets("True") #Greatly improves tweet quality, but much greater time difference between tweets (~2 days for 100 tweets vs ~2 mins for 100  for 'tesla')

#Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria) #Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text] for tweet in tweets] #Creation of dataframe from tweets list
tweets_df = pd.DataFrame(text_tweets)

In [None]:
#Tweets in sample
for t in tweets_df[1]:
  print (t)

If you’re confused about stocks. Trying to do it yourself. Did you know GK is one of the ONLY RIA firms with no minimums? You can have access to one of the fastest growing, most influential investment teams in finance. (Source @LABJnews ) http://GerberKawasaki.com $tsla #tesla
Lenovo Legion x @PlayApex Now is your chance to rise above the rest. Unmatched performance, purposeful engineering, modern design. Gear up with a machine as savage as you are. Stylish outside. Savage inside.
When the markets are closed, and you're worried you didn't buy enough $SPAQ $TSLA $WKHS $NIO
Clean energy $SEDG $TSLA 
Could this be the bottom, or near the bottom of the Nasdaq? $VIX $VXN $NDX $QQQ $SPY $SPX $AAPL $NVDA $FB $TSLA $AMZN
Tesla Should Not Be Valued as Car Company: Gerber - thoughts on tesla from Bloomberg TV today. $tsla ⁦@elonmusk⁩ https://www.bloomberg.com/news/videos/2020-09-12/tesla-should-not-be-valued-as-car-company-gerber-video
$tsla LIFTOFF!!!!! $msft $aapl $tsla $nflx $fb $amzn $shop $

In [None]:
print ("\nTime difference between first and last tweet in sample: ", tweets_df[0][0] - tweets_df[0][99])


Time difference between first and last tweet in sample:  3 days 16:15:08


# 2. Preprocessing tweets
The main objective of preprocessing is to reduce redundancy in our dictionary (the set of all words which appear at least once in some tweet in the sample).

Here are some thoughts to guide preprocessing, from looking at the tweets above and https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/:


* Exclamation marks, question marks,  fully capitalised words and hashtags can help classify tweets as neutral or non-neutral
* Convert words with a leading capital to lowercase
* Similarly we can try lemmatization (reduce words to their root, e.g. convert 'loves', 'loving', 'lovable', etc. to 'love') and stemming (remove suffixes such as 'ing', 'ed', 'ly', etc.)
* Remove all other punctuation, special characters, URLs and numbers
* Smaller words like 'and', 'are', 'is' and 'at' (stopwords) don't imply any obvious sentiment, so they can be removed

See https://textblob.readthedocs.io/en/dev/index.html




# 3. Feature extraction and selection
Here we extract and select the most representative features of the text by obtaining vector representations of our tweets.

Resources: 
* https://arxiv.org/pdf/1908.10063.pdf (FinBERT)
* https://arxiv.org/pdf/1711.05345.pdf (Unsupervised transfer learning, §2.2)
* https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction (API for BoW, TF-IDF)
* http://jalammar.github.io/illustrated-bert/

It's infeasible to train our own model for this task, so we'll use a pre-trained word embedding model. Word2Vec and GloVe are context insensitive, so we'll use ELMo or BERT instead. 

---

It seems like we could just use a pre-trained FinBERT model to accomplish tasks 3 and 4 in one step. We could try to use unsupervised transfer learning on a pre-trained BERT / FinBERT model to improve accuracy for our data.

In [None]:
from textblob import TextBlob

tweets_df[1][:5].apply(lambda x: TextBlob(x).sentiment)

0                 (0.55, 0.75)
1                  (-0.2, 0.1)
2                   (0.8, 0.7)
3    (0.5, 0.8333333333333334)
4                   (0.0, 0.0)
Name: 1, dtype: object

In [None]:
for s in range(0, 5):
  print (tweets_df[1][s])

dudes get around some girls and start talking about "you think I own too many Tesla shares??"
If you can’t purchase the Tesla stock today, look into the companies that manufacture parts for the company. If Tesla is growing then so are the companies who are connected to them. Here are a few. 
Lol the Apple and Tesla stock split got Robinhood looking like the SNKRS app #StockTalk
The sexual tension between me and the Apple and Tesla stock split
Omg I’m up $3500 off a $500 investment on Apple and Tesla 


# 4. Modelling
We want to group unlabelled vector representations of tweets, so the obvious approach is to use a clustering algorithm (see https://developers.google.com/machine-learning/clustering/clustering-algorithms). 

# 5. Backtesting
We now have tweets labelled with sentiment. A possible approach is to group tweets by day, find the average sentiment of tweets for each day and use this to create a buy / hold / sell signal for each day. We can then align this vector of signals with a daily close price vector for the same asset and track performance based on a simple rule. 