# Twitter Cashtags as a Predictive Metric for Catalyst Volatility 

In [1]:
import pandas as pd
import numpy as np
from yahoofinancials import YahooFinancials as yf
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date, timedelta
import requests

## Introduction

The goal of this analysis was to see if there was a difference in occurence of twitter cashtags (ex: $AAPL) before catalyst events that were volatile or non-volatile. Much of this is based on top of Tweepy which is a python wrapper for the twitter API. I gained developer access to the Twitter API on an academic basis. A dummy twitter account was created for this developer access, set to private, and followed a curated list of people involved in biotech on Twitter. The following of the twitter account were then analyzed to find the usernames of those that tweeted cashtags most often. Only users who ranked in the top 20 of most tweeted cashtags in the last 2 days were included in the final user set; this was done to not waste API calls on usernames that didn't tweet cashtags that often. It was necessary to limit the size of the dataset because the twitter API has a limit to the number of calls per 15 minutes. Once frequent cashtag tweeters were identified, the datetime of the tweet and associated cashtags were collected. It is important to note the code actually looks at the timeline of the user so cashtags from tweets that they liked or retweeded are also included. This was 13,399 tweets over the last 30 days; only the last 30 days were evaluated because of the twitter API limits. The cashtags were combined in a dict with the keys being datetime and the values of the dict being a list of the cashtags from that date. Occurence of tickers in the catalyst dataset (period_df) were then counted. Note, the date of each catalyst event was also pulled and occurence was only counted for days before the catalyst event. The sum occurence of tickers were then compared for events that were classified as volatile (positive or negative) or non-volatile. A slight increase in ticker occurence was found to be associated with volatile events, it is not a big enough difference to be significant but is still interesting. 

The catalyst dataset was webscraped from biopharmcatalyst.com. The dates were converted to datetime, price data generated for a unique list of the tickers in the dataset, and each event outcome classifed based on the total return 2 days after the event. Only some events were included, there was a focus on recent events so only events in the list 60 days were included. Some of the most recent events were also eliminated because the event outcome classifier needs 2 days between the event date and the last day in the price data. Events were classified into either positive, negative, or na (not applicable) based on price action after the event. Another classifier was added called Volatile which just measured whether the event outcome was volatile, either positive or negative, (value of 1) or not volatile (value of 0). Only twitter data going back 30 days was generated so some of the events in the dataset were not analyzed. 

Future projects could involve sentiment analysis of the actual tweets using NLP. This would allow for sentiment to be determined over time for varius cashtags. This was not done in this analysis because I was focused on the ability to predict volatility not direction. It is possible to have tweepy call the maximum number of twitter API calls and then pause for 15 minutes so the limit resets and continue. This was not done in this analysis because I wanted to focus on a small, exploratory dataset and not get carried away with functions that take hours to run once. 

## Generation of catalyst dataset to evaluate effects of CashTags

In [2]:
URL = "https://www.biopharmcatalyst.com/calendars/historical-catalyst-calendar" #the url for the website with the historical catalyst dataset
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = soup.find(id='historical-catalysts')

tickers = results.find_all('a', class_='ticker')
drug = results.find_all('strong', class_='drug')
indication = results.find_all('div', class_='indication')
date_list_og = results.find_all('time', class_='catalyst-date')
note = results.find_all('div', class_='catalyst-note')
status = results.find_all('td', class_='stage')

def strip_text(alist): #strips the text from a list, meant to be used for cleaning raw web scraping output
    res = [a.text.strip() for a in alist]
    return res

tickers_raw = strip_text(tickers)
drug_raw = strip_text(drug)
indication_raw =strip_text(indication)
date_raw = strip_text(date_list_og)
note_raw = strip_text(note)
status_raw = strip_text(status)
#strips the text from each raw web scraping output

datetime_format = [datetime.strptime(x,'%m/%d/%Y') for x in date_raw] #reformats the dates pulled into datetime
status_raw = [x.replace(' ','') for x in status_raw] #edits the status list to have no spaces

#creates dataframe of the reformated web scraping output
raw_df = pd.DataFrame({'Ticker':tickers_raw,'Drug':drug_raw,'Indication':indication_raw,'Date':datetime_format,'Status':status_raw})

In [3]:
raw_df.dtypes

Ticker                object
Drug                  object
Indication            object
Date          datetime64[ns]
Status                object
dtype: object

In [4]:
currDate = datetime(2020,12,2)
periodMin = currDate - timedelta(days=60)
periodMax = currDate - timedelta(days=3) # to adjust for event outcome classification calculation

period_df = raw_df.loc[(raw_df['Date']>=periodMin) & (raw_df['Date']<=periodMax)]

In [5]:
period_df # this df is events that happened in the last 60 days

Unnamed: 0,Ticker,Drug,Indication,Date,Status
2475,MGTX,AAV-RPGR,X-Linked Retinitis Pigmentosa,2020-10-03,Phase1/2
2476,YMAB,Omburtamab,CNS/Leptomeningeal Metastases from Neuroblastoma,2020-10-05,BLAFiling
2477,MGEN,Cobomarsen - SOLAR,Cutaneous T-Cell Lymphoma,2020-10-05,Phase2
2478,AMGN,AMG 510,Non-small cell lung cancer (NSCLC),2020-10-05,Phase2
2479,KNSA,Mavrilimumab,Giant cell arteritis (GCA),2020-10-06,Phase2
...,...,...,...,...,...
2600,UROV,Vibegron,Irritable bowel syndrome (IBS),2020-11-24,Phase2a
2601,LQDA,LIQ861,Pulmonary arterial hypertension,2020-11-25,CRL
2602,RVNC,DAXI (RT002),Moderate to severe glabellar (frown) lines,2020-11-25,PDUFA
2603,YMAB,Naxitamab,Neuroblastoma,2020-11-25,Approved


In [6]:
# the stock data is just to measure what the event outcome was. 

def stock_data(alist, date1, date2): #function to generate adj close price data for tickers in alist
    yf_tickers = yf(alist)
    res = yf_tickers.get_historical_price_data(date1, date2,'daily') # this spits json
    res_df = pd.DataFrame({
    a: {x['formatted_date']: x['adjclose'] for x in res[a]['prices']} for a in alist}).round(decimals=2) # formats json into pd df with adj close data, date as index, ticker as columns
    return res_df

uniqueTickers = period_df['Ticker'].unique().tolist()

date1 = '2020-09-01'
date2 = '2020-12-01'

price_data = stock_data(uniqueTickers,date1,date2) # generates price df
price_data.index = pd.to_datetime(price_data.index) #reformats the price data array to have datetime format index
price_data['Day'] = [a for a in range(len(price_data.index))] #adds Day count column for use as second index
price_data.head()

Unnamed: 0,MGTX,YMAB,MGEN,AMGN,KNSA,SIOX,CANF,CRBP,BMY,LLY,...,INCY,KZIA,ARQT,BLPH,VTRS,ALNY,UROV,LQDA,RYTM,Day
2020-09-01,12.36,40.71,13.05,249.17,16.6,,2.03,8.91,60.51,146.55,...,92.97,8.01,24.8,10.37,15.5,130.41,8.76,4.78,28.77,0
2020-09-02,12.73,40.82,14.11,256.38,16.53,,1.92,9.38,61.14,148.79,...,95.44,8.09,25.39,10.12,15.71,130.66,8.79,4.46,29.83,1
2020-09-03,12.59,38.37,13.65,246.24,15.71,,1.8,9.38,59.6,148.26,...,91.57,7.34,24.8,9.92,15.46,125.19,8.84,4.39,29.3,2
2020-09-04,12.64,38.46,13.5,246.72,15.37,,1.61,9.25,59.5,150.14,...,90.49,7.35,24.4,9.46,15.81,122.83,8.85,4.34,29.88,3
2020-09-08,13.02,40.65,13.5,239.55,15.56,,1.7,2.23,58.09,148.55,...,89.73,7.0,24.19,9.27,15.64,122.99,9.05,4.45,29.4,4


In [7]:
def fixDates(dates_list):
    df= price_data #specifies which dataframe to look in to check each date 
    res = []
    for date in dates_list: #for each date in the list
        n=0 #sets the number of loops (n) as zero
        while n<4: #while less than four loops have been performed
            try: 
                df.loc[date] #checks if the date can be loc in the dataframe
                res.append(date)#appends if successful
                n += 5 #this ends the loop by adding 5 to n, this is arbitrary and can be anything more than 4
            except KeyError: #the error handling for when the date is not in the dataframe
                n += 1 #sets the number of loops run to 1 more than it was
                date += timedelta(days=1) #adds one day to the datetime date 
        if n == 4: #this is when four loops have occured, different from when loc successful as 4 can only be produced by 4 loops and success renders 5. 
            res.append(np.nan) #if after 4 loops a date cannot be found it returns NaT, this can then be removed from dataframe. 
    return res

fixed_dates = fixDates(period_df['Date'].to_list()) #slowly adds timedelta until the datetime index stops producing errors, those that still produce errors are replaced with NaN 
period_df['Date'] = fixed_dates # sets the column 'Date' in raw_df to be the fixed dates
period_df = period_df.dropna() #drops the NaT values from the dataframe that were generated by the function above

In [21]:
period_df = period_df.loc[period_df['Day'] <= price_data['Day'].max()-2] # drops the values that are too close to the max Day in the price data, this allows classifier to work

In [22]:
period_df['Day'] = price_data.loc[period_df['Date'],'Day'].to_list() #finds the Day count corresponding to each datetime in the dataframe based on the price array datetimes and day counts

days = period_df['Day'].to_list()
days_adj = [a+2 for a in days]

def find(alist,blist,df):
    arange = [a for a in range(len(alist))] # a range of the list
    res = [ df.loc[df['Day'] == blist[a],alist[a]].item() for a in arange ] # finds the price data associated with the day and ticker input. 
    return pd.Series(res)

# data kept as pd series to make calc easier 
tickers = period_df['Ticker'].to_list() # tickers with duplicates 
day_0 = find(tickers,days,price_data) # pulls the price data for day 0
day_2 = find(tickers,days_adj,price_data) # pulls the price data for day +2
tot_return_2days = round(day_2/day_0 -1,3).to_list() # calculates the tot return 

def sort(alist): #sorts the list of 2 day post total return to identify positive, negative, or na events based on boolean logic.
    res = []
    for a in alist:
        if bool(a >= 0.10) is True: #if postive event
            res.append('pos')
        elif bool(a <= -0.10) is True: #if negative event
            res.append('neg')
        else: #otherwise na event
            res.append('na')
    return res
            
sorted_class = sort(tot_return_2days)
period_df['Class'] = sorted_class #adds the classification of each event to the dataframe for further analysis

In [24]:
period_df.tail()

Unnamed: 0,Ticker,Drug,Indication,Date,Status,Day,Class
2599,ALNY,Lumasiran,Primary Hyperoxaluria Type 1 (PH1),2020-11-24,Approved,59,na
2600,UROV,Vibegron,Irritable bowel syndrome (IBS),2020-11-24,Phase2a,59,na
2601,LQDA,LIQ861,Pulmonary arterial hypertension,2020-11-25,CRL,60,na
2602,RVNC,DAXI (RT002),Moderate to severe glabellar (frown) lines,2020-11-25,PDUFA,60,na
2603,YMAB,Naxitamab,Neuroblastoma,2020-11-25,Approved,60,pos


In [25]:
period_df['Class'].value_counts() # this shows the occurence of each event outcome in this small dataset. 

na     95
neg    23
pos    11
Name: Class, dtype: int64

In [30]:
""" Simple code the classifies events as either 1, Volatile or 2, not Volatile based on event outcome sorting """
volatility = []
for a in sorted_class:
    if a == 'pos' or a == 'neg':
        volatility.append(1)
    else:
        volatility.append(0)

In [32]:
period_df['Volatile'] = volatility

In [33]:
period_df.head()

Unnamed: 0,Ticker,Drug,Indication,Date,Status,Day,Class,Volatile
2475,MGTX,AAV-RPGR,X-Linked Retinitis Pigmentosa,2020-10-05,Phase1/2,23,na,0
2476,YMAB,Omburtamab,CNS/Leptomeningeal Metastases from Neuroblastoma,2020-10-05,BLAFiling,23,na,0
2477,MGEN,Cobomarsen - SOLAR,Cutaneous T-Cell Lymphoma,2020-10-05,Phase2,23,neg,1
2478,AMGN,AMG 510,Non-small cell lung cancer (NSCLC),2020-10-05,Phase2,23,na,0
2479,KNSA,Mavrilimumab,Giant cell arteritis (GCA),2020-10-06,Phase2,24,na,0


In [35]:
period_df['Volatile'].value_counts() # value counts for vol and non-vol events

0    95
1    34
Name: Volatile, dtype: int64

In [38]:
volDf = period_df.loc[(period_df['Class'] == 'pos') | (period_df['Class'] == 'neg')] # creation of dataframe with only vol events

In [40]:
volDf.tail() # just to make sure it worked

Unnamed: 0,Ticker,Drug,Indication,Date,Status,Day,Class,Volatile
2573,BYSI,Plinabulin + TAC (Trial 106) - Protective-2,Chemotherapy-induced neutropenia (CIN),2020-11-16,Phase3,53,neg,1
2581,LXRX,Sotagliflozin,Heart failure,2020-11-17,Phase3,54,neg,1
2583,BNTX,BNT162b2,COVID-19 vaccine,2020-11-18,Phase3,55,pos,1
2596,BLPH,INOpulse Inhaled Nitric Oxide,COVID-19,2020-11-23,Phase3,58,neg,1
2603,YMAB,Naxitamab,Neuroblastoma,2020-11-25,Approved,60,pos,1


In [45]:
naDf = period_df.loc[period_df['Class'] == 'na'] # df with only non-vol events

In [47]:
naDf.tail() # looking for more events to measure mean occurence of tickers into

Unnamed: 0,Ticker,Drug,Indication,Date,Status,Day,Class,Volatile
2598,VTRS,Dolutegravir,Pediatric Formulation of Dolutegravir (DTG),2020-11-23,Approved,58,na,0
2599,ALNY,Lumasiran,Primary Hyperoxaluria Type 1 (PH1),2020-11-24,Approved,59,na,0
2600,UROV,Vibegron,Irritable bowel syndrome (IBS),2020-11-24,Phase2a,59,na,0
2601,LQDA,LIQ861,Pulmonary arterial hypertension,2020-11-25,CRL,60,na,0
2602,RVNC,DAXI (RT002),Moderate to severe glabellar (frown) lines,2020-11-25,PDUFA,60,na,0


## Generation of Twitter Data

In [59]:
import tweepy
from tweepy import Cursor

""" This is the twitter API codes """ # these have been removed because they are registered to me personally 
api_key = ''
api_secret_key = ''

access_token = ''
access_token_secret = ''

def connect_to_twitter_OAuth(): # function to auth twitter API
    auth = tweepy.OAuthHandler(api_key, api_secret_key)
    auth.set_access_token(access_token, access_token_secret)

    api = tweepy.API(auth)
    return api

api = connect_to_twitter_OAuth() # connect to twitter API

In [90]:
# generate list of usernames that my account is currently following
usernames = []
for user in tweepy.Cursor(api.friends, screen_name='biotechtwit1').items():
    usernames.append(user.screen_name) # friend is the same as following, this code allows you to generate a list of usernames 

In [91]:
usernames[15:20] # inspecting the usernames, making sure it worked

['HoganMullally', '10kdiver', 'AdamB1438', 'SuperMugatu', 'NIH']

In [92]:
""" Goal is to pull usernames and cashtags """

if len(usernames) > 0:
    cashtags = []
    names = []
    tweet_count = 0
    for target in usernames:
        item = api.get_user(target)
      
        """ Pulls the cashtags and associated username for the last 2 days """
        # goal is to measure who tweets the most cashtags, only this list will be scanned further back to be more effecient
        end_date = datetime(2020,12,2) - timedelta(days=2) # in the last 2 days
        for status in Cursor(api.user_timeline, id=target).items():
            tweet_count += 1
            if hasattr(status, "entities"):
                entities = status.entities
                if "symbols" in entities:
                    for ent in entities["symbols"]:
                        if ent is not None:
                            if "text" in ent:
                                cashtag = ent["text"]
                                if cashtag is not None:
                                    cashtags.append(cashtag) # append cashtag to the list
                                    names.append(target) # append name of person to the list 
            
            if status.created_at < end_date:
                break

In [102]:
len(cashtags) # this works

1775

In [103]:
len(names)

1775

In [121]:
pairs = list(zip(names,cashtags))

# merges duplicate values and appends list of cashtags for each username (using dict)
res = {}
for name,tag in pairs:
    alist = res.get(name,[]) + [tag] # adds the tag to the list of the associated username.
    res[name] = alist # creates a list as the value for each key

In [97]:
keys = [a for a in res.keys()] # list of the keys from the dict 

In [132]:
count = [] # simple code that counts the len of each list in the value of each key in the dict 
for a in keys:
    count.append(len(res[a]))

In [133]:
len(count) == len(keys) # just a check for errors, these must be the same len

True

In [136]:
userCashtags = pd.DataFrame({'name':keys,'count':count}) # df that shows the number of cashtags associated with each username
print(userCashtags.shape)
userCashtags.head() # inspection

(91, 2)


Unnamed: 0,name,count
0,RFortunae,7
1,RMPerry88,4
2,JNVcapital,10
3,HindenburgRes,16
4,JamesEKrause,14


In [147]:
active = userCashtags.nlargest(20,'count')['name'].to_list() # this is a list of users that use cashtags often enough to be worthwhile. (top 20)

In [148]:
print(round(len(active)/len(usernames),3)) # this is the pct of usernames that made it in the active group out of the total

0.152


In [149]:
""" Pulling Dates and Cashtags from Active group of twitter users """
# this code pulls datetime of tweet and the associated cashtag for each tweet in the active users dataset. 
if len(active) > 0:
    cashtags = []
    dates = []
    tweet_count = 0
    for target in active:
        item = api.get_user(target)
      
        """ Pulls the cashtags and associated username for the last 30 days """
        # goal is to measure who tweets the most cashtags, only this list will be scanned further back to be more effecient
        end_date = datetime(2020,12,2) - timedelta(days=30) # in the last 30 days
        for status in Cursor(api.user_timeline, id=target).items():
            tweet_count += 1
            if hasattr(status, "entities"):
                entities = status.entities
                if "symbols" in entities:
                    for ent in entities["symbols"]:
                        if ent is not None:
                            if "text" in ent:
                                cashtag = ent["text"]
                                if cashtag is not None:
                                    cashtags.append(cashtag) # append cashtag to the list
                                    dates.append(status.created_at) # append name of person to the list 
            
            if status.created_at < end_date:
                break

In [150]:
len(cashtags) # this is the number of cashtags 

8878

In [153]:
len(cashtags) == len(dates)

True

In [163]:
datesSer = pd.Series(dates) # convert to pd ser
datesYr = datesSer.dt.date.to_list() # pulls just the date, no time, from the datetime ex: 2020-2-1 11:58:12 -> 2020-2-1

In [165]:
tuples = list(zip(datesYr,cashtags)) # creates tuple

# creation of dict, removes the duplicates
result = {}
for dates,tag in tuples:
    theList = result.get(dates,[]) + [tag] # finds the duplicate values and creates unique list of them, value is a list of cashtags associated with each key
    result[dates] = theList

In [222]:
def sumOccur(df,someDict):
    tickers = df['Ticker'].to_list()
    output = []
    for i in range(0,len(tickers)):
        analyte = tickers[i]
        occurence = []
        dates = []
        for a in [x for x in someDict.keys()]:
            count = 0
            for x in someDict[a]:
                if x == analyte:
                    count += 1
                    occurence.append(count)
                    dates.append(a)

        # dataframe with dates in Dates col. 
        res = pd.DataFrame({'Dates':dates,'Occurence':occurence})
        res['Dates'] = pd.to_datetime(res['Dates'])
        res = res.loc[res['Dates']<df['Date'].to_list()[i]]
        theSum = res['Occurence'].sum()
        output.append(theSum)

    return output

In [223]:
sumOccurence = sumOccur(period_df,result)

In [217]:
period_df['sumOccur'] = sumOccurence

In [228]:
period_df = period_df.drop(['meanOccur'],axis=1) # I initially tried using mean value of occurence, sum works better because you can take the mean of the dataset. The twitter program takes a long time to run and I didnt want to restart the kernel so I dropped the column manually. The code where I added it is removed. The lack of the col just makes the result more concise. 

In [229]:
period_df # this just shows the new column

Unnamed: 0,Ticker,Drug,Indication,Date,Status,Day,Class,Volatile,sumOccur
2475,MGTX,AAV-RPGR,X-Linked Retinitis Pigmentosa,2020-10-05,Phase1/2,23,na,0,0.0
2476,YMAB,Omburtamab,CNS/Leptomeningeal Metastases from Neuroblastoma,2020-10-05,BLAFiling,23,na,0,0.0
2477,MGEN,Cobomarsen - SOLAR,Cutaneous T-Cell Lymphoma,2020-10-05,Phase2,23,neg,1,0.0
2478,AMGN,AMG 510,Non-small cell lung cancer (NSCLC),2020-10-05,Phase2,23,na,0,0.0
2479,KNSA,Mavrilimumab,Giant cell arteritis (GCA),2020-10-06,Phase2,24,na,0,0.0
...,...,...,...,...,...,...,...,...,...
2599,ALNY,Lumasiran,Primary Hyperoxaluria Type 1 (PH1),2020-11-24,Approved,59,na,0,22.0
2600,UROV,Vibegron,Irritable bowel syndrome (IBS),2020-11-24,Phase2a,59,na,0,91.0
2601,LQDA,LIQ861,Pulmonary arterial hypertension,2020-11-25,CRL,60,na,0,31.0
2602,RVNC,DAXI (RT002),Moderate to severe glabellar (frown) lines,2020-11-25,PDUFA,60,na,0,4.0


In [230]:
volMean = period_df.loc[period_df['Volatile'] == 1]['sumOccur'].mean().round(decimals=3) # this is the mean of the mean of the occurence in the vol dataset
print('Mean of the Sum occurence of cashtags in the volatile catalyst dataset: ',volMean)

Mean of the Sum occurence of cashtags in the volatile catalyst dataset:  34.706


In [231]:
naMean = period_df.loc[period_df['Volatile'] == 0]['sumOccur'].mean().round(decimals=3) # this is the mean of the mean of the occurence in the not vol dataset
print('Mean of the Sum occurence of cashtags in the non-volatile catalyst dataset: ',naMean)

Mean of the Sum occurence of cashtags in the non-volatile catalyst dataset:  28.116


In [152]:
tweet_count # thats a lot of tweets, this is the number of tweets examined for cashtags, not all of them had them. 

13399