Recent [research](https://www.scs.cmu.edu/news/nearly-half-twitter-accounts-discussing-reopening-america-may-be-bots) by Carnegie Mellon University have proposed that bots may be responsible for up to 50% of tweets for particular topics. Could bots be trying to push certain narritives about particular stocks too and is it possible to find these bots? In this notebook I show my attempt to answer this question by finding bots based on a set of common-sense features and criterias. Then I compare how the Bots tweet compare to average in terms of sentiment correlation with stock price. Suspected bot tweets have significantly higher correlation to stock price than the average sentiment across all writers in the dataset.


<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
    
<center><h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background-color:#1E90FF; border:0; color:#FFF5EE' role="tab" aria-controls="home">Content</h2></center>

1. [Data Exploration](#1)
2. [Average Sentiment and Stock Price](#2)
3. [Feature Engineering](#3)
4. ["If it tweets like a bot, it is a bot"](#4)
5. [How does Bot sentiment correlates with Stock Price?](#5)
    
    
If you are interested in playing with time series, check out my [dataset on electricity prices and demand](https://www.kaggle.com/aramacus/electricity-demand-in-victoria-australia) in Victoria (Australian state). And please upvote it if you like it.

In [None]:
pip install yfinance

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import yfinance as yf

from nltk.sentiment.vader import SentimentIntensityAnalyzer

import re
import os
import warnings
warnings.filterwarnings("ignore")

In [None]:
pd.set_option('display.max_columns', 30)
sns.set_context("paper", font_scale=2)

PATH = "/kaggle/input/tweets-about-the-top-companies-from-2015-to-2020/"

<a id="1"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>Data Exploration</h2></center>

In [None]:
company = pd.read_csv(os.path.join(PATH, "Company.csv"))
company = company.set_index("ticker_symbol").to_dict()["company_name"]
company

In [None]:
tweet = pd.read_csv(os.path.join(PATH, "Tweet.csv"))
tweet.head()

It would be convenient to convert "post_date" to the datetime format right away.

In [None]:
tweet['datetime'] = pd.to_datetime(tweet['post_date'], unit='s')
tweet = tweet.drop(['post_date'], axis=1, inplace=False)

In [None]:
company_tweet = pd.read_csv(os.path.join(PATH, "Company_Tweet.csv"))
company_tweet.head()

Some of the tweets mention more than one company of interest:

In [None]:
company_tweet.loc[company_tweet['ticker_symbol'] == 'GOOGL', 'ticker_symbol'] = 'GOOG'
print("Total records: {} | Unique tweet indexes: {}".format(len(company_tweet), company_tweet['tweet_id'].nunique()))

In [None]:
tweet.info()

Lets count missing values in tweet table.

In [None]:
tweet.isnull().sum()

Less than 2% of tweet records have no poster credentials. Drop missing values and count uniqie writers.

In [None]:
tweet = tweet.dropna()
print(f"Number of writers: {tweet['writer'].nunique()}")

Lets calculate the number of duplicates among all the tweets.

In [None]:
print("Percent of duplicated tweets: {:.2f} %".format(sum(tweet['body'].duplicated())/len(tweet) * 100))

To get a better sense about writer activity, lets build a histogram for tweet numbers by writers.

In [None]:
stats = tweet[['writer', 'tweet_id']].groupby('writer').agg("count").rename(columns={'tweet_id' : 'tweet_count'})

In [None]:
sns.set_context("paper", font_scale=2)

plt.figure(figsize=(12, 8))

sns.histplot(data=stats, x='tweet_count', bins=50, log_scale=True)
plt.yscale('log')
plt.title("Posts count histogram")
plt.xlabel("Number of users")
plt.ylabel("Tweet count")
None

<a id="2"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>Average Sentiment and Stock Price</h2></center>

First, lets use NLTK Sentiment Intensity Analyser to evaluate the sentiment for each tweet. Even though Sentiment Analyser does not compain when fed a raw text, lets help it a bit with a mild tweet text cleanup.

In [None]:
sentiment_nltk = SentimentIntensityAnalyzer()

# Mild cleaning: remove weblinks, $ticker_symbol, # symbol from hashtags, remove excessive spaces
tweet['prep_body'] = tweet['body'].replace(r"https?:\S+|http?:\S+|www?:\S+", '', regex=True).replace(r"[@#\$][a-zA-Z]+", '', regex=True).replace(r"\s\s+", ' ', regex=True).str.strip()

tweet['positive_sentiment'] = tweet['prep_body'].apply(lambda x: sentiment_nltk.polarity_scores(x)['pos'])
tweet['negative_sentiment'] = tweet['prep_body'].apply(lambda x: sentiment_nltk.polarity_scores(x)['neg'])
tweet['total_sentiment'] = tweet['prep_body'].apply(lambda x: sentiment_nltk.polarity_scores(x)['compound'])

tweet.head()

It is interesting to look again on duplicates in prepared tweets, after hashtags and weblinks were dropped.

In [None]:
print("Percent of cleaned duplicated tweets: {:.2f} %".format(sum(tweet['prep_body'].duplicated())/len(tweet) * 100))

After pre-processing the proportion of duplicated tweets increased from 10.35% to 25.25%. This indicates templates in about 15% of tweets, which could indicate bots.<br>

Next, lets get the stock prices using Yahoo Finance API. For the sake of simplicity, consider only Closing price.

In [None]:
tweet['date'] = tweet['datetime'].dt.date

prices = yf.download(tickers=" ".join([st for st in company.keys() if st != "GOOGL"]),
    start=tweet['date'].min().strftime('%Y-%m-%d'),
    end=tweet['date'].max().strftime('%Y-%m-%d'),
    interval='1d'
).reset_index()

prices = prices.drop(["Adj Close", "Volume", "Open", "High", "Low"], axis=1)

prices.head()

Aggregate sentiment from tweet by date and matrch with prices by date.

In [None]:
stats = tweet[['date', 'positive_sentiment', 'negative_sentiment', 'total_sentiment']].groupby('date').mean()

prices['date'] = prices['Date'].dt.date
prices = prices.drop(['Date'], axis=1)
stats = prices.join(stats, how='inner', on='date')

Calculate sentiment-price correlation coefficient for closing prices

In [None]:
price_cols = [('Close', ticker) for ticker in company.keys() if ticker != 'GOOGL']
sentim_cols = ['positive_sentiment', 'negative_sentiment', 'total_sentiment']
stats[price_cols + sentim_cols].corr().loc[sentim_cols, price_cols].style.background_gradient(cmap='coolwarm')

To further illustrate this correlation, below are the plots with normalized stock price and an overall positive sentiment as a function of date.

In [None]:
colors = {"AMZN" : "tab:red", 
          "GOOG" : "tab:blue", 
          "AAPL" : "tab:orange", 
          "MSFT" : "tab:purple", 
          "TSLA" : "tab:green"}

plt.figure(figsize=(15, 5))

t = "AMZN"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:cyan", label="Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "GOOG"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:cyan", label="Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "AAPL"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:cyan", label="Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "MSFT"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:cyan", label="Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "TSLA"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:cyan", label="Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

## Summary

There seems to be a fair bit of correlation between stock prices and sentiment. In terms of total sentiment, which consists of both positive and negative tweets, the correlation is between 44 and 54%.
<p>For all stocks considered, mean daily positive sentiment had a higher correlation to prices compared to total mean daily sentiment with ranges between 50 and 55%. In general, negative sentiment had less of a correlation to prices with a notable exception of Tesla, that had much less correlation.

<a id="3"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>Feature Engineering</h2></center>

In case of twitter, an idea of influencing stock price via sentiment manipulation can be implemented via bot networks. Below I try to identify at least some of the accounts that could be bots.<br>
To keep track of writers tweeting patters, lets introduce a "posters" table, which will be filled in as this exploration proceeds. The first feature that may tell something about the writer is what is their peak hourly tweet rate.

In [None]:
tweet['hour'] = tweet['datetime'].dt.hour
data = tweet[['writer', 'hour', 'date', 'tweet_id']].groupby(['writer', 'hour', 'date']).count().reset_index().rename(columns={'tweet_id' : 'tweet_rate'})

tweet = tweet.drop(['hour'], axis=1)

indmax = data.groupby('writer').agg({'tweet_rate' : 'idxmax'})
posters = data.iloc[indmax.tweet_rate].sort_values(by='tweet_rate').set_index('writer')
posters = posters.drop(['hour', 'date'], axis=1).rename(columns={'tweet_rate' : 'max_tweet_rate'})
posters

Next, lets get a mean tweet rate by each poster during each hour, from 0am to 23pm.

In [None]:
hours = data[['writer', 'hour', 'tweet_rate']].groupby(['writer', 'hour']).mean().sort_values(by='tweet_rate')
hours = hours.reset_index().pivot(index='writer', columns='hour', values='tweet_rate').fillna(0)
hours.columns.name = None
posters = posters.join(hours, how='outer')
posters.sort_values(by='max_tweet_rate').head()

### Plot sample writers average hourly tweet rate (while active)

In [None]:
sns.set_context("paper", font_scale=1)

columns = list(range(24))

nrows = 2
ncols = 3
fig, axs = plt.subplots(nrows, ncols, figsize=(17, 12))

sample_writers = ['PeteStock11', 'JimAndrews518', 'computer_hware', 'larryne', 'MarleyJayBiz', 'politicalHEDGE']

for i, writer in enumerate(sample_writers):
    c = i // 2
    r = i - nrows * c
    
    posters.loc[writer, columns].plot(kind='bar', ax=axs[r, c])
    axs[r, c].set_title(writer)
    axs[r, c].set_ylabel("average tweet rate")
    axs[r, c].set_xlabel("hour")
    
plt.show()
None

## Average time between subsequent tweets
Too short time between succesive tweets can indicate a machine authorship. To account for long abscence, such as vocations, limit to shortest 75% of time intervals (in seconds). For writers with only one tweet, assign the maximum value to the time between tweets "mean_diff_sec".

In [None]:
def in_qrange(ser, q):
    return ser.between(*ser.quantile(q=q))

tweet['timediff'] = tweet.sort_values('datetime', ascending=False).groupby(['writer']).datetime.diff(-1).dt.seconds.fillna(np.inf)

In [None]:
data = tweet.loc[tweet['timediff'].transform(in_qrange, q=[0, 0.75]), ['writer', 'timediff']].groupby('writer').agg(['mean']).rename(columns={'mean' : 'mean_diff_sec'})
data.columns = data.columns.droplevel()

tweet = tweet.drop(['timediff'], axis=1)

posters = posters.join(data, on='writer', how='left').fillna(max(data['mean_diff_sec']))
posters.loc[posters['mean_diff_sec'] == 0, 'mean_diff_sec'] = max(data['mean_diff_sec'])
posters.sort_values(by='mean_diff_sec').head()

### Fraction of non-original tweets

In [None]:
data = tweet.loc[tweet['prep_body'].duplicated(), ['writer', 'tweet_id']].groupby('writer').count().rename(columns={'tweet_id' : 'duplicate_posts'})

posters = posters.join(tweet[['writer', 'tweet_id']].groupby('writer').count().rename(columns={'tweet_id' : 'total_posts'}), how='left')

In [None]:
posters = posters.join(data, how='left').fillna(0)
posters['duplicate_posts'] = posters['duplicate_posts']/posters['total_posts']
posters.head()

Some of the conclusion whenever a poster is a bot or not can be drawn based on the collected features. For example: 

* abnormal hourly tweet rate ("max_tweet_rate") 
* too short mean time between successive tweets ("mean_diff_sec")
* lack of hours with no tweets (too few hour columns, "0" to "23", when tweet rate was 0)
* all tweets are among the duplicates ("duplicate_posts" = 1.0)

With a crude criteria, such as abnormal endurance ("max_tweet_rate" of 100 or more) or extreme typing speed ("mean_diff_sec" of 5 or less) or no sleep abilty (no hour columns with tweet rate of 3), some bots can be found.

In [None]:
columns = list(range(24))
bot_check = pd.DataFrame(index=posters.index)

bot_check["max_tweet_rate"] = (posters["max_tweet_rate"] > 100).astype(np.int8)
bot_check["mean_diff_sec"] = (posters["mean_diff_sec"] < 10).astype(np.int8)
bot_check["abscence_hours"] = ((posters[columns] == 0).astype(int).sum(axis=1) < 3).astype(np.int8)
bot_check["all_duplicates"] = (posters["duplicate_posts"] == 1).astype(np.int8)

bot_check.head()

In [None]:
print("max hourly tweet rate > 100 : {} writers".format(sum(bot_check["max_tweet_rate"])))

In [None]:
print("mean time between tweets sec < 5 seconds : {} writers".format(sum(bot_check["mean_diff_sec"])))

In [None]:
print("less than 3 hours of not tweeting : {} writers".format(sum(bot_check["abscence_hours"])))

In [None]:
print("not a single original post : {} writers".format(sum(bot_check["all_duplicates"])))

<a id="4"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>"If it tweets like a bot, it is a bot"</h2></center>

Bootstraping: find all tweets from writers with at least a two flag in "bot_check". All other writers from "bot_check" that tweeted one of such tweets get a flag for "tweet_like_a_bot". Re-count writres with at least two flags.

In [None]:
bot_check[bot_check.sum(axis=1) > 1]

On the first count, there are only 25 writers with two ore more flags.

In [None]:
bot_tweets = tweet.loc[tweet['writer'].isin(bot_check[bot_check.sum(axis=1) > 1].index), 'prep_body'].unique()
bot_check['tweet_like_bot'] = bot_check.index.isin(tweet.loc[tweet['prep_body'].isin(bot_tweets), 'writer'].unique()).astype(np.int8)

In [None]:
print("Percent of bots : {:.2f}%".format(sum(bot_check.sum(axis=1) > 1)/len(posters)*100))

In [None]:
bots = bot_check.loc[bot_check.sum(axis=1) > 1].index
tweet['group'] = 'user'
tweet.loc[tweet.writer.isin(bots), 'group'] = 'bot'

## Summary

There seems to be 21.41%, or 29996 bots of 140131 uniqie writers in the dataset. This number was obtained via a two stage process. Firstly, four features were calculated for each writer:

* abnormal hourly tweet rate ("max_tweet_rate") 
* too short mean time between successive tweets ("mean_diff_sec")
* lack of hours with no tweets (too few hour columns, "0" to "23", when tweet rate was 0)
* all tweets are among the duplicates ("duplicate_posts" = 1.0)

Next, writers with at least two flags were deemed to be bots. At this stage there were only 25 such writers. Next, all their tweets were found and stored in the table "bot_tweets". Then, an additional feature, that determines whenever a writer posted one of the tweets from "bot_tweets". Finally, with five features writers were tallied again and those with at least two flags were deemed to be bots.

<a id="5"></a><center><h2 style='background-color:#1E90FF; border:0; color:#FFF5EE'>How does Bot sentiment correlates with Stock Price?</h2></center>

In [None]:
stats = tweet[tweet['group']=="bot"][['date', 'positive_sentiment', 'negative_sentiment', 'total_sentiment']].groupby('date').mean()

stats = prices.join(stats, how='inner', on='date')

In [None]:
price_cols = [('Close', ticker) for ticker in company.keys() if ticker != 'GOOGL']
sentim_cols = ['positive_sentiment', 'negative_sentiment', 'total_sentiment']
stats[price_cols + sentim_cols].corr().loc[sentim_cols, price_cols].style.background_gradient(cmap='coolwarm')

In [None]:
sns.set_context("paper", font_scale=2)

plt.figure(figsize=(15, 5))

t = "AMZN"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:olive", label="Bots, Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "GOOG"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:olive", label="Bots, Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "AAPL"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:olive", label="Bots, Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "MSFT"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:olive", label="Bots, Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

In [None]:
plt.figure(figsize=(15, 5))

t = "TSLA"

sns.lineplot(x=stats["date"], y=stats[("Close", t)]/max(stats[("Close", t)]), color=colors[t], label=company[t])
sns.lineplot(x=stats["date"], y=stats["positive_sentiment"]/max(stats["positive_sentiment"]), color="tab:olive", label="Bots, Positive Sentiment")
plt.title(company[t])
plt.xlabel("Date")
plt.ylabel("Arbitrary units")
None

## Summary

Bots combined seem to be astonishingly more impactful than average for Google, Microsoft and Amazon in terms of positive sentiment and total sentiment. For these companies total sentiment by bot tweets seems to be up to 20-25% more impactful than average. For Apple total bot sentiment exhibits 10% increase compared to average, still a formidable effect. The correlation is significantly reduced for Tesla, where it differs from the average by only 1%.

<p> Interestingly, there is a significant difference between negative sentiment from bot tweets compared to overall negative sentiment in terms of correlation to stock prices. Average negative sentiment was positively correlated with stock prices, while negative bot sentiment was negatively correlated to stock prices.
    
**Please upvote this notebook if you found this short exploration usefull or interesting.**