Our project aims to analyze twitter messages with keywords related to bitcoin in order to make a guess on the movement of the price for the next day. To do this, we scraped twitter for a small set of tweets with the keywords bitcoin, cryptocurrency, and BTC. Unfortunately, there weren't any recent premade datasets we could reference for training. The twitter API was also very restrictive as well so our data set ended up being a limiting factor for how well we could train our final network. We searched for online databases, but they all were for years before bitcoin had an active market. 

We take these tweets, do a sentiment analysis with a logistical regresion method. Then, we pass vectors with the sentiments for each day along with the price movement of the following day as training data. The idea here is to train the network to take the day's sentiment and use it as a guide to guess whether the prices the next day do indeed rise or fall. 

To start the code, we scrape twitter using the Twipy API and output the results into a formatted csv file:

We define a function titled "twitter_search" which uses the tweepy library to search up to 10 days in the past for tweets.

In [1]:
def twitter_search(authentication, path, start_date, final_date, word):
#Authentication is a list cointaining the setup parameters (including a secret key)
    
    import tweepy
    import csv
    import re
    import time
    
    #autentication parameters
    access_token = authentication[0]
    access_token_secret = authentication[1]
    consumer_key = authentication[2]
    consumer_secret = authentication[3]
    
    # autentication
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

We then limit the number of api calls we make to get a broader range of tweets before tripping the API limit, then pass the tweets into a csv file while parsing the text and cleaning up any replies or retweets. 

In [2]:
 # Open/Create a file to append data
csvFile = open(path, 'a')
    
#Use csv Writer
csvWriter = csv.writer(csvFile, delimiter='\t' )
   
cont = 1;
#Tweet searching
current_date = final_date;
while current_date !=start_date:
     try:
        for tweet in tweepy.Cursor(api.search,
                                       q=word,
                                       since_id=start_date,
                                       until=current_date,
                                       count="1000",
                                       lang = 'en').items():
            if  ('RT @' not in tweet.text): #elimination of the retweets
                   tweet.text = re.sub(r"http\S+","",tweet.text); #elimination of the URLs in tweets
                   #print('\nDate: ', tweet.created_at, '\nAuthor of the tweet: ', tweet.user.name, '\n', tweet.text, '\nNumber of retweets: ', tweet.retweet_count)
                   csvWriter.writerow([tweet.retweet_count, tweet.user.name.encode('utf-8'), tweet.created_at, tweet.text.encode('utf-8')])
                   cont = cont +1;
            if(cont == 1000000):
                    current_date = time_decrease(current_date)
                    cont = 0
                    break
            except tweepy.TweepError:
            pass
     
def time_decrease(date):
    from datetime import datetime
    from datetime import timedelta
    date = datetime.strptime( date, '%Y-%m-%d');
    delta = timedelta(days=1);
    new_date = date-delta;
    final_date = datetime.isoformat(new_date)[0:10];
    return final_date

IndentationError: unexpected indent (<ipython-input-2-638ab69c1110>, line 2)

Then we pass the twitter csv data through the text classifier. We tried both linear and non-linear SVMs, decision trees, logistic regression, and Multi-layer perceptrons. We settled on the effectiveness (~65-70% accuracy)of the MLP in the end. For this first block of code, we import our dependencies, drawing from the scikit learn library among others.

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gensim
from gensim import corpora
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import csv
import numpy as np

dim_e=100 #dimension of the embedding space

n_m=100000     #number of mails 

v_dim=10000  #validation set dimension

#read csv file

def read_csv(filename,n_m):
  with open(filename) as csvfile:         
      readCSV = csv.reader(csvfile, delimiter='\t')
      values = []
      text = []
      for row in readCSV:
        
          a=float(row[1])
          b = row[3]
        
          values.append(a)
          text.append(b)

  return values[0:n_m],text[0:n_m]

values,mails=read_csv('sent3.csv',n_m)
labels=np.asarray(values)
mails0=mails[0:n_m-v_dim]  #we don't take the validation set for training 
text_mails=[s for sentence in mails0 for s in sent_tokenize(sentence)]

ModuleNotFoundError: No module named 'gensim'

Then we import the sample.txt file containing words from many novels on the project Gutenburg literature databse.  

In [None]:
f = open('sample.txt')  #read the txt file
text0=f.read()
text1=text_mails +sent_tokenize(text0)
#text1=text_mails 

#tokenize, remove punctuation, common words and very rare words
stoplist = set('for a of the and to in ! " # $ % & ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~'.split())
texts = [[word.lower() for word in word_tokenize(document) if word not in stoplist] for document in text1]

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

Then we train the model using the gensim library and its built-in word-to-vec capability to format our data for the network. After training, we sort the vectors by frequency to ensure the word-2-vec conversion worked as expected.

In [None]:
model = gensim.models.Word2Vec(texts, min_count=5, iter=20, size=dim_e, sorted_vocab=1)
#use the model to calculate similiarities between words

#sort vectors by frequency

l_k=list(model.wv.vocab.keys())

def f_key(x):
    return model.wv.vocab[x].count

X = model[model.wv.vocab]

   
label=list(reversed(sorted(l_k,key=f_key)))

X_sort=X
for j in range(0,len(label)):
    X_sort[j,:]=model[label[j]]


We then visualize the effectiveness of the word-to-vec encoding with a t-SNE plot to see how words are inter-related and how closely the machine pairs similar words.  

In [None]:
#tsne visualize
plot_i=0
plot_f=500

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_sort[plot_i:plot_f,:])

def plot_with_labels(low_dim_embs, labels):
  assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
  plt.figure(figsize=(18, 18))  # in inches
  for i, label in enumerate(labels):
    x, y = low_dim_embs[i, :]
    plt.scatter(x, y)
    plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
  plt.show()

plot_with_labels(X_tsne, label[plot_i:plot_f])

Then we import bitcoin pricing data to be loaded later and mixed with our tweet sentiments. This function pairs a tweet with a price set a couple hours after the tweet.

In [None]:
def compare(date1, date2):
    from datetime import datetime
    from datetime import timedelta
    date1 = datetime.strptime( date1, '%Y-%m-%d %H:%M');
    date2 = datetime.strptime( date2, '%Y-%m-%d %H:%M');
    return date1 < date2
    
def time_decrease(date):
    from datetime import datetime
    from datetime import timedelta
    date = datetime.strptime( date, '%Y-%m-%d');
    delta = timedelta(days=1);
    new_date = date-delta;
    new_date = date-delta;
    final_date = datetime.isoformat(new_date)[0:10];
    return final_date

def convert(date):
    from datetime import datetime
    from datetime import timedelta
    import time
    new_date = time.mktime(datetime.strptime(date, '%Y-%m-%d %H:%M').timetuple())
    return new_date

# Open/Create a file to append data
csvFile = open('C:\\Users\\Marco\\Desktop\\name.csv', 'a')
    
#Use csv Writer
csvWriter = csv.writer(csvFile, delimiter='\t' )


with open('C:\\Users\\Marco\\Desktop\\tweets.csv') as f:
    r = csv.reader(f, delimiter=';')
    r = list(r)
    
with open('C:\\Users\\Marco\\Desktop\\val.csv') as f:
    s = csv.reader(f, delimiter=';')
    s = list(s)

            
xp=np.zeros(len(s))
yp=np.zeros(len(s))
cont = 0
for row in s:
    xp[cont]=convert(row[0])
    yp[cont]=float(row[1])
    cont = cont + 1

x=np.zeros(len(r))
cont = 0
for row in r:
    x[cont]=convert(row[2])
    cont = cont + 1

y=np.zeros(len(r))
y=np.interp(x,xp,yp)

cont = 0
for row in r:
   csvWriter.writerow([row[0],row[2],row[3],y[cont]])
   cont = cont+1;

Then we commence with the business end of the neural network which uses an MLP to categorize words from the tweets and then pairs them into numpy arrays.

In [4]:
from sklearn.neural_network import MLPClassifier

text_mails0=[[s  for s in sent_tokenize(mail)] for mail in mails]
text_mails=text_mails0[0:n_m]

words = [[word.lower() for sentence in mail for word in word_tokenize(sentence) if word not in stoplist] for mail in text_mails]

frequency = defaultdict(int)
for text in words:
    for token in text:
        frequency[token] += 1

words = [[token for token in text if frequency[token] > 1] for text in words]

#sum of the word vectors for each mail

mail_vectors=np.zeros((n_m,dim_e)); 

for j in range(0,n_m):
    m=words[j]
    for k in range(0,len(m)):
      try:
        mail_vectors[j,:]=mail_vectors[j,:]+np.asarray(model[m[k]]) 
      except KeyError:
        pass #skips words which aren't in the learned vocabulary
        
clf=MLPClassifier(alpha=1)    #0.248 with sample and 200000

#to train the model we only take the first n_m-v_dim mails (the training set)
print('training')
clf.fit(mail_vectors[0:n_m-v_dim],labels[0:n_m-v_dim])

#output to monitor validation error

print("validation error=",np.sum(np.abs(labels[n_m-v_dim:n_m]-clf.predict(mail_vectors[n_m-v_dim:n_m,:])))/v_dim)

print("s=",clf.predict(mail_vectors[1:100,:]))
print("l=",labels[1:100])
def read_csv1(filename):
  with open(filename) as csvfile:         
      readCSV = csv.reader(csvfile, delimiter='\t')
      text = []
      price = []
      for row in readCSV:
        a=row[2]
        b=row[3]
        text.append(a)
        price.append(b)
        
  return text, price

tweets, price =read_csv1('tweet-price.csv') #time is an array with the day for each tweet 

text_mails=[[s  for s in sent_tokenize(mail)] for mail in tweets]


#calculate the sentiments and word vectors for each tweet

words = [[word.lower() for sentence in mail for word in word_tokenize(sentence) if word not in stoplist] for mail in text_mails]

frequency = defaultdict(int)
for text in words:
    for token in text:
        frequency[token] += 1

words = [[token for token in text if frequency[token] > 1] for text in words]

tweet_vectors=np.zeros((len(words),dim_e));

tweet_sentiment=np.zeros(len(words))



for j in range(0,len(words)):
    m=words[j]                  #word vectors for each tweet
    for k in range(0,len(m)):
      try:                    
        tweet_vectors[j,:]=tweet_vectors[j,:]+np.asarray(model[m[k]])
      except KeyError:
        pass


tweet_sentiment=clf.predict(tweet_vectors);   #sentiments

print('sentiments=',tweet_sentiment)

X_train=np.asarray(tweet_sentiment)

Y_train=np.asarray(price)

NameError: name 'mails' is not defined

After this, we put the sentiment analysis in blocks consisting of tweet sentiments. We then pair each tweet sentiment with an associated bitcoin price set a couple hours after the tweet. 

Once we have the tweets and prices paired with the sentiments, we put them in a csv file for more portability then converted the two columns of the csv into arrays. the arrays get fed into a binary classifier to try to distinguish between up and down "guesses"

In [None]:
import keras
import csv
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import TFOptimizer
import numpy as np

coinarrays=[]
coinarrayp=[]
with open('puta2.csv', newline='') as coinfile:
    coinreader=csv.reader(coinfile, delimiter='\t', quotechar='|')
    for row in coinreader:
        coinarrays.append(float(row[0]))
        coinarrayp.append(float(row[1]))

coinlen=len(coinarrays)-801
       

x_train = np.transpose(coinarrays[801:])
y_train = np.transpose(coinarrayp[801:])
x_test = coinarrays[0:800]
y_test = coinarrayp[0:800]

model = Sequential()

model.add(Dense(64, input_dim=1, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='softmax'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['mean_squared_error'])

model.fit(x_train, y_train,
          epochs=20,
          batch_size=800)
score = model.evaluate(x_test, y_test, batch_size=800)

The mse error and loss functions ended up being extremely high for our data. The sentiment analysis operated on a binary basis, giving our data a vey high granularity while the exponential increase of bitcoin price within the limited window we were able to obtain our tweets means that network optimization aside, our results were somewhat doomed from the start to be very skewed.
However, as an exercise this project taught us a lot about the large amounts of effort and time machine learning and data science in general require. We admit defeat to the twitter datalords.