## Jacob Roach

In [1]:
# Import the needed Packages.
import pandas as pd
import numpy as np
from datetime import timedelta
import tensorflow as tf
from tensorflow.keras import layers, Input
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Dense, LSTM
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

## Data Collection and Feature Engineering
Before any modeling was performed, the necessary data was collected using two distinct platforms. The first data that was collected was Twitter data. This was done using the Twitter Developer API, as well as the `tweepy` module. Tweets containing the word "bitcoin" were streamed for several days. This data was written to a `.pkl` file, and saved for later feature engineering.

The other data that was collected was the value of a single Bitcoin. During the same interval (plus twenty-four hours after the last Tweet was recorded) that the Twitter data was collected, the value of a Bitcoin was recorded each minute, along with the corresponding time stamp.

Once the Twitter and Bitcoin data was recorded, further feature engineering was employed. For each Tweet stored, the corresponding price of Bitcoin at the time the Tweet was made was added as the `inital_price` for the Tweet. Then, for each Tweet, if the price of Bitcoin increased within three hours of the time the Tweet was made, the feature `increase` was assigned a value of `1`. Otherwise, `increase` is assigned the value of `0`.

Finally, for each Tweet recorded, the text of that Tweet is cleaned and standardized. This cleaned Tweet is then BERTified, and a vector of length 384 is returned. This vector is stored as the `embedded` feature.

In [16]:
# Read in the training data.
data = pd.read_pickle("../data/3_25_training_data.pkl")

# Reset the index, convert each embedding to an array.
data = data.reset_index(drop=True)
data["embedding"] = data["embedding"].apply(lambda x: np.asarray(x))

# Remove bad rows.
max_stamps = map(lambda x: x - timedelta(hours=12), set(data["time"].tolist()))
data = data.loc[data["time"].isin(list(max_stamps)), :]

In [17]:
# Create a new train-test split (for aggregation).
stamps = np.unique(data.time)
data.set_index(["time"], inplace=True)
test_stamps = np.random.choice(stamps, size=int(stamps.shape[0] * .20))
test_data = data.loc[test_stamps, :]
train_data = data.loc[~data.index.isin(test_stamps), :]

Once the training data has been read in, the data will be quickly inspected, to show the reader the nature of the dataset.

In [18]:
# Investigate the DataFrame.
print("There are", len(data), "rows in the DataFrame.")
print("There are", len(data.loc[data["increase"] == 1, ]), "records with an increase, and", 
        len(data.loc[data["increase"] == 0, ]), "with a decrease.\n")

There are 187303 rows in the DataFrame.
There are 126717 records with an increase, and 60586 with a decrease.



In [19]:
# Create training and testing data.
x_train_sk = train_data["embedding"]
y_train_sk = train_data["increase"]
x_test_sk = test_data["embedding"]
y_test_sk = test_data["increase"]

# Conver to Tensors.
x_train = tf.convert_to_tensor(x_train_sk.to_list())
y_train = tf.convert_to_tensor(y_train_sk.to_list())
x_test = tf.convert_to_tensor(x_test_sk.to_list())
y_test = tf.convert_to_tensor(y_test_sk.to_list())

In [20]:
# Train the model.
input_layer = Input((768,))
dense = Dense(128, activation="relu")(input_layer)
output = Dense(2, activation="softmax")(dense)  # Output values is the number of classes.
rnn_model = Model(input_layer, output)

# Compile the model.
rnn_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam",metrics=["accuracy"])

# Fit the model. MAKE SURE TO CHANGE THIS TO 25 EPOCHS.
rnn_model.fit(x_train, y_train, epochs=25, validation_data=(x_test, y_test))

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7fc5776110d0>

In [72]:
# Other models to try:
### - SVM
### - Naive Bayes
### - kNN
### - Random Forrests

In [23]:
# Apply to DataFrame.
predictions = rnn_model.predict(x_test)
predictions = np.array(list(map(lambda x: 0 if x[0] > x[1] else 1, predictions)))
test_data["prediction"] = predictions

In [24]:
# Create a new DataFrame.
aggregated = pd.DataFrame(test_stamps, columns=["time"])

# Get the actual.
agg_count = test_data.loc[:, ["increase"]].groupby("time").count()
agg_sum = test_data.loc[:, ["increase", "prediction"]].groupby("time").sum()

# Change column names.
agg_count = agg_count.rename(columns={"increase": "total_count"})
agg_sum = agg_sum.rename(columns={"increase": "actual", "prediction": "pred_count"})

# Final join.
agg = agg_count.join(agg_sum)
agg["actual"] = agg["actual"].apply(lambda x: 0 if x == 0 else 1)
agg["pred_perc"] = agg["pred_count"] / agg["total_count"]
agg = agg[["actual", "total_count", "pred_count", "pred_perc"]]
agg.to_csv("../data/3_25_agg.csv")