## Jacob Roach

In [37]:
# Import the needed Packages.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.models import load_model

## Data Collection and Feature Engineering
Before any modeling was performed, the necessary data was collected using two distinct platforms. The first data that was collected was Twitter data. This was done using the Twitter Developer API, as well as the `tweepy` module. Tweets containing the word "bitcoin" were streamed for several days. This data was written to a `.pkl` file, and saved for later feature engineering.

The other data that was collected was the value of a single Bitcoin. During the same interval (plus twenty-four hours after the last Tweet was recorded) that the Twitter data was collected, the value of a Bitcoin was recorded each minute, along with the corresponding time stamp.

Once the Twitter and Bitcoin data was recorded, further feature engineering was employed. For each Tweet stored, the corresponding price of Bitcoin at the time the Tweet was made was added as the `inital_price` for the Tweet. Then, for each Tweet, if the price of Bitcoin increased within twenty-four hours of the time the Tweet was made, the feature `increase` was assigned a value of `1`. Otherwise, `increase` is assigned the value of `0`.

Finally, for each Tweet recorded, the text of that Tweet is cleaned and standardized. This cleaned Tweet is then BERTified, and a vector of length 512 is returned. This vector is stored as the `bertified` feature. Only the `bertified` and `increase` features are kept, and these form the training data to be used in this notebook.

In [31]:
# Read in the training data.
data = pd.read_pickle("../data/training_data.pkl")

# Reset the index.
data = data.reset_index(drop=True)

# Convert each list to an array.
data["bertified"] = data["bertified"].apply(lambda x: np.asarray(x))

Once the training data has been read in, the data will be quickly inspected, to show the reader the nature of the dataset.

In [26]:
# Investigate the DataFrame.
print("There are", len(data), "rows in the DataFrame.")
print("There are", len(data.loc[data["increase"] == 1, ]), "records with an increase, and", len(data.loc[data["increase"] == 0, ]), "with a decrease.\n")

# Now, show the summary of the data.
print(data["increase"].describe())

There are 114233 rows in the DataFrame.
There are 55912 records with an increase, and 58321 with a decrease.

count    114233.000000
mean          0.489456
std           0.499891
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: increase, dtype: float64


Now that the training data has been loaded, it can be partitioned into training and testing sets.

In [61]:
# Create training and testing.
x_train, x_test, y_train, y_test = train_test_split(data["bertified"], data["increase"], test_size=.10)

# Conver to Tensors.
x_train = tf.convert_to_tensor(x_train.to_list())
y_train = tf.convert_to_tensor(y_train.to_list())
x_test = tf.convert_to_tensor(x_test.to_list())
y_test = tf.convert_to_tensor(y_test.to_list())

In [71]:
# Train the model.
input_layer = Input((512,))
dense = Dense(128, activation="relu")(input_layer)
output = Dense(2, activation="softmax")(dense)  # Output values is the number of classes.
model = Model(input_layer, output)

# Compile the model.
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer="adam",
              metrics=["accuracy"])

# Fit the model.
model.fit(x_train, y_train, epochs=30, validation_data=(x_test, y_test))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1474c27c0>

In [70]:
# Test the model.
model.evaluate(x=x_test, y=y_test)



[1.1049458980560303, 0.6046043634414673]