## Jacob Roach

In [2]:
# Import the needed Packages.
import pandas as pd

## Data Collection and Feature Engineering
Before any modeling was performed, the necessary data was collected using two distinct platforms. The first data that was collected was Twitter data. This was done using the Twitter Developer API, as well as the `tweepy` module. Tweets containing the word "bitcoin" were streamed for several days. This data was written to a `.pkl` file, and saved for later feature engineering.

The other data that was collected was the value of a single Bitcoin. During the same interval (plus twenty-four hours after the last Tweet was recorded) that the Twitter data was collected, the value of a Bitcoin was recorded each minute, along with the corresponding time stamp.

Once the Twitter and Bitcoin data was recorded, further feature engineering was employed. For each Tweet stored, the corresponding price of Bitcoin at the time the Tweet was made was added as the `inital_price` for the Tweet. Then, for each Tweet, if the price of Bitcoin increased within twenty-four hours of the time the Tweet was made, the feature `increase` was assigned a value of `1`. Otherwise, `increase` is assigned the value of `0`.

Finally, for each Tweet recorded, the text of that Tweet is cleaned and standardized. This cleaned Tweet is then BERTified, and a vector of length 512 is returned. This vector is stored as the `bertified` feature. Only the `bertified` and `increase` features are kept, and these form the training data to be used in this notebook.

In [26]:
# Read in the training data.
data = pd.read_pickle("../data/training_data.pkl")

# Reset the index.
data = data.reset_index(drop=True)

Once the training data has been read in, the data will be quickly inspected, to show the reader the nature of the dataset.

In [25]:
# Investigate the DataFrame.
print("There are", len(data), "rows in the DataFrame.")
print("There are", len(data.loc[data["increase"] == 1, ]), "records with an increase, and", len(data.loc[data["increase"] == 0, ]), "with a decrease.\n")

# Now, show the summary of the data.
print(data["increase"].describe())

There are 114233 rows in the DataFrame.
There are 55912 records with an increase, and 58321 with a decrease.

count    114233.000000
mean          0.489456
std           0.499891
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: increase, dtype: float64


Now that the training data has been loaded, it can be partitioned into training and testing sets.

In [27]:
# Create training and testing.

# Create train and test vectors.

In [28]:
# Train the model.

In [29]:
# Test the model.