# Merge Market Price & Sentiment Datasets


* Merge the spy price dataset, and the daily sentiment dataset.
* Split into training & testing datasets.
* Training dataset is for training and hyperparameter tuning in cross validation.
* Testing dataset is an internal test set to compare performances of different models.

In [3]:
import pandas as pd

## Merge SPY Price Dataset and Daily Sentiment Dataset

In [4]:
# Load SPY price data
spy = pd.read_csv('../data/cleaned_spy_price.csv')
spy['date'] = pd.to_datetime(spy['date'])

spy.head()

Unnamed: 0,date,spy_close,spy_return,spy_direction
0,2018-03-02,239.550018,0.005155,1
1,2018-03-05,242.318726,0.011558,1
2,2018-03-06,242.932983,0.002535,0
3,2018-03-07,242.843948,-0.000367,1
4,2018-03-08,244.019043,0.004839,1


In [6]:
# Load sentiment data (e.g., from VADER or any sentiment extraction)
sentiment = pd.read_csv('../data/daily_sentiment_vader.csv')
sentiment['date'] = pd.to_datetime(sentiment['date'])

sentiment.head()


Unnamed: 0,date,daily_sentiment_score,daily_sentiment_label,headline_count
0,2018-03-01,-0.083457,negative,7
1,2018-03-02,-0.0999,negative,6
2,2018-03-05,-0.1103,positive,6
3,2018-03-06,0.227525,positive,4
4,2018-03-07,-0.125275,negative,8


In [7]:
# Merge SPY and sentiment data on 'date'
merged_data = pd.merge(spy, sentiment, on='date', how='inner')

merged_data.head()

Unnamed: 0,date,spy_close,spy_return,spy_direction,daily_sentiment_score,daily_sentiment_label,headline_count
0,2018-03-02,239.550018,0.005155,1,-0.0999,negative,6
1,2018-03-05,242.318726,0.011558,1,-0.1103,positive,6
2,2018-03-06,242.932983,0.002535,0,0.227525,positive,4
3,2018-03-07,242.843948,-0.000367,1,-0.125275,negative,8
4,2018-03-08,244.019043,0.004839,1,0.03796,positive,5


## Split to Training/Validation Dataset & Internal Testing Dataset

* Use 80% oldest data for training + cross-validation
* And 20% newest data for internal testing

In [8]:
# Sort by date to maintain time order
merged_data.sort_values('date', inplace=True)
merged_data.reset_index(drop=True, inplace=True)

# Define split index
split_index = int(len(merged_data) * 0.8)

# Split datasets
df_train = merged_data.iloc[:split_index]
df_test = merged_data.iloc[split_index:]

In [9]:
# Save the splits
df_train.to_csv("../data/train_dataset.csv", index=False)
df_test.to_csv("../data/test_dataset.csv", index=False)

In [10]:
print(f"Training set: {df_train['date'].min()} to {df_train['date'].max()} ({len(df_train)} rows)")
print(f"Testing set:  {df_test['date'].min()} to {df_test['date'].max()} ({len(df_test)} rows)")

Training set: 2018-03-02 00:00:00 to 2020-04-20 00:00:00 (223 rows)
Testing set:  2020-04-21 00:00:00 to 2020-07-16 00:00:00 (56 rows)
