<a href="https://colab.research.google.com/github/submarinejuice/CP322-Final-Project-Group-9/blob/main/cp322_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Downloading dependencies

!pip install --upgrade pip

!pip install yfinance pandas numpy scikit-learn matplotlib seaborn shap tensorflow

zsh:1: command not found: pip
zsh:1: command not found: pip


In [5]:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Attention, Input
from tensorflow.keras.optimizers import Adam

# Project Overview
#
# Neuro + Fintech + Financial Text Sentiment Predictor
#
# Goal: Predict Buy / Sell / Hold decisions by integrating multiple data modalities:
# - Stock price data: historical OHLC, returns, moving averages, volatility indicators
# - Financial news sentiment: daily sentiment scores derived from news headlines or articles
# - Simulated cognitive features: attention, stress, risk appetite, confidence (used until a real dataset is available)
#
# This project demonstrates multi-modal machine learning by combining:
# 1. Market numeric data (stocks)
# 2. Textual data (financial news sentiment)
# 3. Neuro-inspired cognitive signals
#
# Key Features of the Project:
# - Temporal modeling: cognitive features and lagged news sentiment are sequence-dependent
# - Multi-modal integration: numeric, textual, and simulated cognitive features feed into a single model
# - Evaluation & Explainability: model performance measured via accuracy and F1-score, with feature importance explored using SHAP
#
# Objectives:
# 1. Build a sequence-aware model (LSTM/GRU with attention) to predict trading actions
# 2. Demonstrate non-obvious patterns by including temporal and multi-modal dependencies
# 3. Perform ablation studies to quantify the contribution of cognitive and sentiment features
# 4. Provide interpretable insights into feature importance and model behavior

def get_stock_data(tickers, start_date, end_date):
    """Download real stock price data"""
    data = yf.download(tickers, start=start_date, end=end_date, auto_adjust=True)
    return data

TICKERS = ['AAPL', 'TSLA', 'GOOGL']
START_DATE = '2018-01-01'
END_DATE = '2024-01-01'

price_data = get_stock_data(TICKERS, START_DATE, END_DATE)
print(f"Data shape: {price_data.shape}")
print(f"Columns: {price_data.columns.tolist()}")
print(f"Date range: {price_data.index[0]} to {price_data.index[-1]}")

ModuleNotFoundError: No module named 'yfinance'

# Uploading Data CSV from repo so we always have it and dont have to manually import
Michelle's addition

In [1]:

import os

REPO_URL = "https://github.com/submarinejuice/CP322-Final-Project-Group-9"
REPO_NAME = "CP322-Final-Project-Group-9"

if not os.path.exists(REPO_NAME):
    # First time in this Colab session: clone the repo
    !git clone {REPO_URL}
else:
    # Repo already there in this runtime: pull latest changes
    %cd {REPO_NAME}
    !git pull
    %cd /content

# Move into repo so relative paths work
%cd /content/{REPO_NAME}


Cloning into 'CP322-Final-Project-Group-9'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 29 (delta 4), reused 8 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (29/29), 339.84 KiB | 10.00 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/CP322-Final-Project-Group-9


In [10]:
import pandas as pd
import re

print("Current directory:", os.getcwd())
print("Repo contents:", os.listdir())
print("DATASET contents:", os.listdir("DATASET"))

df = pd.read_csv("DATASET/AE_investment_dataset.csv")
df.head()
df.info()
df.isna().mean().sort_values().head(20)
df.columns.tolist()
for c in df.columns:
    print(c)




Current directory: /content/CP322-Final-Project-Group-9
Repo contents: ['README.md', '.git', 'cp322_FINAL.ipynb', 'DATASET']
DATASET contents: ['AE_investment_dataset.csv']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Columns: 364 entries, Participant_code to SCR_AnticipatoryS4_T10
dtypes: float64(356), int64(5), object(3)
memory usage: 85.4+ KB
Participant_code
Age
Gender
Nationality
Ethnicity
Played_stock_market
Played_in_years
Played_how_often
Stock_amount_S1_T1
Stock_amount_S1_T2
Stock_amount_S1_T3
Stock_amount_S1_T4
Stock_amount_S1_T5
Stock_amount_S1_T6
Stock_amount_S1_T7
Stock_amount_S1_T8
Stock_amount_S1_T9
Stock_amount_S1_T10
Stock_amount_S2_T1
Stock_amount_S2_T2
Stock_amount_S2_T3
Stock_amount_S2_T4
Stock_amount_S2_T5
Stock_amount_S2_T6
Stock_amount_S2_T7
Stock_amount_S2_T8
Stock_amount_S2_T9
Stock_amount_S2_T10
Stock_amount_S3_T1
Stock_amount_S3_T2
Stock_amount_S3_T3
Stock_amount_S3_T4
Stock_amount_S3_T5
Stock_amount_S3_T6
Stock_amount_S3_T7
Stock_amo

##Quick note bc I didn't know what PANAS meant:
- PANAS refers to the Positive and Negative Affect Schedule, a widely used psychological scale that measures an individual's mood by assessing both positive and negative emotions. Developed in 1988 by Watson, Clark, and Tellegen, it's a 20-item self-report measure used in research and clinical settings to gauge how frequently someone experiences emotions like interest, joy, enthusiasm (positive affect) versus feelings of distress, sadness, and nervousness (negative affect).
## How it works
- 20 items: The scale consists of 20 words that describe different feelings and emotions.
- Two dimensions: These items are separated into two subscales: one for positive affect (PA) and one for negative affect (NA).
- Rating scale: Participants rate how they felt about each item over a specific time frame (e.g., "right now," "today," "over the past few weeks") on a 5-point scale.
- Scoring: Each positive and negative item is scored individually. The total positive score and total negative score are then calculated. A higher positive score indicates more positive affect, while a higher negative score indicates more negative affect.

Building a per-trial table with:
1. inputs per step:
  - money_in_stocks
  - mean_return
  - stock_fluctuation
  - scr_anticipatory
2. Static inputs:
3. Target
  - Whether they invested in the stock (money_in_stocks > 0 -> 1 else 0)

In [6]:
import re
import pandas as pd

# 1. ID & static columns we carry along
id_cols = ["Participant_code", "Age", "Gender", "Nationality", "Ethnicity", "Played_stock_market"]

# 2. Find all trial-level columns by prefix
stock_cols = [c for c in df.columns if c.startswith("Money_in_stocks_S")]
scr_cols   = [c for c in df.columns if c.startswith("SCR_AnticipatoryS")]
ret_cols   = [c for c in df.columns if c.startswith("Mean_Return_S")]
fluc_cols  = [c for c in df.columns if c.startswith("stock_fluctuation_S")]

print("n_stock_cols:", len(stock_cols))
print("n_scr_cols:", len(scr_cols))
print("n_return_cols:", len(ret_cols))
print("n_fluctuation_cols:", len(fluc_cols))


n_stock_cols: 40
n_scr_cols: 40
n_return_cols: 36
n_fluctuation_cols: 36


In [3]:
print("Example stock cols:", stock_cols[:5])
print("Example SCR cols:", scr_cols[:5])
print("Example return cols:", ret_cols[:5])
print("Example fluctuation cols:", fluc_cols[:5])


NameError: name 'stock_cols' is not defined