# What's missing
Til næste gang, mandag kl 13 på uni:

Jasi:
- Estatter billeder af graferne til dem, der indeholder alt data
- Sætter et framework op i teksten

Rasmus:
- Tilføjer Naive Bayes tekst / resultater i "Jasmin"
- - Tænkt som en test / baseline men klarede sig faktisk okay :)
- Random Forest til Classification og til regression


In [87]:
#Library imports
import pandas as pd
import numpy as np

from transformers import pipeline


In [80]:
#Importing and investigating the data
data = pd.read_csv('data.csv')
data.head(3)

Unnamed: 0,Title,Date,CP
0,"JPMorgan Predicts 2008 Will Be ""Nothing But Net""",2008-01-02,1447.16
1,Dow Tallies Biggest First-session-of-year Poin...,2008-01-02,1447.16
2,2008 predictions for the S&P 500,2008-01-02,1447.16


## Identify sentiment
The following code uses a transformer model to analyse sentiment of the headlines into either positive, neutral, or negative. Each sentiment is given a score based on how sure the model is, and these scores are later averaged on a per-day basis to find average sentiment on that day. Also, the price difference to the day before is calculated.

In [95]:
#Use finbert model, found on HuggingFace
classifier = pipeline("text-classification", model="ProsusAI/finbert")

results = []

last_price = data.iloc[0]['CP'] #This variable is used for calculating the price difference from day to day
previous_date = data.iloc[0]['Date']

for index, row in data.iterrows():
    """
    This for loop extracts the sentiment from the headlines
    
    Each headline is then awarded a sentiment score: Positive (+) for positive sentiment, 0 for neutral and Negative (-) for negative sentiment
    The score in this reflects how sure the model is. If there are more than one headline per day, the scores are aggregated to mean in a later step
    """
    output = classifier(row['Title'])

    label = output[0]['label']
    score = output[0]['score']

    #Assign sentiment score (- if negative, + if positive)
    if label == 'negative':
        sentiment_score = -score
    elif label == 'neutral':
        sentiment_score = 0
    elif label == 'positive':
        sentiment_score = score      

    price_difference = row['CP'] - last_price

    if price_difference > 0:
        price_increase = 1
        price_decrease = 0
    elif price_difference < 0:
        price_increase = 0
        price_decrease = 1
    else:
        price_increase = 0
        price_decrease = 0
    
    price_difference_percentage = ((row['CP'] - last_price) / last_price) * 100

    #Return score as well as price-difference to previous day both in total and as a percentage
    results.append([row['Date'], sentiment_score, price_difference, price_difference_percentage, price_increase, price_decrease])

    #If the next Date is different, change the last_price variable
    try:
        if row['Date'] != data.iloc[index +1]['Date']:
            last_price = row['CP']
    except IndexError:
        pass

df = pd.DataFrame(results, columns=["Date", "Score", "Total price difference", "Percentage price difference", "Price increase", "Price decrease"])

#Aggregate to a per-day basis
grouped_df = df.groupby(['Date'], as_index=False).mean() #Use mean of scores. This way, we represent the overall sentiment in the market on the given day
grouped_df.to_csv('aggregated_per_day.csv')
grouped_df['Positive sentiment'] = np.where(grouped_df['Score'] > 0, 1, 0)
grouped_df['Negative sentiment'] = np.where(grouped_df['Score'] < 0, 1, 0)

print("----- Data for the first 10 days: -----")
grouped_df.head(10)

Device set to use mps:0


----- Data for the first 10 days: -----


Unnamed: 0,Date,Score,Total price difference,Percentage price difference,Price increase,Price decrease,Positive sentiment,Negative sentiment
0,2008-01-02,-0.522108,0.0,0.0,0.0,0.0,0,1
1,2008-01-03,0.861627,0.0,0.0,0.0,0.0,1,0
2,2008-01-07,0.586567,-30.98,-2.140745,0.0,1.0,1,0
3,2008-01-09,-0.421696,-7.05,-0.497818,0.0,1.0,0,1
4,2008-01-10,0.542341,11.2,0.794817,1.0,0.0,1,0
5,2008-01-22,-0.602433,-109.83,-7.73271,0.0,1.0,0,1
6,2008-01-29,-0.766623,51.8,3.95269,1.0,0.0,0,1
7,2008-01-30,0.0,-6.49,-0.4764,0.0,1.0,0,0
8,2008-02-01,0.0,39.61,2.921501,1.0,0.0,0,0
9,2008-02-05,-0.308031,-58.78,-4.212352,0.0,1.0,0,1


In [139]:
print("--- Prior probabilities ---")
print(f"P(Price Increase): {round(grouped_df['Price increase'].sum() / len(grouped_df),2)}")
print(f"P(Price Decrease): {round(grouped_df['Price decrease'].sum() / len(grouped_df),2)}")
print(f"P(Positive sentiment): {round(grouped_df['Positive sentiment'].sum() / len(grouped_df),2)}")
print(f"P(Negative sentiment): {round(grouped_df['Negative sentiment'].sum() / len(grouped_df),2)}")

--- Prior probabilities ---
P(Price Increase): 0.54
P(Price Decrease): 0.46
P(Positive sentiment): 0.25
P(Negative sentiment): 0.44


The above prior probabilities suggest a slightly imbalanced dataset leaning towards price increases. If we invest randomly, 54% of the days we will have a price increase (not considering broker fees etc.). Throughout the rest of the project, we will investigate, if this probability can be increased using sentiment

## Classification of increase / decrease
The later models will attempt to predict the price increase / decrease with regression. As a baseline model, the authors decided to attempt a classification prediction of whether the given headlines' sentiment will trigger an increase / decrease compared to the previous day. This classification will be performed with the Naive Bayes classifier

In [180]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

#Try with and without shuffling the data
for shuffle in [True, False]:
    print(f"--- Model Evaluation (Shuffle = {shuffle}) ---")
    
    #Price Increase (Categorical)
    X_cat, y_increase = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X_cat, y_increase, test_size=0.2, random_state=67, shuffle=shuffle)
    nb_increase_cat = GaussianNB()
    y_pred_increase_cat = nb_increase_cat.fit(X_train, y_train).predict(X_test)
    
    #Use F1-Score for better evaluation of imbalanced classes (average='binary' for 1/0 targets)
    increase_f1_categorical = f1_score(y_test, y_pred_increase_cat, average='binary', zero_division=0)
    
    print("-- Predictions based on Categorical Sentiment for Price Increase --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_increase_cat):.4f}")
    print(f"F1-Score: {increase_f1_categorical:.4f}")
    
    #Price Decrease (Categorical)
    X_cat, y_decrease = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X_cat, y_decrease, test_size=0.2, random_state=67, shuffle=shuffle)
    nb_decrease_cat = GaussianNB()
    y_pred_decrease_cat = nb_decrease_cat.fit(X_train, y_train).predict(X_test)
    decrease_f1_categorical = f1_score(y_test, y_pred_decrease_cat, average='binary', zero_division=0)

    print("-- Predictions based on Categorical Sentiment for Price Decrease --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_decrease_cat):.4f}")
    print(f"F1-Score: {decrease_f1_categorical:.4f}")
    
    #Price Decrease (Numerical/Score)
    X_num, y_decrease = grouped_df[['Score']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X_num, y_decrease, test_size=0.2, random_state=67, shuffle=shuffle)
    nb_decrease_num = GaussianNB()
    y_pred_decrease_num = nb_decrease_num.fit(X_train, y_train).predict(X_test)
    decrease_f1_numerical = f1_score(y_test, y_pred_decrease_num, average='binary', zero_division=0)
    
    print("-- Predictions based on Numerical Score for Price Decrease --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_decrease_num):.4f}")
    print(f"F1-Score: {decrease_f1_numerical:.4f}")
    
    X_num, y_increase = grouped_df[['Score']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X_num, y_increase, test_size=0.2, random_state=67, shuffle=shuffle)
    nb_increase_num = GaussianNB()
    y_pred_increase_num = nb_increase_num.fit(X_train, y_train).predict(X_test)
    increase_f1_numerical = f1_score(y_test, y_pred_increase_num, average='binary', zero_division=0)
    
    print("-- Predictions based on Numerical Score for Price Increase --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_increase_num):.4f}")
    print(f"F1-Score: {increase_f1_numerical:.4f}\n")

--- Model Evaluation (Shuffle = True) ---
-- Predictions based on Categorical Sentiment for Price Increase --
Accuracy: 0.5883
F1-Score: 0.6309
-- Predictions based on Categorical Sentiment for Price Decrease --
Accuracy: 0.5897
F1-Score: 0.5355
-- Predictions based on Numerical Score for Price Decrease --
Accuracy: 0.5912
F1-Score: 0.4271
-- Predictions based on Numerical Score for Price Increase --
Accuracy: 0.5883
F1-Score: 0.6785

--- Model Evaluation (Shuffle = False) ---
-- Predictions based on Categorical Sentiment for Price Increase --
Accuracy: 0.6567
F1-Score: 0.6107
-- Predictions based on Categorical Sentiment for Price Decrease --
Accuracy: 0.6567
F1-Score: 0.6930
-- Predictions based on Numerical Score for Price Decrease --
Accuracy: 0.5997
F1-Score: 0.3769
-- Predictions based on Numerical Score for Price Increase --
Accuracy: 0.5983
F1-Score: 0.7044



### Evaluation
We used the Naïve Bayes classifier to try to predict price increases / decreases based on categorical as well as numerical predictors. As the accuracy suggests, this is a slight improvement over the prior probabilities. Especially when the test data is ordered chronologically, a categorical sentiment seems to be able to predict a price increase and decrease quite well.

## Classification using Random Forests
To supplement the above NB classification approach, we try Random Forests

In [178]:
from sklearn.ensemble import RandomForestClassifier

#Try with and without shuffling the data
for shuffle in [True, False]:
    print(f"--- Model Evaluation (Shuffle = {shuffle}) ---")
    
    #Price Increase (Categorical)
    X_cat, y_increase = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X_cat, y_increase, test_size=0.2, random_state=67, shuffle=shuffle)
    rf_increase_cat = RandomForestClassifier(max_depth=3, n_estimators=1000, random_state=67)
    y_pred_increase_cat = rf_increase_cat.fit(X_train, y_train).predict(X_test)
    
    #Use F1-Score for better evaluation of imbalanced classes (average='binary' for 1/0 targets)
    increase_f1_categorical = f1_score(y_test, y_pred_increase_cat, average='binary', zero_division=0)
    
    print("-- Predictions based on Categorical Sentiment for Price Increase --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_increase_cat):.4f}")
    print(f"F1-Score: {increase_f1_categorical:.4f}")
    
    #Price Decrease (Categorical)
    X_cat, y_decrease = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X_cat, y_decrease, test_size=0.2, random_state=67, shuffle=shuffle)
    rf_decrease_cat = RandomForestClassifier(max_depth=3, n_estimators=1000, random_state=67)
    y_pred_decrease_cat = rf_decrease_cat.fit(X_train, y_train).predict(X_test)
    decrease_f1_categorical = f1_score(y_test, y_pred_decrease_cat, average='binary', zero_division=0)

    print("-- Predictions based on Categorical Sentiment for Price Decrease --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_decrease_cat):.4f}")
    print(f"F1-Score: {decrease_f1_categorical:.4f}")
    
    #Price Decrease (Numerical/Score)
    X_num, y_decrease = grouped_df[['Score']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X_num, y_decrease, test_size=0.2, random_state=67, shuffle=shuffle)
    rf_decrease_num = RandomForestClassifier(max_depth=3, n_estimators=1000, random_state=67)
    y_pred_decrease_num = rf_decrease_num.fit(X_train, y_train).predict(X_test)
    decrease_f1_numerical = f1_score(y_test, y_pred_decrease_num, average='binary', zero_division=0)
    
    print("-- Predictions based on Numerical Score for Price Decrease --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_decrease_num):.4f}")
    print(f"F1-Score: {decrease_f1_numerical:.4f}")
    
    X_num, y_increase = grouped_df[['Score']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X_num, y_increase, test_size=0.2, random_state=67, shuffle=shuffle)
    rf_increase_num = RandomForestClassifier(max_depth=3, n_estimators=1000, random_state=67)
    y_pred_increase_num = rf_increase_num.fit(X_train, y_train).predict(X_test)
    increase_f1_numerical = f1_score(y_test, y_pred_increase_num, average='binary', zero_division=0)
    
    print("-- Predictions based on Numerical Score for Price Increase --")
    print(f"Accuracy: {accuracy_score(y_test, y_pred_increase_num):.4f}")
    print(f"F1-Score: {increase_f1_numerical:.4f}\n")

--- Model Evaluation (Shuffle = True) ---
-- Predictions based on Categorical Sentiment for Price Increase --
Accuracy: 0.5883
F1-Score: 0.6309
-- Predictions based on Categorical Sentiment for Price Decrease --
Accuracy: 0.5897
F1-Score: 0.5355
-- Predictions based on Numerical Score for Price Decrease --
Accuracy: 0.5926
F1-Score: 0.4457
-- Predictions based on Numerical Score for Price Increase --
Accuracy: 0.5912
F1-Score: 0.6764

--- Model Evaluation (Shuffle = False) ---
-- Predictions based on Categorical Sentiment for Price Increase --
Accuracy: 0.6567
F1-Score: 0.6107
-- Predictions based on Categorical Sentiment for Price Decrease --
Accuracy: 0.6567
F1-Score: 0.6930
-- Predictions based on Numerical Score for Price Decrease --
Accuracy: 0.6510
F1-Score: 0.5488
-- Predictions based on Numerical Score for Price Increase --
Accuracy: 0.6510
F1-Score: 0.7154



### Evaluation
The results are very similar to that of NB, while the computation time for NB is slightly faster. This is an argument for sticking to NB.
It is strange that 

## Regression
Trying to predict exact stock return (in percent) using different regression methods

### Random Forest Regression


In [197]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error

#Try with and without shuffling the data
for shuffle in [True, False]:
    print(f"--- Model Evaluation (Shuffle = {shuffle}) ---")
    
    #Price Increase (Categorical)
    X, y = grouped_df[['Score']].astype(float), grouped_df['Percentage price difference'].astype(float)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, shuffle=shuffle)
    rf = RandomForestRegressor(max_depth=3, n_estimators=1000, random_state=67)
    y_pred_rf = rf.fit(X_train, y_train).predict(X_test)
    
    rmse = root_mean_squared_error(y_true=y_test, y_pred=y_pred_rf)
    # Calculating "hit rate" (directional accuracy)
    y_test_sign = np.sign(y_test)
    y_pred_sign = np.sign(y_pred_rf)
    directional_hits = (y_test_sign == y_pred_sign)
    directional_accuracy = np.mean(directional_hits)

    print(directional_accuracy)
    print(f"RMSE: {rmse}")

--- Model Evaluation (Shuffle = True) ---
0.5883190883190883
RMSE: 1.2454612716669526
--- Model Evaluation (Shuffle = False) ---
0.6509971509971509
RMSE: 1.062711645210153
