# 1. NB

### Produkt
Based on news headlines from a day, we want to develop a model that is able to predict whether the closing price will rise or fall.

Antagelse: Title udkom inden Closing, så Title påvirker dagens CP.

Step 1: Define sentiment in headlines
- Use a model from huggingface


###

In [87]:
#Library imports
import pandas as pd
import numpy as np

from transformers import pipeline


In [80]:
#Importing and investigating the data
data = pd.read_csv('data.csv')
data.head(3)

Unnamed: 0,Title,Date,CP
0,"JPMorgan Predicts 2008 Will Be ""Nothing But Net""",2008-01-02,1447.16
1,Dow Tallies Biggest First-session-of-year Poin...,2008-01-02,1447.16
2,2008 predictions for the S&P 500,2008-01-02,1447.16


## Identify sentiment
The following code uses a transformer model to analyse sentiment of the headlines into either positive, neutral, or negative. Each sentiment is given a score based on how sure the model is, and these scores are later averaged on a per-day basis to find average sentiment on that day. Also, the price difference to the day before is calculated.

In [95]:
#Use finbert model, found on HuggingFace
classifier = pipeline("text-classification", model="ProsusAI/finbert")

results = []

last_price = data.iloc[0]['CP'] #This variable is used for calculating the price difference from day to day
previous_date = data.iloc[0]['Date']

for index, row in data.iterrows():
    """
    This for loop extracts the sentiment from the headlines
    
    Each headline is then awarded a sentiment score: Positive (+) for positive sentiment, 0 for neutral and Negative (-) for negative sentiment
    The score in this reflects how sure the model is. If there are more than one headline per day, the scores are aggregated to mean in a later step
    """
    output = classifier(row['Title'])

    label = output[0]['label']
    score = output[0]['score']

    #Assign sentiment score (- if negative, + if positive)
    if label == 'negative':
        sentiment_score = -score
    elif label == 'neutral':
        sentiment_score = 0
    elif label == 'positive':
        sentiment_score = score      

    price_difference = row['CP'] - last_price

    if price_difference > 0:
        price_increase = 1
        price_decrease = 0
    elif price_difference < 0:
        price_increase = 0
        price_decrease = 1
    else:
        price_increase = 0
        price_decrease = 0
    
    price_difference_percentage = ((row['CP'] - last_price) / last_price) * 100

    #Return score as well as price-difference to previous day both in total and as a percentage
    results.append([row['Date'], sentiment_score, price_difference, price_difference_percentage, price_increase, price_decrease])

    #If the next Date is different, change the last_price variable
    try:
        if row['Date'] != data.iloc[index +1]['Date']:
            last_price = row['CP']
    except IndexError:
        pass

df = pd.DataFrame(results, columns=["Date", "Score", "Total price difference", "Percentage price difference", "Price increase", "Price decrease"])

#Aggregate to a per-day basis
grouped_df = df.groupby(['Date'], as_index=False).mean() #Use mean of scores. This way, we represent the overall sentiment in the market on the given day
grouped_df.to_csv('aggregated_per_day.csv')
grouped_df['Positive sentiment'] = np.where(grouped_df['Score'] > 0, 1, 0)
grouped_df['Negative sentiment'] = np.where(grouped_df['Score'] < 0, 1, 0)

print("----- Data for the first 10 days: -----")
grouped_df.head(10)

Device set to use mps:0


----- Data for the first 10 days: -----


Unnamed: 0,Date,Score,Total price difference,Percentage price difference,Price increase,Price decrease,Positive sentiment,Negative sentiment
0,2008-01-02,-0.522108,0.0,0.0,0.0,0.0,0,1
1,2008-01-03,0.861627,0.0,0.0,0.0,0.0,1,0
2,2008-01-07,0.586567,-30.98,-2.140745,0.0,1.0,1,0
3,2008-01-09,-0.421696,-7.05,-0.497818,0.0,1.0,0,1
4,2008-01-10,0.542341,11.2,0.794817,1.0,0.0,1,0
5,2008-01-22,-0.602433,-109.83,-7.73271,0.0,1.0,0,1
6,2008-01-29,-0.766623,51.8,3.95269,1.0,0.0,0,1
7,2008-01-30,0.0,-6.49,-0.4764,0.0,1.0,0,0
8,2008-02-01,0.0,39.61,2.921501,1.0,0.0,0,0
9,2008-02-05,-0.308031,-58.78,-4.212352,0.0,1.0,0,1


In [139]:
print("--- Prior probabilities ---")
print(f"P(Price Increase): {round(grouped_df['Price increase'].sum() / len(grouped_df),2)}")
print(f"P(Price Decrease): {round(grouped_df['Price decrease'].sum() / len(grouped_df),2)}")
print(f"P(Positive sentiment): {round(grouped_df['Positive sentiment'].sum() / len(grouped_df),2)}")
print(f"P(Negative sentiment): {round(grouped_df['Negative sentiment'].sum() / len(grouped_df),2)}")

--- Prior probabilities ---
P(Price Increase): 0.54
P(Price Decrease): 0.46
P(Positive sentiment): 0.25
P(Negative sentiment): 0.44


The above prior probabilities suggest a slightly imbalanced dataset leaning towards price increases. If we invest randomly, 54% of the days we will have a price increase (not considering broker fees etc.). Throughout the rest of the project, we will investigate, if this probability can be increased using sentiment

## Classification of increase / decrease
The later models will attempt to predict the price increase / decrease with regression. As a baseline model, the authors decided to attempt a classification prediction of whether the given headlines' sentiment will trigger an increase / decrease compared to the previous day. This classification will be performed with the Naive Bayes classifier

In [132]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#Try with and without shuffling the data
for shuffle in [True, False]:
    ### -- Try modelling for price increase based on categorical sentiment -- ###
    #Get train and test data from the dataset
    X, y = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, shuffle=shuffle)
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    increase_accuracy_categorical = accuracy_score(y_test, y_pred)

    ### -- Modelling for price decrease on categorical sentiment -- ###
    #Get train and test data from the dataset
    X, y = grouped_df[['Positive sentiment', 'Negative sentiment']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, shuffle=shuffle)
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    decrease_accuracy_categorical = accuracy_score(y_test, y_pred)

    ### -- Modelling for price decrease on numerical (continuous) sentiment -- ###
    #Get train and test data from the dataset
    X, y = grouped_df[['Score']], grouped_df['Price decrease']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, shuffle=shuffle)
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    decrease_accuracy_numerical = accuracy_score(y_test, y_pred)

    ### -- Modelling for price increase on numerical (continuous) sentiment -- ###
    #Get train and test data from the dataset
    X, y = grouped_df[['Score']], grouped_df['Price increase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, shuffle=shuffle)
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    increase_accuracy_numerical = accuracy_score(y_test, y_pred)

    print("--- Model Evaluation ---")
    print(f"Shuffle = {shuffle}")
    print("\n-- Predictions based on categorical values --")
    print(f"For prediction of price increases, the model has an accuracy of: {increase_accuracy_categorical:.4f}")
    print(f"For prediction of price decrease, the model has an accuracy of: {decrease_accuracy_categorical:.4f}")

    print("\n-- Predictions based on numerical (continuous) values --")
    print(f"For prediction of price increase, the model has an accuracy of: {increase_accuracy_numerical:.4f}")
    print(f"For prediction of price decrease, the model has an accuracy of: {decrease_accuracy_numerical:.4f}\n")

--- Model Evaluation ---
Shuffle = True

-- Predictions based on categorical values --
For prediction of price increases, the model has an accuracy of: 0.5883
For prediction of price decrease, the model has an accuracy of: 0.5897

-- Predictions based on numerical (continuous) values --
For prediction of price increase, the model has an accuracy of: 0.5883
For prediction of price decrease, the model has an accuracy of: 0.5912

--- Model Evaluation ---
Shuffle = False

-- Predictions based on categorical values --
For prediction of price increases, the model has an accuracy of: 0.6567
For prediction of price decrease, the model has an accuracy of: 0.6567

-- Predictions based on numerical (continuous) values --
For prediction of price increase, the model has an accuracy of: 0.5983
For prediction of price decrease, the model has an accuracy of: 0.5997



### Evaluation
We used the Naïve Bayes classifier to try to predict price increases / decreases based on categorical as well as numerical predictors. As the accuracy suggests, this is a slight improvement over the prior probabilities. Especially when the test data is ordered chronologically, a categorical sentiment seems to be able to predict a price increase and decrease quite well.