# Financial Stock Price Prediction with News Sentiment

## Introduction

This notebook is aimed to serve as an introduction to the creation of a recommender that operates on an open dataset of stock market prices in order to predict prices in the future. This data is converted into various financial metrics and is also enriched with sentiment analysis information derived from news headlines that are mapped to specific tickers. It covers the data download and processing steps, the calculation of the features fed into the prediction model, and finally a simple model that predicts the profitability of assets to rank them. 

Before we get started, we need to confgure where we are to store the dataset and models produced. If you are running this notebook locally, then any folder on your machine should be fine. If you are working within a container, you may need to change the directory to a mounted writable file system.

In [1]:
# Local Mode
#storageDIR = "HugeStockMarketDataset" # creates a dataset directory in the same folder as the notebook
#storageDIRNews = "NewsSentimentDataset"
# Container Mode
storageDIR = "/tmp/iPythonNotebooks/dataset/stock" # creates a dataset directory in the /tmp/ directory for the container
storageDIRNews = "/tmp/iPythonNotebooks/dataset/news" # creates a dataset directory in the /tmp/ directory for the container


## Dataset

Different types of financial asset recommendation system use different types of data to prodice their recommendations. This approach is known as Profitability Prediction, where assets that are predicted to gain significant value over the following year are recommended. This type of approach uses past pricing data, i.e. the price for different assets over time to identify pricing trends and hence future profitable assets. Hence, as input we need a dataset that for a range of assets contains their price history over time.

For illustration, in this notebook we will use an open dataset, although you may wish to swap this for data from your market of choice. The dataset used here is the Huge Stock Market Dataset compiled by Boris Marjanovic, and is publically available <b><a href='https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs'>here</a>.</b>

The dataset comprises historical price and volume data for all US-based stocks and ETFs trading on the NYSE, NASDAQ, and NYSE markets, and runs up to the last quarter of 2017. For each financial asset (stock or ETF), it contains a series of price entries describing the market price of that asset on different days. Each entry is comprised of:
 - Date: The date of the pricing data 
 - Open: Opening price for that day
 - High: The maximum price for that day
 - Low: The minimum price for that day
 - Close: The closing price for that day
 - Volume: The amount of the asset that is traded 
 - OpenInt: The total number of outstanding contracts held by market participants

## Downloading the Dataset

This dataset can be downloaded through the <b><a href='https://github.com/Kaggle/kaggle-api'>Kaggle API</a></b>. Users should make an account on the Kaggle website and download an API token in order to access this dataset on their local machine. This is available from the Account section of the Kaggle user profile.

To use the Kaggle API, we first install the package through pip. We then export the username and the API key from the aforementioned token, and use the download command to fetch and unzip the dataset. For our current experiment we will use only the set of Stocks.

(The below commands can also be entered directly into your terminal, without the ! prefix.)

In [2]:
import os
os.environ["KAGGLE_USERNAME"] = "Your Kaggle user name here"
os.environ["KAGGLE_KEY"] = "Your Kaggle user key here"

!pip install kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
api.dataset_download_files('borismarjanovic/price-volume-data-for-all-us-stocks-etfs', path=storageDIR)
print("Download Complete")

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m
Download Complete


## Extracting and Loading the Dataset

After downloading the dataset we need to extract the files and then convert it into a Pandas DataFrame, which is somewhat like a large data table that makes the raw data easier to analyse. 

In [3]:
import zipfile
import pandas as pd
import numpy as np
import glob, os, random, math

# Unzip the Dataset
with zipfile.ZipFile(storageDIR+"/price-volume-data-for-all-us-stocks-etfs.zip", 'r') as zip_ref:
    zip_ref.extractall(storageDIR)

# Replace this with your dataset path
path = storageDIR+'/Stocks/' 
all_files = glob.glob(os.path.join(path, "*.us.txt"))
dfs = []

# Iterating through files and only using non-empty files
for f in all_files:
    if os.path.getsize(f) > 0:
        df = pd.read_csv(f) 
        df['Stock'] = f.split('/')[-1].split('.')[0]
        dfs.append(df)
full_kaggle_df = pd.concat(dfs)

print("Dataset Extraction and Loading as Dataframe Complete")

Dataset Extraction and Loading as Dataframe Complete


In [4]:
full_kaggle_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt,Stock
0,2005-02-25,3.0,3.0,2.88,2.93,5100,0,axdx
1,2005-02-28,2.86,2.92,2.83,2.91,5400,0,axdx
2,2005-03-01,2.89,2.99,2.87,2.97,10300,0,axdx
3,2005-03-02,2.97,2.97,2.9,2.9,4900,0,axdx
4,2005-03-03,2.91,2.91,2.86,2.86,5000,0,axdx


## Filtering the Dataset

Pandas allows us to perform manipulations on the pricing data so that we can extract only what we need for training the model. For the purpose of our illustration, we will only use pricing data from 2016 and 2017, where we consider 2016 as the 'past' and hence we can use that period for learning what assets are likely to be profitable, and 2017 as the future (which we will use to test how well our model performs later). In effect, for our experiment here we can consider the date of asset recommendation to be the 1st of February 2017. 

Lets first filter the dataset to only hold data from the dates we care about:

In [5]:
#full_kaggle_df['Date'] = pd.to_datetime(full_kaggle_df['Date'])
#full_kaggle_df['year'] = full_kaggle_df['Date'].dt.year
#full_kaggle_df['month'] = full_kaggle_df['Date'].dt.month
#full_kaggle_df['day'] = full_kaggle_df['Date'].dt.day

# Selecting only that data from either 2016 or 2017
full_kaggle_df = full_kaggle_df[(full_kaggle_df['Date'] >= '2016-01-01')]
#full_kaggle_df['Date'] = full_kaggle_df['Date'].dt.strftime('%Y-%m-%d')

stocks = full_kaggle_df['Stock'].unique().tolist()
pricedfs = []
for s in stocks:
    df = full_kaggle_df[full_kaggle_df['Stock'] == s]
    df = df.rename(columns={'Open': 'open', 'High': 'high', 'Low': 'low', 'Close':'close', 'Volume': 'volume'})
    df = df.iloc[1:]
    pricedfs.append(df)
    
print("Dataset Filtering Complete")

Dataset Filtering Complete


In [54]:
full_kaggle_df

Unnamed: 0,Date,Open,High,Low,Close,Volume,OpenInt,Stock
2650,2016-01-04,21.50,21.90,20.500,21.27,235999,0,axdx
2651,2016-01-05,21.25,21.87,20.920,21.25,130507,0,axdx
2652,2016-01-06,21.00,21.21,20.360,20.68,196246,0,axdx
2653,2016-01-07,20.53,20.53,19.800,19.97,319336,0,axdx
2654,2016-01-08,22.96,22.97,19.840,19.95,317046,0,axdx
...,...,...,...,...,...,...,...,...
3196,2017-11-06,42.11,42.40,41.815,41.84,658811,0,enr
3197,2017-11-07,41.89,41.89,41.070,41.83,1496501,0,enr
3198,2017-11-08,43.94,44.83,40.640,42.80,2696595,0,enr
3199,2017-11-09,42.25,43.42,41.780,42.57,840855,0,enr


In [55]:
pricedfs

[            Date   open   high    low  close  volume  OpenInt Stock  \
 2651  2016-01-05  21.25  21.87  20.92  21.25  130507        0  axdx   
 2652  2016-01-06  21.00  21.21  20.36  20.68  196246        0  axdx   
 2653  2016-01-07  20.53  20.53  19.80  19.97  319336        0  axdx   
 2654  2016-01-08  22.96  22.97  19.84  19.95  317046        0  axdx   
 2655  2016-01-11  19.80  20.08  18.31  18.97  526832        0  axdx   
 ...          ...    ...    ...    ...    ...     ...      ...   ...   
 3115  2017-11-06  18.00  19.55  17.60  19.40  998854        0  axdx   
 3116  2017-11-07  19.50  19.50  18.55  18.80  729410        0  axdx   
 3117  2017-11-08  19.00  19.05  18.05  18.90  678804        0  axdx   
 3118  2017-11-09  18.90  19.70  18.30  19.55  685261        0  axdx   
 3119  2017-11-10  19.55  19.65  18.55  18.75  442027        0  axdx   
 
              tp   tr_    tr    atr_14  plus_dm  down_dm     adx_14  
 2651  21.346667   NaN   NaN       NaN     0.00     0.00        

## Feature Creation for the Model

Now that we have the pricing data in a more useful form, we can now convert that data into additional indicators that a machine learned model can use for identifying patterns/trends. In effect, we want to capture how the price for an asset changed in the recent past, for use as indicators for future performance (of course past performance is not always a good indicator, and more advanced approaches may mix in other sources of evidence here). We convert the pricing data into 14 different indicator (feature) types:

**NOTE:** In the following equations, the sub-index $t$ indicates the time of computation of the metric. $t-1$ might indicate, then, the previous day, and so on.

1. <b>True range</b>: The average true range (ATR) is a market volatility indicator. The true range indicator is taken as the greatest of the following: current high less the current low; the absolute value of the current high less the previous close; and the absolute value of the current low less the previous close. The ATR is a moving average of the true ranges. Usually, it is computed over 14 days ($n=14$)

\begin{equation}
\text{TR}_t = \max{\left(\text{High}_t - \text{Low}_t, |\text{High}_t - \text{Close}_{t-1}|, |\text{Low}_t - \text{Close}_{t-1}|\right)}
\end{equation}

\begin{equation}
    \text{ATR}_t(n) = \frac{(n-1)\cdot\text{ATR}_{t-1}   + \text{TR}_t}{n}
\end{equation}

2. <b>Average directional index </b>: The average directional index (ADX) is a technical analysis indicator used by some traders to determine the strength of a trend. The ADX makes use of a positive (+DI) and negative (-DI) directional indicator in addition to the trendline. The ADX identifies a strong trend when it is over 25 and a weak trend when it is below 20. Crossovers of the -DI and +DI lines can be used to generate trade signals. Usually, it is computed over a period of 14 days ($n=14$)

\begin{equation}
\text{ADX}_t(n) = \frac{(n-1)\cdot\text{ADX}_{t-1}(n) + \text{DX}_{t}(n)}{n}
\end{equation}

\begin{equation}
\text{DX}_t(n) = 100\cdot\frac{\left|\text{+DI}_t(n) - \text{-DI}_t(n)\right|}{\left|\text{+DI}_t(n) + \text{-DI}_t(n)\right|}
\end{equation}

\begin{equation}
\text{(+/-)DI}_t(n) = 100\cdot\frac{\text{(+/-)SmDM}_t(n)}{\text{ATR}_t(n)}
\end{equation}

\begin{equation}
\text{(+/-)smDM}_t(n) = \sum_{i=1}^{n}\text{(+/-)DM}_{t-i} - \frac{1}{n}\sum_{i=1}^{n}\text{(+/-)DM}_{t-i} + \text{(+/-)DM}_{t}
\end{equation}

\begin{equation}
\text{+DM}_t = \begin{cases}
\text{High}_t - \text{High}_{t-1} & \text{if } \text{High}_t - \text{High}_{t-1} > \text{Low}_{t-1} - \text{Low}_{t}\\
0 & \text{otherwise}
\end{cases}
\end{equation}

\begin{equation}
\text{-DM}_t = \begin{cases}
\text{Low}_{t-1} - \text{Low}_{t} & \text{if } \text{High}_t - \text{High}_{t-1} < \text{Low}_{t-1} - \text{Low}_{t}\\
0 & \text{otherwise}
\end{cases}
\end{equation}

3. <b>Moving average convergence divergence</b>: Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period exponential moving average (EMA) from the 12-period EMA.
\begin{equation}
\text{EMA}_t(n) = \left(\text{Close}_t * \left(\frac{\alpha}{1 + n}\right)\right) + EMA_{t-1}(n) * \left(1 - \left(\frac{\alpha}{1 + n}\right)\right) \\
\end{equation}
where $\alpha$ is an smoothing factor (we take here as $\alpha=2$) and $n$ is the number of days in the period. Then:

\begin{equation}
\text{MACD}_t = \text{EMA}_t(12) - \text{EMA}_t(26)
\end{equation}

4. <b>Momentum</b>: Momentum is the rate of acceleration of a security's price. It refers to the inertia of a price trend to continue either rising or falling for a particular length of time, usually taking into account both price and volume information. Here we calculate momentum as the difference between the close prices over 1, 3, 5, 7, 14, 21, and 28 trading days respectively. If we denote by $n$ the number of trading days:

\begin{equation}
\text{Momentum}_t(n) = \text{Close}_t - \text{Close}_{t-n}
\end{equation}


5. <b>Rate of change</b>: The rate of change (ROC) is the speed at which a variable changes over a specific period of time. ROC is often used when speaking about momentum.

\begin{equation}
\text{ROC}_t(n) = \frac{\text{Momentum}_t(n)}{\text{Close}_t}
\end{equation}

6. <b>Relative strength index</b>: The relative strength index (RSI) is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. The RSI is displayed as an oscillator (a line graph that moves between two extremes) and can have a reading from 0 to 100. Here, again, the common period to use is 14 days ($n$ = 14).

\begin{equation}
\text{RSI}_t(n) = 100 - \left(\frac{100}{1 + \text{RS}_t(n)}\right)
\end{equation}

\begin{equation}
\text{RS}_t(n) = \frac{\text{EMAGain}_t(n)}{\text{EMALoss}_t(n)}
\end{equation}

\begin{equation}
\text{EMAGain}_t(n) = \frac{(n-1)\cdot\text{EMAGain}_t(n) + \text{Gain}_t}{n}
\end{equation}

\begin{equation}
\text{Gain}_t = \begin{cases}
                    \text{Close}_t - \text{Close}_{t-1} & \text{if } \text{Close}_t > \text{Close}_{t-1} \\
                    0 & \text{otherwise}
                    \end{cases}
\end{equation}

\begin{equation}
\text{EMALoss}_t(n) = \frac{(n-1)\cdot\text{EMALoss}_t(n) + \text{Loss}_t}{n}
\end{equation}

\begin{equation}
\text{Loss}_t = \begin{cases}
                    \text{Close}_{t-1} - \text{Close}_t & \text{if } \text{Close}_t < \text{Close}_{t-1} \\
                    0 & \text{otherwise}
                    \end{cases}
\end{equation}

7. <b>Vortex indicator</b>: A vortex indicator (VI) is an indicator composed of two lines - an uptrend line (VI+) and a downtrend line (VI-). These lines are typically colored green and red respectively. A vortex indicator is used to spot trend reversals and confirm current trends.

\begin{equation}
\text{VI+}_t(n) = \frac{\text{SumVM+}_t(n)}{\text{SumTR}_t(n)}
\end{equation}

\begin{equation}
\text{VI-}_t(n) = \frac{\text{SumVM-}_t(n)}{\text{SumTR}_t(n)}
\end{equation}

\begin{equation}
\text{SumTR}_t(n) = \sum_{i = 0}^{n-1} \text{TR}_{t-i}
\end{equation}

\begin{equation}
\text{SumVM(+/-)}_t(n) = \sum_{i = 0}^{n-1} \text{VM(+/-)}_{t-i}
\end{equation}

\begin{equation}
\text{VM+}_t = \left| \text{High}_t - \text{Low}_{t-1}\right|
\end{equation}
\begin{equation}
\text{VM-}_t = \left| \text{Low}_t - \text{High}_{t-1}\right|
\end{equation}

8. <b>Detrended close oscillator</b>: A detrended price oscillator, used in technical analysis, strips out price trends in an effort to estimate the length of price cycles from peak to peak or trough to trough. Unlike other oscillators, such as the MACD, the DPO is not a momentum indicator. It instead highlights peaks and troughs in price, which are used to estimate buy and sell points in line with the historical cycle.

\begin{equation}
\text{DCO}_t(n) = \text{Close}_{t-(n/2 + 1)} - \text{SMA}_t(n)
\end{equation}

\begin{equation}
\text{SMA}_t(n) = \frac{1}{n}\sum_{i=0}^{n-1} \text{Close}_{t-i}
\end{equation}

9. <b>Returns</b>: The returns on investment (ROI) represent the percentage change between close prices on different dates, across different periods.

\begin{equation}
\text{ROI}_t(n) = \frac{\text{Close}_t - \text{Close}_{t-n}}{\text{Close}_{t-n}}
\end{equation}

10. <b>Volatility</b>: Volatility represents the risk of a stock as expressed by its fluctuations, and is expressed as the standard deviation of the logarithmic returns of the stock. In this case, we take the daily returns.
\begin{equation}
\text{Volatility}_t(N,n) = \sqrt{\frac{1}{N-1} \sum_{i=0}^{N-1} \log^2(\text{ROI}_{t-i}(n)) - \left(\frac{1}{N-1} \sum_{i=0}^{N-1} \log(\text{ROI}_{t-i}(n))\right)^2} * \sqrt{n}
\end{equation}
Here, $N$ represents the number of periods we consider for measuring the Volatility (here, we take $N$ days), and $n$ represents the period of time for computing the ROI (here, we take $n = 1$ day). In the right square root, $n$ is the number of periods covered by the ROI calculation. For instance, if we took a monthly measure of ROI, we should measure $n$ in months. In this example, as each period is equal to a day, we take $n = 1$.


11. <b>Force index</b>: The force index (FI) is a technical indicator that measures the amount of power used to move the price of an asset. The force index uses price and volume to determine the amount of strength behind a price move. The index is an oscillator, fluctuating between positive and negative territory. It is unbounded meaning the index can go up or down indefinitely. It is used for trend and breakout confirmation, as well as spotting potential turning points by looking for divergences.

\begin{equation}
\text{FI}_t(1) = \left(\text{Close}_t - \text{Close}_{t-1}\right) \cdot \text{Volume}_t
\end{equation}

\begin{equation}
\text{FI}_t(n) = \left(\text{FI}_t(1) \cdot \left(\frac{\alpha}{1 + n}\right)\right) + \text{FI}_{t-1}(n) \cdot \left(1 - \left(\frac{\alpha}{1 + n}\right)\right)
\end{equation}

12. <b>Accumulation/Distribution index</b>: The accumulation/distribution indicator (A/D) is a cumulative indicator that uses volume and price to assess whether a stock is being accumulated or distributed. The A/D measure seeks to identify divergences between the stock price and the volume flow. This provides insight into how strong a trend is.
\begin{equation}
\text{A/D}_t = \text{A/D}_{t-1} + \text{MFV}_t
\end{equation}
where the Money Flow Volume (MFV) is
\begin{equation}
\text{MFV}_t = \text{MFM}_t \cdot \text{Volume}_t
\end{equation}
and the Money Flow Multiplier (MFM) is computed as:
\begin{equation}
\text{MFM}_t = \frac{(\text{Close}_t - \text{Low}_t)  - (\text{High}_t - \text{Close}_t)}{\text{High}_t - \text{Low}_t}
\end{equation}

13. <b>Chaikin oscillator</b>: This estimator measures the difference between the three day and ten day exponential moving averages of the accumulation/distribution index. It measures the momentum predicted by oscillations around the accumulation-distribution line.

\begin{equation}
\text{Chaikin}_t = \text{EMAA\D}_t(3) - \text{EMAA\D}_t(10)
\end{equation}

\begin{equation}
\text{EMAA\D}_t(n) = \left(\text{A\D}_t \cdot \left(\frac{\alpha}{1 + n}\right)\right) + \text{EMAA\D}_{t-1}(n) \cdot \left(1 - \left(\frac{\alpha}{1 + n}\right)\right)
\end{equation}

13. <b>Min-max</b>: This presents the minimum and maximum close price over a specific period.






In [14]:
def true_range(df, N=14):
    atr_name = 'atr_' + str(N)
    df['tr'] = np.maximum(df["high"], df["close"].shift(1)) - np.minimum(df["low"], df["close"].shift(1))
    df[atr_name] = df['tr'].ewm(alpha=1/N, min_periods=N).mean()
    
    return df

def average_directional_index(df, N=14):
    adx_name = 'adx_' + str(N)
    atr_name = 'atr_' + str(N)
    
    if not atr_name in df.columns:
        true_range(df, N)

    upmove =  df['high'] - df['high'].shift(1)
    downmove = df['low'].shift(1) - df['low']

    df['plus_dm'] = np.where((upmove > downmove) & (upmove > 0), upmove, 0)
    df['down_dm'] = np.where((downmove > upmove) & (downmove > 0), downmove, 0)
    
    upi = 100 * df['plus_dm'].ewm(alpha=1/N, min_periods=N).mean() /  df[atr_name]
    downi = 100 * df['down_dm'].ewm(alpha=1/N, min_periods=N).mean() /  df[atr_name]
    df[adx_name] = 100 * (np.abs(upi - downi) / (upi + downi)).ewm(alpha=1/N, min_periods=14).mean()
    df =  df.drop(['plus_dm', 'down_dm'], axis=1)
    return df

def moving_average_convergence_divergence(df):
    close_EMA_26 = df['close'].ewm(span=26, adjust=False).mean()
    close_EMA_12 = df['close'].ewm(span=12, adjust=False).mean()

    df['MACD'] = close_EMA_12 - close_EMA_26
    return df

def momentum(df, periods=[1,3,5,7,14,21,28]):
    for t in periods:
        df[f"m_{t}"] = df['close'].diff(t)
    return df

def rate_of_change(df, periods=[1,3,5,7,14,21,28]):
    for t in periods:
        df[f"roc_{t}"] = df[f"m_{t}"] / df['close'].shift(t)
    return df

def relative_strength_index(df, N=14):
    u = df['close'].diff()
    d = df['close'].shift(1) - df['close']
    df['up'] = np.where(u > 0, u, 0)
    df['down'] = np.where(d > 0, d, 0)
    rsi_name = 'rsi_' + str(N)
    df[rsi_name] = 100 - 100 / ( 1 + df['up'].ewm(span=N, adjust=False).mean() / df['down'].ewm(span=N, adjust=False).mean())

    df = df.drop(['up', 'down'], axis=1)
    return df

def vortex_indicator(df, N=14):
    if not 'tr' in df.columns:
        true_range(df, N)
    
    vm_up = np.abs(df['high'] - df['low'].shift(1))
    vm_down = np.abs(df['low'] - df['high'].shift(1))

    tr_14 = df['tr'].rolling(window=N).sum()
    vm_up_14 = vm_up.rolling(window=N).sum()
    vm_down_14 = vm_down.rolling(window=N).sum()

    df[f"vi_{N}_plus"] = vm_up_14 / tr_14
    df[f"vi_{N}_neg"] = vm_down_14 / tr_14

    return df

def detrended_close_oscillator(df, N=22):
    dco_name = 'dco_' + str(N)
    mid_index = int(N/2+1)
    df[dco_name] = df['close'].shift(mid_index) - df['close'].rolling(window=N).mean()
    return df

def returns(df, periods=[1,3,5,7,14,21,28,84,168]):
    for t in periods:
        df[f"return_{t}"] = (df['close'] - df['close'].shift(t)) / df['close'].shift(t)
   # df['log_return_1'] = np.log(df['close'] / df['close'].shift(1))
    return df

def log_returns(df, periods=[1,3,5,7,14,21,28,84,168]):
    for t in periods:
        df[f"log_return_{t}"] = (df['close'] - df['close'].shift(t)) / df['close'].shift(t)
    return df

def volatility(df, roi_periods = [1], periods=[3,5,7,14,21,28,84,168]):
    for n in roi_periods:
        name = f"log_return_{n}"
        if not name in df.columns:
            log_returns(df, roi_periods)
            break
    
    for t in periods:
        for n in roi_periods:
            df[f"volatility_{t}_{n}"] = df[f"log_return_{n}"].rolling(window=t).std()*np.sqrt(n)

    df['3_28_volatility_ratio'] = df['volatility_3_1'] / df['volatility_28_1']
    return df

def force_index(df):
    df['force_index'] = (df['close'] - df['close'].shift(1)) * df['volume']
    return df

def accumulation_distribution_index(df):
    df['accdist'] = ((2 * df['close'] - (df['low'] + df['high'])) / (df['high'] - df['low'])) * df['volume']
    df['accdist'] = df['accdist'].expanding().sum()
    return df

def chaikin_oscillator(df):
    if not 'accdist' in df.columns:
        accumulation_distribution_index(df)
    df['chakin_oscillator'] = df['accdist'].ewm(span=3).mean() - df['accdist'].ewm(span=10).mean()
    
    return df

def min_max(df, periods=[3,5,7,14,21,28]):
    for t in periods:
        df[f"min_{t}"] = df['close'].rolling(window=t).min()
        df[f"max_{t}"] = df['close'].rolling(window=t).max()

        
        df[f'exp_mean_{t}'] = df['close'].ewm(span=t).mean()

    return df

def mean_price(df, periods=[3,5,7,14,21,28,84,168]):
    for t in periods:
        df[f'mean_{t}'] = df['close'].rolling(window=t).mean()
    
    return df


In [57]:
newpricedfs = []
for p in pricedfs:
    if not p.empty:
        p['tp'] = (p['high'] + p['low'] + p['close']) / 3
        p1 = true_range(p)
        p1 = average_directional_index(p1)
        p1 = moving_average_convergence_divergence(p1)
        p1 = momentum(p1)
        p1= rate_of_change(p1)
        p1 = relative_strength_index(p1)
        p1 = vortex_indicator(p1)
        p1 = detrended_close_oscillator(p1)
        p1 = returns(p1)
        p1 = volatility(p1)
        p1 = force_index(p1)
        p1 = accumulation_distribution_index(p1)
        p1 = chaikin_oscillator(p1)
        p1 = min_max(p1)
        p1 = mean_price(p1)
        newpricedfs.append(p1)
print ("Metrics calculated for all stocks")

Metrics calculated for all stocks


In [58]:
newpricedfs[0]

Unnamed: 0,Date,open,high,low,close,volume,OpenInt,Stock,tp,tr_,...,max_28,exp_mean_28,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_84,mean_168
2651,2016-01-05,21.25,21.87,20.92,21.25,130507,0,axdx,21.346667,,...,,21.250000,,,,,,,,
2652,2016-01-06,21.00,21.21,20.36,20.68,196246,0,axdx,20.750000,0.89,...,,20.954821,,,,,,,,
2653,2016-01-07,20.53,20.53,19.80,19.97,319336,0,axdx,20.100000,0.88,...,,20.602830,20.633333,,,,,,,
2654,2016-01-08,22.96,22.97,19.84,19.95,317046,0,axdx,20.920000,3.13,...,,20.421735,20.200000,,,,,,,
2655,2016-01-11,19.80,20.08,18.31,18.97,526832,0,axdx,19.120000,1.77,...,,20.088485,19.630000,20.164,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3115,2017-11-06,18.00,19.55,17.60,19.40,998854,0,axdx,18.850000,1.95,...,22.55,20.236996,18.900000,19.320,19.328571,19.335714,19.730952,20.360714,23.047321,24.703720
3116,2017-11-07,19.50,19.50,18.55,18.80,729410,0,axdx,18.950000,0.95,...,22.55,20.137892,18.733333,19.110,19.264286,19.232143,19.597619,20.239286,22.937798,24.674554
3117,2017-11-08,19.00,19.05,18.05,18.90,678804,0,axdx,18.666667,1.00,...,22.55,20.052520,19.033333,18.880,19.185714,19.157143,19.519048,20.112500,22.819345,24.647470
3118,2017-11-09,18.90,19.70,18.30,19.55,685261,0,axdx,19.183333,1.40,...,22.30,20.017864,19.083333,18.930,19.142857,19.135714,19.476190,20.005357,22.710417,24.631696


## Sentiment analysis
Now that we have calculated financial metrics based on the stock pricing information we have, we can further enrich this dataset with sentiment scores. These sentiment scores are calculated for each stock ticker, by using their respective news headlines where available over the course of 2016 and the first month of 2017, and represent a positive or negative opinion towards that stock as predicted from the news in question. 

### Dataset
In order to perform the sentiment analysis, we download a dataset from Kaggle. This dataset contains the headlines of financial news for a collection of more than 6000 stocks, ranging from 2009 to 2020. The dataset is contained in a .csv file containing the following information:

- Index key
- Article headline: the headline of the article.
- URL: url of the news article.
- Timestamp: The date the corresponding news article was published (Format: YYYY-MM-DD hh:mm:ssTZD, where TZD is -hh:mm or +hh:mm)
- Ticker: the stock ticker related to the corresponding news article.

In [18]:
api.dataset_download_files('miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests', path=storageDIRNews)
print("Download Complete")

Download Complete


Once the dataset has been downloaded, we have to prepare it for the sentiment analysis task. This would be done as follows:

1. From the Kaggle dataset, obtain the headlines from 2016 to 2017.
2. Process them, so they are ready for using them with the NLTK library.
3. Collect the stock prices for the different tickers in the period from 2016 to 2017.
4. Retrieve the stock prices at the time of the news, and one week / one month / three months afterwards.
5. The difference between stocks will determine the sentiment of the news: if it is positive and bigger than a threshold, we can classify it as positive; if it is negative and lower than a threshold, as negative. Otherwise, we can classify it as neutral.

First, we store the dataset in a Pandas data frame.

In [19]:
with zipfile.ZipFile(storageDIRNews+"/massive-stock-news-analysis-db-for-nlpbacktests.zip", 'r') as zip_ref:
    zip_ref.extractall(storageDIRNews)

dataset = pd.read_csv(storageDIRNews + "/raw_partner_headlines.csv", index_col=0)
dataset

  mask |= (ar1 == a)


Unnamed: 0,headline,url,publisher,date,stock
2,Agilent Technologies Announces Pricing of $5……...,http://www.gurufocus.com/news/1153187/agilent-...,GuruFocus,2020-06-01 00:00:00,A
3,Agilent (A) Gears Up for Q2 Earnings: What's i...,http://www.zacks.com/stock/news/931205/agilent...,Zacks,2020-05-18 00:00:00,A
4,J.P. Morgan Asset Management Announces Liquida...,http://www.gurufocus.com/news/1138923/jp-morga...,GuruFocus,2020-05-15 00:00:00,A
5,"Pershing Square Capital Management, L.P. Buys ...",http://www.gurufocus.com/news/1138704/pershing...,GuruFocus,2020-05-15 00:00:00,A
6,Agilent Awards Trilogy Sciences with a Golden ...,http://www.gurufocus.com/news/1134012/agilent-...,GuruFocus,2020-05-12 00:00:00,A
...,...,...,...,...,...
1849874,Consumer Cyclical Sector Wrap,https://www.benzinga.com/content/12/08/2846030...,webmaster,2012-08-20 00:00:00,ZX
1849875,Consumer Cyclical Sector Wrap,https://www.benzinga.com/content/12/07/2767124...,webmaster,2012-07-23 00:00:00,ZX
1849876,Zacks #5 Rank Additions for Monday - Tale of t...,http://www.zacks.com/stock/news/73497/here-are...,Zacks,2012-04-23 00:00:00,ZX
1849877,4 Stock Strategies From Wall Street: Feb. 9 (U...,http://www.thestreet.com/story/11409053/1/4-st...,TheStreet.Com,2012-02-09 00:00:00,ZX


Afterwards, we filter the headlines, and we only obtain those between 2016 and 2017.

In [20]:
actual_dataset = dataset.filter(items=['headline', 'date','stock'])
filter1 = actual_dataset.loc[actual_dataset['date'] < "2018-01-01"]
filtered_dataset = filter1.loc[filter1['date'] > "2016-01-01"]
filtered_dataset

Unnamed: 0,headline,date,stock
443,AdvisorShares Announces December —…–7 Distribu...,2017-12-29 00:00:00,A
444,Is Agilent Technologies With A Good Business A...,2017-12-22 00:00:00,A
445,Agilent Technologies Inc (A) Files –…-K for th...,2017-12-21 00:00:00,A
446,86 Dividend Growth Stocks Going Ex-Dividend Ne...,2017-12-21 00:00:00,A
447,Agilent (A) Down 3.7% Since Earnings Report: C...,2017-12-21 00:00:00,A
...,...,...,...
1849851,China Zenix Auto International's (ZX) CEO Junq...,2016-05-19 00:00:00,ZX
1849852,China Zenix Auto reports Q1 results,2016-05-19 00:00:00,ZX
1849853,China Zenix's (ZX) CEO Junqiu Gao on Q4 2015 R...,2016-04-15 00:00:00,ZX
1849854,China Zenix Auto International (ZX) Down Ahead...,2016-04-15 00:00:00,ZX


### Preparation of the datasets for the NLTK library

Before we can apply the news to analyze its sentiment, we first have to create the dataset that will work with the sentiment analysis library that we are using in this notebook: NLTK. Considering that, we have to apply some processing to the actual headlines:

1. Tokenize the headlines
2. Lemmatize and stem the words.
3. Clean the headlines (remove stopwords)

As a first step, we have to tokenize them:

In [21]:
!pip install nltk

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m


In [22]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
from nltk.tokenize import word_tokenize

tokenized = []
for index, row in filtered_dataset.iterrows():
    token = word_tokenize(row['headline'])
    tokenized.append(token)
    
filtered_dataset.insert(2, "token_headline", tokenized, True)
filtered_dataset

Unnamed: 0,headline,date,token_headline,stock
443,AdvisorShares Announces December —…–7 Distribu...,2017-12-29 00:00:00,"[AdvisorShares, Announces, December, —…–7, Dis...",A
444,Is Agilent Technologies With A Good Business A...,2017-12-22 00:00:00,"[Is, Agilent, Technologies, With, A, Good, Bus...",A
445,Agilent Technologies Inc (A) Files –…-K for th...,2017-12-21 00:00:00,"[Agilent, Technologies, Inc, (, A, ), Files, –...",A
446,86 Dividend Growth Stocks Going Ex-Dividend Ne...,2017-12-21 00:00:00,"[86, Dividend, Growth, Stocks, Going, Ex-Divid...",A
447,Agilent (A) Down 3.7% Since Earnings Report: C...,2017-12-21 00:00:00,"[Agilent, (, A, ), Down, 3.7, %, Since, Earnin...",A
...,...,...,...,...
1849851,China Zenix Auto International's (ZX) CEO Junq...,2016-05-19 00:00:00,"[China, Zenix, Auto, International, 's, (, ZX,...",ZX
1849852,China Zenix Auto reports Q1 results,2016-05-19 00:00:00,"[China, Zenix, Auto, reports, Q1, results]",ZX
1849853,China Zenix's (ZX) CEO Junqiu Gao on Q4 2015 R...,2016-04-15 00:00:00,"[China, Zenix, 's, (, ZX, ), CEO, Junqiu, Gao,...",ZX
1849854,China Zenix Auto International (ZX) Down Ahead...,2016-04-15 00:00:00,"[China, Zenix, Auto, International, (, ZX, ), ...",ZX


Once the different headlines have been tokenized, we have to lemmatize them. That way, we can work with the basic form of the different words, instead of working with plurals, verbal forms, etc. This makes easy for them to work on the sentiment analysis platform.

In [24]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [26]:
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

# We define here a function to lemmatize the different headlines
# after they have been tokenized
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
            
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

lemmatized = []
for index, row in filtered_dataset.iterrows():
    lemma = lemmatize_sentence(row['token_headline'])
    lemmatized.append(lemma)
    
filtered_dataset.insert(2, "lemmatized_headline", lemmatized, True)
filtered_dataset

Unnamed: 0,headline,date,lemmatized_headline,lemmatized_headline.1,token_headline,stock
443,AdvisorShares Announces December —…–7 Distribu...,2017-12-29 00:00:00,"[AdvisorShares, Announces, December, —…–7, Dis...","[AdvisorShares, Announces, December, —…–7, Dis...","[AdvisorShares, Announces, December, —…–7, Dis...",A
444,Is Agilent Technologies With A Good Business A...,2017-12-22 00:00:00,"[Is, Agilent, Technologies, With, A, Good, Bus...","[Is, Agilent, Technologies, With, A, Good, Bus...","[Is, Agilent, Technologies, With, A, Good, Bus...",A
445,Agilent Technologies Inc (A) Files –…-K for th...,2017-12-21 00:00:00,"[Agilent, Technologies, Inc, (, A, ), Files, –...","[Agilent, Technologies, Inc, (, A, ), Files, –...","[Agilent, Technologies, Inc, (, A, ), Files, –...",A
446,86 Dividend Growth Stocks Going Ex-Dividend Ne...,2017-12-21 00:00:00,"[86, Dividend, Growth, Stocks, Going, Ex-Divid...","[86, Dividend, Growth, Stocks, Going, Ex-Divid...","[86, Dividend, Growth, Stocks, Going, Ex-Divid...",A
447,Agilent (A) Down 3.7% Since Earnings Report: C...,2017-12-21 00:00:00,"[Agilent, (, A, ), Down, 3.7, %, Since, Earnin...","[Agilent, (, A, ), Down, 3.7, %, Since, Earnin...","[Agilent, (, A, ), Down, 3.7, %, Since, Earnin...",A
...,...,...,...,...,...,...
1849851,China Zenix Auto International's (ZX) CEO Junq...,2016-05-19 00:00:00,"[China, Zenix, Auto, International, 's, (, ZX,...","[China, Zenix, Auto, International, 's, (, ZX,...","[China, Zenix, Auto, International, 's, (, ZX,...",ZX
1849852,China Zenix Auto reports Q1 results,2016-05-19 00:00:00,"[China, Zenix, Auto, report, Q1, result]","[China, Zenix, Auto, report, Q1, result]","[China, Zenix, Auto, reports, Q1, results]",ZX
1849853,China Zenix's (ZX) CEO Junqiu Gao on Q4 2015 R...,2016-04-15 00:00:00,"[China, Zenix, 's, (, ZX, ), CEO, Junqiu, Gao,...","[China, Zenix, 's, (, ZX, ), CEO, Junqiu, Gao,...","[China, Zenix, 's, (, ZX, ), CEO, Junqiu, Gao,...",ZX
1849854,China Zenix Auto International (ZX) Down Ahead...,2016-04-15 00:00:00,"[China, Zenix, Auto, International, (, ZX, ), ...","[China, Zenix, Auto, International, (, ZX, ), ...","[China, Zenix, Auto, International, (, ZX, ), ...",ZX


And, finally, to finish the preparation of the sentences, we have to perform some cleaning over them: we just put all the tokens in lower cases and remove stopwords from the text. As all the news in the dataset are in English, we just use the basic English stopwords provided by the NLTK library.

In [27]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [37]:
def clean_lemmatized_tokens(tokens):
    cleaned_tokens = []
    for token in tokens:
        low_token = token.lower()
        if len(low_token) > 0 and low_token not in string.punctuation and low_token not in stop_words:
            cleaned_tokens.append(low_token)
    return cleaned_tokens

cleaned = []
for index, row in filtered_dataset.iterrows():
    cleaned_tokens = clean_lemmatized_tokens(row['lemmatized_headline'])
    cleaned.append(cleaned_tokens)
filtered_dataset.insert(2, "cleaned_headline", cleaned, True)
df = filtered_dataset.filter(items=['cleaned_headline', 'date','stock'])
df

Unnamed: 0,cleaned_headline,date,stock
443,"[advisorshares, announces, december, —…–7, dis...",2017-12-29 00:00:00,A
444,"[agilent, technologies, good, business, total,...",2017-12-22 00:00:00,A
445,"[agilent, technologies, inc, files, –…-k, fisc...",2017-12-21 00:00:00,A
446,"[86, dividend, growth, stocks, going, ex-divid...",2017-12-21 00:00:00,A
447,"[agilent, 3.7, since, earnings, report, rebound]",2017-12-21 00:00:00,A
...,...,...,...
1849851,"[china, zenix, auto, international, 's, zx, ce...",2016-05-19 00:00:00,ZX
1849852,"[china, zenix, auto, report, q1, result]",2016-05-19 00:00:00,ZX
1849853,"[china, zenix, 's, zx, ceo, junqiu, gao, q4, 2...",2016-04-15 00:00:00,ZX
1849854,"[china, zenix, auto, international, zx, ahead,...",2016-04-15 00:00:00,ZX


### Sentiment computation

Once the different headlines have been pre-processed, we can generate the training / test datasets. For that, we have to read the stock information, and obtain the difference in the pricing between the stocks at the day of the news, and after sometime has passed (in this example, we are using a month as the score, but it can be the desired amount of time). This measure will be the value we use as the "sentiment" of the piece of news: if the difference is positive, we shall count it as a positive example whereas, if it is negative, we consider that the sentiment of the headline is negative.

In order to split the data in training and test sets, we use the 1st January 2017 as the timestamp (news from 2016 are used as training, whereas the rest are used as test).

In [39]:
import datetime
import calendar
from dateutil.relativedelta import relativedelta

tickers = df["stock"].unique()

training_dataset = []
test_dataset = []
not_included = []
count = 0

for ticker in tickers:
    try:
        file_name = storageDIR+'/Stocks/' + ticker.lower() + ".us.txt"
        info = pd.read_csv(file_name)

        # Once we have obtained the csv, we just do the following
        reduced_df = df[df['stock']==ticker]

        for index, row in reduced_df.iterrows():
            date_time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
            current_str = date_time.strftime('%Y-%m-%d')
            next_date_time = date_time + relativedelta(months=1)
            next_str = next_date_time.strftime('%Y-%m-%d')

            cur_series = info[info['Date']==current_str]
            fut_series = info[info['Date']==next_str]

            if cur_series.size > 0 and fut_series.size > 0:
                score = fut_series["Close"].iloc[0] - cur_series["Close"].iloc[0]
                dictionary = dict([token, True] for token in row['cleaned_headline'])
                if score > 0:
                    tuple_val = (dictionary, "Positive")
                else:
                    tuple_val = (dictionary, "Negative")

                if row['date'] < '2017':
                    training_dataset.append(tuple_val)
                else:
                    test_dataset.append(tuple_val)
    
    except Exception:
        not_included.append(ticker)
    
print("The following tickers did not have data:")
print(tickers)


The following tickers did not have data:
['A' 'AAC' 'AADR' ... 'ZU' 'ZUMZ' 'ZX']


In [40]:
training_dataset

[({'agilent': True,
   'acquire': True,
   'belgian': True,
   'molecular': True,
   'diagnostics': True,
   'firm': True,
   'multiplicom': True},
  'Positive'),
 ({'cancer': True,
   'immunotherapy': True,
   'market': True,
   '2016': True,
   'takeaways': True,
   'expect': True,
   '2017': True},
  'Positive'),
 ({'agilent': True,
   'technologies': True,
   'raises': True,
   'dividend': True,
   '–5': True},
  'Positive'),
 ({'growth': True, 'stock': True, 'land': True, 'value': True, 'stocks': True},
  'Positive'),
 ({'agilent': True,
   'technologies': True,
   'ceo': True,
   'mike': True,
   'mcmullen': True,
   'q4': True,
   '2016': True,
   'results': True,
   'earnings': True,
   'call': True,
   'transcript': True},
  'Positive'),
 ({'agilent': True,
   'technologies': True,
   'inc.': True,
   '2016': True,
   'q4': True,
   'results': True,
   'earnings': True,
   'call': True,
   'slides': True},
  'Positive'),
 ({'agilent': True,
   'fq4': True,
   'revenues': True,

In [41]:
test_dataset

[({'agilent': True,
   'boosts': True,
   'market': True,
   'share': True,
   'fda': True,
   'nod': True,
   'new': True,
   'cancer': True,
   'test': True},
  'Positive'),
 ({'new': True,
   'strong': True,
   'buy': True,
   'stocks': True,
   'july': True,
   '24th': True},
  'Positive'),
 ({'top': True,
   'ranked': True,
   'momentum': True,
   'stocks': True,
   'buy': True,
   'july': True,
   '17th': True},
  'Negative'),
 ({'mitek': True,
   'systems': True,
   'mitk': True,
   'looks': True,
   'good': True,
   'stock': True,
   'adds': True,
   '10.5': True,
   'session': True},
  'Negative'),
 ({'agilent': True,
   'expands': True,
   'use': True,
   'cancer': True,
   'diagnostics': True,
   'europe': True},
  'Positive'),
 ({'fujifilm': True,
   'fujiy': True,
   'beats': True,
   'earnings': True,
   'misses': True,
   'revenues': True,
   'q4': True},
  'Positive'),
 ({'new': True,
   'strong': True,
   'buy': True,
   'stocks': True,
   'june': True,
   '12th': True

### Classification
In the previous steps, we built the training and test data for our classifier. Now, we can a) train b) evaluate and c) use the sentiment analysis classifier according to our needs. As a simple model, we will use the Naive Bayes classifier provided by the NLTK library.

In [42]:
from nltk import classify
from nltk import SklearnClassifier

In [None]:
classifier = NaiveBayesClassifier.train(training_dataset)
print("Accuracy is:", classify.accuracy(classifier, test_dataset))

In [43]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
classifier = SklearnClassifier(RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1))

classifier.train(training_dataset)
print("Accuracy is:", classify.accuracy(classifier, test_dataset))

Accuracy is: 0.5364436066603833


Once the classifier has been trained, we store the values into a file. First, as we want to combine the scores of the different KPIs and the sentiment analysis, we provide the values of the prediction for the training examples (news published during 2016).

In [44]:
tickers = df["stock"].unique()

f = open('pred_file_2016.csv','w')
f.write("date,ticker,pos,neg")
count = 0

for ticker in tickers:
    try:
        file_name = storageDIR+'/Stocks/' + ticker.lower() + ".us.txt"
        info = pd.read_csv(file_name)
        
        # Once we have obtained the csv, we just do the following
        reduced_df = df[df['stock']==ticker]
        reduced_df = reduced_df[reduced_df['date'] < '2017']
        
        for index, row in reduced_df.iterrows():
            date_time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
            current_str = date_time.strftime('%Y-%m-%d')
            next_date_time = date_time + relativedelta(months=1)
            next_str = next_date_time.strftime('%Y-%m-%d')
                        
            cur_series = info[info['Date']==current_str]
            fut_series = info[info['Date']==next_str]
            
            if cur_series.size > 0 and fut_series.size > 0:
                dictionary = dict([token, True] for token in row['cleaned_headline'])
                t = classifier.prob_classify(dictionary)
                f.write("\n" + current_str + "," + ticker + "," + str(t.prob('Positive')) + "," + str(t.prob('Negative')))
    except Exception:
        print(ticker + " does not have series data")
f.close()

AADR does not have series data
AAVL does not have series data
ABCW does not have series data
ABGB does not have series data
ABTL does not have series data
ACAS does not have series data
ACAT does not have series data
ACCU does not have series data
ACE does not have series data
ACFN does not have series data
ACG does not have series data
ACIM does not have series data
ACMP does not have series data
ACPW does not have series data
ACT does not have series data
ACTS does not have series data
ACUR does not have series data
ADAT does not have series data
ADEP does not have series data
ADGE does not have series data
ADK does not have series data
ADPT does not have series data
ADRD does not have series data
ADT does not have series data
AEGR does not have series data
AEPI does not have series data
AF does not have series data
AFCB does not have series data
AFFX does not have series data
AFK does not have series data
AFM does not have series data
AFOP does not have series data
AGA does not have

And then, as we are trying to predict the stock values one month ahead of February 1st 2017, we store the value of the sentiment for the news published at January 2017 (in the definitive prediction, we shall use the predicted sentiment of the last news article published before February 1st 2017)

In [45]:
tickers = df["stock"].unique()

f = open('pred_file_01_01_2017.csv','w')
f.write("ticker,pos,neg")
count = 0

for ticker in tickers:
    try:
        file_name = storageDIR+'/Stocks/' + ticker.lower() + ".us.txt"
        info = pd.read_csv(file_name)
        
        # Once we have obtained the csv, we just do the following
        reduced_df = df[df['stock']==ticker]
        reduced_df = reduced_df[reduced_df['date'] < '2017-01-02']
        
        series = reduced_df['cleaned_headline']
        if series.size > 0:
            dictionary = dict([token, True] for token in series.iloc[-1])
            t = classifier.prob_classify(dictionary)
            f.write("\n" + ticker + "," + str(t.prob('Positive')) + "," + str(t.prob('Negative')))
    except Exception:
        print(ticker + " does not have series data")

f.close()


AADR does not have series data
AAVL does not have series data
ABCW does not have series data
ABGB does not have series data
ABTL does not have series data
ACAS does not have series data
ACAT does not have series data
ACCU does not have series data
ACE does not have series data
ACFN does not have series data
ACG does not have series data
ACIM does not have series data
ACMP does not have series data
ACPW does not have series data
ACT does not have series data
ACTS does not have series data
ACUR does not have series data
ADAT does not have series data
ADEP does not have series data
ADGE does not have series data
ADK does not have series data
ADPT does not have series data
ADRD does not have series data
ADT does not have series data
AEGR does not have series data
AEPI does not have series data
AF does not have series data
AFCB does not have series data
AFFX does not have series data
AFK does not have series data
AFM does not have series data
AFOP does not have series data
AGA does not have

## Sentiment enrichment and dataset splitting

Now that we have calculated financial metrics based on the stock pricing information, we can further enrich this dataset with the sentiment scores we computed. These sentiment scores are calculated for each stock ticker, by using their respective news headlines where available over the course of 2016 and the first month of 2017, and represent a positive or negative opinion towards that stock as predicted from the news in question. 

As the target value we are predicting is the close price of the stock 28 trading days out, we shift the values of the close price ahead by 28 days so that our model can investigate if there is a suitable relationship between our predictors and future close price.

In [59]:
newdfs_training = []
newdfs_test_1m = []

newdfs_training_without_sentiment = []
newdfs_test_1m_without_sentiment = []

newdfs_old = []
newdfs_test_1m_old = []

sentiment_2016 = pd.read_csv('pred_file_2016.csv').rename(columns={'ticker':'Stock', 'date': 'Date', 'pos': 'POSITIVE', 'neg': 'NEGATIVE'})
sentiment_2016['Stock'] = sentiment_2016['Stock'].str.lower()
sentiment_2016['Date'] = pd.to_datetime(sentiment_2016['Date'])

sentiment_2017 = pd.read_csv('pred_file_01_01_2017.csv').rename(columns={'ticker':'Stock','pos': 'POSITIVE', 'neg': 'NEGATIVE'})
sentiment_2017['Stock'] = sentiment_2017['Stock'].str.lower()

for d in newpricedfs:
    d['target_price'] = d['close'].shift(-28)

    d['Date'] = pd.to_datetime(d['Date'])
    d['year'] = d['Date'].dt.year
    d['month'] = d['Date'].dt.month
    d['day'] = d['Date'].dt.day 
    
    sentiment_snippet = sentiment_2016[sentiment_2016['Stock'] == d['Stock'].values.tolist()[0]]
    dt = d[d['year'] == 2016].drop(columns=['year', 'month', 'day'])
    dt['Stock'] = d['Stock'].values.tolist()[0]
    dts = pd.merge(dt, sentiment_snippet, on=['Stock', 'Date'], how='inner')
    
    
    sentiment_snippet_test = sentiment_2017[sentiment_2017['Stock'] == d['Stock'].values.tolist()[0]]
    d1m = d[(d['year'] == 2017) & ((d['month'] == 1) | (d['month'] == 2) | (d['month'] == 3))].drop(columns=['year', 'month', 'day'])
    d1m['Stock'] = d['Stock'].values.tolist()[0]
    d1ms = pd.merge(d1m, sentiment_snippet_test, on=['Stock'], how='left')
    
    
    d['Date'] = d['Date'].dt.strftime('%Y-%m-%d')
        
    if ((d1m.shape[0] > 0) & (d1ms.shape[0] > 0) & (dts.shape[0] > 0) & (dt.shape[0] > 0)):
        newdfs_training.append(dts)
        newdfs_test_1m.append(d1ms)
        
        dt_ws = dts.drop(columns=['POSITIVE', 'NEGATIVE'])
        d1m_ws = d1ms.drop(columns=['POSITIVE', 'NEGATIVE'])
        
        newdfs_training_without_sentiment.append(dt_ws)
        newdfs_test_1m_without_sentiment.append(d1m_ws)
        
        dt_old = dt[['Stock', 'Date', 'return_84', 'return_168', 'return_28', 'mean_84', 'mean_168', 'mean_28', 'volatility_84_1',
                    'volatility_168_1', 'volatility_28_1', 'target_price']].drop_duplicates()
        d1m_old = d1m[['Stock', 'Date', 'return_84', 'return_168', 'return_28', 'mean_84', 'mean_168', 'mean_28', 'volatility_84_1',
                    'volatility_168_1', 'volatility_28_1', 'target_price']].drop_duplicates()
        
        newdfs_old.append(dt_old)
        newdfs_test_1m_old.append(d1m_old)
#         
print("Dataset Divided into Training and Test Sets")

Dataset Divided into Training and Test Sets


In [60]:
newdfs_test_1m_old

[     Stock       Date  return_84  return_168  return_28    mean_84   mean_168  \
 2902  axdx 2017-01-03  -0.047727    0.724280  -0.187984  24.262144  20.651191   
 2903  axdx 2017-01-04  -0.026709    0.820491  -0.146825  24.255120  20.708870   
 2904  axdx 2017-01-05  -0.122881    0.782946  -0.173653  24.220596  20.762977   
 2905  axdx 2017-01-06  -0.209807    0.738568  -0.194000  24.156905  20.813929   
 2906  axdx 2017-01-09  -0.163584    0.727504  -0.150313  24.109524  20.864941   
 ...    ...        ...        ...         ...        ...        ...        ...   
 2959  axdx 2017-03-27  -0.093254    0.166411  -0.033827  22.833631  23.332382   
 2960  axdx 2017-03-28  -0.059880    0.179860  -0.010504  22.815774  23.353751   
 2961  axdx 2017-03-29  -0.076000    0.149826  -0.008584  22.793155  23.371667   
 2962  axdx 2017-03-30   0.006263    0.226463   0.041037  22.794940  23.398155   
 2963  axdx 2017-03-31  -0.016227    0.181774   0.002066  22.790179  23.420358   
 
         mean_

## Stock price prediction model - Random forest regression

After having curated our training and testing sets, we proceed to use a simple machine learning model to predict the profitability of assets from the test dataset, shifted 28 trading days out from the end of our training period. TA random forest regression model is an ensemble model, which means that it combines the predictions of multiple machine learning models in order to improve prediction accuracy and avoid overfitting. A random forest model aggregates decision trees and takes the average of their predictions or probabilities for classes; decision trees are supervised models that attempt to infer simple decision rules from the features and use these to make predictions.

Random forests introduce randomness from two other sources, namely that each tree is built with a random sample from the training set, and that the splitting of nodes during tree construction is either found from all features or from a random subset of the maximum permitted number of features. These sources of randomness are intended to minimize the impact of each singular decision tree's sensitivity.


For the purpose of evaluation, we run this model on three separate training sets:

1. First, similar to the approach taken in the initial Infinitech profitability estimation recommender on the marketplace (found <a href='https://marketplace.infinitech-h2020.eu/login?redirect_to=https%3A%2F%2Fmarketplace.infinitech-h2020.eu%2Fassets%2Ffinancial-asset-recommender-profitabiliy-estimation'>here</a>), we take the basic KPIs of average price, volatility, and returns over a number of months to predict price.
2. Then, we add more advanced technical analysis indicators to this dataset.
3. Finally, we add sentiment features.

This is to evaluate the impact of adding more advanced technical analysis indicators and sentiment features in predicting the close price.

In [61]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


predictions_old = {}
test_prices_old = {}
mae_old = {}
mse_old = {}
r2_old = {}

for i, t in enumerate(newdfs_old):
    if not t.empty:
        instrument_id = t['Stock'].values.tolist()[0]
        print (instrument_id)
        te = newdfs_test_1m_old[i]
        
        t.replace([np.inf, -np.inf], np.nan, inplace=True)
        te.replace([np.inf, -np.inf], np.nan, inplace=True)
        
        t.dropna(inplace=True)
        
        train_price = t['target_price']
        t = t.drop(columns=['Stock', 'Date', 'target_price'])
        
        test_price = te['target_price']
        te = te.drop(columns=['Stock', 'Date', 'target_price'])
        
        if not t.empty and not te.empty:
            scaler = StandardScaler()
            scaled_data = scaler.fit_transform(t)
            scaler = StandardScaler()
            scaled_test = scaler.fit_transform(te)
            
            scaled_data = scaled_data[~np.isnan(scaled_data).any(axis=1)]
            scaled_data = scaled_data[np.isfinite(scaled_data).any(axis=1)]
            scaled_test = scaled_test[~np.isnan(scaled_test).any(axis=1)]
            scaled_test = scaled_test[np.isfinite(scaled_test).any(axis=1)]
            
            if (len(scaled_test) >= 1):
                
                # Prediction

                model = RandomForestRegressor()
                model.fit(scaled_data, train_price)
                pred = model.predict(scaled_test)
                
                predictions_old[instrument_id] = pred
                test_prices_old[instrument_id] = test_price

                mse_old[instrument_id] = mean_squared_error(test_price, pred)
                mae_old[instrument_id] = mean_absolute_error(test_price, pred)
                r2_old[instrument_id] = r2_score(test_price, pred)
                    

axdx
rad
chu
vac
incr
nksh
nnn
gcbc
htlf
sgmo
aoi
igr
rm
smsi
ston
kst
grpn
irl
strt
hty
wsr
evol
mag
nbn
ipgp
mobl
jbss
fn
blin
gty
fgen
ndls
gnbc
bksc
kio
aldw
mack
afmd
kbsf
afb
bcc
ftd
dswl
rvnc
cor
lmnr
msex
dct
tat
sgms
rvlt
flex
wbig
plxs
mygn
ntg
virc
pfd
ipar
emi
rlj
syx
alex
vbnd
mchp
ess
dvcr
main
lmos
hcap
hchc
ddr
cobz
hiw
swz
gbr
vnom
plpm
odp
true
fl
flt
news
fran
ivr
acy
inve
pkg
kss
akao
fac
amtd
tdw
wbid
cva
icd
wdc
cown
rcmt
slp
are
wbc
atra
xncr
artx
csf
ttek
mmsi
bmo
matx
gdv
sgb
finl
basi
ardm
brfs
hmn
xoxo
gpt
ivc
rice
pcom
uht
wyy
fcnca
nan
gbdc
yume
kof
pjc
hsgx
praa
crai
pky
pfbc
bldr
pfsi
qrvo
phi
cffi
crk
npk
vntv
el
ibtx
ren
pes
ardx
mco
jbt
usat
avb
tast
aed
isl
bms
tgh
apto
pdco
amrc
ffbc
ldos
afg
swx
fprx
veev
noa
cts
mwa
ch
shos
wina
aamc
wit
wcg
idra
sm
farm
cl
mnov
amrk
etp
fsfg
bif
mbtf
igt
dkl
oesx
sfbc
bfz
kool
rli
banc
cytr
evp
giii
hear
vrsn
sir
evtc
shoo
clb
pets
cphc
orbk
hhc
aeo
dlhc
kmb
alk
hurc
mnk
gzt
htht
csx
srt
gmo
cuba
weys
hbp
kep
cwt


In [62]:
predictions = {}
test_prices = {}
mae = {}
mse = {}
r2 = {}
for i, t in enumerate(newdfs_training):
    if not t.empty:
        instrument_id = t['Stock'].values.tolist()[0]
        print (instrument_id)
        te = newdfs_test_1m[i]
        
        t.replace([np.inf, -np.inf], np.nan, inplace=True)
        te.replace([np.inf, -np.inf], np.nan, inplace=True)
        
        t.dropna(inplace=True)
        
        train_price = t['target_price']
        t = t.drop(columns=['Stock', 'Date', 'target_price'])
        
        test_price = te['target_price']
        te = te.drop(columns=['Stock', 'Date', 'target_price'])
        
        if not t.empty and not te.empty:
            scaler = StandardScaler()
            scaled_data = scaler.fit_transform(t)
            scaler = StandardScaler()
            scaled_test = scaler.fit_transform(te)
            
            scaled_data = scaled_data[~np.isnan(scaled_data).any(axis=1)]
            scaled_data = scaled_data[np.isfinite(scaled_data).any(axis=1)]
            scaled_test = scaled_test[~np.isnan(scaled_test).any(axis=1)]
            scaled_test = scaled_test[np.isfinite(scaled_test).any(axis=1)]
            
            if (len(scaled_test) >= 1):
                

                model = RandomForestRegressor()
                model.fit(scaled_data, train_price)
                pred = model.predict(scaled_test)

                predictions[instrument_id] = pred

                test_prices[instrument_id] = test_price
                mse[instrument_id] = mean_squared_error(test_price, pred)
                mae[instrument_id] = mean_absolute_error(test_price, pred)
                r2[instrument_id] = r2_score(test_price, pred)
                    

axdx
rad
chu
vac
incr
nksh
nnn
gcbc
htlf
sgmo
aoi
igr
rm
smsi
ston
kst
grpn
irl
strt
hty
wsr
evol
mag
nbn
ipgp
mobl
jbss
fn
blin
gty
fgen
ndls
gnbc
bksc
kio
aldw
mack
afmd
kbsf
afb
bcc
ftd
dswl
rvnc
cor
lmnr
msex
dct
tat
sgms
rvlt
flex
wbig
plxs
mygn
ntg
virc
pfd
ipar
emi
rlj
syx
alex
vbnd
mchp
ess
dvcr
main
lmos
hcap
hchc
ddr
cobz
hiw
swz
gbr
vnom
plpm
odp
true
fl
flt
news
fran
ivr
acy
inve
pkg
kss
akao
fac
amtd
tdw
wbid
cva
icd
wdc
cown
rcmt
slp
are
wbc
atra
xncr
artx
csf
ttek
mmsi
bmo
matx
gdv
sgb
finl
basi
ardm
brfs
hmn
xoxo
gpt
ivc
rice
pcom
uht
wyy
fcnca
nan
gbdc
yume
kof
pjc
hsgx
praa
crai
pky
pfbc
bldr
pfsi
qrvo
phi
cffi
crk
npk
vntv
el
ibtx
ren
pes
ardx
mco
jbt
usat
avb
tast
aed
isl
bms
tgh
apto
pdco
amrc
ffbc
ldos
afg
swx
fprx
veev
noa
cts
mwa
ch
shos
wina
aamc
wit
wcg
idra
sm
farm
cl
mnov
amrk
etp
fsfg
bif
mbtf
igt
dkl
oesx
sfbc
bfz
kool
rli
banc
cytr
evp
giii
hear
vrsn
sir
evtc
shoo
clb
pets
cphc
orbk
hhc
aeo
dlhc
kmb
alk
hurc
mnk
gzt
htht
csx
srt
gmo
cuba
weys
hbp
kep
cwt


In [63]:
predictions_ws = {}
test_prices_ws = {}
mae_ws = {}
mse_ws = {}
r2_ws = {}

for i, t in enumerate(newdfs_training_without_sentiment):
    if not t.empty:
        instrument_id = t['Stock'].values.tolist()[0]
        print (instrument_id)
        te = newdfs_test_1m_without_sentiment[i]
        
        t.replace([np.inf, -np.inf], np.nan, inplace=True)
        te.replace([np.inf, -np.inf], np.nan, inplace=True)
        
        t.dropna(inplace=True)
        
        train_price = t['target_price']
        t = t.drop(columns=['Stock', 'Date', 'target_price'])
        
        test_price = te['target_price']
        te = te.drop(columns=['Stock', 'Date', 'target_price'])
        
        if not t.empty and not te.empty:
            scaler = StandardScaler()
            scaled_data = scaler.fit_transform(t)
            scaler = StandardScaler()
            scaled_test = scaler.fit_transform(te)
            
            scaled_data = scaled_data[~np.isnan(scaled_data).any(axis=1)]
            scaled_data = scaled_data[np.isfinite(scaled_data).any(axis=1)]
            scaled_test = scaled_test[~np.isnan(scaled_test).any(axis=1)]
            scaled_test = scaled_test[np.isfinite(scaled_test).any(axis=1)]
            
            if (len(scaled_test) >= 1):

                model = RandomForestRegressor()
                model.fit(scaled_data, train_price)
                pred = model.predict(scaled_test)

                predictions_ws[instrument_id] = pred

                test_prices_ws[instrument_id] = test_price
                mse_ws[instrument_id] = mean_squared_error(test_price, pred)
                mae_ws[instrument_id] = mean_absolute_error(test_price, pred)
                r2_ws[instrument_id] = r2_score(test_price, pred)
                    

axdx
rad
chu
vac
incr
nksh
nnn
gcbc
htlf
sgmo
aoi
igr
rm
smsi
ston
kst
grpn
irl
strt
hty
wsr
evol
mag
nbn
ipgp
mobl
jbss
fn
blin
gty
fgen
ndls
gnbc
bksc
kio
aldw
mack
afmd
kbsf
afb
bcc
ftd
dswl
rvnc
cor
lmnr
msex
dct
tat
sgms
rvlt
flex
wbig
plxs
mygn
ntg
virc
pfd
ipar
emi
rlj
syx
alex
vbnd
mchp
ess
dvcr
main
lmos
hcap
hchc
ddr
cobz
hiw
swz
gbr
vnom
plpm
odp
true
fl
flt
news
fran
ivr
acy
inve
pkg
kss
akao
fac
amtd
tdw
wbid
cva
icd
wdc
cown
rcmt
slp
are
wbc
atra
xncr
artx
csf
ttek
mmsi
bmo
matx
gdv
sgb
finl
basi
ardm
brfs
hmn
xoxo
gpt
ivc
rice
pcom
uht
wyy
fcnca
nan
gbdc
yume
kof
pjc
hsgx
praa
crai
pky
pfbc
bldr
pfsi
qrvo
phi
cffi
crk
npk
vntv
el
ibtx
ren
pes
ardx
mco
jbt
usat
avb
tast
aed
isl
bms
tgh
apto
pdco
amrc
ffbc
ldos
afg
swx
fprx
veev
noa
cts
mwa
ch
shos
wina
aamc
wit
wcg
idra
sm
farm
cl
mnov
amrk
etp
fsfg
bif
mbtf
igt
dkl
oesx
sfbc
bfz
kool
rli
banc
cytr
evp
giii
hear
vrsn
sir
evtc
shoo
clb
pets
cphc
orbk
hhc
aeo
dlhc
kmb
alk
hurc
mnk
gzt
htht
csx
srt
gmo
cuba
weys
hbp
kep
cwt


## Model Effectiveness Evaluation

After the prediction of returns, we proceed onto the evaluation of our model. We can do so by means of more classical metrics that assess the difference between actual and predicted close price values. For regressors, some of these metrics are the R-squared score, mean absolute error (MAE) and the mean squared error (MSE). We take the first model with the basic KPIs as our baseline to see how our new features impact prediction. Note that this is a slightly different task from the previous notebook we discussed (linked above), as it predicts close price one month out, not profitability.

In [64]:
mse_df = pd.DataFrame.from_dict(data=mse, columns=['MSE - KPIS+sentiment'], orient='index').reset_index()
mse_df_ws = pd.DataFrame.from_dict(data=mse_ws, columns=['MSE - KPIs'], orient='index').reset_index()
mse_df_old = pd.DataFrame.from_dict(data=mse_old, columns=['Baseline (MSE - Basic KPIS)'], orient='index').reset_index()
mse_df_full = pd.merge(mse_df, mse_df_ws, on='index')
mse_df_full = pd.merge(mse_df_full, mse_df_old, on='index')
mse_df_full['% difference in MSE - KPIS+sentiment'] = (mse_df_full['MSE - KPIS+sentiment'] - 
                                    mse_df_full['Baseline (MSE - Basic KPIS)'])/mse_df_full['Baseline (MSE - Basic KPIS)']
mse_df_full['% difference in MSE - KPIS'] = (mse_df_full['MSE - KPIs'] - 
                                    mse_df_full['Baseline (MSE - Basic KPIS)'])/mse_df_full['Baseline (MSE - Basic KPIS)']

mae_df = pd.DataFrame.from_dict(data=mae, columns=['MAE - KPIS+sentiment'], orient='index').reset_index()
mae_df_ws = pd.DataFrame.from_dict(data=mae_ws, columns=['MAE - KPIs'], orient='index').reset_index()
mae_df_old = pd.DataFrame.from_dict(data=mae_old, columns=['Baseline (MAE - Basic KPIS)'], orient='index').reset_index()
mae_df_full = pd.merge(mae_df, mae_df_ws, on='index')
mae_df_full = pd.merge(mae_df_full, mae_df_old, on='index')
mae_df_full['% difference in MAE - KPIS+sentiment'] = (mae_df_full['MAE - KPIS+sentiment'] - 
                                    mae_df_full['Baseline (MAE - Basic KPIS)'])/mae_df_full['Baseline (MAE - Basic KPIS)']
mae_df_full['% difference in MAE - KPIS'] = (mae_df_full['MAE - KPIs'] - 
                                    mae_df_full['Baseline (MAE - Basic KPIS)'])/mae_df_full['Baseline (MAE - Basic KPIS)']

r2_df = pd.DataFrame.from_dict(data=r2, columns=['R2 - KPIS+sentiment'], orient='index').reset_index()
r2_df_ws = pd.DataFrame.from_dict(data=r2_ws, columns=['R2 - KPIs'], orient='index').reset_index()
r2_df_old = pd.DataFrame.from_dict(data=r2_old, columns=['Baseline (R2 - Basic KPIS)'], orient='index').reset_index()
r2_df_full = pd.merge(r2_df, r2_df_ws, on='index')
r2_df_full = pd.merge(r2_df_full, r2_df_old, on='index')
r2_df_full['% difference in R2 - KPIS+sentiment'] = (r2_df_full['R2 - KPIS+sentiment'] - 
                                    r2_df_full['Baseline (R2 - Basic KPIS)'])/r2_df_full['Baseline (R2 - Basic KPIS)']
r2_df_full['% difference in R2 - KPIS'] = (r2_df_full['R2 - KPIs'] - 
                                    r2_df_full['Baseline (R2 - Basic KPIS)'])/r2_df_full['Baseline (R2 - Basic KPIS)']


In [65]:
mse_df_full.sort_values(by='% difference in MSE - KPIS+sentiment').head(10)

Unnamed: 0,index,MSE - KPIS+sentiment,MSE - KPIs,Baseline (MSE - Basic KPIS),% difference in MSE - KPIS+sentiment,% difference in MSE - KPIS
2541,rvp,0.004945,0.005336,0.897394,-0.994489,-0.994054
2358,prto,0.226794,0.226794,33.236203,-0.993176,-0.993176
2136,tril,0.074316,0.07286,6.38582,-0.988362,-0.98859
1386,rxii,0.005663,0.005655,0.347964,-0.983725,-0.983748
226,arwr,0.062524,0.076582,3.781132,-0.983464,-0.979746
922,brcd,0.065828,0.096709,2.750002,-0.976063,-0.964833
571,elgx,0.125341,0.128351,4.66284,-0.973119,-0.972474
2148,anth,2.358522,2.388216,78.768982,-0.970058,-0.969681
108,ardm,0.133085,0.133085,4.11166,-0.967632,-0.967632
751,gene,0.003472,0.003497,0.105667,-0.967145,-0.966909


In [66]:
mse_df_full.describe()

Unnamed: 0,MSE - KPIS+sentiment,MSE - KPIs,Baseline (MSE - Basic KPIS),% difference in MSE - KPIS+sentiment,% difference in MSE - KPIS
count,3160.0,3160.0,3160.0,3160.0,3160.0
mean,562846.0,564551.1,3577574.0,-0.002325,-0.002137
std,19833340.0,20044520.0,182588400.0,0.781511,0.785997
min,0.0002600506,0.0002600506,0.0004569644,-0.994489,-0.994054
25%,0.8466413,0.8356722,1.017718,-0.446354,-0.444824
50%,4.013226,4.045162,5.423835,-0.121553,-0.124605
75%,19.85394,19.66189,25.24239,0.202765,0.197766
max,866559100.0,880838100.0,10230110000.0,13.620089,13.620089


In [67]:
mae_df_full.sort_values(by='% difference in MAE - KPIS+sentiment').head(10)

Unnamed: 0,index,MAE - KPIS+sentiment,MAE - KPIs,Baseline (MAE - Basic KPIS),% difference in MAE - KPIS+sentiment,% difference in MAE - KPIS
2541,rvp,0.059069,0.061697,0.619004,-0.904574,-0.900329
2358,prto,0.441935,0.441935,4.546111,-0.902788,-0.902788
2136,tril,0.189434,0.188821,1.759255,-0.892322,-0.89267
226,arwr,0.212979,0.221387,1.639302,-0.870079,-0.86495
1386,rxii,0.061013,0.06099,0.450332,-0.864515,-0.864568
2148,anth,1.326898,1.347441,7.217354,-0.816152,-0.813305
571,elgx,0.303113,0.309277,1.483731,-0.795709,-0.791554
533,bpk,0.016163,0.015987,0.076851,-0.789686,-0.791974
3071,lbai,0.509599,0.512517,2.422582,-0.789646,-0.788442
2627,rbpaa,0.142015,0.142015,0.672773,-0.788912,-0.788912


In [68]:
mae_df_full.describe()

Unnamed: 0,MAE - KPIS+sentiment,MAE - KPIs,Baseline (MAE - Basic KPIS),% difference in MAE - KPIS+sentiment,% difference in MAE - KPIS
count,3160.0,3160.0,3160.0,3160.0,3160.0
mean,25.482494,25.357068,45.491216,-0.027367,-0.027265
std,676.423874,677.685344,1613.689053,0.378619,0.3791
min,0.013135,0.013135,0.016104,-0.904574,-0.902788
25%,0.799039,0.797619,0.852641,-0.25804,-0.257034
50%,1.779241,1.785425,2.001658,-0.048362,-0.048384
75%,3.993844,3.991863,4.374407,0.128794,0.125594
max,27261.005887,27589.279142,85817.63455,3.496922,3.496922


In [69]:
r2_df_full.sort_values(by='R2 - KPIS+sentiment').head(10)

Unnamed: 0,index,R2 - KPIS+sentiment,R2 - KPIs,Baseline (R2 - Basic KPIS),% difference in R2 - KPIS+sentiment,% difference in R2 - KPIS
3112,rgse,-7091.512188,-6610.277051,-3942.69113,0.798648,0.67659
612,wgl,-894.078065,-829.525502,-1454.503384,-0.385304,-0.429685
525,drys,-781.959743,-794.86124,-9242.187325,-0.915392,-0.913996
368,opht,-652.544086,-691.050252,-607.181665,0.07471,0.138128
2730,gale,-635.342629,-635.284134,-2074.052972,-0.693671,-0.693699
684,bas,-411.024586,-396.679781,-1136.904971,-0.638471,-0.651088
1786,gevo,-387.462577,-368.059139,-444.369743,-0.128063,-0.171728
3077,tcs,-315.837272,-331.423373,-151.119161,1.089988,1.193126
33,mack,-315.409544,-309.619399,-160.554199,0.964505,0.928442
2675,ajg,-186.90465,-181.810148,-109.483056,0.707156,0.660624


In [70]:
r2_df_full.describe()

Unnamed: 0,R2 - KPIS+sentiment,R2 - KPIs,Baseline (R2 - Basic KPIS),% difference in R2 - KPIS+sentiment,% difference in R2 - KPIS
count,3160.0,3160.0,3160.0,3160.0,3160.0
mean,-13.503239,-13.354323,-17.888153,-0.246814,-0.21401
std,130.416394,122.10157,187.39923,10.76969,9.639015
min,-7091.512188,-6610.277051,-9242.187325,-562.209375,-496.140707
25%,-11.075835,-11.060709,-13.231736,-0.666473,-0.657995
50%,-3.901985,-3.887939,-5.70223,-0.196326,-0.203239
75%,-1.067606,-1.057816,-2.215868,0.219765,0.211517
max,0.575261,0.397028,0.753375,62.094656,62.094656


In [71]:
mean_se = mse_df_full.mean()
mean_ae = mae_df_full.mean()
mean_r2 = r2_df_full.mean()

mean_metrics = pd.DataFrame({'Metric': ['MSE', 'MAE', 'R2'], 
                        'Baseline (Basic KPIs)': [mean_se['Baseline (MSE - Basic KPIS)'], 
                                                  mean_ae['Baseline (MAE - Basic KPIS)'],
                                                 mean_r2['Baseline (R2 - Basic KPIS)']], 
                       'KPIs': [mean_se['MSE - KPIS+sentiment'], 
                                                  mean_ae['MAE - KPIS+sentiment'],
                                                 mean_r2['R2 - KPIS+sentiment']],
                        'KPIS+sentiment':[mean_se['MSE - KPIS+sentiment'], 
                                                  mean_ae['MAE - KPIS+sentiment'],
                                                 mean_r2['R2 - KPIS+sentiment']]
                       })
mean_metrics

Unnamed: 0,Metric,Baseline (Basic KPIs),KPIs,KPIS+sentiment
0,MSE,3577574.0,562846.043097,562846.043097
1,MAE,45.49122,25.482494,25.482494
2,R2,-17.88815,-13.503239,-13.503239


From the above evaluation, we notice that there are a fair few assets whose predicted prices have extremely large prediction errors. These result in large MAE and MSE values. We also notice that a negative R-squared value indicates that the model fits the data less accurately than random guessing. Further, the addition of sentiment features does not seem to add any improvement to the price prediction model at all. Therefore, we can conclude that we perhaps have yet to identify a model, or an appropriate combination of features, that can serve to perform this task with sufficient success.

<!-- In the above table, we are showing the top 10 stocks that were predicted to be profitable. The last two columns report the actual return on investment after 9 months and asset volitility. Note that a return value of 1.5 means a 150% return on investment. 

We notice that predicted returns for the top stocks are exceedingly high, which is not ordinary. However, we can also see that the actual returns for these stocks are similar for several of these instances, i.e. the model is not wrong in predicting these as profitable investments in the short term. However, we can also see that the volitility fo these stocks is very high, i.e. these are 'high-risk' assets that may subsequently crash in price. 

We can also analyse the statistics for this predictions across the dataset. -->

<!-- Here we are able to see the average returns and volatility of the entire test set. 

We can also examine the predicted returns and volatility over the top 1, 5, 10, and 20 stocks, ranked by their predicted profitability. This will serve to provide an idea of how financially feasible our recommendations are, and whether the recommended stocks are an improvement to the average returns of the test set. -->

<!-- The returns and volatility for the top stocks, ranked by predicted returns, are far higher than their averages across the test set. This indicates that ranking assets by their predicted returns can produce some highly profitable but risk-laden investment recommendations, which might be suitable for aggressive investors. However, it remains to be seen how much of this is owed to fluctuations and outliers in the data, and perhaps even if there are better ways to capture the returns and volatility of the dataset.

Next, we look at the differences between the actual and predicted returns. -->

<!-- Lastly, we can examine the mean absolute error and mean squared error of the predictions. As these can be quite dependent on the dataset and problem in question, we also assume a simple baseline, by taking the median of all stock returns from the test dataset. We then compare the results of applying these metrics to the baseline and our predictor model. -->

<!-- We can see from this that the random forest model presents an improvement (reduction) in both MAE and MSE. -->