# Introduction

This  notebook is meant to be a starting point for teams competing in Bot Xchange in ShriTeq 2022. It will help you:

- Get started with processing the data

- Learn how to reshape the time series data into something you can use for your mode

- Get started with a rudimentary RandomForestRegressor model to predict stock prices

To use this notebook:

1. Click the three dots at the top right of this page

undefined. Click 'Download'. This will download this notebook and all project files to your local filesystem

# Data processing

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('opening_prices_biotech.csv')
df.head()

Unnamed: 0,ticker,day_0,day_1,day_2,day_3,day_4,day_5,day_6,day_7,day_8,...,day_941,day_942,day_943,day_944,day_945,day_946,day_947,day_948,day_949,day_950
0,NVO,49.0179,49.5701,49.7771,50.2947,50.4069,50.6312,50.2947,49.8548,50.01,...,48.9641,49.0019,50.4401,49.9102,49.9291,49.7021,48.9735,48.7938,48.2923,47.4123
1,VRTX,121.09,123.92,124.38,123.99,124.97,127.68,125.55,123.03,123.89,...,174.48,177.01,178.23,176.32,172.31,172.05,170.29,169.55,169.24,167.69
2,REGN,535.59,531.06,535.56,536.45,542.28,550.54,542.62,531.76,519.38,...,284.66,288.75,297.36,295.02,290.21,287.7,280.45,275.3,277.27,275.31
3,SGEN,42.31,42.73,43.71,43.7,44.51,45.3,44.82,43.88,42.07,...,73.02,73.15,73.81,75.07,73.08,73.82,76.7,79.17,85.6,85.26
4,ALNY,88.27,87.24,91.29,90.16,92.49,93.48,94.18,92.18,93.49,...,85.84,84.9,86.69,85.18,83.04,83.63,80.13,81.1,80.07,74.15


For our model that predicts the next opening price of a stock, we'll take the last _n_ days' prices as input data. We will set n = 50 for this notebook, but **we encourage you to experiment** with what number of features work. 

We will also take all-time historical data, such as the standard deviation, historical highs and lows, as well as the mean value of the stock.

In [3]:
# Only do this if you're using sets of 50
df.drop('day_0', axis=1, inplace=True)
df.head()

Unnamed: 0,ticker,day_1,day_2,day_3,day_4,day_5,day_6,day_7,day_8,day_9,...,day_941,day_942,day_943,day_944,day_945,day_946,day_947,day_948,day_949,day_950
0,NVO,49.5701,49.7771,50.2947,50.4069,50.6312,50.2947,49.8548,50.01,49.2164,...,48.9641,49.0019,50.4401,49.9102,49.9291,49.7021,48.9735,48.7938,48.2923,47.4123
1,VRTX,123.92,124.38,123.99,124.97,127.68,125.55,123.03,123.89,120.8,...,174.48,177.01,178.23,176.32,172.31,172.05,170.29,169.55,169.24,167.69
2,REGN,531.06,535.56,536.45,542.28,550.54,542.62,531.76,519.38,499.0,...,284.66,288.75,297.36,295.02,290.21,287.7,280.45,275.3,277.27,275.31
3,SGEN,42.73,43.71,43.7,44.51,45.3,44.82,43.88,42.07,41.84,...,73.02,73.15,73.81,75.07,73.08,73.82,76.7,79.17,85.6,85.26
4,ALNY,87.24,91.29,90.16,92.49,93.48,94.18,92.18,93.49,92.76,...,85.84,84.9,86.69,85.18,83.04,83.63,80.13,81.1,80.07,74.15


## Constants

In [4]:
NUM_STOCKS = len(df)
print(f'Number of stocks in DataFrame: {NUM_STOCKS}')

Number of stocks in DataFrame: 48


In [5]:
ORIGINAL_NUM_DAYS = len(df.iloc[0])
print(f'Original number of days for which we have price data for each ticker: {ORIGINAL_NUM_DAYS}')

Original number of days for which we have price data for each ticker: 951


In [6]:
DAYS_PER_BATCH = 50
print(f'Numbers of days for which we have price data in one division of the original data: {DAYS_PER_BATCH}')

Numbers of days for which we have price data in one division of the original data: 50


## Creating the new DataFrame

In [7]:
# Save the tickers before dropping it
tickers = df['ticker']
df = df.drop('ticker', axis = 1)

# Now that the dataframe is numerical only, we can easily compute the following
df['mean'] = df.mean(axis=1)
df['std'] = df.std(axis=1)
df['min'] = df.min(axis=1)
df['max'] = df.max(axis=1)
df['ticker'] = tickers
df.head()

Unnamed: 0,day_1,day_2,day_3,day_4,day_5,day_6,day_7,day_8,day_9,day_10,...,day_946,day_947,day_948,day_949,day_950,mean,std,min,max,ticker
0,49.5701,49.7771,50.2947,50.4069,50.6312,50.2947,49.8548,50.01,49.2164,48.3968,...,49.7021,48.9735,48.7938,48.2923,47.4123,43.04198,5.792453,5.792453,53.2909,NVO
1,123.92,124.38,123.99,124.97,127.68,125.55,123.03,123.89,120.8,119.01,...,172.05,170.29,169.55,169.24,167.69,137.722917,38.664037,38.664037,194.67,VRTX
2,531.06,535.56,536.45,542.28,550.54,542.62,531.76,519.38,499.0,510.5,...,287.7,280.45,275.3,277.27,275.31,381.977832,53.964763,53.964763,550.54,REGN
3,42.73,43.71,43.7,44.51,45.3,44.82,43.88,42.07,41.84,39.42,...,73.82,76.7,79.17,85.6,85.26,58.427116,12.56451,12.56451,85.6,SGEN
4,87.24,91.29,90.16,92.49,93.48,94.18,92.18,93.49,92.76,87.39,...,83.63,80.13,81.1,80.07,74.15,80.599735,25.244504,25.244504,147.89,ALNY


In [8]:
new_col_names = ['ticker', 'std', 'min', 'max', 'mean']

for i in range(1, DAYS_PER_BATCH + 1, 1):
    col_name = 'price_' + str(i)
    new_col_names.append(col_name)

In [9]:
# Create an empty DataFrame with these column headings
new = pd.DataFrame(columns=new_col_names)
new.head()

Unnamed: 0,ticker,std,min,max,mean,price_1,price_2,price_3,price_4,price_5,...,price_41,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50


In [10]:
for i in range(1, DAYS_PER_BATCH + 1, 1):
    dfs = []
    original_cols = [f'day_{j}' for j in range(i, ORIGINAL_NUM_DAYS, 50)]
    for name in original_cols:
        dfs.append(df[name])

    combined = pd.concat(dfs, ignore_index=True).reindex()
    new[f'price_{i}'] = combined

In [11]:
new.head()

Unnamed: 0,ticker,std,min,max,mean,price_1,price_2,price_3,price_4,price_5,...,price_41,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50
0,,,,,,49.5701,49.7771,50.2947,50.4069,50.6312,...,45.1445,43.5744,44.6355,45.0323,44.8253,45.1445,45.4723,44.6527,48.6384,48.5348
1,,,,,,123.92,124.38,123.99,124.97,127.68,...,88.06,84.12,87.33,86.44,86.83,86.13,88.94,92.25,89.91,87.36
2,,,,,,531.06,535.56,536.45,542.28,550.54,...,392.0,383.69,396.36,396.92,396.24,387.86,410.0,406.23,404.55,401.79
3,,,,,,42.73,43.71,43.7,44.51,45.3,...,30.53,29.2,30.12,30.45,30.95,30.28,31.86,32.53,31.73,32.46
4,,,,,,87.24,91.29,90.16,92.49,93.48,...,62.05,58.74,58.96,57.72,58.0,59.49,59.92,62.33,60.66,58.99


In [12]:
NUM_ENTRIES = len(new)
print(NUM_ENTRIES)

912


In [13]:
means = stds = min_vals = max_vals = tickers = np.zeros(NUM_ENTRIES)
tickers = list(map(str, tickers))

"""
The new dataframe consists of 912 rows. They are ordered in 19 sets of 48 each. 
This means that the 0th value and 48th value both point to the same stock
"""

for i in range(NUM_STOCKS):
    row = df.iloc[i]
    mean = row['mean']
    std = row['std']
    min_val = row['min']
    max_val = row['max']
    ticker = row['ticker']


In [14]:
for i in range(0, 48, 1):
    row = df.iloc[i]
    mean = row['mean']
    std = row['std']
    min_val = row['min']
    max_val = row['max']
    ticker = row['ticker']
    for j in range(i, 912, 48):
        means[j] = mean
        stds[j] = std
        min_vals[j] = min_val
        max_vals[j] = max_val
        tickers[j] = ticker

In [15]:
new['mean'] = means
new['std'] = stds
new['min'] = min_vals
new['max'] = max_vals
new['ticker'] = tickers

In [16]:
new.head(50)

Unnamed: 0,ticker,std,min,max,mean,price_1,price_2,price_3,price_4,price_5,...,price_41,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50
0,NVO,53.2909,53.2909,53.2909,53.2909,49.5701,49.7771,50.2947,50.4069,50.6312,...,45.1445,43.5744,44.6355,45.0323,44.8253,45.1445,45.4723,44.6527,48.6384,48.5348
1,VRTX,194.67,194.67,194.67,194.67,123.92,124.38,123.99,124.97,127.68,...,88.06,84.12,87.33,86.44,86.83,86.13,88.94,92.25,89.91,87.36
2,REGN,550.54,550.54,550.54,550.54,531.06,535.56,536.45,542.28,550.54,...,392.0,383.69,396.36,396.92,396.24,387.86,410.0,406.23,404.55,401.79
3,SGEN,85.6,85.6,85.6,85.6,42.73,43.71,43.7,44.51,45.3,...,30.53,29.2,30.12,30.45,30.95,30.28,31.86,32.53,31.73,32.46
4,ALNY,147.89,147.89,147.89,147.89,87.24,91.29,90.16,92.49,93.48,...,62.05,58.74,58.96,57.72,58.0,59.49,59.92,62.33,60.66,58.99
5,BMRN,106.3,106.3,106.3,106.3,104.4,105.83,105.91,106.02,106.3,...,78.11,74.31,76.92,76.64,80.75,82.5,87.44,88.91,89.73,87.7
6,INCY,151.69,151.69,151.69,151.69,109.21,110.26,108.98,109.98,110.27,...,73.47,71.29,74.0,75.0,74.76,74.01,74.79,74.0,73.51,71.64
7,TECH,212.847,212.847,212.847,212.847,85.2154,85.3011,86.6815,87.7763,87.4241,...,84.3325,82.0582,83.0329,83.654,82.9182,82.3544,85.6226,86.1291,87.2471,87.8683
8,UTHR,169.31,169.31,169.31,169.31,159.28,159.15,159.99,161.87,159.26,...,128.2,126.26,128.75,125.83,123.01,122.14,126.94,128.48,126.0,127.0
9,JAZZ,183.49,183.49,183.49,183.49,144.15,144.24,141.78,142.65,142.56,...,124.96,120.0,121.92,124.72,123.48,121.42,123.4,124.92,125.44,122.78


As we can see, we have the same metadata for rows 0 and 1 as we do for 48 and 49 respectively. 

## Preparing data for ML

### Establishing variables

We first need to establish input feature names and the name of the output feature. Here, we'll use every feature apart from the 50th day's price as the target feature, i.e, the 50th day's price is our y-variable.

In [17]:
inp_features = new_col_names[1:-1]
X = new[inp_features]
X.head()

Unnamed: 0,std,min,max,mean,price_1,price_2,price_3,price_4,price_5,price_6,...,price_40,price_41,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49
0,53.2909,53.2909,53.2909,53.2909,49.5701,49.7771,50.2947,50.4069,50.6312,50.2947,...,44.1869,45.1445,43.5744,44.6355,45.0323,44.8253,45.1445,45.4723,44.6527,48.6384
1,194.67,194.67,194.67,194.67,123.92,124.38,123.99,124.97,127.68,125.55,...,89.41,88.06,84.12,87.33,86.44,86.83,86.13,88.94,92.25,89.91
2,550.54,550.54,550.54,550.54,531.06,535.56,536.45,542.28,550.54,542.62,...,399.05,392.0,383.69,396.36,396.92,396.24,387.86,410.0,406.23,404.55
3,85.6,85.6,85.6,85.6,42.73,43.71,43.7,44.51,45.3,44.82,...,30.71,30.53,29.2,30.12,30.45,30.95,30.28,31.86,32.53,31.73
4,147.89,147.89,147.89,147.89,87.24,91.29,90.16,92.49,93.48,94.18,...,64.52,62.05,58.74,58.96,57.72,58.0,59.49,59.92,62.33,60.66


In [18]:
target_feature = new_col_names[-1]
y = new[target_feature]
y

0       48.5348
1       87.3600
2      401.7900
3       32.4600
4       58.9900
         ...   
907     51.4146
908     13.0000
909      9.0800
910     15.7900
911     10.3984
Name: price_50, Length: 912, dtype: float64

### Training-Testing Split
For this notebook, we're just splitting data into training and testing. You might want to create a validation set as well.

In [19]:
from sklearn.model_selection import train_test_split

# 20% of our data will be validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Let's save this in case we want to come back later

In [20]:
X_train.to_csv('X_train.csv')
X_test.to_csv('X_test.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')

# Creating the model

## Random forest model

We will be using a simple random forest model for demonstation purposes in this notebook. **You should not use this simple of a model in the competition**. As you will see, it will not achieve ideal performance. This will put your models at a competitive disadvantage when predicting prices.

In [21]:
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

In [22]:
model = RandomForestRegressor(n_estimators=2, random_state=0, n_jobs=2)
model.fit(X_train, y_train)

In [23]:
y_pred = model.predict(X_test)

In [24]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 2.008217759562841
Mean Squared Error: 18.381158921407092
Root Mean Squared Error: 4.287325380864752


## Custom Evaluation functions
These will help us better contextualise our predictions in the context of the dataset

MAPE: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

WMAPE: https://en.wikipedia.org/wiki/WMAPE

In [25]:
def mape(y_true, y_pred):
    ape = np.abs((y_true - y_pred) / y_true)
    ape[~np.isfinite(ape)] = 1. # pessimist estimate
    return np.mean(ape)

def wmape(y_true, y_pred):
    return np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true))

In [26]:
mape(y_test, y_pred)

0.04017294247796121

In [27]:
wmape(y_test, y_pred)

0.03530892531200969

# What's next?
Your job: Create a model that does the best possible job of predicting the next day's stock price based on historical data for that stock.
Explore:
- Creating training data in a different fashion
- New features
- Alternative algorithsm and models, such as neural networks
- Changing the hyperparameters on the random forest

**Remember what your algorithm will be fed and craft input features around that.**
Your algorithm will be given all-time historical data for each stock (as was the case in the original DataFrame here).  From there, you can choose how many days' data to use, what statistical features to extract, etc.

**Good luck!**

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=86376c80-cc11-49b3-a94a-d3d931acdd32' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>