# Introduction

This  notebook is meant to be a starting point for teams competing in Bot Xchange in ShriTeq 2022. It will help you:

- Get started with processing the data

- Learn how to reshape the time series data into something you can use for your mode

- Get started with a rudimentary RandomForestRegressor model to predict stock prices

To use this notebook:

1. Click the three dots at the top right of this page

undefined. Click 'Download'. This will download this notebook and all project files to your local filesystem

# Data processing

In [1]:
import pandas as pd
import numpy as np

Below, we import the historical stock price dataset. This should have been provided to you as a link on a separate document, but you can download it manually from the sidebar here. 

The important thing to note here is that each row consists of historical data for one stock. Day 0 is the price on the first day, day 1 is the price on the second day, and so on, i.e., the prices here are in sequence. Also note that we have the same number of days of data for all stocks - i.e., there are no missing values.

We will use this dataset to train our model to predict stock prices. 

Note: When team programs are run in the evaluation phase, a different dataset consisting of different stocks in a similar industry will be used.

In [2]:
df = pd.read_csv('med_stock_data.csv')
df.head()

Unnamed: 0,ticker,day_0,day_1,day_2,day_3,day_4,day_5,day_6,day_7,day_8,...,day_1354,day_1355,day_1356,day_1357,day_1358,day_1359,day_1360,day_1361,day_1362,day_1363
0,UNH,71.5371,71.9338,72.1454,71.3696,71.4137,70.0118,70.8759,70.1882,71.0786,...,279.8966,279.6659,283.5101,287.2582,282.0013,283.76,283.7984,284.4808,284.5192,282.2223
1,JNJ,80.5057,80.6913,80.5622,80.4331,81.579,80.8688,82.0067,81.5064,81.6274,...,133.1297,133.8,133.9583,136.7419,135.6806,135.9226,135.8947,136.1088,135.7178,135.0848
2,LLY,50.9207,50.9124,51.3297,50.8373,51.2462,50.4784,51.2045,50.9708,51.526,...,118.4571,122.1636,123.31,126.5675,126.472,125.211,125.1442,125.6887,125.6409,125.3161
3,PFE,20.3353,20.5124,20.2503,20.0519,20.1511,19.896,20.2148,20.0235,20.0164,...,33.757,33.3889,33.3804,33.6799,33.7398,33.6628,33.6371,33.7484,33.6371,33.2006
4,ABBV,37.15,37.5427,37.6784,36.7573,36.8359,37.6427,37.8426,37.9997,37.8212,...,79.1581,79.2021,78.8766,79.5099,78.7447,79.5099,79.334,79.1405,78.4721,77.5837


In [3]:
df.shape

(84, 1365)

This means that we have 84 stocks, with 1364 days of stock prices for each.

For our model that predicts the next opening price of a stock, we'll take the last _n_ days' prices as input data to predict the price on the next day. We will set n = 50 for this notebook, but **we encourage you to experiment** with what number of days work. 

These **may or may not be good** features for this problem. Do not blindly follow the selection of features or model architecture in this notebook. This notebook is primarily intended to help you **get started** with processing and reshaping the data and building a **rudimentary** model with it.

## Constants

In [4]:
NUM_STOCKS = len(df)
print(f'Number of stocks in DataFrame: {NUM_STOCKS}')

Number of stocks in DataFrame: 84


In [5]:
ORIGINAL_NUM_DAYS = len(df.iloc[0]) - 1
print(f'Original number of days for which we have price data for each ticker: {ORIGINAL_NUM_DAYS}')

Original number of days for which we have price data for each ticker: 1364


In [6]:
DAYS_PER_BATCH = 51
print(f'Numbers of days for which we have price data in one division of the original data: {DAYS_PER_BATCH}')
SETS_PER_STOCK = (ORIGINAL_NUM_DAYS // DAYS_PER_BATCH)
print(f'Expected number of sets of {DAYS_PER_BATCH} for each stock: {SETS_PER_STOCK}')
EXPECTED_TOTAL_SETS = SETS_PER_STOCK * NUM_STOCKS
print(f'Expected number of sets of {DAYS_PER_BATCH}: {EXPECTED_TOTAL_SETS}')

Numbers of days for which we have price data in one division of the original data: 51
Expected number of sets of 51 for each stock: 26
Expected number of sets of 51: 2184


## Creating the new DataFrame

We need to turn this raw dataset into something that our model can use. We want each row to be a training example for our model, i.e., each row will consist of 50 days of stock prices as the input variables and the price on the 51st day as the target variable to predict.

We need to extract these 2184 sets of 51 from the original dataset. We will then split this into training and test data. 

Note: remember that you may wish to use a different number of days as an input than 50. In that case, you may need to update parts of this notebook.

In [7]:
# We have 1364 days of data. The last multiple of 51 below 1326 is the last column from which we will extract data.
last_day = ORIGINAL_NUM_DAYS  - (ORIGINAL_NUM_DAYS %  DAYS_PER_BATCH)
last_day

1326

In [8]:
new_col_names = []
for i in range(1, DAYS_PER_BATCH + 1, 1):
    col_name = 'price_' + str(i)
    new_col_names.append(col_name)
# Create an empty DataFrame with these column headings
new = pd.DataFrame(columns=new_col_names)
new

Unnamed: 0,price_1,price_2,price_3,price_4,price_5,price_6,price_7,price_8,price_9,price_10,...,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50,price_51


In [9]:
"""
Here, we create a new dataframe where each row represents one training example. 
This piece of code can be difficult to wrap your head around, so we suggest playing around with the variables
if you're trying to understand it. 
It should reflect changes in previous parts of your code if all your constants are configured correctly 
"""
for i in range(1, DAYS_PER_BATCH + 1, 1):
    dfs = []
    original_cols = [f'day_{j}' for j in range(i, last_day + 1, DAYS_PER_BATCH)]
    for name in original_cols:
        dfs.append(df[name])

    combined = pd.concat(dfs, ignore_index=True).reindex()
    new[f'price_{i}'] = combined

new.tail()

Unnamed: 0,price_1,price_2,price_3,price_4,price_5,price_6,price_7,price_8,price_9,price_10,...,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50,price_51
2179,60.8,60.4,61.02,61.51,61.32,62.0,61.45,62.0,61.79,62.42,...,62.97,61.79,62.27,62.08,62.98,62.86,62.91,63.43,64.98,66.96
2180,129.94,124.66,127.41,127.75,127.89,127.34,127.6,127.1,126.2,124.82,...,124.19,123.24,124.42,124.59,127.1,126.12,125.89,128.18,128.63,128.16
2181,94.16,91.87,92.53,90.27,89.65,87.64,86.45,88.98,86.18,86.25,...,84.48,83.81,85.89,87.95,86.05,83.3,85.0,89.42,91.29,92.59
2182,34.6514,33.957,34.358,34.6807,34.4362,35.0719,36.4705,37.0182,37.3899,37.4779,...,38.9058,38.3679,38.5439,38.2701,38.0451,38.5048,38.0354,38.8178,39.4144,39.5904
2183,55.3,53.67,54.87,55.75,56.37,59.5,59.66,57.94,59.52,60.78,...,59.05,57.66,57.94,58.12,58.94,58.94,58.89,60.17,61.98,67.62


In [10]:
NUM_ENTRIES = len(new)
print(NUM_ENTRIES)
print(EXPECTED_TOTAL_SETS) # we should have as many sets of 51 as we had predicted earlier

2184
2184


## Preparing data for ML

### Establishing variables

Here, we extract the first 50 days of data as the x-variables (i.e., the input variables) and the 51st day of data as the y-variable (i.e., the target variable to predict.)

In [11]:
inp_features = new_col_names[:-1]
X = new[inp_features]
X.head()

Unnamed: 0,price_1,price_2,price_3,price_4,price_5,price_6,price_7,price_8,price_9,price_10,...,price_41,price_42,price_43,price_44,price_45,price_46,price_47,price_48,price_49,price_50
0,71.9338,72.1454,71.3696,71.4137,70.0118,70.8759,70.1882,71.0786,72.0396,72.4805,...,76.7499,76.1478,75.1295,75.59,76.7233,75.1738,74.6337,76.4577,75.3421,75.4572
1,80.6913,80.5622,80.4331,81.579,80.8688,82.0067,81.5064,81.6274,82.2246,82.4828,...,86.5284,86.1059,84.5459,84.684,85.8621,84.8709,83.2216,85.009,83.1566,82.4335
2,50.9124,51.3297,50.8373,51.2462,50.4784,51.2045,50.9708,51.526,51.139,51.6018,...,55.0514,55.43,54.7654,54.8747,55.4721,54.799,53.722,54.6391,53.8651,53.8987
3,20.5124,20.2503,20.0519,20.1511,19.896,20.2148,20.0235,20.0164,20.0235,20.4487,...,21.0934,20.81,20.6612,20.7533,20.7958,20.5691,20.4132,20.8313,20.6258,20.6612
4,37.5427,37.6784,36.7573,36.8359,37.6427,37.8426,37.9997,37.8212,38.2853,38.7994,...,41.5983,41.3627,41.0842,40.9985,41.8339,41.1271,40.3702,41.784,40.283,39.4844


In [12]:
target_feature = new_col_names[-1]
y = new[target_feature]
y

0        74.6780
1        81.6779
2        53.6379
3        20.2928
4        38.5562
          ...   
2179     66.9600
2180    128.1600
2181     92.5900
2182     39.5904
2183     67.6200
Name: price_51, Length: 2184, dtype: float64

### Training-Testing Split
For this notebook, we're just splitting data into training and testing. You might want to create a validation set as well.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

Let's save this in case we want to come back later

In [14]:
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

# Creating the model

## Random forest model

We will be using a simple random forest model for demonstation purposes in this notebook. **You should not use this simple of a model in the competition**. As you will see, it will not achieve ideal performance. This will put your models at a competitive disadvantage when predicting prices. There may be other types of models better suited to this problem.

In [15]:
from sklearn.ensemble import RandomForestRegressor

In [16]:
model = RandomForestRegressor(n_estimators=100, random_state=0, n_jobs=2)
model.fit(X_train.values, y_train.values)

In [17]:
y_pred = model.predict(X_test)



In [18]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 1.8956070730593637
Mean Squared Error: 64.45397726942628
Root Mean Squared Error: 8.028323440758118


## Saving the model

 Let's save the model now to a file. When we want to use the model in our program, we'll load it from the file.

We will use the pickle library for this.

In [22]:
import pickle
FILE_NAME = 'model.sav'
pickle.dump(model, open(FILE_NAME, 'wb'))

## How to use the model in your code

In [23]:
# Load the model from the local filesystem. This may vary based on how your code is structured
model = pickle.load(open(FILE_NAME, 'rb'))

In [24]:
# Select a random row from the test dataset. In your code, you'll need to write something different
x = X_test.values[-1]
sample_pred = (model.predict([x]))[0] # the predicted price is returned in the form of a 2d list
sample_pred

38.476219999999984

# What's next?
Your job: Create a model that does the best possible job of predicting the next day's stock price based on historical data for that stock.
Explore:
- Using a different model architecture - research about what kinds of models are best suited to this problem of predicting the next price in a series of prices
- Using a larger set of days
- Creating new features
- Changing the hyperparameters on the random forest

**Remember what your algorithm will be fed and craft input features around that.**
Your algorithm will be given all-time historical data for each stock (as was the case in the original DataFrame here).  From there, you can choose how many days' data to use, what statistical features to extract, etc.

Look at the reference documents provided to you for more details or ask for help on [Discord](https://discord.gg/Gt3ZGvssgH)

**Good luck!**

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=86376c80-cc11-49b3-a94a-d3d931acdd32' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>