# Fastai Starter Notebook

While I was looking into the [JPX Tokyo Stock Exchange Prediction competition](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction), I realized that there was no starter notebook yet for fastai.  Because of this, I decided to put together a notebook that uses fastai Tabular.  As I make improvements, I will continue to update this notebook and add the new changes into the notebook. 

## Import fastai.tabular

In [None]:
from fastai.tabular.all import *

## Read Data

In [None]:
input_path = Path('../input/jpx-tokyo-stock-exchange-prediction')

In [None]:
all_df = pd.read_csv(input_path/'train_files'/'stock_prices.csv')

In [None]:
# I noticed some Target values that were nan.  I am replacing them with 0 for now, but probably a better way to handle this. 
all_df.Target.fillna(0, inplace=True)

In [None]:
def prices_data_transform(df, log_columns):
    df = df.copy()
    for col in log_columns: 
        df[f'{col}_log'] = np.log1p(df[col])
    return df

In [None]:
all_df = prices_data_transform(all_df, ['Open', 'High', 'Low', 'Close', 'Volume'])

## Split Data

I wanted to split the data in a way that made sense.  I thought of a few ideas on this.  One was to take a time based approach where everything past a certain date would be considered validation and everything before woudl be training.  I decided (for now at least) to instead split by security code instead.  This is because we know that different time periods will produce different results.  By splitting based on security code, I am hoping that my training and validation data will behave similarly and I can make something that works well.  I may still end up doing the temporal-based split at some point if I am getting poor performance with my result

In [None]:
all_stocks = all_df.SecuritiesCode.unique()
np.random.seed(42)
valid_stocks = np.random.choice(all_stocks, 200, replace=False)
non_valid_stocks = [s for s in all_stocks if s not in valid_stocks]
test_stocks = np.random.choice(non_valid_stocks, 100, replace=False)
train_stocks = [s for s in non_valid_stocks if s not in test_stocks]

In [None]:
#Make sure that we don't accidentally have more or less than we started with
test_eq(len(train_stocks)+len(valid_stocks)+len(test_stocks), len(all_stocks))

In [None]:
all_df['is_train'] = np.isin(all_df.SecuritiesCode.values, train_stocks)
all_df['is_valid'] = np.isin(all_df.SecuritiesCode.values, valid_stocks)
all_df['is_test'] = np.isin(all_df.SecuritiesCode.values, test_stocks)
test_df = all_df[all_df.is_test].copy()
valid_df = all_df[all_df.is_valid].copy()
train_df = all_df[all_df.is_train].copy()

Now that we've split everything and labeled it, let's re-combine training and validation

In [None]:
tv_df = pd.concat((train_df, valid_df))

In [None]:
tv_df.reset_index(drop=True, inplace=True)

In [None]:
tv_df['Target_10000'] = tv_df.Target*1000

In [None]:
valid_idx = tv_df[tv_df.is_valid].index

## Create Tabular dls

In [None]:
dls = TabularDataLoaders.from_df(tv_df, 
                                 valid_idx=valid_idx,
                                  y_names='Target',
                                  cat_names = ['SupervisionFlag'],
                                  cont_names = ['Open_log', 'High_log', 'Low_log', 'Close_log', 'Volume_log'],#, 'ExpectedDividend', 'AdjustmentFactor'],
                                  procs = [Categorify, FillMissing, Normalize], 
                                  bs=1024
                                 )

## Create tabular learner

In [None]:
learn = tabular_learner(dls)

## run lr_find

In [None]:
learn.lr_find(suggest_funcs=(slide,valley))

## fit_one_cycle

In [None]:
learn.fit_one_cycle(1, 1e-3)

## Evaluation

In [None]:
def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short

In [None]:
def calc_spread_return_sharpe(
    df: pd.DataFrame, # predicted results
    portfolio_size: int = 200, # # of equities to buy/sell
    toprank_weight_ratio: float = 2 # the relative weight of the most highly ranked stock compared to the least.
) -> float: # Sharpe ratio
    buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio

## Calculate Sharpe Ratio on validation set

In [None]:
valid_df['Rank'] = valid_df.groupby('Date').Target.rank(method='first', ascending=False)-1

full_submission_df = pd.DataFrame(columns=['Date', 'SecuritiesCode', 'Target', 'Rank'])
for dt,df in valid_df.groupby('Date'):
    test_dl = learn.dls.test_dl(df)
    with learn.no_bar():
        preds, targs, decoded = learn.get_preds(dl=test_dl, with_decoded=True)
    test_dl.items['Target'] = decoded.squeeze().tolist()
    test_dl.items['Rank'] = test_dl.items.groupby(['Date']).rank(method="first", ascending=False)['Target'] - 1
    submission_df = test_dl.items[['Date','SecuritiesCode', 'Target', 'Rank']]
    full_submission_df = pd.concat((full_submission_df, submission_df))

calc_spread_return_sharpe(full_submission_df, portfolio_size=20)

## Create Submission

In [None]:
import jpx_tokyo_market_prediction

In [None]:
env = jpx_tokyo_market_prediction.make_env()

In [None]:
iter_test = env.iter_test()    # an iterator which loops over the test files

In [None]:
full_submission_df = pd.DataFrame(columns=['Date', 'SecuritiesCode', 'Rank'])

In [None]:
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
        prices = prices_data_transform(prices, ['Open', 'High', 'Low', 'Close', 'Volume'])
        test_dl = learn.dls.test_dl(prices)
        with learn.no_bar():
            preds, targs, decoded = learn.get_preds(dl=test_dl, with_decoded=True)
        test_dl.items['Target'] = decoded.squeeze().tolist()
        test_dl.items['Rank'] = test_dl.items.groupby(['Date']).rank(method="first")['Target'] - 1
        submission_df = test_dl.items[['Date','SecuritiesCode', 'Rank']]
        full_submission_df = pd.concat((full_submission_df, submission_df))
        submission_df = submission_df.astype({'Rank':np.int64})
        env.predict(submission_df)

In [None]:
submission_df