# Intro

Now that the data has been transformed into a usual, some elementary EDA will be performed to check the numerical features.

In [1]:
import pandas as pd
from os.path import join
from os import listdir

In [2]:
def read_batches(path):
    """
    Read the batches stored in the path and group them
    :param path: the path with batches
    :return: a pandas dataframe with the grouped and de-duplicated batches
    """
    # Regroup batches
    batches = [pd.read_csv(join(path, f)) for f in listdir(path)]

    # After creating the random observations, potential duplicates (although unlikely) need to be dropped
    observations = pd.concat(batches)
    observations.drop_duplicates(subset=["date", "asset_file"])

    observations.date = pd.to_datetime(observations.date)

    observations.dropna(inplace=True)

    return observations

In [3]:
observations = read_batches("DataBatches2")

In [4]:
observations

Unnamed: 0,asset_file,date,stock,1_month_return,6_month_return,12_month_return,1_month_volatility,6_month_volatility,12_month_volatility,1_month_img,6_month_img,12_month_img,label
0,Data/Stocks\cca.us.txt,2012-03-06,1,-0.004883,0.130125,0.211069,0.005297,0.006436,0.007214,Charts2/img_500_1_month.PNG,Charts2/img_500_6_month.PNG,Charts2/img_500_12_month.PNG,0
1,Data/Stocks\wwr.us.txt,2010-09-09,1,0.326923,-0.103896,-0.295918,0.062183,0.060901,0.057894,Charts2/img_501_1_month.PNG,Charts2/img_501_6_month.PNG,Charts2/img_501_12_month.PNG,2
2,Data/Stocks\apu.us.txt,2014-03-04,1,0.008864,0.033208,0.027071,0.006744,0.012142,0.013019,Charts2/img_502_1_month.PNG,Charts2/img_502_6_month.PNG,Charts2/img_502_12_month.PNG,2
3,Data/Stocks\aste.us.txt,2011-12-05,1,-0.032136,-0.039234,0.034616,0.039193,0.038240,0.030234,Charts2/img_503_1_month.PNG,Charts2/img_503_6_month.PNG,Charts2/img_503_12_month.PNG,2
4,Data/Stocks\dtus.us.txt,2011-11-01,1,-0.005257,-0.100670,-0.118569,0.005065,0.006326,0.007545,Charts2/img_504_1_month.PNG,Charts2/img_504_6_month.PNG,Charts2/img_504_12_month.PNG,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Data/Stocks\pahc.us.txt,2016-09-27,1,0.303545,0.003591,-0.123365,0.032447,0.027091,0.027717,Charts2/img_9495_1_month.PNG,Charts2/img_9495_6_month.PNG,Charts2/img_9495_12_month.PNG,0
496,Data/Stocks\tlgt.us.txt,2016-07-12,1,0.320127,0.216058,0.165035,0.034619,0.037514,0.040948,Charts2/img_9496_1_month.PNG,Charts2/img_9496_6_month.PNG,Charts2/img_9496_12_month.PNG,0
497,Data/Stocks\irl.us.txt,2006-10-20,1,-0.001629,0.164298,0.460406,0.011995,0.014218,0.012064,Charts2/img_9497_1_month.PNG,Charts2/img_9497_6_month.PNG,Charts2/img_9497_12_month.PNG,2
498,Data/Stocks\spok.us.txt,2008-11-12,1,0.006921,0.286695,-0.237152,0.045302,0.040031,0.036625,Charts2/img_9498_1_month.PNG,Charts2/img_9498_6_month.PNG,Charts2/img_9498_12_month.PNG,2


## Labels

In [5]:
observations.groupby("label").agg(count=("asset_file", "count"))

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,5951
1,2118
2,6925


The above distribution of labels is in line with our intuition about equity returns. There are slightly more positive returns (label 2) than negative returns (label 0) and a rather small number of neutral returns (label 1).

## Date

In [6]:
observations.date.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x227c49ed0c8>

From the above histogram it is clear that the majority of observations comes from recent years. This is in line with the fact that the raw data is also skewed towards more recent years. This could potentially impact the model since recent events (such as the 2008 crisis) could be over-weighted. For now we will leave this as is, but depending on how the model performs, a new cleaned dataset could be constructed where the dates are more balanced)

## Returns

In [7]:
observations[["1_month_return", "6_month_return", "12_month_return"]].describe()

Unnamed: 0,1_month_return,6_month_return,12_month_return
count,14994.0,14994.0,14994.0
mean,0.009377,0.053186,0.098148
std,0.159049,0.638698,0.800949
min,-0.819417,-0.958691,-0.98801
25%,-0.042918,-0.100952,-0.146749
50%,0.004755,0.024571,0.04002
75%,0.049268,0.143872,0.218156
max,6.0,60.0,49.833333


From this overview, it appears that there are some extreme returns. For instance 921% returns in the 1 month returns, 515% returns in the 6 month returns, and 2071% returns in the 12 month returns. These outliers could make the trained model less generalizable. To prevent this, returns will be limited to 300% returns. This still corresponds to a 4X return over the period, which is highly unlikely but within reasonable bounds.

In [8]:
observations = observations[
    (observations["1_month_return"] <= 3) &
    (observations["6_month_return"] <= 3) &
    (observations["12_month_return"] <= 3)
].copy()

In [9]:
len(observations.index)

14902

From the original dataset of 9994 observations, 51 were removed leaving us with a dataset of 9943 observations.

## Volatility

In [10]:
# 1 month volatility
observations[["1_month_volatility", "6_month_volatility", "12_month_volatility"]].describe()

Unnamed: 0,1_month_volatility,6_month_volatility,12_month_volatility
count,14902.0,14902.0,14902.0
mean,0.023548,0.025029,0.079727
std,0.025811,0.024955,6.622723
min,0.0,8.8e-05,9.8e-05
25%,0.010235,0.011742,0.012243
50%,0.017307,0.019248,0.020106
75%,0.02879,0.031279,0.031967
max,1.017118,0.813391,808.481065


Similar to returns, there are some extremely high volatilities, expecially in the 6 and 12 month features. However, these are limited to very few observations and correspond to the remaining observations with very large returns. They are considered acceptable and will not be removed from the dataset.

# Set-up train/validation/test sets

Now the datasets will be split into 50% training, 25% validation, and 25% test sets

In [11]:
from sklearn.model_selection import train_test_split
import os

observations.drop(
    columns=["asset_file", "date"], # these features will not be taken into account during model training/testing
    inplace=True
)

train, temp = train_test_split(observations, test_size=0.5)
val, test = train_test_split(temp, test_size=0.5)

# Create directory
if not os.path.isdir("ModelData2"):
    os.makedirs("ModelData2")

train.to_csv("ModelData2/obs_train.csv", index=False)
val.to_csv("ModelData2/obs_val.csv", index=False)
test.to_csv("ModelData2/obs_test.csv", index=False)