<a href="https://colab.research.google.com/github/suren777/Ml-AI-in-finance/blob/master/FixMissingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data fixing method using ANN

It is not a secret tha machine learning people spend most of the time fixing datasets and proxying the missing data points. Here we will try to construct artificial neural network to perform this cleaning task for us.

The dataset that we are going to use is a daily dataset of yield curves obtained from the US treasury website, which is public. I will briefly discuss the function to download and clean the dataset in the following cells. 



In [0]:
from keras.models import Model, save_model, load_model 
from keras.layers import Dense, Input, Dropout
from keras.optimizers import Adam
from keras import regularizers
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib as mpl
from six.moves import urllib
import xml.etree.cElementTree as et
from copy import deepcopy
import os
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


## Data processing
Before we jump into the development of the network, let's first get some data. 

From this link: https://www.treasury.gov/resource-center/data-chart-center/interest-rates/; we can download yearly data for interest rates observed daily. To do so we will use **urllib** module. Unfortunatelly the downloaded file is in xml format and we will have to do some parsing in order to get a working dataset. 

In [0]:
link = r"https://www.treasury.gov/resource-center/data-chart-center/interest-rates/pages/XmlView.aspx?data=yieldyear&year="
years = [2000 + y for y  in range(2,19)] # list of years for which we are downloading data 

Lets quicly define a function to process the xmls.

In [0]:
def read_xml(file):
    parseXML = et.parse(file)
    curve = []
    curve_labels = []
    first = True
    for node in parseXML.getroot():
        try:
            aux = node.find("{http://www.w3.org/2005/Atom}content").find('{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties')
            curve.append([elem.text for elem in aux])
            if first:
                curve_labels=[elem.tag.split('}')[1] for elem in aux]
                first = False
        except:
            pass
    return curve, curve_labels

Next, we define the downloader for the data

In [0]:
def maybe_download(filename):
    if not os.path.exists(filename):
        first = True
        total = []
        labels = None
        for y in years:
            aux= urllib.request.urlopen(link+str(y))
            if first:
                curve, labels = read_xml(aux) 
                first                = False
            else:
                curve, _ = read_xml(aux)
            total.extend(deepcopy(curve))
        pd.DataFrame(total, columns = labels).sort_values(by='Id').to_csv(filename)

maybe_download("dataset.csv")

Ok, at this point we should have a dataset.csv saved on our drive, lets load it into the dataframe and do some cleaning from NaNs. For the experiments, training and testing we will look at the part of data set where the data rows don't have NaNs. In addition, we will create separate data set which has NaNs in it and we will try to fix it using our trained model. We will call it Neuro Patcher.

In [0]:
dataset = pd.read_csv('dataset.csv')
val = dataset.drop(columns=dataset.columns[:3])
cols = dataset.columns[3:-1]
val = val.values/100
val[np.isnan(val)] = 0.0
zero_rows = list(set(np.where(val==0)[0]))
non_zero_rows = [a for a in range(len(val)) if a not in zero_rows]
train = val[non_zero_rows]
val = val[zero_rows]
print(train[1:5])

[[0.0014 0.0015 0.0018 0.0029 0.0059 0.0099 0.0193 0.0265 0.0332 0.0423
  0.0447 0.0447]
 [0.0015 0.0015 0.0018 0.0026 0.0061 0.0103 0.0199 0.0271 0.034  0.0428
  0.0452 0.0452]
 [0.0014 0.0015 0.0018 0.0027 0.0059 0.01   0.0195 0.0266 0.0335 0.0427
  0.0453 0.0453]
 [0.0015 0.0015 0.0018 0.0026 0.0059 0.01   0.0193 0.0265 0.0334 0.0424
  0.045  0.045 ]]


Now we need to prepare training data set. Let assume following structure:


*   $I_n = \{\hat{R}_t(T_0),...,\hat{R}_t(T_N) \}$
*   $O_n = \{R_t(T_0),...,R_t(T_N) \}$

where $I_n, O_n$ are inputs and outputs respectively. $R_t(T_n)$ is the yield curve value at time $t$ and tenor $T_n$. $\hat{R}$ means that this component can be damaged (set to 0).


For this purpose we will create a quick function, which takes yield curve as input and the number of elements to damage in the input. 

In [0]:
def create_batch(input_data, prob=0.3):
  N,M = input_data.shape
  inputs = np.zeros((N,M))
  labels = np.zeros((N,M))
  inputs[:] = input_data
  labels[:] = input_data
  mask = np.random.choice([False, True], (N,M), p=[1-prob, prob])
  inputs[mask] = 0
  return inputs,labels

inputs, labels = create_batch(train[1:2])
for i in range(len(inputs[0])):
  print(inputs[0][i],'->', labels[0][i])


0.0014000000000000002 -> 0.0014000000000000002
0.0 -> 0.0015
0.0018 -> 0.0018
0.0029 -> 0.0029
0.0 -> 0.0059
0.0 -> 0.009899999999999999
0.019299999999999998 -> 0.019299999999999998
0.0265 -> 0.0265
0.0332 -> 0.0332
0.0 -> 0.042300000000000004
0.0 -> 0.0447
0.0447 -> 0.0447


Finally it's time to assemble our model. First lets sample some training data

In [0]:
X = []
y = []
n,m = train.shape

dataset_max_size = 100000
for i in range(dataset_max_size//n):
  temp_x,temp_y = create_batch(train)
  X.extend(temp_x.tolist())
  y.extend(temp_y.tolist()) 
X = np.array(X)
y = np.array(y)  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

Let setup layers

In [0]:
lr=0.0001
inp_layer = Input(shape=(m,))
lnn1 = Dense(m+2, activation='relu')(inp_layer)
lnn2 = Dense(m+2, activation='relu')(lnn1)
lnn3 = Dense(m+2, activation='relu')(lnn2)
output = Dense(m, activation='linear')(lnn3)

Define the model

In [0]:
regressor = Model(inp_layer, output)
regressor.compile(optimizer=Adam(lr=lr), loss='mean_squared_error')
regressor.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 12)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 14)                182       
_________________________________________________________________
dense_2 (Dense)              (None, 14)                210       
_________________________________________________________________
dense_3 (Dense)              (None, 14)                210       
_________________________________________________________________
dense_4 (Dense)              (None, 12)                180       
Total params: 782
Trainable params: 782
Non-trainable params: 0
_________________________________________________________________


Train the model

In [0]:
regressor.fit(x=X_train, y=y_train, shuffle=True, epochs=5000, batch_size=2058, verbose=0)

<keras.callbacks.History at 0x7fc9e4bc0278>

Test score:

In [0]:
score = regressor.evaluate(X_test,y_test, batch_size = 1024)
print('Test error:', score)

Test error: 7.509482618667786e-07


In [0]:
validation = regressor.predict(X_test)

pos=len(validation)
pos = np.random.choice(len(validation))

patched=np.array(X_test[pos])
patched[patched==0]=validation[pos,patched==0]

plt.plot(validation[pos],'y')
plt.plot(y_test[pos],'b-.')
plt.plot(X_test[pos],'g*')
plt.plot(patched,'co')
plt.legend([ 'predicted','actual', 'input', 'patched'])
plt.show()