## Getting raw data from Mercury ftp server
you need to download the raw data and save it at /home/ubuntu/MLMortgage/data/raw/[raw_directory]
For this example [raw_directory]='chuncks_random_c1mill'. If it does not exist, the wget command will created it.
you can change the example file 'temporalloandynmodifmrstaticitu-random-1mill-2mill.txt' for the desired file.
Note: This works only for ubuntu.

In [None]:
%cd

In [None]:
!wget --ftp-user=machinelearning --ftp-password=Mdje7i3739# ftp://mercury.vichara.co.uk/temporalloandynmodifmrstaticitu-random-1mill-2mill.txt -P /home/ubuntu/MLMortgage/data/raw/chuncks_random_c1mill/

## Beginning Steps

In [None]:
import sys
import os
import pandas as pd
from pathlib import Path
from datetime import datetime
import argparse
import psutil

nb_dir = os.path.join(Path(os.getcwd()).parents[0], 'src', 'data')
if nb_dir not in sys.path:
    sys.path.insert(0, nb_dir)
# print(sys.path)
import features_selection as fs
import make_dataset as md
import build_data as bd
import get_raw_data as grd
import data_classes
import glob

models_dir = os.path.join(Path(os.getcwd()).parents[0], 'src', 'models')
if models_dir not in sys.path:
    sys.path.insert(0, models_dir)
import nn_real as nn


In [None]:
RAW_DIR = os.path.join(Path(os.getcwd()).parents[0], 'data', 'raw') 
PRO_DIR = os.path.join(Path(os.getcwd()).parents[0], 'data', 'processed')

print(RAW_DIR, PRO_DIR)

## Preprocessing  
From console you can run:

#### $ cd /home/ubuntu/MLMortgage/src/data

#### $ python build_data.py --prepro_step=preprocessing --prepro_dir=chuncks_random_c1mill --prepro_chunksize=500000 --train_period 121 143 --valid_period 144 147 --test_period 148 155


For this example, the raw file will be extracted from 'data/raw/chuncks_random_c1mill' directory  and the processed file will be save at 'data/processed/chuncks_random_c1mill' according to the folder name that you give in the parameter --prepro_dir. The periods are defined for training, validation and testing. prepro_chunksize is a parameter for processing blocks of data, instead just one by one. In this implementation the preprocessed file will be save in .h5 format because of their compression format and also you can put training, validation and testing dataset in just one file.

The following cells make the same as in console:

In [None]:
FLAGS, UNPARSED = bd.update_parser(argparse.ArgumentParser())    
#these are the more important parameters for preprocessing:
FLAGS.prepro_dir='chuncks_random_c1mill' #this directory must be the same inside 'raw' and processed directories.
FLAGS.prepro_chunksize=500000 
FLAGS.train_period=[121,279] #[121, 143] 
FLAGS.valid_period=[280,285] #[144, 147] 
FLAGS.test_period=[286,304] #[148, 155]
                                                
print(FLAGS)    

In [None]:
glob.glob(os.path.join(RAW_DIR, FLAGS.prepro_dir,"*.txt"))

In [None]:
startTime = datetime.now()
if not os.path.exists(os.path.join(PRO_DIR, FLAGS.prepro_dir)): #os.path.exists
        os.makedirs(os.path.join(PRO_DIR, FLAGS.prepro_dir))
bd.allfeatures_preprocessing(RAW_DIR, PRO_DIR, FLAGS.prepro_dir, FLAGS.train_period, FLAGS.valid_period, FLAGS.test_period, dividing='percentage', 
                          chunksize=FLAGS.prepro_chunksize, refNorm=FLAGS.ref_norm, with_index=FLAGS.prepro_with_index, output_hdf=True)        
print('Preprocessing - Time: ', datetime.now() - startTime)

In [None]:
!cd /home/ubuntu/MLMortgage/src/data

In [None]:
#first h5 file:
!python build_data.py --prepro_step=slicing --slice_input_dir=chuncks_random_c1mill --slice_output_dir chuncks_random_c1millx2_train chuncks_random_c1millx2_valid chuncks_random_c1millx2_test --slice_tag train valid test --slice_target_name 1-1mill_cs1200_train 1-11mill_cs1200_valid 1-1mill_cs1200_test --slice_target_size=36000000 --slice_index=0

In [None]:
# 2nd. h5 file:
!python build_data.py --prepro_step=slicing --slice_input_dir=chuncks_random_c1mill --slice_output_dir chuncks_random_c1millx2_train chuncks_random_c1millx2_valid chuncks_random_c1millx2_test --slice_tag train valid test --slice_target_name 2mill-3mill_cs1200_train 2mill-3mill_cs1200_valid 2mill-3mill_cs1200_test --slice_target_size=36000000 --slice_index=1

## Training
From console, execute:

#### $ cd /home/ubuntu/MLMortgage/src/models

#### $ python nn_real.py --train_dir=chuncks_random_c1mill --valid_dir=chuncks_random_c1mill --test_dir=chuncks_random_c1mill --logdir=/home/ubuntu/real_summaries_4425_-15ep_99-01/ --epoch_num=15 --max_epoch_size=-1 --batch_size=4425                                                    
This execution runs 15 epochs over the entire dataset (max_epoch_size=-1) and the training, validation and testing datasets are in the same directory inside /home/ubuntu/MLMortgage/data/processed/chuncks_random_c1mill/. 

The checkpoints and the models results will be saved into, for example, logdir=/home/ubuntu/real_summaries_4425_-15ep_99-01/. You can change it by uncommenting and modifying the FLAGS.logdir variable.

To execute step by step:


In [None]:
import tensorflow as tf

FLAGS, UNPARSED = nn.update_parser(argparse.ArgumentParser())
print("UNPARSED", UNPARSED)
FLAGS.logdir=Path(str('/home/ubuntu/real_summaries4425-15ep_test/'))
if not os.path.exists(os.path.join(FLAGS.logdir)): #os.path.exists
    os.makedirs(os.path.join(FLAGS.logdir))
FLAGS = nn.FLAGS_setting(FLAGS, 1)
FLAGS.train_dir = 'chuncks_random_c1millx2_train'
FLAGS.valid_dir = 'chuncks_random_c1millx2_valid'
FLAGS.test_dir = 'chuncks_random_c1millx2_test'
FLAGS.train_period=[121,279] #[121, 143] 
FLAGS.valid_period=[280,285] #[144, 147] 
FLAGS.test_period=[286,304] #[148, 155]
FLAGS.epoch_num=15 
FLAGS.max_epoch_size=-1 
FLAGS.batch_size=4425
print("FLAGS", FLAGS) #you can change the FLAGS by adding the setting before this line.

In [None]:
DATA = md.get_h5_data(PRO_DIR, FLAGS.train_dir, FLAGS.valid_dir, FLAGS.test_dir, train_period=FLAGS.train_period, valid_period=FLAGS.valid_period, test_period=FLAGS.test_period) 
print('Features List: ', DATA.train.features_list)
print('Labels List: ', DATA.train.labels_list)

In [None]:
FLAGS.log_file.write('METRICS:  %s\r\n' % str(FLAGS))
FLAGS.log_file.write('training files:  %s\r\n' % str(DATA.train._dict))
# print('training files:  %s\r\n' % str(DATA.train._dict))
FLAGS.log_file.write('validation files:  %s\r\n' % str(DATA.validation._dict))
# print('validation files:  %s\r\n' % str(DATA.validation._dict))
FLAGS.log_file.write('testing files:  %s\r\n' % str(DATA.test._dict))        
# print('testing files:  %s\r\n' % str(DATA.test._dict))     

In [None]:
print('Training features - Sample', DATA.train._dict[0]['dataset_features'][0:100]) #you can increase the sampling number of records 

In [None]:
architecture = nn.architecture_settings(DATA, FLAGS)
print('RAM before build: ', psutil.virtual_memory()) #  physical memory usage
FLAGS.log_file.write('RAM  before build: %s\r\n' % str(psutil.virtual_memory()))
graph = nn.build_graph(architecture, FLAGS)        
print('RAM after build', psutil.virtual_memory()) #  physical memory usage
FLAGS.log_file.write('RAM  after build: %s\r\n' % str(psutil.virtual_memory()))
nn.run_model(graph, 'testing_data', 1,  FLAGS, DATA)      