Insight Data Engineering Coding Challenge

Introduction and background

This challenge assumes the author were a data engineer working at a financial institution that analyzes real-time stock market data. To determine the best trading strategy, the company's data scientists created a machine learning model to predict the future price of a stock every hour, and they want to test it on real-time stock data.

The task of the data engineer, is to help test how accurate their predictions are over time by comparing their predictions with newly arriving real-time stock prices.

Challenge details

This repo reads two different files, one provides the actual value of each stock every hour and the second lists the predicted value of various stocks at a certain hour during the same time period. In this setting, they are in the input folder, users can also specify detailed paths in the run.sh bash file. In the input folder, there is also a window.txt file, which holds a single integer value denoting the window size (in hours) for calculating the average error.

The output file contains the average error by calculating the average difference between the actual stock prices and predicted values over a specified sliding time window. Users can find the results in the file output/comparison.txt. The path for output file can also be specified in run.sh bash file.

Calculation

To compute the average error in a given time window, we find the absolute difference between the actual and predicted values of every matched stock and time pair, and call this result error. We then compute the average of all errors within a certain time window to get the average error.

Input files

Lines in both input files, actual.txt and predicted.txt, are listed in chronological order. Both contain the following pipe-delimited fields:

time: An integer greater than 0 designating the hour.
stock: ID of a stock
price: Price of a given stock at the given hour.

The file, window.txt, will contain one value (an integer greater than 0), specifying the size of the sliding window in hours.

Output file

You must create an output file named comparison.txt where each line has the following pipe-delimited fields:

Starting hour time window
Ending hour time window
average error rounded off to 2 decimal places.

How to run

To run this program, simply navigate to the folder of this repo on your local machine, and type ./run.sh. The results will appear in the output folder once the process is over.

Dependencies

Python 3.6

You only need native Python to run this program. It should work in Python 3.5 and Python 3.7 also. To make sure you have Python 3 installed in your command line environment, simply type python3 in the command line and check whether the Python environment is correctly loaded.

Advanced usage

In run.sh file, you can find the following command,

python3 ./src/prediction-validation.py -a ./input/actual.txt -p ./input/predicted.txt -w ./input/window.txt  -o ./output/comparison.txt

Here are what each argument in ./src/prediction-validation.py means:

-a: the path to the file containing actual values. REQUIRED
-p: the path to the file containing predicted values.REQUIRED
-w: the path to the file containing window value. REQUIRED
-o: the path to the output file. OPTIONAL; If not specified, the default value is ./output/comparison.txt.

By modifying the keyword arguments within the run.sh file, you can customize the paths to those files mentioned above.

Author

Yuanming (Jeremy) Shi, Institute for Artificial Intelligence, University of Georgia

For original repo, please check this repo https://github.com/InsightDataScience/prediction-validation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
input		input
insight_testsuite		insight_testsuite
output		output
src		src
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Insight Data Engineering Coding Challenge

Table of Contents

Introduction and background

Challenge details

Calculation

Input files

Output file

How to run

Dependencies

Advanced usage

Author

About

Uh oh!

Releases

Packages

Languages

whusym/prediction-validation

Folders and files

Latest commit

History

Repository files navigation

Insight Data Engineering Coding Challenge

Table of Contents

Introduction and background

Challenge details

Calculation

Input files

Output file

How to run

Dependencies

Advanced usage

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages