Skip to content
This project is the beta version of "Central Intelligence Platform" designed by me. The platform serves for stock trading and money management purpose.
Branch: master
Clone or download
Latest commit dc13326 Aug 10, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.

Applied Data Science in Stock Market

This repository provides crucial analysis of applied data science in stock market.


This project requires reader to have a broad range of knowledge including but not limited to (1) financial accounting, (2) time-series analysis, (3) predictive modeling skills, (4) coding in R, (5) design software package such as R Shiny, and (6) parallel computing using shell script.


Project summary: What is tomorrow's stock price? Under big data era, what searching technique can we use to grasp the useful information so that we can minimize our prediction error predicting a regression problem? This project studies price actions in capital market as a random walk from limit theorems. Through clear construction, we derive algorithms from a series of theorems to create standardized buy signals given a trader's committed frequency to participate in the market. Using such processed data, we can use influence measure, I-score, to select robust stock clusters to construct portfolio. Simulation result shows under the same risk profile a $1,000 initial investment returns $5,000 while the same time S&P500 returns less than $1,500. Empirical evidence show results of on average 97% error reduction.

Mathematical Model

Lo et al (2002) have introduced a non-parametric statistics that measures the predictivity of a cluster of variables given a data set in discrete framework. After reading dissertation from Huang (2004) and Hsu (2014), we have adopted the extension of their methodology to measure predicitivity in continuous framework.

The following graph is taken from Hsu (2014) and it presents an illustration to use nearest neighborhood to measure local mean in predictivity score.

Performance and Results

We present a 97% error reduction on average on 30 stocks in Dow Jones 30 Component on held-out test set. Below we present a sample of selected test set resutls for MMM for two comparisons: (1) the first is using time-series ARMA model, and (2) the second is using I-score as feature selection method before we do regression.


Yiqiao Yin is the designated presenter for this presentation. He will mainly be using Presentation Slides for the main material. For detailed reference, we invite our audience to read the paper on the research site. The paper is also uploaded to zip folder in Github folder doc.

Shiny App

We also build a platform using Shiny App and this app should serve as supplement in addition to the paper and presentation. Due to slow speed from Shiny server when executing code to download data lively, we will present limited information from Shiny App. The app can be accessed here.

R Notebook

In additional to files above, we also provide a R notebook, image. This R notebook calls RData saved in the doc folder. Then the script produces the graphs such as the following. The R Notebook is meant to work as one of the many supplements in support of the presentation materials just like R Shiny App above.

About Author

This is Yiqiao Yin, a graduate student in statistics at Columbia University.

  • I lead a team of analysts at Columbia University, all from Master in Statistics program offered by the Department of Statistics.
  • Team members are: Anke, Peilin, and Chuqiao


This project provide fair and robust analysis in predicting security prices. However, money management is more of art than science. We have not disclosed related strategy in game planning when it comes to risk management. Hence, this project does not serve as investment advise and nor are we responsible for any monetary losses from investment decisions by any audience. The risk of money management subject to your investment is solely your responsibility.

Special thanks to Shawhwa Lo for his work on non-parametric statistics, I-score.

You can’t perform that action at this time.