# Stock Market Time Series Analysis and Predictive Modeling 

By Christian Zachariah



<img alt="Simple Icon of a camera. This is just a place holder for your image" src="https://dezyre.gumlet.io/images/blog/stock-price-prediction-using-machine-learning-project/image_845932577291646715294687.png?w=768&dpr=1.5" width="75%">

Image from: https://www.projectpro.io/article/stock-price-prediction-using-machine-learning-project/571

---
### Abstract

Financial markets are complex systems that are influenced by a number of factors. Because of this, making accurate stock price predictions is a challenging task. Despite this, quant firms such as Jane Street Capital and Two Sigma leverage powerful trading algorithms to allow for highly accurate stock market predictions. 

In an effort to understand how these quant firms see such unparalleled levels of success, I aim to explore tools such as TensorFlow and Python Numba to make stock price predictions based on historic stock price data. I aim to benchmark the model's accuracy, training time between GPU-accelerated TensorFlow and CPU-based training, inference speed, and computational improvements through Python Numba optimizations.

I hope to accelerate operations such as moving averages and standard deviations using Numba, model training time using TensorFlow's GPU capabilities, and prediction computation. A successful outcome from this project would be developing a model with 50% or greater accuracy that outperforms baseline methods. Additionally, a demonstration of significant reduction in computation time for model training and prediction would also be considered a success.


----
### Schedule

* Sunday February 9 - Project Proposal Milestone Due
* Week of February - Collect historic stock price data
* Week of February - Have data cleaned and missing values handled
* Week of February - Conduct exploratory data analysis to visualize trends
* Sunday February 26 - Project Part 1 Due
* Week of March - Implement baseline models (Moving average)
* Week of March - Develop ML models (Random Forest)
* Week of March - Evaluate model performance (MAE, R^2)
* Week of March - Use best-performing model to create simple trading strategy
* Week of April - Test strategy on historic data, compare returns to benchmark
* Week of April - Document findings, challenges, and room for research and improvement
* April 15 - Final Project due

---
### Part 1 Software Exploration

For this project, I aim to explore TensorFlow for machine learning-based time series forecasting, and Python Numba for computational optimization. I chose TensorFlow because it provides an efficient framework for training and deploying machine learning models, which is pivotal to the project since we will be comparing the effectiveness of different models. Since I am relatively new to TensorFlow, I will be using documentation listed in the TensorFlow Core as my starting point for setting up the time series forecasting. I will also be using the Python Numba documentation as a starting point for optimizing my model's computation and training time.

This process will consist of setting up a TensorFlow environment with GPU acceleration to improve model training speed, then I will implement feature engineering techniques to ensure that the dataset is structured for efficient model training. Once these steps have been complete, I plan on using the MSU HPCC to run and test my code against large-scale financial datasets. The expected outcome that I hope to see is a fully functional TensorFlow model, an optimized financial feature engineering functions through Python Numba, and a well structured repository with clear documentation outlining my process and the instructions for reproducing my results.


TensorFlow Core: https://www.tensorflow.org/tutorials/structured_data/time_series

Python Numba Documentation:
https://numba.readthedocs.io/en/stable/user/index.html



---
### Part 2 Benchmark and Optimization

For benchmarking, I will evaluate model training time, inference speed, and computational efficiency. Specifically, I will be benchmarking CPU vs. GPU training time, with the goal of comparing how much faster TensorFlow can train a model on a GPU vs. a CPU. I will also be comparing the performance of Numba vs. the performance of standard Python functions, with the hope of seeing improvements from using Numba-accelerated feature calculations as opposed to using only Python. And finally, I will be benchmarking parallel processing in TensorFlow, with the goal of investigating whether multi-threading or vectorized computations can further optimize the models.

TensorFlow already supports GPU parallelism, and Python Numba allows parallel execution for numerical operations. Using these capabilities, I will optimize the models with the hope of seeing success in several metrics:

- Achieving a 4x speedup in feature engineering through Numba optimizations
- Reduce model training time on a GPU by a factor of 5x or more compared to CPU-based training
- Improve inference speed to allow faster predictions

A successful oucome will demonstrate that machine learning models for stock prediction can be significantly optimized through the use of TensorFlow's GPU acceleration and Python Numba's computational efficiency.

---- END TEMPLATE ----

---
<a name=-Setting-up-your-GIT-repository></a>
# 3.  Setting up your GIT repository


To submit your proposal you are going to create a project folder, commit your proposal to the folder and share your git repository with your instructor.  The following videos from the  [Getting to know git (Tutorial)](https://msu-cmse-courses.github.io/cmse401-S23-student/0000--Jupyter-Getting-Started-Guide.html) and may be helpful setting up your repository.

In [1]:
# Git init intorduction https://www.youtube.com/playlist?list=PLqPfbT7gwVP_AlE6HeDQUJsG4nUbGyeh3
from IPython.display import YouTubeVideo
YouTubeVideo("IAAv4DjYYUA",width=640,height=360, cc_load_policy=True)

The following are instructions for how to use the more general Github:

In [None]:
#Inicializing Repository using Github
from IPython.display import YouTubeVideo
YouTubeVideo("dpeHlFm8SYU",width=640,height=360, cc_load_policy=True)

As you update and change the files in your repository you will need to push those changes to github.  The following instructions walk you though this process:

In [None]:
#Git Add Committ https://www.youtube.com/playlist?list=PLqPfbT7gwVP_AlE6HeDQUJsG4nUbGyeh3

from IPython.display import YouTubeVideo
YouTubeVideo("GTM-h5xX2Lk",width=640,height=360, cc_load_policy=True)

----

## What not to include (building a .gitignore file)

First thing we want to teach is is that not everything should go into a git repository.  i.e. we do not want to bloat our repository with unwanted files.  The git repository works best with Text files that represent "source" code and not compiled or generated code. Here are some basic guidelines of what not to include:


* ```.ipynb_checkpoint``` - These folders are generated when you run jupyter notebooks.  They are "temporary" compiled folders that will change each time you run your notebook and should not be included in your repository. 
* ```__pychache__``` - Similar to .ipynb_checkpoint folders these folders are often generated when running python scripts and should not be included in your repository. 
* **_Other "Temporary" files_** - Temporary files are generated by all types of software and often start with a special characters such as the dot (.) or the tilde (~).  For example many text editors generate temporary files to save a document in case of a program crash.  Do not include temporary files in your repository. 
* **_Compiled Code_** - Programs such as C and FORTRAN must compile their code to an executable in order to run on your computer. These compiled codes are not editable and should be left out of your repository.  Instead it is better to include instructions for compiling the source code as part of your repository.  
* **_Program Output_** - Do not include any program output in your repository (unless for very specific reasons such as documentation, testing, or figures in your final report).  Assume that any output that can be generated by the source code should not be included with the source code (it is redundant). 

A good rule of thumb is that if you did not generate the file and/or do not know what it is you probably do NOT want to include it in your repository. 

**_WARNING_** do not blindly add all files to your repository with the * (star) syntax.  This is bad practice. For example do NOT do the following:

    git add * #THIS IS BAD!!!!

### Other files to avoid

In addition to the above files it is good to avoid any type of "Binary" file (with a few exceptions).  As stated early, git works best with text files so it can easily track changes. Some example binary files to avoid include:

- **_Large Data files_**  Although it is good to include a few example inputs to your software, avoid using entire datasets.  It is best to store these files someplace else.  
- **_Non-Text formats_** such as Word, Excel or PowerPoint documents should be avoided.  These tend to change each time they are opened even if the core text does not change. it is better to use an alternative text example. 


**_Note:_** one exception to the above rules are image files (ex jpg or png) that are used to help markdown or in the documentation.  It is typically okay to include these since they tend to get included only once and do not change much as the project evolves. 

The ```.gitignore``` (typically read "dot git ignore") is aa text file that contains a list of regular expressions (we will learn more about these later) that specify names of files we do not want to include in a git repository. 


### .gitignore file

The ```.gitignore``` (read "dot git ignore") file is used to help keep unwanted files out of your project.  Each line ```.gitignore``` file are filenames you want git to ignore.  For example, based on what we said above, a  good place to start on your ```.gitignore``` file would be the following two lines:
```
.ipynbcheckpoint
__pychache__
```

What should go into a .gitignore depends a lot on the type of project.  However, you don't need to invent these from scratch. For example, you could just copy the .gitignore file from the course repository or find one on the internet.  If you are using github I think it can also automatically create a .gitignore file for you if you specify your project as a python project. 

In [None]:
#Hidden files https://www.youtube.com/playlist?list=PLqPfbT7gwVP_AlE6HeDQUJsG4nUbGyeh3

from IPython.display import YouTubeVideo
YouTubeVideo("kzI-mPSY8y4",width=640,height=360, cc_load_policy=True)

---

## Avoid Spaces in file names

When you name all of the files and folders inside of a repository, it is important that your names **_DO NOT include spaces_**.  Although all modern computer's have ways to accept names with spaces do not use them.  Instead use underscores (\_) or ```CamleCase``` (No spaces and capital letters at the beginning of each word in the name).  Avoiding spaces in your names will **_ALWAYS_** save time in the long run.  

----

## Always Use Relative Paths

In your code there are two basic ways to determine the location of a folder inside your computer; Relative Paths and Absolute Paths.  A relative path is a path starting from your current directory and an absolute path is is a path starting from your computer's "root" directory.  

- Relative paths typically start with a single dot (.), representing the currecnt directory, or two double dots (..) representing the current directories parent folder.
- Absolute paths typically start at the global root directory (/) on a Linux or Mac machine or with a drive label (ex C:) on a windows machine.  

**_ALWAYS_** use relative paths in your git repository.  This ensures that others will be able to use your software if they download it onto their computer.  For example:

    Good: ./data/  or ../data/ is a relative path to a child directory or sibling directory called data. 
    Bad (not acceptable): C:/research/data or /mnt/home/data are absolute paths to a data directory
 
 

----

## Jupyter notebook files in git repositories

Turns out that Jupyter notebook files and git repositories work very poorly together.  Jupyter notebook files are a unique combination of source and program generated information.  So, everytime you run a jupyter file it can add output cells which make git think you you changed something important. In many cases it is just a few numbers or some output text.  When you run the ```git status``` command it always looks like jupyter notebook files have changed even when they have not changed. 

A good rule of thumb is to clear all of the output files before committing any changes to jupyter notebook files.  

- Open the jupyter notebook file
- Select "Reset Kernel and clear output" from the menu
- Save the notebook file.
- Do your "git add" and "git commit" commands

The following video goes though why we have to treat jupyter notebooks this way:

[Direct Link](https://www.youtube.com/embed/79hW_TzLos8)

In [None]:
# Jupyter vs. Git -- https://www.youtube.com/playlist?list=PLqPfbT7gwVP_AlE6HeDQUJsG4nUbGyeh3
from IPython.display import YouTubeVideo
YouTubeVideo("79hW_TzLos8",width=640,height=360, cc_load_policy=True)



---
<a name=-Turning-in-your-Project></a>
# 4.  Turning in your Project


In order to turn in your GIT repository you just need to give the instructors and classmates the permissions to clone the repository and provide the full git command. Please use the following form to submit this information.

Or if you prefer, you can use the [Direct Link](https://forms.microsoft.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofbihPqVa-WtNjOGYhCwpOgRUMlhQQlY0WE04SzUyRVhWR1NDRFRZSzdOMS4u)

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.microsoft.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofbihPqVa-WtNjOGYhCwpOgRUMlhQQlY0WE04SzUyRVhWR1NDRFRZSzdOMS4u" 
	width="100%" 
	height="1100px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

---

# 4. Rubric

Below is the grading rubric that will be used to assess your project proposals.   Please use the following as a guide when preparing your proposal:

```
(50 points) Proposal
    - (5 points) Is there a project title?
    - (5 points) Is there a motivating picture included and referenced?
    - (10 points) Is the abstract well written and easy to understand?
    - (5 points) Are there due dates and milestones in the schedule?
    - (10 points) Are the goals for Part1 clear from the proposal, including links and references?
    - (10 points) Are the goals for Part2 clear from the proposal, including links and references?
    - (5 points) Is the proposal written in ipynb or md file? 

(50 points) Correctly setting up your git repository
    - (10 points) Was everything turned in on time?
    - (10 points) Is there a gitignore file (does it work)?
    - (10 points) Are the permissions to the git repository set up correctly?
    - (5 points) Does the project use correct Filenames?
    - (5 points) Does the project have any temporary or hidden files that should not be included?
    - (10 points) Were all directions followed?
    
100 points total
```


-----
### Congratulations, you are done!

Make sure your report is added to your Github repository as well as D2L. You should also make sure that you have added the instructors as collaborators to your project repository so that we can access your project. 

Written by Dr. Dirk Colbry, Michigan State University (Updated by Dr. Nathan Haut in Spring 2025)
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

----