# Module 2: Fundamentals of Data Engineering

## Sprint 4: Capstone Project

# About this Sprint

It is time for your second Capstone Project.  
You will work on this project for the whole Sprint.  
The outcome of this Sprint should potentially serve as your portfolio item, so try to show your best work!

This time your objective is even more challenging - you will be required to iteratively build and implement a plan for a large dataset based on business objectives.  
You'll have to translate the business requirements, making assumptions where necessary, into a plan for your project.

Even though you have learned a lot about Python, SQL, Apache Airflow in this Sprint, working on a larger project will be the real challenge.

Good luck!

## Context

You are hired by a hedge fund specializing in using the latest technologies to develop bespoke trading strategies to gain an edge in trading various financial instruments.  
You are working in a team that is using machine learning to predict the price movements of commodities - oil, natural gas, gold, silver, etc.  
Your role in the team is to help your team release the precious metals machine learning models that they have been developing for the past year.

- Your database should contain at least two schema objects (tables, views, etc.) - one for training machine learning models and another for analytical workflows.

During the planning session you have committed to deliver these tasks:
- Set up an RDBMS that will have the required data to be used for both model training and analytical workflows. The database objects that you use should contain at minimum columns for each of the metals, a timestamp, and an id. The table used for machine learning training should only contain rows for the last twelve hourly entries, as this is how much data is required for training the ML model. The table used for analytics should contain all available data.
- Set up a data ingestion solution for gold, silver, platinum, and palladium prices. You can use any API that you'd like. Some suggestions for the APIs: [Live Metal Prices](https://rapidapi.com/solutionsbynotnull/api/live-metal-prices), [Gold & Silver Prices](https://www.goldapi.io/), [Metals-API](https://metals-api.com/documentation), [api.metals.live](http://api.metals.live/), [Metalprice API](https://metalpriceapi.com/pricing). P.S. you should not need to use a paid plan for this project.
- Modify the Model class to connect to the real data sources and use them for trainig the models.
- Set up the machine learning training pipeline, which should result in multiple trained model files.
- Set up periodical backups of the machine learning models and the database.
- Keep all historical data available for analytical purposes.

Note: for the RDBMS, you can use any technology that you like (e.g., IBM Db2, PostgreSQL, etc.), and or orchestrating your jobs use Apache Airflow.

### Model Definiton

This is the code for training machine learning models that the data scientists in you team have given you to test your pipelines.
For reference, it includes a way to generate data so that you know how to structure your data, when you connect a real data source to the model.
Their final Model are going to be different, but the API will be the same, so make your solution modular.

```python
import numpy as np
import pandas as pd
from pathlib import Path
from sktime.forecasting.arima import ARIMA

rng = np.random.default_rng()

AR_LOWER = 0.1
AR_UPPER = 0.6
MEAN_LOWER = 1000
MEAN_UPPER = 2000
STD = 1


def generate_integrated_autocorrelated_series(
    p: float, mean: float, std: float, length: int
) -> np.ndarray:
    """Generates an integrated autocorrelated time series using a specified autoregression parameter, mean and standard deviation of the normal distribution, and the desired length of the series."""
    x = 0
    ar1_series = np.asarray([x := p * x + rng.normal(0, 1) for _ in range(length)])
    return np.cumsum(ar1_series * std) + mean


def generate_sample_data(
    cols: list[str], x_size: int, y_size: int
) -> tuple[pd.DataFrame, pd.DataFrame, tuple[np.ndarray, np.ndarray]]:
    """Generates sample training and test data for specified columns. The data consists of autocorrelated series, each created with randomly generated autoregression coefficients and means. The method also returns the generated autocorrelation coefficients and means for reference. 'x_size' determines the length of the training set, and 'y_size' determines the length of the test set. 'cols' determines the names of the columns."""
    ar_coefficients = rng.uniform(AR_LOWER, AR_UPPER, len(cols))
    means = rng.uniform(MEAN_LOWER, MEAN_UPPER, len(cols))
    full_dataset = pd.DataFrame.from_dict(
        {
            col_name: generate_integrated_autocorrelated_series(
                ar_coefficient, mean, STD, x_size + y_size
            )
            for ar_coefficient, mean, col_name in zip(ar_coefficients, means, cols)
        }
    )
    return (
        full_dataset.head(x_size),
        full_dataset.tail(y_size),
        (ar_coefficients, means),
    )


class Model:
    def __init__(self, tickers: list[str], x_size: int, y_size: int) -> None:
        self.tickers = tickers
        self.x_size = x_size
        self.y_size = y_size
        self.models: dict[str, ARIMA] = {}

    def train(self, /, use_generated_data: bool = False) -> None:
        if use_generated_data:
            data, _, _ = generate_sample_data(
                self.tickers, self.x_size, self.y_size
            )
        else:
            raise NotImplementedError
        for ticker in self.tickers:
            dataset = data[ticker].values
            model = ARIMA(order=(1, 1, 0), with_intercept=True, suppress_warnings=True)
            model.fit(dataset)
            self.models[ticker] = model

    def save(self, path_to_dir: str | Path) -> None:
        path_to_dir = Path(path_to_dir)
        path_to_dir.mkdir(parents=True, exist_ok=True)
        for ticker in self.tickers:
            full_path = path_to_dir / ticker
            self.models[ticker].save(full_path)
```

### Training and Saving the Model

The data scientists also said that you can use this script to see how the model training and saving works with generated data.

```python
model = Model(["XAUUSD", "XAGUSD", "XPTUSD", "XPDUSD"], 12, 1)
model.train(use_generated_data=True)
model.save("model1")
```

### Dependencies

Even when you asked about dependencies (Python version, 3rd party libraries, etc.) for this code, the data scientist couldn't give you a good answer - they didn't remember which versions of the Python they used in their notebooks.
They said that they simply ran `pip install sktime` and other libraries were already installed in the environment that they used.
You will need to figure out the dependencies for this pipeline yourself.

## Objectives for this Part

- Practice translating business requirements into data engineering tasks.
- Practice architecting a solution feeding data to a Python application.
- Practice orchestrating jobs using Apache Airflow.
- Practice using machine learning to solve business problems.
- Practice deploying multiple machine learning models.

## Requirements

- Your solution should encompass the functionality outlined in the Context section.
- Create a plan for your deliveries. This should include your assumptions, overall objectives, and objectives for each step in your plan. You are not expected to have a plan for the whole project but instead have a clear understanding of what you'll try to achieve in the next step and build the plan one step at a time.
- Architect a solution enabling the machine learning models to use the data in your database.
- Your machine learning training pipeline should be scheduled to run immediately after loading the data from the APIs.
- Your system should backup the entire database and the machine learning models every six hours and store the last twenty backups.
- Provide suggestions about how your analysis can be improved.

## Bonus Challenges

As a data engineer, you will spend a significant amount of your time learning new things.  
Sometimes you will do that for fun, but most of the time, you will have an urgent problem, and you will need to quickly learn some new skills to be able to solve it.  
It is essential to build this skill gradually - it is extremely valuable for all data engineers.  
The bonus challenges are designed to simulate these types of situations.  
These challenges require you to do something that we haven't covered in the course yet.  
Instead of trying to do all of the bonus challenges, concentrate on just one or two and do them well.  
All of the bonus challenges are optional - no points will be deducted if you skip them.

- Write unit and integration tests for your solution.
- Extend your solution to store all hyperparamters and parameters of each model for each training run in a separate table or database.

## Evaluation Criteria

- Adherence to the requirements. How well did you meet the requirements?
- Code quality. Was your code well-structured? Did you use the appropriate levels of abstraction? Did you remove commented-out and unused code?
- Code performance. Did you use suitable algorithms and data structures to solve the problems?
- Presentation quality. Coherence of the presentation of the project, how well everything is explained.
- General understanding of the topic.

## Correction

During your project correction, you should present it as if talking to a data scientist building the machine learning model in your team.  
You can assume that they will have strong data science and decent software engineering skills - they will understand technical jargon but are not expected to notice things that could have been done better or ask about the choices you've made.
They are well familiar with the problem (but it is always worth to recap of which part of the problem you were asked to solve), so don't spend your time explaining trivial concepts or code snippets that are simple - your best bet is to focus your presentation on technological and design choices as well as the end-user functionality of your solution.


## General Correction Guidelines

For an in-depth explanation about how corrections work at Turing College, please read [this doc](https://turingcollege.atlassian.net/wiki/spaces/DLG/pages/537395951/Peer+expert+reviews+corrections).
