# Programming Assignment 2: Expeditions into ML Workflows


# Team Details

When submitting, fill your team details in this cell. Note that this is a markdown cell.

**Student 1 Name: Doddapaneni Udith**

**Student 1 ID: 142201012**

**Student 2 Name:**

**Student 2 ID:**

**Student 3 Name:**

**Student 3 ID:**


# Assignment Details

Even though model training gets all the focus, it is often a small part of your ML pipeline. There are tons of other components that must work well for the model to work well. I am assuming that your other courses such as DS/ML will teach about the basic ML model training and inference. 

In this assignment (PA2), you will explore some of the components of "traditional" ML pipelines that are typically under explored. In the next one (PA3), you will explore some of the modern components such as AutoML, ML Observability, decision theory etc. 

Here is a corresponding illustration from a famous paper "Hidden Technical Debt in Machine Learning Systems" published in NeurIPS 2015. 

![ML Pipeline](resources/ml_pipeline.png) 

# Assignment Breakdown

This assignment is likely to be slightly more challenging, if not time consuming, compared to PA1. So please start early. Also, please use course forums from Moodle LMS if you face any issues. I have cherry picked some interesting sub-problems. Similar to PA1, this assignment will only scratch the surface of a vast and interesting topic. So, feel free to learn more about the individual topics / libraries on your own. 

Being a data scientist requires endless experiments such as trying out different variations of the model, hyper params etc. So your code needs to be modular and easy to change. Most importantly, when you make changes, it should not break things. So you will learn (some basic ideas about) how to design such extensible ML systems in this assignment.

### 1. Configuration Management via Pydantic (10 points)

Modern ML pipelines are increasingly becoming complex with different teams being responsible for different components. So it is possible that one team makes some change that can *silently* break your model. Recently, [Pydantic](https://docs.pydantic.dev/latest/) has become a popular way to do data validation in Python. In this task, you will use Pydantic to do some basic configuration in ML pipelines. 

### 2. Model Serving (30 points)

We will start from the end by focusing on model serving. You will write code to serve your ML model in four different modalities: command line based, terminal based, API based, and web based. 

### 3. Model Evaluation Metrics (10 points)

In this task, you will explore some of the summary metrics available for evaluating classifiers. You will also learn how to create a new metric and use it inside the scikit-learn ecosystem. 

### 4. Error Estimation (15 points)

Next, you will explore some methods to estimate the generalization error and analyze how accurate they are. We will also evaluate some advanced methods such as Bootstrap to get confidence intervals around error estimates. 

### 5. Hyper Parameter Tuning (10 points)

In this task, we will explore four common approaches for hyper parameter optimization. 

### 6. Putting It All Together (10 points)

In this task, you will combine all of aforementioned ideas to build a pipeline for training a ML model.

### 7. Model Performance Reports (15 points)

In the final task, we will learn some basic tricks for generating reports that show the performance of the model. We will use the concept of templates to generate complex and interactive reports. 

# Setting up Programming Assignment Environment

Similar to PA1, we will use rye for managing the assignment. Please follow the instructions below to setup the assignment environment.

1. Open the DS5612_PA2 folder in a terminal.

2. Install the relevant Python interpreter and packages.

> rye sync

3. Please refer to PA1's 01_environment_setup.html for more details about how to use rye. The most relevant ones are 

> rye fmt
>
> rye lint
> 
> rye test






# Submission Instructions

This subsection will provide clear instructions about how to submit the assignment. Incorrect submissions will incur negative points :)

### Linting and Formatting (-5 points in the worst case)

It is important that your code is formatted in accordance to widely used PEP conventions. We will be using ruff to do it. If you had setup the ruff as the formatter in your editor, then this should be done whenever you save the file. If not, you can manually format the files using the command.

> rye fmt

It is important that your code has as little issues as possible. This is not limited to syntax errors or bugs but also some of the code aesthetics and code quality. So, the TAs will be running an aggressive linter where almost thousands of rules are enabled. This is a useful thing to do at the beginning of your career to learn to code in a Pythonic style. If your code has more than 50 errors, then TAs will deduct 5 points. You can check your current status by running

> rye lint

Ruff is reasonably smart and can automatically fix a decent chunk of issues (just remember to close the files inside your editor).

> rye lint --fix

Some of the more pedantic linter errors are disabled (see ignore key of tool.ruff.lint table in pyproject.toml). I am open to adding more based on popular request, but usually it is good to have a lean ignore list.

### Testing Your Code

A decent chunk of this assignment will be graded automatically. For the tasks that are automatically graded, we have split the grader to consist of public and private test cases. An automatic grader provided with the assignment should test your code against public test cases. The TAs will have access to the grader with the private test cases which will be a superset of public test cases.

Please follow the instructions clearly as the grading will be binary - either you get the full score or none at all.

You can use the following command to run ALL the test cases. Run this before submission

> rye test

You can also test the code based on individual tasks by using pytest markers. The full list of markers can be found inside tool.pytest.ini_options of pyproject.toml. The name of the marker for a given assignment will be provided as part of the task description.

> rye test -- -m MARKER

For example, you can run only the test for the typer deployment using the command

> rye test -- -m t2typer


### Preparing the Submission file (-5 points in the worst case)

When submitting, first do a copy of the DS5612_PA2 folder. We will make the following changes and then submit DS5612_PA2.zip file

- Delete the .git, .pdm-build, .venv, .ruff_cache, .pytest_cache folders from project root
- Delete the .gitignore, .python-version from project root.
- Delete the \_\_pycache\_\_ file from src/ds5612_pa2 and src/ds5612_pa2/code.


