In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 1000)

In [2]:
df = pd.read_csv("./train_competition_2026.csv")

# Kaggle Data Competition Project

### Project Description

Context: Medical data from patients in the ER. 

There are multiple timepoints observed per patient. Some of the predictors are constant in time, some are dynamic. The responses y_1 and y_2 are the measurements of indicators of health (e.g.  arterial pressure, heart rate) recorded after a 5 minute gap from the predictors. These responses are usually a sign of whether someone's condition is deterioriating to an extreme point. 

We want to build a model to predict these values so we can know when someone's condition is at risk 5 minutes before it actually happens. Your task for this project is to build a machine learning model which takes in the predictors and returns predictions for y_1 and y_2. At each of the checkpoints (2/17 at 11:59p, 3/2 at 11:59p, and 3/11 at 11:59p) you will use your model to predict the 2 responses for a test set. Your relative performance (based on MAE) compared to your competitors will dictate part of your score for the project. 


### Grading Breakdown

Evaluation of your project will be determined by the following breakdown:

#### 1. Relative performance at first checkpoint, Feb 17 (5 pts)
     - Top 25% of scorers on test set will receive 5/5 points
     - Next 50% of scorers will receive 4.5/5
     - Bottom 25% of scorers will receive 4/5, conditional on meeting a minimum threshold average MAE of 10 on
         the private test set
     - Failure to beat an avg MAE of 10 on the private test set will receive 2.5/5
     - No submission = 0/5

#### 2. Relative performance at second checkpoint, Mar 2 (5 pts)

     - Top 25% of scorers on test set will receive 5/5 points
     - Next 50% of scorers will receive 4/5
     - Bottom 25% of scorers will receive 3/5, conditional on meeting a minimum threshold average MAE of 10 on
         the private test set
     - Failure to improve upon your MAE from the first checkpoint on the private test set will receive 2.5/5
     - No submission = 0/5
     
#### 3. Relative performance at final due date, Mar 11 (10 pts)

     - Top 25% of scorers on test set will receive 10/10 points
     - Next 50% of scorers will receive 8/10
     - Bottom 25% of scorers will receive 6/10
         
#### 4. Write up of your model development process in a blog (NOT hosted online -- submit a .html or .pdf over Canvas). (60 pts)

    - Components of the blog:
    - Sections separated by checkpoint and what models you had explored at that point /5
    - Visualization & EDA -- should at least be in first section (or in multiple sections if you continue doing EDA after first checkpoint) /10
    - Description of feature engineering you used /10
    - Model comparison & validation across candidate models + vizualization of results /20
    - Summary of why you think your winning model performed the best /10
    - Description of division of work across partners /5

    Feel free to include other details in your blog like navigating computational issues, bug fixing, brainstorming of solutions/models, etc. 


#### 5. Peer evaluation of partners' teamwork, attendance in class, and contributions to model development blogs (20 pts)

#### Extra. The top $5$ teams will have the opportunity to present their models and development process on the final day of class. This presentation can add up 1 extra credit to your overall course grade. 

### First checkpoint 
Your first task should be to perform EDA and construct and tune a baseline model which has an average MAE of 10 or less. The due date for this checkpoint will be Feb 17. 

In [3]:
df.head(10)

Unnamed: 0,obs,sub_id,time,num_0,num_1,num_2,cat_0,cat_1,cat_2,cat_3,cat_4,t_0,t_1,t_2,t_3,t_4,y_1,y_2
0,0,0,2068-09-19 23:34:11,1.38,49,7,1,3,1,0,1,105.5,95.0,67.4,36.6,23.2,33.4,107.4
1,0,0,2068-09-19 23:35:11,1.38,49,7,1,3,1,0,1,104.4,95.0,66.4,37.8,22.7,33.4,107.4
2,0,0,2068-09-19 23:36:11,1.38,49,7,1,3,1,0,1,104.0,95.0,65.2,37.0,22.1,33.4,107.4
3,0,0,2068-09-19 23:37:11,1.38,49,7,1,3,1,0,1,102.8,95.0,63.4,35.9,20.7,33.4,107.4
4,0,0,2068-09-19 23:38:11,1.38,49,7,1,3,1,0,1,101.3,95.1,59.1,34.5,18.1,33.4,107.4
5,0,0,2068-09-19 23:39:11,1.38,49,7,1,3,1,0,1,99.7,94.8,56.8,33.4,16.5,33.4,107.4
6,0,0,2068-09-19 23:40:11,1.38,49,7,1,3,1,0,1,101.2,94.7,60.8,35.4,19.5,33.4,107.4
7,0,0,2068-09-19 23:41:11,1.38,49,7,1,3,1,0,1,99.9,94.2,64.3,36.2,20.7,33.4,107.4
8,0,0,2068-09-19 23:42:11,1.38,49,7,1,3,1,0,1,107.2,94.8,67.7,38.3,24.2,33.4,107.4
9,0,0,2068-09-19 23:43:11,1.38,49,7,1,3,1,0,1,109.0,95.4,68.0,35.5,22.9,33.4,107.4


## Test set data predictors, without y_1, y_2

In [4]:
ex_test = pd.read_csv("./test_no_outcome.csv")

ex_test.head(50)

Unnamed: 0,obs,sub_id,time,num_0,num_1,num_2,cat_0,cat_1,cat_2,cat_3,cat_4,t_0,t_1,t_2,t_3,t_4
0,18,1,2134-04-01 22:23:14,-1.0,38,1,1,1,0,0,0,105.4,99.8,50.7,61.4,36.8
1,18,1,2134-04-01 22:24:14,-1.0,38,1,1,1,0,0,0,105.4,99.4,49.4,61.1,36.2
2,18,1,2134-04-01 22:25:14,-1.0,38,1,1,1,0,0,0,104.6,99.0,49.7,61.4,36.6
3,18,1,2134-04-01 22:26:14,-1.0,38,1,1,1,0,0,0,104.5,99.6,51.7,61.8,37.2
4,18,1,2134-04-01 22:27:14,-1.0,38,1,1,1,0,0,0,104.6,99.5,52.5,61.9,37.5
5,18,1,2134-04-01 22:28:14,-1.0,38,1,1,1,0,0,0,102.8,98.9,60.8,66.0,42.7
6,18,1,2134-04-01 22:29:14,-1.0,38,1,1,1,0,0,0,103.3,100.0,58.2,63.9,40.1
7,18,1,2134-04-01 22:30:14,-1.0,38,1,1,1,0,0,0,104.5,100.0,63.2,67.0,43.9
8,18,1,2134-04-01 22:31:14,-1.0,38,1,1,1,0,0,0,108.1,99.8,69.7,72.0,49.3
9,18,1,2134-04-01 22:32:14,-1.0,38,1,1,1,0,0,0,107.7,98.8,65.6,69.7,46.4


## Example Submission Format

In [5]:
ex_submission = pd.read_csv("./sample_submission.csv")

ex_submission.head(50)

Unnamed: 0,obs,y_1,y_2
0,18,42.0,82.0
1,19,42.0,82.0
2,20,42.0,82.0
3,21,42.0,82.0
4,22,42.0,82.0
5,23,42.0,82.0
6,24,42.0,82.0
7,25,42.0,82.0
8,26,42.0,82.0
9,27,42.0,82.0
