# Assignment 2:

**Course:** 4DT907 - Project in Data Intensive Systems  
**Date:** 2026-02-04  
**Team:** Samuel, Nasser, Emil, Jesper

## 1. Iterations Over ML Process Steps


### 1.1 ML Pipeline Overview

The ML pipeline follows an iterative development approach with the following key steps:

1. **Data Collection & Preparation**
2. **Feature Engineering & Selection**
3. **Model Training & Experimentation**
4. **Model Evaluation & Selection**
5. **Model Deployment & Monitoring**

### 1.2 Data Preprocessing

**Dataset**: `AimoScore_WeakLink_big_scores_A2.xls` converted to CSV format.

**Preprocessing Steps** (from `a2.ipynb`):

1. **Data Loading**
   - Load dataset from `src/data/AimoScore_WeakLink_big_scores_A2.csv`
   - Initial exploration of features and target variable

2. **Data Cleaning**
   - Drop `EstimatedScore` column
   - Remove duplicate records using `drop_duplicates()`
   - Handle missing values if present

3. **Exploratory Data Analysis**
   - Statistical summary of features
   - Correlation analysis with heatmap visualization
   - Distribution analysis of target variable

4. **Train-Test Split**
   - Split data using `train_test_split()` from scikit-learn
   - Random seed set for reproducibility

### 1.3 Model Training Iterations

The project implements **Linear Regression** models with various configurations:

#### Iteration 1: Baseline Model
- **Model**: Simple Linear Regression
- **Features**: All features except EstimatedScore
- **Purpose**: Establish baseline performance
- **Result**: R2_Mean: 0.641

#### Iteration 2: Feature Selection
- **Approach**: Remove low-correlation features
- **Method**: Correlation analysis and statistical tests
- **Goal**: Reduce overfitting and improve generalization
- **Result**: R2_Mean: 0.543

#### Iteration 3: Outlier Handling
- **Detection**: Leverage analysis, Cook's distance
- **Treatment**: Remove or cap outliers, using threshold multiplier 3 gave best result.
- **Impact**: Improved model stability
- **Result**: R2_Mean: **0.706**

#### Iteration 4: Weighted Regression
- **Approach**: Apply weights to training samples
- **Result**: R2_Mean: 0.544

### 1.4 MLflow & DagsHub Integration

**Experiment Tracking Setup**:

```python
import dagshub
import mlflow
import scripts.ml_utils as MLUtils

# Initialize DagsHub and MLflow
dagshub.init(repo_owner="SamuelFredricBerg", repo_name="4dt907", mlflow=True)
utils = MLUtils.MLUtils("Project_Model")
```

**Logged Metrics** (per experiment run):
- **R² Score** (R2_Mean): Primary metric for model comparison
- **R² Score** (R2_Std)
- **Mean Squared Error** (MSE_Mean)
- **Mean Squared Error** (MSE_Std)
- **Mean Absolute Error** (MAE_Mean)
- **Mean Absolute Error** (MAE_Std)

**Logged Parameters**:
```python
config = {
    "data_split_seed": 42,
    "test_size": 0.2,
    "n_folds": 5,
    "shuffle": True,
    "variant": "Remove outliers/Removing dupes",
    "threshold_multiplier": 3,
}
```

**Artifacts**:
- Correlation heatmaps

### 1.5 Automated Model Selection & Promotion

**Challenger System** (`ml_utils.py`):

The `auto_check_challenger()` function automatically compares new models against the current `@dev` model:

```python
if utils.auto_check_challenger(run.info.run_id, metric_name="R2_Mean"):
    mlflow.sklearn.log_model(model, "model", registered_model_name="Project_Model")
    latest_v = utils.client.get_latest_versions("Project_Model")[0].version
    utils.client.set_registered_model_alias("Project_Model", "dev", latest_v)
    print("New model beat current @dev, uploading to DagsHub")
else:
    print("Did not beat current @dev, model not uploaded to DagsHub")
```

**Model Aliases**:
- **`@dev`**: Latest development model (highest R² on dev set)
- **`@prod`**: Production model (promoted from @dev after validation)
- **`@backup`**: Previous production model (for rollback)

**Promotion Workflow**:
1. New model trained and logged to MLflow
2. Automatic comparison with `@dev`
3. If better: becomes new `@dev`, old `@dev` → `@backup`
4. Manual promotion: `@dev` → `@prod` via `promote_dev_to_prod()`
5. Rollback available: `@backup` → `@prod` via `revert_backup_to_prod()`

### 1.6 Model Evaluation Results

**Champion Model Selection**:

The dev model is selected based on:
- Highest R² score on test set

The prod(champion) model is selected based on:
- Highest R² score on test set
- Manually checking if the best R² is actually the best model

**Key Performance Indicators**:
- **R² Score**: Measures variance explained by the model
- **MAE**: Mean absolute error

**Model Validation**:
- Cross-validation (K-fold)
- Train/Test split validation

## 2. Deployed Client-Server System

This section documents the architecture, design, implementation, and usage of the deployed prediction system.

### 2.1 System Architecture

The system follows a **three-tier architecture**:

```text
┌─────────────┐      HTTP/REST      ┌─────────────┐      ┌─────────────┐
│   React     │ ◄─────────────────► │   FastAPI   │ ◄────┤   DagsHub   │
│  Frontend   │                     │   Backend   │      │             │
│             │                     │             │      │             │
└─────────────┘                     └─────────────┘      └─────────────┘
```

Ports and model endpoints in DagsHub are configured via environment variables for flexibility.

**Container Orchestration** (Docker Compose):
- **Backend Container**: `4dt907-backend`
- **Frontend Container**: `4dt907-frontend`
- **Network**: `app-network` (bridge driver)
- **Volumes**: Mounted for hot-reload during development

### 2.2 Design Principles

**1. Separation of Concerns**
- Frontend: UI/UX and user interaction
- Backend: Business logic and ML model serving
- Data Tier: Model storage and versioning

**2. Stateless API Design**
- RESTful endpoints
- No server-side session management
- Scalable horizontally

**3. Model Caching Strategy**
- Thread-safe caching with locks
- Cache key: model URI or model name
- Reduces model loading latency

**4. Versioned APIs**
- `/api/v1/`: Current stable API
- `/api/v2/`: Future/experimental endpoints
- Backward compatibility maintained

**5. Environment-Driven Configuration**
- Port configuration via environment variables
- MLflow URI from `.env` file
- Model URIs configurable per environment

### 2.3 Implementation Details

#### 2.3.1 Backend Implementation (FastAPI)

**File Structure**:
```text
src/backend/
├── app/
│   ├── api/
│   │   ├── health.py              # Health check endpoint
│   │   ├── v1/
│   │   │   ├── endpoints/
│   │   │   │   ├── predict.py     # Prediction endpoints
│   │   │   │   └── model_info.py  # Model metadata endpoints
│   │   │   └── router.py          # V1 route aggregation
│   │   └── v2/                    # V2 API scaffold
│   ├── services/
│   │   └── model_service.py       # Core ML model service
│   ├── schemas/
│   │   └── prediction.py          # Pydantic models
│   └── main.py                    # FastAPI app initialization
├── tests/                         # Comprehensive test suite
├── requirements.txt               # Python dependencies
└── Dockerfile                     # Backend container image
```

**Key API Endpoints**:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check, returns service status |
| `/api/v1/predict/champion` | POST | Predict using production model |
| `/api/v1/predict/latest` | POST | Predict using latest dev model |
| `/api/v1/model-info/champion` | GET | Get champion model metadata |
| `/api/v1/model-info/latest` | GET | Get latest model metadata |

**Model Service Features** (`model_service.py`):
- **Dynamic Model Loading**: Load models from MLflow registry
- **Alias Resolution**: Support for `@dev`, `@prod`, `@backup` aliases
- **Fallback Mechanism**: Handle registry limitations gracefully
- **Thread-Safe Caching**: Prevent race conditions
- **Feature Validation**: Check input feature count matches model expectations

**Request/Response Schema**:

```python
# Request
{
  "features": [1.0, 2.0, 3.0, ...]  # List of floats
}

# Response
{
  "prediction": 75.3,               # Predicted score
  "model_uri": "models:/Project_Model@prod"  # Model used
}
```

#### 2.3.2 Frontend Implementation (React + Vite)

**File Structure**:
```text
src/frontend/
├── src/
│   ├── components/
│   │   └── Predict.jsx       # Main prediction UI component
│   ├── App.jsx                # Root application component
│   ├── main.jsx               # Application entry point
│   └── index.css              # Global styles + TailwindCSS
├── public/                    # Static assets
├── nginx.conf.template        # Nginx reverse proxy config
├── vite.config.js             # Vite dev server + proxy
├── package.json               # Node dependencies
└── Dockerfile                 # Multi-stage frontend build
```

**UI Components**:

1. **Model Selection Dropdown**
   - Choose between "Champion" and "Latest" models
   - Dynamic endpoint selection based on variant

2. **Feature Input Field**
   - Comma-separated number input
   - Client-side validation (parseFeatures function)
   - Clear error messages for invalid input

3. **Prediction Button**
   - Loading state during API call
   - Disabled state to prevent double-submission

4. **Result Display**
   - Shows prediction value
   - Displays model URI for traceability
   - Error messages with details

**Styling**:
- TailwindCSS utility classes
- Custom gradient background (`.bg-aurora`)
- Glassmorphism effect with backdrop blur
- Responsive design (mobile-friendly)

#### 2.3.3 Docker & Deployment

**Docker Compose Configuration** (`docker-compose.yml`):

```yaml
services:
  backend:
    build:
      context: ./backend
      args:
        - BACKEND_PORT=${BACKEND_PORT}
    container_name: 4dt907-backend
    environment:
      - BACKEND_PORT=${BACKEND_PORT}
      # Note: MLFLOW_TRACKING_URI should be added here
    volumes:
      - ./backend:/backend_app
      - /opt/.venv                  # Cache Python dependencies
    networks:
      - app-network

  frontend:
    build:
      context: ./frontend
      args:
        - BACKEND_PORT=${BACKEND_PORT}
        - FRONTEND_PORT=${FRONTEND_PORT}
    container_name: 4dt907-frontend
    ports:
      - "${FRONTEND_PORT}:${FRONTEND_PORT}"
    depends_on:
      - backend
    networks:
      - app-network
```

**Backend Dockerfile**:
- Base image: `python:3.12-slim`
- Install dependencies from `requirements.txt`
- Expose configured port
- Run with Uvicorn ASGI server

**Frontend Dockerfile** (Multi-stage):
- **Stage 1**: Build React app with Vite
- **Stage 2**: Serve with Nginx
- Dynamic port configuration via environment substitution
- Reverse proxy to backend

**Environment Variables**:
- `BACKEND_PORT`: Backend service port (default: 8080)
- `FRONTEND_PORT`: Frontend service port (default: 3030)
- `MLFLOW_TRACKING_URI`: MLflow server URL
- `MODEL_URI_PROD`, `MODEL_URI_DEV`, `MODEL_URI_BACKUP`: Direct model URIs

### 2.4 System Usage

#### 2.4.1 Local Development Setup

**Prerequisites**:
- Docker and Docker Compose
- Python 3.12.x
- Node.js 22.x (LTS)

**Starting the System**:

```bash
# Navigate to src directory
cd src

# Create .env file with required variables
cat > .env << EOF
BACKEND_PORT=<port>
FRONTEND_PORT=<port>
MLFLOW_TRACKING_URI=https://dagshub.com/<repository-owner>/<repository-name>.mlflow
MODEL_URI_PROD=models:/<model_name>@<tag>
MODEL_URI_DEV=models:/<model_name>@<tag>
MODEL_URI_BACKUP=models:/<model_name>@<tag>
EOF

# Start services
docker compose up -d

# Check status
docker compose ps
```

**Accessing the System**:
- Frontend UI: `http://localhost:3030`

#### 2.4.2 Making Predictions

**Via Web UI**:
1. Open `http://localhost:3030` in browser
2. Select model variant (Champion or Latest)
3. Enter feature values (comma-separated)
4. Click "Predict" button
5. View prediction result and model URI

**Via API (cURL)**:

```bash
# Predict with champion model
curl -X POST http://localhost:8080/api/v1/predict/champion \
  -H "Content-Type: application/json" \
  -d '{"features": [1.0, 1.0]}'

# Response:
# {
#   "prediction": 75.3,
#   "model_uri": "models:/Project_Model@prod"
# }
```

**Via API (Python)**:

```python
import requests

response = requests.post(
    "http://localhost:8080/api/v1/predict/latest",
    json={"features": [1.0, 1.0]}
)
result = response.json()
print(f"Prediction: {result['prediction']}")
print(f"Model: {result['model_uri']}")
```

## 3. DevOps Process

This section documents the CI/CD pipelines, testing strategy, and deployment automation.

### 3.1 CI/CD Pipeline Architecture

The project uses **GitHub Actions** for continuous integration and deployment (deployment is work-in-progress).


### 3.2 Main CI/CD Workflow

**File**: `.github/workflows/main.yml`

**Triggers**:
- Push to `main` branch
- Pull requests targeting `main` branch

**Jobs**:

#### Job 1: Backend Tests
```yaml
backend-test:
  runs-on: ubuntu-latest
  steps:
    - Checkout code
    - Setup Python 3.12 with pip cache
    - Install dependencies (requirements.txt + test tools)
    - Run pytest with coverage
```

**Test Coverage**:
- Unit tests for model service
- API endpoint tests
- Schema validation tests
- Service integration tests

#### Job 2: Frontend Tests
```yaml
frontend-test:
  runs-on: ubuntu-latest
  steps:
    - Checkout code
    - Setup Node.js 22 with npm cache
    - Install dependencies (npm ci)
    - Run ESLint
```

**Quality Checks**:
- Code style (ESLint rules)
- Import validation
- Build verification

#### Job 3: Docker Build & Integration
```yaml
docker-build:
  needs: [backend-test, frontend-test]
  runs-on: ubuntu-latest
  steps:
    - Build multi-container system
    - Wait for services to be ready (health checks)
    - Test frontend accessibility
    - Test backend API (optional)
    - Cleanup containers
```

**Integration Tests**:
- Container orchestration
- Network connectivity
- Service discovery
- Health endpoint verification

### 3.3 Testing Strategy

#### 3.3.1 Backend Testing

**Test Structure** (`src/backend/tests/`):
```text
tests/
├── api/
│   ├── v1/endpoints/
│   │   ├── test_predict.py       # Prediction endpoint tests
│   │   └── test_model_info.py    # Model info endpoint tests
│   ├── test_health.py            # Health check tests
│   └── test_deps.py              # Dependency injection tests
├── services/
│   └── test_model_service.py     # Model service unit tests
├── schemas/
│   └── test_prediction.py        # Pydantic schema tests
├── conftest.py                   # Pytest fixtures
└── test_main.py                  # Application-level tests
```

**Key Test Cases**:
- Model loading and caching
- Feature validation (correct vs. incorrect feature count)
- Prediction accuracy
- Error handling (missing models, invalid input)
- Thread safety of cache

**Testing Tools**:
- `pytest`: Test framework
- `pytest-mock`: Mocking support
- `FastAPI TestClient`: API testing

#### 3.3.2 Frontend Testing

**Linting & Code Quality**:
- ESLint configuration (`.eslintrc.cjs`, `eslint.config.js`)
- Style enforcement
- Import validation
- React best practices

**Actual tests to be implemented**

### 3.4 Deployment Process

**Development Environment**:
```bash
cd src
docker compose up -d
```

### 3.5 Monitoring & Observability

**Health Checks**:
- Backend: `GET /health`
- Frontend: Root URL accessibility
- Container health: Docker health check commands

**Logging**:
- Application logs: Python logging module
- Access logs: Uvicorn and Nginx
- Error tracking: Exception logging in endpoints

**MLflow Tracking**:
- All experiments logged to DagsHub
- Model performance metrics tracked
- Artifact storage for reproducibility
- Version history maintained

## 4. Team Member Contributions


| Team Member | Contributions               |
|-------------|-----------------------------|
| Samuel      | DevOps and repository setup |
| Nasser      | Front- and backend          |
| Emil        | Testing                     |
| Jesper      | ML                          |