In [3]:
import os, sys, pathlib
cwd = pathlib.Path.cwd()
if not (cwd / 'scripts').exists():
    if cwd.name == 'notebooks' and (cwd.parent / 'scripts').exists():
        os.chdir(cwd.parent)
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())
print('Working dir:', os.getcwd())


Working dir: /home/tanmay08/Desktop/PC/Project/Mega-Project-1


# Centralized baseline

Train a simple logistic regression on the full CSV and record AUC/Precision.

In [4]:
!python -m scripts.preprocess_data --input data/creditcard.csv --clients 3 --out-root data --seed 42 --test-size 0.2

Saved processed CSVs to data/processed and per-client splits to data/clients


In [5]:
!python -m scripts.train_centralized --train data/processed/train.csv --test data/processed/test.csv --label Class --epochs 2 --lr 0.1 --batch-size 512 --reg 0.0

Epoch 1/2 - avg_loss=0.094335
Epoch 2/2 - avg_loss=0.019149
Epoch 2/2 - avg_loss=0.019149
Saved model to outputs/models/logreg.csv
Saved metrics to outputs/metrics/centralized_metrics.csv
Saved model to outputs/models/logreg.csv
Saved metrics to outputs/metrics/centralized_metrics.csv


# Project flow and concepts (ML + Cryptography)


This project implements federated logistic regression with Secure Aggregation (Bonawitz et al.) so the server never sees any client's raw gradients or data. Below is the end‑to‑end flow and where each concept is used in the repo.

## 1) Architecture overview


- Data and baseline


  - `scripts/preprocess_data.py` splits the dataset (e.g., `data/creditcard.csv`) into train/test and into client shards.

  - `scripts/train_centralized.py` trains a centralized baseline logistic regression and logs AUC and Precision.

- Federated training


  - Server: `server/server.py` (Flask) orchestrates rounds, collects masked updates, aggregates, evaluates, and logs metrics.

  - Clients: `clients/*/` each run a local training loop, mask their update via Secure Aggregation, and POST to the server.

- Secure Aggregation crypto


  - `server/secure_aggregation/bonawitz.py` implements pairwise masking following Bonawitz et al.

  - `server/secure_aggregation/hkdf_prg.py` provides HKDF-SHA256 and HMAC-DRBG PRG.

- Encoding for masks


  - `scripts/encoding.py` handles clip→fixed-point encode→mod q arithmetic→decode.

## 2) Data and preprocessing


- Input: `data/creditcard.csv` (fraud detection). Class imbalance is significant.

- Preprocessing (`scripts/preprocess_data.py`):

  - Splits into train/test and K client shards.

  - Standardizes/normalizes as needed (project-dependent) and writes CSVs under `data/processed/`.

- Why it matters: federated shards simulate distributed data ownership across clients.

## 3) Model and local training (ML)


- Model: Logistic Regression implemented in pure Python (module renamed to `logistic_regression/LogisticRegression`).

- Local step on client (see `clients/base_client.py`):

  - Pulls current global weights from server.

  - Runs several local epochs of SGD on its shard (optionally with L2 reg).

  - Produces a weight delta Δw and bias delta Δb, sample‑weighted by client data size.

  - Optional clipping with constant C to bound update magnitude (helps robustness/DP compatibility).

## 4) Encoding and modular arithmetic (pre‑crypto)


To use one‑time pads safely over integers, we move updates to a large finite ring.

- Clip: ensure Δ is within a bounded range.

- Fixed‑point encode: multiply by scale S (e.g., S = 2^20) and round to integers.

- Modulus: work in Z_q with q = 2^61 − 1 (large prime near 2^61), compute all sums modulo q.

- Center‑lift: when decoding, map back from [0, q) to signed integers, then divide by S to recover floats.

- Implementation: `scripts/encoding.py` (clip_vector, encode_vector_to_int, center_lift, decode_int_to_float).

## 5) Secure Aggregation (Crypto)


Goal: server learns only Σ masked updates, never any individual update.

- Pairwise keys via ECDH X25519

  - Each client i has an ephemeral X25519 keypair (Curve25519 Diffie‑Hellman).

  - For each peer j, derive a shared secret k_ij = X25519(sk_i, pk_j).

  - Files: `server/secure_aggregation/bonawitz.py` (protocol harness), uses `cryptography`'s X25519.

- Key derivation via HKDF‑SHA256

  - From k_ij, derive a seed s_ij using HKDF with context (round_id, client ids) to avoid reuse.

  - File: `server/secure_aggregation/hkdf_prg.py` (hkdf_expand).

- Pseudorandom mask via HMAC‑DRBG PRG

  - Using s_ij as the PRG seed, stream pseudorandom words to length of the model vector (weights + bias as one vector).

  - File: `server/secure_aggregation/hkdf_prg.py` (hmac_drbg_stream, prg_vector_mod_q).

- Signed cancellation

  - For a fixed client order, client i adds +mask_ij when i < j and adds −mask_ji when i > j.

  - Summing across all clients cancels all pairwise masks; only ΣΔ remains.

- Bias term

  - Bias is treated as an extra dimension and masked the same way.

- Security intuition

  - Without collusion and with honest participation, server sees only a uniformly masked vector per client.