## Question 1. ZLG algorithm implementation (50 points)
**You are to implement the ZLG algorithm for this problem.**
- **We will use a subset of multiclass data where the label is a protein subcellular localization.**
- **The 8 features are extracted from the protein sequence.**
- **For this problem we are only using points with labels `MIT` or `NUC`.**
- **A total of 892 data points have labels `MIT` (244) or `NUC` (429). We start with the labels of only the first 200 data points (set `Y_k`). The other 792 points are in `Y_u`.**

**First, read the paper and answer the following questions.**
#### 1. What is the idea behind the ZLG algorithm (5 points)?
#### 2. What are the assumptions behind the ZLG algorithm (5 points)?
#### 3. What are the pros and cons of the ZLG algorithm (5points)?

### Imports

In [1]:
import numpy as np
import pandas as pd 
from scipy.spatial import distance_matrix
from sklearn.preprocessing import LabelEncoder

import packages.zlg.zlg as zlg

### Data Prep

In [2]:
data = pd.read_csv('data/data.csv')
print('shape',data.shape)
print('unique labels', data.Label.unique())
data.head(5)

shape (673, 9)
unique labels ['MIT' 'NUC']


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Label
0,0.58,0.61,0.47,0.13,0.5,0.0,0.48,0.22,MIT
1,0.43,0.67,0.48,0.27,0.5,0.0,0.53,0.22,MIT
2,0.64,0.62,0.49,0.15,0.5,0.0,0.53,0.22,MIT
3,0.58,0.44,0.57,0.13,0.5,0.0,0.54,0.22,NUC
4,0.42,0.44,0.48,0.54,0.5,0.0,0.48,0.22,MIT


In [3]:
# filter out records without desired labels
data_MITNUC = data.loc[data['Label'].isin(['MIT','NUC'])].values

# split data into features and target, encode target classes as 0 or 1
X = data_MITNUC[:,:8]
y = LabelEncoder().fit_transform(data_MITNUC[:,-1])

print(X.shape)
print(y.shape)

(673, 8)
(673,)


In [4]:
# split data into first 200 records and remainder
n_l = 200

Xk = X[:n_l,:]
Yk = y[:n_l]
Xu = X[n_l:,:]
Yu = y[n_l:]

print(Xk.shape)
print(Yk.shape)
print(Xu.shape)
print(Yu.shape)

(200, 8)
(200,)
(473, 8)
(473,)


# 1.1. Part 1 (5 points)
**TODO:**
- **Let's first construct the weight matrix W.**
- **Use $t = 0$ and $\sigma$ as the standard deviation of $X$.**
- **Then calculate the $D$ matrix and the Laplacian matrix (Delta).**

### Formulas
Similarity is measured using the radial basis function (RBF):

This paper:
$$w_{ij}=\exp{\left( -\frac{1}{\sigma^2}\sum_{d=1}^m (x_{id} - x_{jd})^2 \right)}$$

Another paper (https://www.aaai.org/Papers/ICML/2003/ICML03-118.pdf):

$$w_{ij}=\exp{\left( -\sum_{d=1}^m \frac{(x_{id} - x_{jd})^2}{\sigma^2_d} \right)}$$












where
- $x_i \in \mathbb{R}^m$
- $x_{id}$ is the $d$-th component  of instance $x_i$
- $\sigma_1, \ldots ,\sigma_m$ are length scale hyperparameters for each dimension


Note: $\sum_{d=1}^m \left( x_{id} - x_{jd} \right)^2$ is the squared Euclidean distance between $x_i$ and $x_j$

I think I understand what you mean about flattening the matrix, but it doesn't make sense to me. Let me explain. 


Consider the inner expression:

$$ \frac{1}{\sigma^2}\sum_{d=1}^m (x_{id} - x_{jd})^2$$


Suppose:

$$X=\begin{bmatrix}1 && 2 \\ 3 && 5\end{bmatrix}$$

where

$$x_1=\begin{bmatrix}1 && 2 \end{bmatrix}$$

$$x_2=\begin{bmatrix}3 && 5 \end{bmatrix}$$

We can calculate the variance of each of the columns:

$$v=\begin{bmatrix}1 && 2.25 \end{bmatrix}$$


If we use `distance_matrix()`, referenced in the Notebook, the squared distance matrix of $X$ would be:

$$X=\begin{bmatrix}0 && 13 \\ 13 && 0\end{bmatrix}$$

If we now multiply the two terms together, we would get this as the inner result:

$$\frac{1}{v} \cdot X = \begin{bmatrix}0 && 5.78 \\ 5.78 && 0\end{bmatrix}$$

with the inner term 5.78 representing the result of the expression.

However, consider the version of the RBF in this paper referenced by the authors (https://www.aaai.org/Papers/ICML/2003/ICML03-118.pdf):

$$w_{ij}=\exp{\left( -\sum_{d=1}^m \frac{(x_{id} - x_{jd})^2}{\sigma^2_d} \right)}$$

In this version, it clearly seems to show that we should be dividing each term by the variance of that term. I can't find a more efficient way to do it other than term by term. But if we do that, we get a different value.

$$\frac{(3-1)^2}{1}+\frac{(5-2)^2}{2.25}=8$$


$$8 \ne 5.78$$


To me, it clearly seems like these two approaches get different results. Am I looking at this the wrong way?



In [5]:
# ?? do we use all X or only 200 sample

In [6]:
# implemented in custom Python package zlg

Delta = zlg.laplacian_matrix(X,t=0)
Delta.shape

(673, 673)

# 1.2. Part 2 (5 points) 
**TODO:**
- **Now complete the subroutine to compute the minimum-energy solution for the unlabeled instances. (Hint: Use the formula in page 38, Lecture 7.)** 
- **The function also outputs one submatrix that we will use to select points to query.**

In [7]:
# implemented in custom Python package zlg
# def minimum_energy_solution(Delta,n_l,fl):
#     """
#     Args:
#         Delta: The Laplacian matrix. 
#         n_l: Number of labeled points. Notice that Delta should have the upper left submatrix 
#             corresponding to these n_l points. So when new points get labeled, you may need 
#             to rearrange the matrix.
#         fl: Known labels.
#     Returns:
#         Delta_uu_inv: Inverse matrix of the submatrix corresponding to unlabeled points.
#         fu: Minimum energy solution of all unlabeled points.
#     """
#     ## TODO ##
    
#     return Delta_uu_inv, fu

In [8]:
# first 200 points are labeled
labeled = [i for i in range(200)]
unlabeled = [i for i in range(200,673)]

fu,Delta_uu_inv = zlg.minimum_energy_solution(Delta,labeled,unlabeled,Yk)
print(Delta_uu_inv.shape)
print(fu.shape)

(473, 473)
(473,)


# 1.3. Part 3 (15 points) 
**TODO:**
- **We would like to query the points that minimize the expected risk. To do so, we want to be able to calculate the expected estimated risk after querying any point $k$.**
- **The variable `Rhat_fplus_xk` refers to $\hat{R}(f^{+x_k})$.**
- **`fu_xk0` is $f_u^{+(x_k,0)}$ and vice versa for `fu_xk1`.**

I'm confused about the notation involved in calculating expected risk for ZLG. In the paper, we have the following equations:

#### Expected risk:

$$\hat{R}\left( f^{+(x_k,y_k)} \right)=\sum_{i=1}^n min\left( f_i^{+(x_k,y_k)},1-f_i^{+(x_k,y_k)}\right)$$

#### Expected estimated risk:

$$\hat{R}\left( f^{+x_k} \right)=
(1-f_k)\hat{R}\left( f^{+(x_k,0)} \right)
+   f_k\hat{R}\left( f^{+(x_k,1)} \right)$$


#### Conditional Distribution of all unlabeled nodes:

$$f_u^{+(x_k,y_k)}=f_u+(y_k-f_k)\frac{(\Delta_{uu}^{-1})_{ \cdot k}}{(\Delta_{uu}^{-1})_{kk}}$$


***

I'm confused at the difference between $f_u$ and $f$ and $f_i$. The paper defines $f=\begin{bmatrix}f_l \\ f_u\end{bmatrix}$ but I don't understand how exactly this works through the equations above. My understanding is that $f_u$ is a vector. I calculated that $f_u^{+(x_k,0)}$ is a vector of the same dimension.  

Is $f^{+(x_k,0)}$ the same as $f_u^{+(x_k,0)}$?

***

I'm also confused at how to calculate estimated risk.

$$\hat{R}\left( f^{+(x_k,0)} \right)=\sum_{i=1}^n min\left( f_i^{+(x_k,0)},1-f_i^{+(x_k,0)}\right)$$

What is the $n$ in this case? I thought we were dealing with $f_u$. What would $f_1, f_2, \ldots, f_n$ be?

***







In [9]:
# implemented in custom Python package zlg
# def expected_estimated_risk(Delta_uu_inv,k,fu):
#     """
#     Args:
#         Delta_uu_inv: Inverse matrix of the submatrix corresponding to unlabeled points.
#         k: index of one unlabeled point with respect to the uu submatrix (not the entire Delta)
#         fu: Minimum energy solution of all unlabeled points.
#     Returns:
#         Rhat_fplus_xk: Expected estimated risk after querying node k
#     """
#     ## fu plus xk, yk = 0
#     fu_xk0 = fu + (0 - fu[k])*Delta_uu_inv[:,k]/Delta_uu_inv[k,k]
#     ## fu plus xk, yk = 1
#     fu_xk1 = fu + (1 - fu[k])*Delta_uu_inv[:,k]/Delta_uu_inv[k,k]
    
#     ## TODO ##
    
#     return Rhat_fplus_xk


# 1.4. Part 4 (5 points) 
**TODO:**
- **Compute the above expected estimated risk for all unlabeled points and select one to query.**
- **Let's try query 100 points. Which points are queried? Compare with random queries and make a plot.**

In [10]:
# implemented in custom Python package zlg
# def zlg_query(Delta_uu_inv,n_l,fu,n_samples):
#     """
#     Args:
#         Delta_uu_inv: Inverse matrix of the submatrix corresponding to unlabeled points.
#         n_l: Number of labeled points.
#         fu: Minimum energy solution of all unlabeled points.
#         n_samples: Number of samples.
#     Returns:
#         query_idx: the idx of the point to query, wrt the unlabeled points 
#                 (idx is 0 if it's the first unlabeled point)
#     """
#     n_u = n_samples - n_l
#     query_idx = 0
#     min_Rhat = np.inf
#     ## TODO ##
    
#     return query_idx

### Select 1 to Query

In [11]:
idx_1_4 = zlg.zlg_query(fu, Delta_uu_inv, n_l, num_samples=673)
idx_1_4

(473,) (473, 473) 0 0
0.18885223860282407 (473,) 0.0716013979040977
(473,)
(473,) (473, 473) 0 1
0.18885223860282407 (473,) 0.0716013979040977
(473,)
(473,) (473, 473) 1 0
0.2764926248223528 (473,) 0.03977130723016503
(473,)
(473,) (473, 473) 1 1
0.2764926248223528 (473,) 0.03977130723016503
(473,)
(473,) (473, 473) 2 0
0.3457722299914612 (473,) 0.03746442431099505
(473,)
(473,) (473, 473) 2 1
0.3457722299914612 (473,) 0.03746442431099505
(473,)
(473,) (473, 473) 3 0
0.32131708358346156 (473,) 0.05937236308438603
(473,)
(473,) (473, 473) 3 1
0.32131708358346156 (473,) 0.05937236308438603
(473,)
(473,) (473, 473) 4 0
0.1232271444078386 (473,) 0.18881943230244214
(473,)
(473,) (473, 473) 4 1
0.1232271444078386 (473,) 0.18881943230244214
(473,)
(473,) (473, 473) 5 0
0.1587052475282429 (473,) 0.14402421896079493
(473,)
(473,) (473, 473) 5 1
0.1587052475282429 (473,) 0.14402421896079493
(473,)
(473,) (473, 473) 6 0
0.7867764067998505 (473,) 0.06833311956294619
(473,)
(473,) (473, 473) 6 1
0

  change = (y_k - f_k) * kth_col / kth_diag
  change = (y_k - f_k) * kth_col / kth_diag


88 0
0.5054887878749301 (473,) 1.021319928059301
(473,)
(473,) (473, 473) 88 1
0.5054887878749301 (473,) 1.021319928059301
(473,)
(473,) (473, 473) 89 0
0.6008894096283912 (473,) 0.22815925978693521
(473,)
(473,) (473, 473) 89 1
0.6008894096283912 (473,) 0.22815925978693521
(473,)
(473,) (473, 473) 90 0
0.4427088369828177 (473,) 1.0830667362621127
(473,)
(473,) (473, 473) 90 1
0.4427088369828177 (473,) 1.0830667362621127
(473,)
(473,) (473, 473) 91 0
0.5846490265550076 (473,) 0.08901867699034391
(473,)
(473,) (473, 473) 91 1
0.5846490265550076 (473,) 0.08901867699034391
(473,)
(473,) (473, 473) 92 0
0.48221505550375665 (473,) 0.05388785516168591
(473,)
(473,) (473, 473) 92 1
0.48221505550375665 (473,) 0.05388785516168591
(473,)
(473,) (473, 473) 93 0
0.5740748165651148 (473,) 0.08961076295898883
(473,)
(473,) (473, 473) 93 1
0.5740748165651148 (473,) 0.08961076295898883
(473,)
(473,) (473, 473) 94 0
0.3186576374464345 (473,) 0.1184619490089924
(473,)
(473,) (473, 473) 94 1
0.3186576374

(473,) (473, 473) 150 1
0.8913759193551178 (473,) 0.4131515184202273
(473,)
(473,) (473, 473) 151 0
0.872260293885202 (473,) 0.07542772645574934
(473,)
(473,) (473, 473) 151 1
0.872260293885202 (473,) 0.07542772645574934
(473,)
(473,) (473, 473) 152 0
0.4280988953553628 (473,) 36.48192835327263
(473,)
(473,) (473, 473) 152 1
0.4280988953553628 (473,) 36.48192835327263
(473,)
(473,) (473, 473) 153 0
0.8687687048928889 (473,) 1.2240201417180017
(473,)
(473,) (473, 473) 153 1
0.8687687048928889 (473,) 1.2240201417180017
(473,)
(473,) (473, 473) 154 0
0.46105606410859135 (473,) 0.8063914806658432
(473,)
(473,) (473, 473) 154 1
0.46105606410859135 (473,) 0.8063914806658432
(473,)
(473,) (473, 473) 155 0
0.8750181785930038 (473,) 0.09066900044074644
(473,)
(473,) (473, 473) 155 1
0.8750181785930038 (473,) 0.09066900044074644
(473,)
(473,) (473, 473) 156 0
0.8093072920259231 (473,) 0.08443253952038798
(473,)
(473,) (473, 473) 156 1
0.8093072920259231 (473,) 0.08443253952038798
(473,)
(473,) (

(473,) (473, 473) 224 0
0.810773166113479 (473,) 0.06143007906633214
(473,)
(473,) (473, 473) 224 1
0.810773166113479 (473,) 0.06143007906633214
(473,)
(473,) (473, 473) 225 0
0.8947302889903944 (473,) 0.37442343581243387
(473,)
(473,) (473, 473) 225 1
0.8947302889903944 (473,) 0.37442343581243387
(473,)
(473,) (473, 473) 226 0
0.8035232810752295 (473,) 0.04789030642429537
(473,)
(473,) (473, 473) 226 1
0.8035232810752295 (473,) 0.04789030642429537
(473,)
(473,) (473, 473) 227 0
0.1398310217611329 (473,) 0.3478292489435809
(473,)
(473,) (473, 473) 227 1
0.1398310217611329 (473,) 0.3478292489435809
(473,)
(473,) (473, 473) 228 0
0.5406929562132897 (473,) 0.2629688763691414
(473,)
(473,) (473, 473) 228 1
0.5406929562132897 (473,) 0.2629688763691414
(473,)
(473,) (473, 473) 229 0
0.26462521511531056 (473,) 0.06828692425296203
(473,)
(473,) (473, 473) 229 1
0.26462521511531056 (473,) 0.06828692425296203
(473,)
(473,) (473, 473) 230 0
0.45022935185902024 (473,) 0.03752572611057808
(473,)
(4

(473,) (473, 473) 298 0
0.7502339532390225 (473,) 0.030203337151033505
(473,)
(473,) (473, 473) 298 1
0.7502339532390225 (473,) 0.030203337151033505
(473,)
(473,) (473, 473) 299 0
0.7865765300732113 (473,) 0.189040336137794
(473,)
(473,) (473, 473) 299 1
0.7865765300732113 (473,) 0.189040336137794
(473,)
(473,) (473, 473) 300 0
0.5861736676717791 (473,) 0.21885665954698036
(473,)
(473,) (473, 473) 300 1
0.5861736676717791 (473,) 0.21885665954698036
(473,)
(473,) (473, 473) 301 0
0.5977743203602595 (473,) 0.07587546183133043
(473,)
(473,) (473, 473) 301 1
0.5977743203602595 (473,) 0.07587546183133043
(473,)
(473,) (473, 473) 302 0
0.772842228975903 (473,) 1.1391034868579044
(473,)
(473,) (473, 473) 302 1
0.772842228975903 (473,) 1.1391034868579044
(473,)
(473,) (473, 473) 303 0
0.8677766372214685 (473,) 7.702534728996886
(473,)
(473,) (473, 473) 303 1
0.8677766372214685 (473,) 7.702534728996886
(473,)
(473,) (473, 473) 304 0
0.8484719982941348 (473,) 0.041932900480903534
(473,)
(473,) (

(473,) (473, 473) 357 1
0.8698353331052741 (473,) 0.05376579921139604
(473,)
(473,) (473, 473) 358 0
0.8282275007735452 (473,) 0.07538696727165163
(473,)
(473,) (473, 473) 358 1
0.8282275007735452 (473,) 0.07538696727165163
(473,)
(473,) (473, 473) 359 0
0.7582450348021456 (473,) 0.07640372019828276
(473,)
(473,) (473, 473) 359 1
0.7582450348021456 (473,) 0.07640372019828276
(473,)
(473,) (473, 473) 360 0
0.7659040629144387 (473,) 0.045421084413947384
(473,)
(473,) (473, 473) 360 1
0.7659040629144387 (473,) 0.045421084413947384
(473,)
(473,) (473, 473) 361 0
0.7235274435510429 (473,) 0.08986410658728727
(473,)
(473,) (473, 473) 361 1
0.7235274435510429 (473,) 0.08986410658728727
(473,)
(473,) (473, 473) 362 0
0.8945996740668548 (473,) 0.16387175991422942
(473,)
(473,) (473, 473) 362 1
0.8945996740668548 (473,) 0.16387175991422942
(473,)
(473,) (473, 473) 363 0
0.762893011329302 (473,) 0.08388813486929932
(473,)
(473,) (473, 473) 363 1
0.762893011329302 (473,) 0.08388813486929932
(473,)

(473,) (473, 473) 428 0
0.8658119069863395 (473,) 11.448727665889585
(473,)
(473,) (473, 473) 428 1
0.8658119069863395 (473,) 11.448727665889585
(473,)
(473,) (473, 473) 429 0
0.8218274244634935 (473,) 0.250023838450305
(473,)
(473,) (473, 473) 429 1
0.8218274244634935 (473,) 0.250023838450305
(473,)
(473,) (473, 473) 430 0
0.8323096037176774 (473,) 3.2147876365851333
(473,)
(473,) (473, 473) 430 1
0.8323096037176774 (473,) 3.2147876365851333
(473,)
(473,) (473, 473) 431 0
0.8832790702447312 (473,) 0.1510138528891024
(473,)
(473,) (473, 473) 431 1
0.8832790702447312 (473,) 0.1510138528891024
(473,)
(473,) (473, 473) 432 0
0.7031618740373091 (473,) 0.1772536438590848
(473,)
(473,) (473, 473) 432 1
0.7031618740373091 (473,) 0.1772536438590848
(473,)
(473,) (473, 473) 433 0
0.8164445343119386 (473,) 0.1787124031990252
(473,)
(473,) (473, 473) 433 1
0.8164445343119386 (473,) 0.1787124031990252
(473,)
(473,) (473, 473) 434 0
0.7792171754785817 (473,) 0.16091102476225622
(473,)
(473,) (473, 

208

### ZLG Query

In [12]:
# n_samples = X.shape[0]
# for t in range(100):
#     ## edit this block ##
#     query_idx = zlg_query(Delta_uu_inv,n_l,fu,n_samples)
#     Yk = np.append(Yk,Yu[query_idx])
#     Yu = np.delete(Yu,query_idx)
#     Xk = np.append(Xk,[Xu[query_idx,:]],axis=0)
#     Xu = np.delete(Xu,query_idx, 0)
#     n_l += 1
#     Delta = Laplacian_matrix(np.concatenate((Xk,Xu),axis=0))
#     Delta_uu_inv, fu = minimum_energy_solution(Delta,n_l,Yk)
#     print(query_idx)
#     ## TODO ##

### Random Query

# 1.5. Bonus question 

**Answer the following questions. (Your grade will not exceed 100 for this homework.)**

#### 1. For this dataset, how many labeled data points do you actually need to train the model sufficiently well? 
#### 2. And why?