layout | title | category | typora-copy-images-to |
---|---|---|---|
notes |
Unsupervised |
ml |
../assets |
{:toc}
- labels are not given
- intra-cluster distances are minimized, inter-cluster distances are maximized
- distance measures
- symmetric D(A,B)=D(B,A)
- self-similarity D(A,A)=0
- positivity separation D(A,B)=0 iff A=B
- triangular inequality D(A,B) <= D(A,C)+D(B,C)
- ex. Minkowski Metrics
$d(x,y)=\sqrt[r]{\sum \vert x_i-y_i\vert ^r}$ - r=1 Manhattan distance
- r=1 when y is binary -> Hamming distance
- r=2 Euclidean
- r=$\infty$ "sup" distance
- correlation coefficient - unit independent
- edit distance
- two approaches:
- bottom-up agglomerative clustering - starts with each object in separate cluster then joins
- top-down divisive - starts with 1 cluster then separates
- ex. starting with each item in its own cluster, find best pair to merge into a new cluster
- repeatedly do this to make a tree (dendrogram)
- distances between clusters defined by linkage function
- single-link - closest members (long, skinny clusters)
- complete-link - furthest members (tight clusters)
- average - most widely used
- ex. MST - keep linking shortest link
-
ultrametric distance - tighter than triangle inequality
$d(x, y) \leq \max[d(x,z), d(y,z)]$
- partition n objects into a set of K clusters (must be specified)
- globally optimal: exhaustively enumerate all partitions
- minimize sum of squared distances from cluster centroid
- evaluation w/ labels - purity - ratio between dominant class in cluster and size of cluster
- k-means++ - better at not getting stuck in local minima
- randomly move centers apart
- Complexity:
$O(n^2p)$ for first iteration and then can only get worse
-
latent vars - values not specified in the observed data
-
K-Means
- start with random centers
- E: assign everything to nearest center:
$O(|\text{clusters}|*np) $ - M: recompute centers
$O(np)$ and repeat until nothing changes - partition amounts to Voronoi diagram
- can be viewed as minimizing distortion measure
$J=\sum_n \sum_i z_n^i ||x_n - \mu_i||^2$
-
GMMs:
$p(x|\theta) = \underset{i}{\Sigma} \pi_i \mathcal{N}(x|\mu_i, \Sigma_i)$ -
$l(\theta|x) = \sum_n \log : p(x_n|\theta) \ = \sum_n \log \sum_i \pi_i \mathcal{N}(x_n|\mu_i, \Sigma_i)$ -
hard to maximize bcause log acts on a sum
-
"soft" version of K-means - update means as weighted sums of data instead of just normal mean
-
sometimes initialize K-means w/ GMMs
-
graph LR;
X-->Y;
X --> Z
Z --> Y
- ex.
- latent variable Z has multinomial distr.
-
mixing proportions:
$P(Z^i=1|x, \xi)$ - ex. $ \frac{e^{\xi_i^Tx}}{\sum_je^{\xi_j^Tx}}$
-
mixture components:
$p(y|Z^i=1, x, \theta_i)$ ~ different choices - ex. mixture of linear regressions
- $p(y| x, \theta) = \sum_i \underbrace{\pi_i (x, \xi)}{\text{mixing prop.}} \cdot \underbrace{\mathcal{N}(y|\beta_i^Tx, \sigma_i^2)}{\text{mixture comp.}}$
- ex. mixtures of logistic regressions
- $p(y|x, \theta_i) = \underbrace{\pi_i (x, \xi)}{\text{mixing prop.}} \cdot \underbrace{\mu(\theta_i^Tx)^y\cdot[1-\mu(\theta_i^Tx)]^{1-y}}{\text{mixture comp.}}$ where
$\mu$ is the logistic function
- $p(y|x, \theta_i) = \underbrace{\pi_i (x, \xi)}{\text{mixing prop.}} \cdot \underbrace{\mu(\theta_i^Tx)^y\cdot[1-\mu(\theta_i^Tx)]^{1-y}}{\text{mixture comp.}}$ where
-
mixing proportions:
- also, nonlinear optimization for this (including EM)
In general there is some tension between preserving global properties (e.g. PCA) and local peroperties (e.g. nearest neighborhoods)
Method | Analysis objective | Temporal smoothing | Explicit noise model | Notes |
---|---|---|---|---|
PCA | Covariance | No | No | orthogonality |
FA | Covariance | No | Yes | like PCA, but with errors (not biased by variance) |
LDS/GPFA | Dynamics | Yes | Yes | |
NLDS | Dynamics | Yes | Yes | |
LDA | Classification | No | No | |
Demixed | Regression | No | Yes/No | |
Isomap/LLE | Manifold discovery | No | No | |
T-SNE | .... | .... | ... | |
UMAP | ... | ... | ... |
- NMF -
$\min_{D \geq 0, A \geq 0} ||X-DA||_F^2$ - SEQNMF
- ICA
- remove correlations and higher order dependence
- all components are equally important
- like PCA, but instead of the dot product between components being 0, the mutual info between components is 0
- goals
- minimize statistical dependence between components
- maximize information transferred in a network of non-linear units
- uses information theoretic unsupervised learning rules for neural networks
- problem - doesn't rank features for us
- LDA/QDA - finds basis that separates classes
- reduced to axes which separate classes (perpendicular to the boundaries)
- K-means - can be viewed as a linear decomposition
- spectral clustering - does dim reduction on eigenvalues (spectrum) of similarity matrix before clustering in few dims
- uses adjacency matrix
- basically like PCA then k-means
- performs better with regularization - add small constant to the adjacency matrix
-
want new set of axes (linearly combine original axes) in the direction of greatest variability
- this is best for visualization, reduction, classification, noise reduction
- assume
$X$ (nxp) has zero mean
-
derivation:
- minimize variance of X projection onto a unit vector v
-
$\frac{1}{n} \sum (x_i^Tv)^2 = \frac{1}{n}v^TX^TXv$ subject to$v^T v=1$ -
$\implies v^T(X^TXv-\lambda v)=0$ : solution is achieved when$v$ is eigenvector corresponding to largest eigenvalue
-
- like minimizing perpendicular distance between data points and subspace onto which we project
- minimize variance of X projection onto a unit vector v
-
SVD: let
$U D V^T = SVD(Cov(X))$ -
$Cov(X) = \frac{1}{n}X^TX$ , where X has been demeaned
-
-
equivalently, eigenvalue decomposition of covariance matrix
$\Sigma = X^TX$ -
each eigenvalue represents prop. of explained variance:
$\sum \lambda_i = tr(\Sigma) = \sum Var(X_i)$ -
screeplot - eigenvalues in decreasing order, look for num dims with kink
- don't automatically center/normalize, especially for positive data
-
-
SVD is easier to solve than eigenvalue decomposition, can also solve other ways
- multidimensional scaling (MDS)
- based on eigenvalue decomposition
- adaptive PCA
- extract components sequentially, starting with highest variance so you don't have to extract them all
-
good PCA code: http://cs231n.github.io/neural-networks-2/
X -= np.mean(X, axis = 0) # zero-center data (nxd)
cov = np.dot(X.T, X) / X.shape[0] # get cov. matrix (dxd)
U, D, V = np.linalg.svd(cov) # compute svd, (all dxd)
Xrot_reduced = np.dot(X, U[:, :2]) # project onto first 2 dimensions (n x 2)
- nonlinear pca
- usually uses an auto-associative neural network
-
similar, try to discover topics in a model (which maybe can be linearly combined to produce the original document)
-
ex. LDA - generative model: posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics
- D is like autoencoder output weight matrix
-
$a$ is more complicated - requires solving inner minimization problem - outer loop is not quite lasso - weights are not what is penalized
- impose norm
$D$ not too big - algorithms
- thresholding (simplest) - do
$D^Ty$ and then threshold this - basis pursuit - change
$l_0$ to$l_1$ - this will work under certain conditions (with theoretical guarantees)
- matching purusuit - greedy, find support one at a time, then look for the next one
- thresholding (simplest) - do
- goal: want to decompose
$X$ into$z$ , where we assume$X = Az$ - assumptions
- independence:
$P(z) = \prod_i P(z_i)$ - non-gaussianity of
$z$
- independence:
- 2 ways to get
$z$ which matches these assumptions- maximize non-gaussianity of
$z$ - use kurtosis, negentropy - minimize mutual info between components of
$z$ - use KL, max entropy- often equivalent
- maximize non-gaussianity of
- identifiability:
$z$ is identifiable up to a permutation ans scaling of sources when- at most one of the sources
$z_k$ is gaussian -
$A$ is full-rank
- at most one of the sources
- assumptions
- ICA learns components which are completely independent, whereas PCA learns orthogonal components
-
non-linear ica:
$X \approx f(z)$ , where assumptions on$s$ are the same, and$f$ can be nonlinear- to obtain identifiability, we need to restrict
$f$ and/or constrain the distr of the sources$s$
- to obtain identifiability, we need to restrict
- bell & sejnowski 1995 original formulation (slightly different)
- entropy maximization - try to find a nonlinear function
$g(x)$ which lets you map that distr$f(x)$ to uniform- then, that function
$g(x)$ is the cdf of$f(x)$
- then, that function
- in ICA, we do this for higher dims - want to map distr of
$x_1, ..., x_p$ to$y_1, ..., y_p$ where distr over$y_i$ 's is uniform (implying that they are independent)- additionally we want the map to be information preserving
- mathematically:
$\underset{W} \max I(x; y) = \underset{W} \max H(y)$ since$H(y|x)$ is zero (there is no randomness)- assume
$y = \sigma (W x)$ where$\sigma$ is elementwise - (then S = WX,
$W=A^{-1}$ ) - requires certain assumptions so that
$p(y)$ is still a distr:$p(y) = p(x) / |J|$ where J is Jacobian
- assume
- learn W via gradient ascent
$\Delta W \propto \partial / \partial W (\log |J|)$ - there is now something faster called fast ICA
- entropy maximization - try to find a nonlinear function
- topographic ICA (make nearby coefficient like each other)
- interestingly, some types of self-supervised learning perform ICA assuming certain data structure (e.g. time-contrastive learning (hyvarinen et al. 2016))
-
multidimensional scaling (MDS)
- given a a distance matrix, MDS tries to recover low-dim coordinates s.t. distances are preserved
- minimizes goodness-of-fit measure called stress = $\sqrt{\sum (d_{ij} - \hat{d}{ij})^2 / \sum d{ij}^2}$
- visualize in low dims the similarity between individial points in high-dim dataset
- classical MDS assumes Euclidean distances and uses eigenvalues
- constructing configuration of n points using distances between n objects
- uses distance matrix
$d_{rr} = 0$ $d_{rs} \geq 0$
- solns are invariant to translation, rotation, relfection
- solutions types
- non-metric methods - use rank orders of distances
- invariant to uniform expansion / contraction
- metric methods - use values
- non-metric methods - use rank orders of distances
- D is Euclidean if there exists points s.t. D gives interpoint Euclidean distances
- define B = HAH
- D Euclidean iff B is psd
- define B = HAH
-
t-sne preserves pairwise neighbors
- t-sne tutorial
- t-sne tries to match pairwise distances between the original data and the latent space data:
- original data
- distances are converted to probabilities by assuming points are means of Gaussians, then normalizing over all pairs
- variance of each Gaussian is scaled depending on the desired perplexity
- distances are converted to probabilities by assuming points are means of Gaussians, then normalizing over all pairs
- latent data
- distances are calculated using some kernel function
- t-SNE uses heavy-tailed Student's t-distr kernel (van der Maaten & Hinton, 2008)
- SNE use Gausian kernel (Hinton & Roweis, 2003)
- kernels have some parameters that can be picked or learned
- perplexity - how to balance between local/global aspects of data
- distances are calculated using some kernel function
- optimization - for optimization purposes, this can be decomposed into attractive/repulsive forces
- umap: Uniform Manifold Approximation and Projection for Dimension Reduction
- pacmap
- NNK-Means: Dictionary Learning using Non-Negative Kernel regression (shekkizhar & ortega, 2021)
- data summarization - represent large datasets by a small set of elements (e.g. k-means)
- here, use dictionary learning instead of k-means to summarize data
- each dictionary element is a sparse combination of inputs
- use non-negative kernel regesion (NNK) to measure distances when designing the dictionary (shekkizar & ortega, 2020)
- overview: https://blog.openai.com/generative-models/
- notes for deep unsupervised learning
- MLE equivalent to minimizing KL for density estimation:
$\min_\theta KL(p|| p_\theta) =\ \min_\theta-H(p) + \mathbb E_{x\sim p}[-\log p_\theta(x)] \ \max_\theta E_p[\log p_\theta(x)]$
- model input based on input
-
$p(x_1)$ is a histogram (learned prior) -
$p(x_2|x_1)$ is a distr. ouptut by a neural net (output is logits, followed by softmax) - all conditional distrs. can be given by neural net
-
- can model using an RNN: e.g. char-rnn (karpathy, 2015):
$\log p(x) - \sum_i \log p(x_i | x_{1:i-1})$ , where each$x_i$ is a character - can also use masks
- masked autoencoder for distr. estimation - mask some weights so that autoencoder output is a factorized distr
- pick an odering for the pixels to be conditioned on
- ex. 1d masked convolution on wavenet (use past points to predict future points)
- ex. pixelcnn - use masking for pixels to the topleft
- ex. gated pixelcnn - fixes issue with blindspot
- ex. pixelcnn++ - nearby pixel values are likely to cooccur
- ex. pixelSNAIL - uses attention and can get wider receptive field
-
attention:
$A(q, K, V) = \sum_i \frac{\exp(q \cdot k_i)}{\sum_j \exp (q \cdot k_j)} v_i$ - masked attention can be more flexible than masked convolution
- can do super resolution, hierarchical autoregressive model
- masked autoencoder for distr. estimation - mask some weights so that autoencoder output is a factorized distr
- problems
- slow - have to sample each pixel (can speed up by caching activations)
- can also speed up by assuming some pixels conditionally independent
- slow - have to sample each pixel (can speed up by caching activations)
- hard to get a latent reprsentation
- can use Fisher score
$\nabla_\theta \log p_\theta (x)$
- can use Fisher score
- good intro to implementing invertible neural networks: https://hci.iwr.uni-heidelberg.de/vislearn/inverse-problems-invertible-neural-networks/
- input / output dimension need to have same dimension
- we can get around this by padding one of the dimensions with noise variables (and we might want to penalize these slightly during training)
- normalizing flows
- ultimate goal: a likelihood-based model with
- fast sampling
- fast inference (evaluating the likelihood)
- fast training
- good samples
- good compression
- transform some
$p(x)$ to some$p(z)$ -
$x \to z = f_\theta (x)$ , where$z \sim p_Z(z)$ $p_\theta (x) dx = p(z)dz$ $p_\theta(x) = p(f_\theta(x))|\frac {\partial f_\theta (x)}{\partial x}|$
-
- autoregressive flows
- map
$x\to z$ invertible-
$x \to z$ is same as log-likelihood computation -
$z\to x$ is like sampling
-
- end up being as deep as the number of variables
- map
- realnvp (dinh et al. 2017) - can couple layers to preserve invertibility but still be tractable
- downsample things and have different latents at different spatial scales
- other flows
- flow++
- glow
- FFJORD - continuous time flows
- discrete data can be harder to model
- dequantization - add noise (uniform) to discrete data
- intuitively understanding vae
-
VAE tutorial
- minimize
$\mathbb E_{q_\phi(z|x)}[\log p_\theta(x|z)- D_{KL}(q_\phi(z|x):||:p(z))]$ - want latent
$z$ to be standard normal - keeps the space smooth
- want latent
- hard to directly calculate
$p(z|x)$ , since it includes$p(x)$ , so we approximate it with the variational posterior$q_\phi (z|x)$ , which we assume to be Gaussian - goal: $\text{argmin}\phi KL(q\phi(z|x) :|| :p(z|x))$
- still don't have acess to
$p(x)$ , so rewrite$\log p(x) = ELBO(\phi) + KL(q_\phi(z|x) : || : p(z|x))$ - instead of minimizing
$KL$ , we can just maximize the$ELBO=\mathbb E_q [\log p(x, z)] - \mathbb E_q[\log q_\phi (z|x)]$
- still don't have acess to
- mean-field variational inference - each point has its own params (e.g. different encoder DNN) vs amortized inference - same encoder for all points
- minimize
-
pyro explanation
- want large evidence
$\log p_\theta (\mathbf x)$ (means model is a good fit to the data) - want good fit to the posterior
$q_\phi(z|x)$
- want large evidence
- just an autoencoder where the middle hidden layer is supposed to be unit gaussian
- add a kl loss to measure how well it maches a unit gaussian
- for calculation purposes, encoder actually produces means / vars of gaussians in hidden layer rather than the continuous values....
- this kl loss is not too complicated...https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
- add a kl loss to measure how well it maches a unit gaussian
- generally less sharp than GANs
- uses mse loss instead of gan loss...
- intuition: vaes put mass between modes while GANs push mass towards modes
- constraint forces the encoder to be very efficient, creating information-rich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding non-training images, will produce a nicer result when decoded.
-
evaluating gans
- don't have explicit objective like likelihood anymore
- kernel density = parzen-window density based on samples yields likelihood
- inception score $IS(\mathbf x) = \exp(\underbrace{H(\mathbf y)}{\text{want to generate diversity of classes}} - \underbrace{H(\mathbf y|\mathbf x)}{\text{each image should be distinctly recognizable}})$
-
FID - Frechet inception score works directly on embedded features from inception v3 model
- embed population of images and calculate mean + variance in embedding space
- measure distance between these means / variances for real/synthetic images using Frechet distance = Wasseterstein-2 distance
-
infogan
-
problems
- mode collapse - pick just one mode in the distr.
-
train network to be loss function
-
original gan paper (2014)
-
generative adversarial network
-
goal: want G to generate distribution that follows data
- ex. generate good images
-
two models
- G - generative
- D - discriminative
-
G generates adversarial sample x for D
- G has prior z
- D gives probability p that x comes from data, not G
- like a binary classifier: 1 if from data, 0 from G
- adversarial sample - from G, but tricks D to predicting 1
-
training goals
- G wants D(G(z)) = 1
- D wants D(G(z)) = 0
- D(x) = 1
- converge when D(G(z)) = 1/2
- G loss function:
$G = argmin_G log(1-D(G(Z))$ - overall
$\min_g \max_D$ log(1-D(G(Z))
-
training algorithm
- in the beginning, since G is bad, only train my minimizing G loss function
-
projecting into gan latent space (=gan inversion)
- 2 general approaches
- learn an encoder to go image -> latent space
- In-Domain GAN Inversion for Real Image Editing (zhu et al. 2020)
- learn encoder to project image into latent space, with regularizer to make sure it follows the right distr.
- learn an encoder to go image -> latent space
- optimize latent code wrt image directly
- can also learn an encoder to initialize this optimization
- some work designing GANs that are intrinsically invertible
- stylegan-specific - some works which exploit layer-wise noises
- stylegan2 paper: optimize w along with noise maps - need to make sure noise maps don't include signal
- 2 general approaches
- blog post
- seminal paper: Generative Modeling by Estimating Gradients of the Data Distribution (song & ermon, 2019)
- first describe a procedure for gradually turning data into noise
- then training a DNN to invert this procedure step-by-step
- single model handles many different noise levels with shared parameters
- really started earlier: Deep Unsupervised Learning using Nonequilibrium Thermodynamics (sohl-dickstein, ..., ganguli, 2015)
- basics: predict some part of the input (e.g. present from past, bottom from top, etc.)
- ex. denoising autoencoder
- ex. in-painting (can use adversarial loss)
- ex. colorization, split-brain autoencoder
- colorization in video given first frame (helps learn tracking)
- ex. relative patch prediction
- ex. orientation prediction
- ex. nlp
- word2vec
- bert - predict blank word
- contrastive predictive coding - translates generative modeling into classification
- contrastive loss = InfoNCE loss uses cross-entropy loss to measure how well the model can classify the “future” representation amongst a set of unrelated “negative” samples
- negative samples may be from other batches or other parts of the input
- momentum contrast - queue of previous embeddings are "keys"
- match new embedding (query) against keys and use contrastive loss
- similar idea as memory bank
- Unsupervised Visual Representation Learning by Context Prediction (efros 15)
- predict relative location of different patches
- SimCLR (Chen et al, 2020)
- maximize agreement for different points after some augmentation (contrastive loss)
- make the classifier more confident
- entropy minimization - try to minimize the entropy of output predictions (like making confident predictions labels)
- pseudo labeling - just take argmax pred as if it were the label
- label consistency with data augmentation
- Billion-scale semi-supervised learning for image classification (fair 19)
- unsupervised learning + model distillation succeeds on imagenet
- ensembling
- temporal ensembling - ensemble multiple models at different training epochs
- mean teachers - learn from exponential moving average of students
- unsupervised data augmentation - augment and ensure prediction is the same
- distribution alignment - ex. cyclegan - enforce cycle consistency = dual learning = back translation
- simpler is marginal matching
- simplest - fixed-length code
- variable-length code
- arithmetic coding
- motivation: coding one symbol at a time incurs penalty of +1 per symbol - more efficient to encode groups of things
- can be improved with good autoregressive model
- What makes for good views for contrastive learning (tian et al. 2020)
- how to select views (e.g. transformations we want to be invariant to)?
- reduce the mutual information (MI) between views while keeping task-relevant information intact
- Supervised Contrastive Learning (khosla et al. 2020)
- Data-Efficient Image Recognition with Contrastive Predictive Coding
- pre-training with CPC on ImageNet yields super good results
- Automatically Discovering and Learning New Visual Categories with Ranking Statistics