Skip to content

zhangir-azerbayev/nn-generalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Neural Network Generalization Reading List

Why do neural networks generalize?

Consider the problem of learning to map some input $X$ to an output $Y$ given some finite number of training examples $M = {(x,y)}$. All learning algorithms are equally bad at this problem, because in general, knowing $M$ tells you nothing about what the rest of $Y^X$ looks like (where $Y^X$ denotes the set of functions from $X$ into $Y$). This is the famous no-free lunch theorem in statistical learning theory. David Hume expressed some version of this theorem before we had the tools to give it a rigorous basis.

Does this mean learning from data is impossible? Clearly not, because humans do it all the time. The key insight is that we don't care about abritrary learning problems, rather we care about structured learning problems, i.e problems where $Y^X$ is in some sense compressible and thus a finite number of $(x,y)$ pairs do in fact tell you something about the rest of $Y^X$.

Given $M$, the ideal learner would find the shortest program that generates the dataset, where "short" can be formalized as Kolmogorov complexity. This requirement essentially encodes Occam's razor (see also Solomonoff induction. However, a basic result in algorithmic information theory states that finding such a shortest program is uncomputable.

Does this mean learning from data is uncomputable? Not quite, because it is not necessary that our learner search the entirety of program space. Just as we can get away with using finite-precision floats while the reals are uncomputable, is there some dense subset of program space we can efficiently search through?

The last 10 years of AI research have empirically demonstrated that searching through the space of neural networks by using stochastic gradient descent to optimize a loss function over the data is a fantastic practical approximation to computing the shortest data-generating program. This fact is remarkable, and is certainly the most important discovery in the history of AI. Future generations may even judge it as the most important discovery in human history.

It is important to think of neural networks not only as a special class of functions, such as logistic regressors or support vector machines, but as a programming paradigm. That is, a very clever programmer could manually deduce the weights that yield an arbitrary program, just as he might write lines of C or Python. Training a neural network is truly akin to searching program space. (Nb. that a single feedforward neural network is a bit special compared to a C program, because defining a feedforward architecture *precisely specifies the number of instructions executing your program consists of, whereas even a very short C program can execute arbitrarily many instructions. This may seem like a great limitation, however it is not for two reasons 1) you can always make your neural network bigger if you need to do a bigger computation and 2) if you really want to execute an unbounded number of instructions, there are many promising approaches including recurrence and language model cascades.)

It is easy to see that the space of neural networks provides good coverage of program space, but it is a mystery why optimizing a loss function with SGD is a useful search procedure that effectively approximates looking for the shortest data-generating program. The readings in this list comprise some of our early efforts to unravel this mystery.

Preliminaries and Overviews

Author Title Year Type Remarks
Hardt and Recht "Generalization", in Patterns, Prediction, and Action 2021 book chapter Classical ideas in statistical learning theory and empirical phenomena in deep learning
Friedman, Hastie, and Tibshirani "Model Assesment and Selection", in Elements of Statistical Learning 2008 book chapter A good overview of classical learning theory. Classical learning theory explains traditional ML algorithms, but seems to contradict empirical phenomena in deep learning
Poggio Theoretical issues in deep networks 2020 paper By one of the founders of computational neuorscience.
Roberts, Yaida, and Hanin Principles of Deep Learning Theory 2022 book A tour de force. One of the few compelling analyses of the dynamics of deep networks.

Scaling Laws and Pre-training

Author Title Year Type Remarks
Kaplan et al. Scaling Laws for Neural Language Models 2020 paper
Tripuraneni et al. On the Theory of Transfer Learning: The Importance of Task Diversity 2020 paper
Bommasani et al. On the Opportunities and Risks of Foundation Models 2021 monograph
Wei et al. Emergent Abilities of Large Language Models 2022 paper

Optimization Dynamics

Author Title Year Type Remarks
Zhang et al. Understanding deep learning requires rethinking generalization 2016 paper
Zhang et al. Theory of Deep Learning III: Generalization Properties of SGD 2017 monograph
Jacot et al. Neural Tangent Kernel: Convergence and Generalization in Neural Networks 2018 paper
Wang and Isola Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere 2020 paper
Power et al. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets 2021 paper

Deep Double Descent

Author Title Year Type Remarks
Hubinger Understanding "Deep Double Descent" 2019 blog post
Belkin et al. Reconciling modern machine learning practice and the bias-variance trade-off 2019 paper
Nakkiran et al. Deep Double Descent: Where Bigger Models and More Data Hurt 2019 paper The original deep double descent paper
Keskar et al. On Large-Batch Training For Deep Learning: Generalization Gap and Sharp Minima 2017 paper
Dinh et al. Sharp Minima Can Generalize For Deep Nets 2017 paper
Neyshabur et al. Exploring Generalization in Deep Learning 2017 paper Scale normalization, connection between sharpness and PAC-Bayes theory.
Frankle and Carbin The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks 2018 paper

Data Manifolds

Author Title Year Type Remarks
Carlsson et al. On the Local Behavior of Spaces of Natural Images 2008 paper
Olah Neural Networks, Manifolds, and Topology 2014 blog post
Fefferman et al. Testing the manifold hypothesis 2016 paper

Generalization Bounds

Author Title Year Type Remarks
Dziugaite and Roy Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data 2017 paper
Arora et al. Stronger generalization bounds for deep nets via a compression approach 2018 paper
Bubeck and Sellke A Universal Law of Robustness via Isoperimetry 2021 paper Overparametrized models learn smooth functions. Distinguished among theory papers in that the result both seems quite general and quite relevant.
HaoChen et al. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss 2021 paper

About

Neural Network Generalization Reading List

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published