# Gradient Descent
- First_order iterative optimization algorithm for finding a local minimum of a differentiable function

# Important Concepts in Optimization
- Generalization
- Under-fitting vs. over_fitting
- Cross validation
- Bias-variance tradeoff
- Boostrapping
- Bagging and boosting

## Generalization
- How well the learned model will behave on unseen data.
- Generalization이 좋다 ==> 네트워크의 성능이 학습데이터와 비슷하게 나올 것이라고 보장한다는 의미
![image.png](attachment:image.png)

## Underfitting vs. Overfitting
- 학습 데이터엔 잘 동착하지만 테스트 데이터엔 잘 동작하지 않음 : Overfitting
- 네트워크가 너무 간단하거나 train을 조금 시켜서 학습 데이터도 잘 못 맞춘다 : Underfitting
![image.png](attachment:image.png)
https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

## Cross-validation
- Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set.
- 학습 데이터는 k개씩 나누는 것이다.
- 최적의 하이퍼 파라미터를 찾고 학습은 모든 데이터셋을 사용한다.
![image.png](attachment:image.png)
https://blog.quantinsti.com/cross-validation-machine-learning-trading-models/

## Bias and Variance
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

## Bootstrapping
- Bootstrapping is any test or metric that uses random sampling with replacement.
![image.png](attachment:image.png)

## Bagging vs. Boosting
- Bagging (**B**ootstrapping **agg**regat**ing**)
    - Multiple models are being trained with bootstrapping.
    - ex) Base classifiers are fitted on random subset where indeividual predictions are aggregated(voting or averaging)
    
    
- Boosting
    - It focuses on those specific training samples that are hard to classify.
    - A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner.

![image.png](attachment:image.png)
https://www.datacamp.com/community/tutorials/adaboost-classifier-python

# Practical Gradient Descent Methods
## Gradient Descent Methods
### Stochastic gradient descent
    - Update with the gradient computed from a single sample.
    
### Mini-batch gradient descent
    - Update with the gradient computed from a subset of data.
    
### Batch gradient descent
    - Update with the gradeint computed from the whole data.

## Batch-size Matters
- It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize.'


- 'We... present numerical evidence that supports the view that large batch methods tend to converge to **sharp minimizers** of the training and testing functions. In contrast, small-batch methods consistently converge to **flat minimizers...** this is due to the inherent noise in the gradient estimation.'
![image.png](attachment:image.png)

## (Stochastic) Gradient Descent
- lr을 정하는 것이 어렵다.
![image.png](attachment:image.png)

## Momentum
- momentum : 관성
- gradient가 많이 왔다갔다해도 어느정도 잘 학습이 되는 효과가 있다.
![image.png](attachment:image.png)

## Nesterov Accelerated Gradient(NAG)
- gradient를 계산할 때 Lookahead gradient를 계산한다.
- 
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

## Adagrad
- **Adagrad** adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.
- 파라미터가 얼만큼 지금까지 많이 변해왔는지 보고, 많이 변한 것은 적게 변화시키고 적게 변한 것은 많이 변화시킨다.
- G가 계속 커지기 때문에 나중에는 분수가 0에 수렴하여 뒤로 갈 수록 업데이트가 잘 안 된다.
![image.png](attachment:image.png)

## Adadelta
- **Adadelta** extends **Adagrad** to reduce its monotonically decreasing the learning rate by restricting the accumulation window.
- learning rate가 없어서 바꿀 수 있는 요소가 별로 없어서 잘 활용되지 않는다.
![image.png](attachment:image.png)

## RMSprop
- **RMSprop** is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.
- 제프리 힌턴이 그냥 해보니까 잘 됐던 방법이다.
![image.png](attachment:image.png)

## Adam
- Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients.
- 그레이언트 크기에 따라 모멘텀을 활용한다.
![image.png](attachment:image.png)

# Regularization
- 학습 데이터에만 잘 동작하는 것이 아니라 테스트 데이터에도 잘 동작할 수 있도록 하는 것이다.
- Early stopping
- Parameter norm penalty
- Data augmentation
- Noise robustness
- Label smoothing
- Dropout
- Batch normalization

## Early stopping
- 학습을 조기 종료한다.
![image.png](attachment:image.png)

## Parameter Norm Penalty
- It adds smoothness to the function space.
- 부드러운 함수일수록 generalization performance가 높을 것이라는 가정.
![image.png](attachment:image.png)

## Data Augmentation(데이터 증폭)
- More data are always welcomed.
- 데이터가 무한히 많으면 웬만하면 다 잘 된다.
![image.png](attachment:image.png)

- However, in most cases, training data are given in advance.
- In such cases, we need data augmentation.
![image-2.png](attachment:image-2.png)
- 레이블이 변환되지 않는다는 조건 하에서 해야한다.

## Noise Robustness
- Add random noises inputs or weights.
![image.png](attachment:image.png)

## Label Smoothing
- **Mix-up** constructs augmented training examples by mixing both input and output of two randomly selected training data.
![image.png](attachment:image.png)


- **CutMix** constructs augmented training examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data.
![image-2.png](attachment:image-2.png)

## Dropout
- In each forward pass, randomly set some neurons to zero.
![image.png](attachment:image.png)

## Batch Normalization
- Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize.
- 내가 적용하고자 하는 layer의 statistics를 정규화 시키는 것.
- 분류 문제에서 일반적으로 layer가 깊게 쌓아지면 성능이 올라간다.
![image.png](attachment:image.png)


- There are different variances of normalizations.
![image-2.png](attachment:image-2.png)