# Chapter 11. Practical Methodology

* 손고리즘ML : 파트 4 - DML [1]
* 김무성

# Contents
* 11.1 Performance Metrics
* 11.2 Default Baseline Models
* 11.3 Determining Whether to Gather More Data
* 11.4 Selecting Hyperparameters
    - 11.4.1 Manual Hyperparameter Tuning
    - 11.4.2 Automatic Hyperparameter Optimization Algorithms
    - 11.4.3 Grid Search
    - 11.4.4 Random Search
    - 11.4.5 Model-Based Hyperparameter Optimization
* 11.5 Debugging Strategies
* 11.6 Example: Multi-Digit Number Recognition

Successfully applying deep learning techniques 
* requires more than just a good knowledge of 
    - <font color="blue">what algorithms exist</font> and 
    - the <font color="blue">principles that explain how they work</font>. 
* A good machine learning practitioner also needs to know 
    - <font color="red">how to choose an algorithm for a particular application</font> and 
    - <font color="red">how to monitor and respond to feedback</font>
        - obtained from experiments in order to improve a machine learning system. 

During day to day development of machine learning systems, practitioners need to decide
* whether to gather more data, 
* increase or decrease model capacity, 
* add or remove regularizing features, 
* improve the optimization of a model, 
* improve approximate inference in a model, or 
* debug the software implementation of the model.

<img src="http://www.svms.org/srm/Sewell2006.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part2/study01/dml05/figures/fig5.6.png" width=600 />

<img src="http://yosinski.com/mlss12/media/slides/MLSS-2012-Fukumizu-Kernel-Methods-for-Statistical-Learning_050.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part2/study04/dml07/figures/cap7.21.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part2/study04/dml07/figures/cap7.48.png" width=600 />

Most of this book is about diﬀerent machine learning models, training algorithms, and objective functions. 
This may give the impression that the mostimportant ingredient to being a machine learning expert is knowing a wide varietyof machine learning techniques and being good at diﬀerent kinds of math.
<font color="red">In practice, one can usually do much better with a correct application of a commonplace algorithm than by sloppily applying an obscure algorithm.</font> Correct application ofan algorithm depends on mastering some fairly simple methodology.

We recommend the following practical design process:
* <font color="red">Determine your goals</font>
     - what error metric to use, and 
     - your target value for this error metric. 
     - These goals and error metrics should be driven by the problem that the application is intended to solve.
* <font color="red">Establish a working end-to-end pipeline</font> as soon as possible, 
    - including the estimation of the appropriate performance metrics.
* Instrument the system well to <font color="red">determine bottlenecks in performance</font>.
    - Diagnose which components are performing worse than expected and
    - whether itis due to 
        - overﬁtting, 
        - underﬁtting, or 
        - a defect 
            - in the data or 
            - software.
* <font color="red">Repeatedly make incremental changes</font> such as 
    - gathering new data, 
    - adjusting hyperparameters, or 
    - changing algorithms, based on speciﬁc ﬁndings from your instrumentation

As a running example, we will use <font color="red">Street View address number transcription system</font> (Goodfellow et al., 2014d). 
* The purpose of this application is to add buildings to Google Maps.
* Street View cars photograph the buildings and record the GPS coordinates associated with each photograph. 
* A convolutional network recognizes the address number in each photograph, allowing the Google Mapsdatabase to add that address in the correct location. 
* The story of how this commercial application was developed gives an example of how to follow the design methodology we advocate

# 11.1 Performance Metrics

#### what level of performance you desire

Determining your goals, in terms of which <font color="blue">error metric</font> to use, is a necessary ﬁrst step because your error metric will guide all of your future actions. <font color="red">You should also have an idea of what level of performance you desire</font>.

#### error

Keep in mind that for most applications, it is <font color="red">impossible to achieve absolute zero error</font>.

#### training data

<font color="red">The amount of training data can be limited</font> for a variety of reasons.
* Data collection can require time,money, or human suﬀering

#### reasonable level of performance to expect

How can one determine a <font color="red">reasonable level of performance to expect</font>?
* Typically,in the academic setting, 
    - we have some estimate of the error rate that is attainable based on <font color="blue">previously published benchmark results</font>. 
* In the real-word setting, 
    - we have some idea of the error rate that is <font color="blue">necessary for an application</font> to be safe, cost-eﬀective, or appealing to consumers.

#### common performance metrics

Another important consideration besides the target value of the performance metric is the <font color="red">choice of which metric to use</font>.
* Several diﬀerent performance metrics may be used to measure the eﬀectiveness of a complete application that includes machine learning components. 
* These performance metrics are usually diﬀerent from the cost function used to train the model.
* As described in Sec. 5.1.2, it is <font color="blue">common to measure</font> 
    - the accuracy, or equivalently, 
        <img src="http://www.welaptega.com/wp-content/uploads/2014/09/testing-accuracy.jpg" width=300 />
    - the error rate, of a system.
        - 1 - accuracy

#### more advanced metrics

##### 참고
* [2] Performance measures in Azure ML: Accuracy, Precision, Recall and F1 Score. - https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-accuracy-precision-recall-and-f1-score/
* [3] Using ROC plots and the AUC measure in Azure ML - https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/using-roc-plots-and-the-auc-measure-in-azure-ml/

However, many applications require <font color="red">more advanced metrics</font>.
* e-mail spam detection
    - Rather thanmeasuring the error rate of a spam classiﬁer, we may wish to measure some formof total cost, where the cost of blocking legitimate messages is higher than the costof allowing spam messages.
* a binary classiﬁer that is intended to detect somerare event.
    - example : medical test for a rare disease
    - precision and recall
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/2000px-Precisionrecall.svg.png" width=600 />
        - PR curve
            <img src="http://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/1588.image_5F00_thumb_5F00_1E8173C0.png" width=600 />
        - F-score : The F1 Score is the weighted average of Precision and Recall
            <img src="figures/cap11.1.png" width=600 />
        - Another option is to report the total area lying beneath the PR curve
            

#### decision criteria

In some applications, it is possible for the machine learning system to refuse to make a decision. This is useful when the machine learning algorithm can estimate <font color="red">how conﬁdent it should be about a decision</font>, especially if a wrong decision can be harmful and if a human operator is able to occasionally take over.
* coverage
    - Coverage is the <font color="red">fraction of examples for which the machine learning system is able to produce a response</font>
    - One can always obtain 100% accuracyby refusing to process any example, but this reduces the coverage to 0%. 
    - For theStreet View task, the goal for the project was to reach human-level transcription accuracy while maintaining 95% coverage.
    - Human-level performance on this taskis 98% accuracy.

#### other metrics

Many other metrics are possible. 
* We can for example, measure <font color="blue">click-through rates</font>, 
    - collect <font color="blue">user satisfaction</font> surveys, and so on. 
        <img src="http://www.mailigen.com/blog/wp-content/uploads/2013/08/350x300px.png" width=600 />
* Many specialized application areas have <font color="red">application-speciﬁc criteria</font> as well

#### Without clearly deﬁned goals

What is important is to determine which performance metric to improve ahead of time, then <font color="blue">concentrate on improving this metric</font>. <font color="red">Without clearly deﬁned goals,it can be diﬃcult to tell whether changes to a machine learning system make progress or not</font>.

# 11.2 Default Baseline Models

After choosing performance metrics and goals, the next step in any practical application is to <font color="red">establish a reasonable end-to-end system as soon as possible</font>. In this section, we provide recommendations for which algorithms to use as the ﬁrst baseline approach in various situations.

#### without using deep learning

* Depending on the complexity of your problem, <font color="red">you may even want to begin without using deep learning</font>.
* If your problem has a chance of being solved by just choosing a few linear weights correctly, you may want to <font color="red">begin with a simple statistical model like logistic regression</font>.

#### using deep learning

* If you know that your problem falls into an “AI-complete” category like 
    - object recognition, 
    - speech recognition, 
    - machine translation, and so on, 
* then you are likely to do well by <font color="red">beginning with an appropriate deep learning model</font>.

#### model 

* First, choose the <font color="blue">general category</font> of model <font color="red">based on the structure of your data</font>.
    - If you want to perform <font color="blue">supervised learning with ﬁxed-size vectors as input</font> <font color="red"> -> a feedforward network with fully connected layers</font>
    <img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" width=300 />
    - If the <font color="blue">input has known topological structure</font> (for example, if the input is an image) <font color="red"> -> CNN</font>
        <img src="http://deeplearning.net/tutorial/_images/mylenet.png" width=600 />
        - In these cases, you should begin by using some kind of <font color="red">piecewise linear unit</font> 
            ##### 참고
            * [6] L1 : Deep Neural Networks (Udacity) - https://drive.google.com/file/d/0B3vuuoFuJsKWdFFkMS10N1BpLTg/view

            - ReLUs or 
            <img src="http://nn.readthedocs.org/en/rtd/image/relu.png" width=300 />
            - their generalizations like 
                - Leaky ReLUs, 
                <img src="http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/relufamily.png" width=600 />
                - PreLus and 
                    ##### 참고 
                    * [4] Benchmarking ReLU and PReLU using MNIST and Theano - http://gforge.se/2015/06/benchmarking-relu-and-prelu/
                    <img src="http://gforge.se/wp-content/uploads/2015/05/PReLU.jpg" width=400 />
                - maxout.
                    ##### 참고
                    * [5] Maxing out the digits - http://fastml.com/maxing-out-the-digits/
                    <img src="http://fastml.com/images/pylearn2/digits/maxout.png" width=400 />
    - If your <font color="blue">input or output is a sequence</font> <font color="red"> -> gated recurrent net (LSTM or GRU)</font>
        <img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" width=600 />
        <img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width=600 />
        <img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png" width=600 />

#### optimization algorithm
* A reasonable choice of optimization algorithm is 
    - SGD 
        - <font color="blue">with momentum</font> 
            - <font color="blue">with a decaying learning rate</font> 
            (popular decay schemes that perform better or worse on diﬀerent problems include decaying linearly until reaching a ﬁxed minimum learning rate, decaying exponentially, or decreasing the learning rate by a factor of 2-10 each time validation error plateaus)
            #### 참고
                - [7] Deeplearning4j Updaters Explained - http://deeplearning4j.org/updater
    <img src="http://deeplearning4j.org/img/updater_1.png" width=300 />
    <img src="http://deeplearning4j.org/img/updater_2.png" width=300 />
    <img src="http://i.ytimg.com/vi/s6jC7Wc9iMI/0.jpg" width=300 />
* Another very reasonable alternative is 
    - Adam
    ##### 참고
        - [8] An overview of gradient descent optimization algorithms - http://sebastianruder.com/optimizing-gradient-descent/index.html#adam
* Batch normalization     
    - can have a dramatic eﬀect on optimization performance,
    - especially for 
        - convolutional networks and 
        - networks with sigmoidal nonlinearities.
    - While it is reasonable to omit batch normalization from the very ﬁrst baseline, it should be introduced quickly if optimization appears to be problematic.
    
    ##### 참고
    * [9] Directions in Convolutional Neural Networks at Google - http://vision.stanford.edu/teaching/cs231n/slides/jon_talk.pdf
    * [10] Batch Normalization (ICML 2015) - http://sanghyukchun.github.io/88/
    <img src="http://sanghyukchun.github.io/images/post/88-1.jpg" width=300 />
    <img src="http://sanghyukchun.github.io/images/post/88-2.png" width=400/>
    <img src="http://sanghyukchun.github.io/images/post/88-5.png" width=400 />
    

#### regularization
* <font color="red">Unless your training set contains tens of millions</font> of examples or more, you should include <font color="red">some mild forms of regularization from the start</font>.
    - Early stopping
        <img src="http://deeplearning4j.org/img/earlystopping.png" width=300 />
    - Dropout
        <img src="http://engineering.flipboard.com/assets/convnets/dropout.png" width=300 />
    - Batch normalization
* <font color="red">If your task is similar to another task that has been studied</font> extensively, you will probably do well by ﬁrst <font color="red">copying the model and algorithm that is already known to perform best on the previously studied task</font>.
    - For example, it is common to use the featuresfrom a convolutional network trained on ImageNet to solve other computer visiontasks

#### unsupervised learning
* A common question is whether to begin by using unsupervised learning, de-scribed further in Part III.
* This is somewhat domain speciﬁc.
    - NLP
        - Some domains, suchas natural language processing, are known to beneﬁt tremendously from unsuper-vised learning techniques such as learning unsupervised word embeddings.
    - Computer vision
        - In otherdomains, such as computer vision, current unsupervised learning techniques donot bring a beneﬁt, except in the semi-supervised setting, when the number oflabeled examples is very small
* If your application is in a context where unsupervised learning is known to be important,then include it in your ﬁrst end-to-end baseline.
*  Otherwise, only use unsupervised learning in your ﬁrst attempt if the task you want to solve is unsupervised. 
    - You can always try adding unsupervised learning later if you observe that your initialbaseline overﬁts.

# 11.3 Determining Whether to Gather More Data

<font color="blue">After the ﬁrst end-to-end system is established</font>, it is time to measure the perfor-mance of the algorithm and determine how to improve it. Many machine learning novices are tempted to make improvements by trying out many diﬀerent algorithms. However, <font color="red">it is often much better to gather more data</font> than to improve the learning algorithm.

<font color="red">How does one decide whether to gather more data?</font>
* First, determine <font color="green">whether the performance on the training set is acceptable</font>.
    -  <font color="red">If performance on the training set is poor</font>, 
        - the learning algorithm is not using the training data that is already available, so there is no reason to gather more data.
        - Instead, 
            - <font color="red">try increasing the size of the model</font> 
                - by adding more layers or adding more hidden units to each layer.
            - Also, <font color="red">try improving the learning algorithm</font>, 
                - for example by tuning the learning rate hyperparameter. 
        - <font color="red">If large models and carefully tuned optimization algorithms do not work well</font>, then the problem might be the <font color="red">quality of the training data</font>.
            - The data may be too noisy or may not include the right inputs needed to predict the desired outputs. 
            - This suggests starting over, collecting cleaner data or collecting a richer set of features.
    - <font color="blue">If the performance on the training set is acceptable</font>,  
        - then <font color="blue">measure the performance on a test set</font>. 
* <font color="blue">If the performance on the test set is also acceptable</font>,
    - then there is nothing left to be done.
* <font color="purple">If test set performance is much worse than training set performance</font>,
    - then <font color="purple">gathering more data</font> is one of the most eﬀective solutions.

#### gethering more data

* The key considerations are the cost and feasibility of gathering moredata, the cost and feasibility of reducing the test error by other means, and the amount of data that is expected to be necessary to improve test set performance signiﬁcantly.
* A simple alternative to gathering more data is to 
    - reduce the size of the model or 
    - improve regularization, 
        - by adjusting hyperparameters such as 
            - weight decay coeﬃcients,or 
        - by adding regularization strategies such as 
            - dropout. 
     - If you ﬁnd that the gap between train and test performance is still unacceptable even after tuning theregularization hyperparameters, 
         - then gathering more data is advisable.
* When deciding whether to gather more data, 
    - it is also necessary to <font color="red">decide how much to gather</font>. It is helpful to plot curves showing the relationship betweentraining set size and generalization error, like in Fig. 5.4.
    - By extrapolating such curves, one can predict how much additional training data would be needed to achieve a certain level of performance. 
    - Usually, adding a small fraction of the totalnumber of examples will not have a noticeable impact on generalization error. 
    - It is therefore recommended to experiment with training set sizes on a logarithmic scale,for example doubling the number of examples between consecutive experiments.
    <img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part2/study01/dml05/figures/fig5.4.png" width=600 />

If gathering much more data is not feasible, the only other way to improve generalization error is to <font color="red">improve the learning algorithm itself</font>. 
* This becomes thedomain of research and not the domain of advice for applied practitioners

# 11.4 Selecting Hyperparameters
* 11.4.1 Manual Hyperparameter Tuning
* 11.4.2 Automatic Hyperparameter Optimization Algorithms
* 11.4.3 Grid Search
* 11.4.4 Random Search
* 11.4.5 Model-Based Hyperparameter Optimization

## 11.4.1 Manual Hyperparameter Tuning

<img src="figures/cap11.2.png" width=600 />

<img src="figures/cap11.3.png" width=600 />

## 11.4.2 Automatic Hyperparameter Optimization Algorithms

<img src="figures/cap11.4.png" width=600 />

## 11.4.3 Grid Search

## 11.4.4 Random Search

<img src="figures/cap11.5.png" width=600 />

## 11.4.5 Model-Based Hyperparameter Optimization

# 11.5 Debugging Strategies

<img src="figures/cap11.6.png" width=600 />

<img src="figures/cap11.7.png" width=600 />

<img src="figures/cap11.8.png" width=600 />

# 11.6 Example: Multi-Digit Number Recognition

# 참고자료


* [1] Chapter 11. Practical Methodology(in Bengio's deep learning book) -  http://www.deeplearningbook.org/contents/guidelines.html
* [2] Performance measures in Azure ML: Accuracy, Precision, Recall and F1 Score. - https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-accuracy-precision-recall-and-f1-score/
* [3] Using ROC plots and the AUC measure in Azure ML - https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/using-roc-plots-and-the-auc-measure-in-azure-ml/
* [4] Benchmarking ReLU and PReLU using MNIST and Theano - http://gforge.se/2015/06/benchmarking-relu-and-prelu/
* [5] Maxing out the digits - http://fastml.com/maxing-out-the-digits/
* [6] L1 : Deep Neural Networks (Udacity) - https://drive.google.com/file/d/0B3vuuoFuJsKWdFFkMS10N1BpLTg/view
* [7] Deeplearning4j Updaters Explained - http://deeplearning4j.org/updater
* [8] An overview of gradient descent optimization algorithms - http://sebastianruder.com/optimizing-gradient-descent/index.html#adam
* [9] Directions in Convolutional Neural Networks at Google - http://vision.stanford.edu/teaching/cs231n/slides/jon_talk.pdf
* [10] Batch Normalization (ICML 2015) - http://sanghyukchun.github.io/88/