# Art and Science of Machine Learning

- You'll learn about aspects of machine learning that require some intuition, good judgment, and experimentation. We call it the art of ML

- We will learn the many knobs and levers involved in training a model. You will manually adjust them to see their effects on model performance

- You will learn how to tune them in an automatic way

- You'll be ready to spice things up by pinch of science

## The Art of ML

- Generalize a model, so that it performs well on unseen test data not just training data

- There are many strategies used in the ML to solve this problem. They are collectively known as regularization

### Regularization

- While training model, we will apply Ockham's Razor principle as our heuristic guide in favoring simpler models with less assumptions about the training

- We need to find the right balance between simplicity and accurate fitting of the training data

- There are also data set augmentation methods, noise robustness, sparse representation, and many more

- Regularization refers to any technique that helps generalize a model. A generalized model performs well not just on training data but also on never seen test data.

### L1 & L2 Regularizations

- L1 and L2 regularization methods represent model complexity as the magnitude of the weight vector, and try to keep that in check

- L1 regularization results in a solution that's more sparse. Sparsity in this context refers to the fact that some of the weights end up having the optimal value of zero

- This property of L1 regularization extensively used as a feature selection mechanism. Feature selection simplifies the ML problem by causing a subset of the weight to become zero

## Learning rate and batch size

- learning rate controls the size of the step in the weight space. If the steps are too small, training will take a long time. On the other hand, if the steps are too large, it will bounce around and could even miss the optimal point

- the gradient are computed within the batch. If the batch is not representative, the loss will jump around too much from botched batch

- batch size, controls the number of samples that gradient is calculated on. If batch size is too small, we could be bouncing around because the batch may not be a good enough representation of the input. On the other hand, if batch size is too large, training will take a very long time.

- As a rule of thumb, 40 to 100 tends to be a good range for batch size. It can go up to as high as 500

### Optimization

- the gradient descent which tries to find the minimum of last function by altering vait values.

- Momentum reduces learning rate when gradient values are small

- AdaGrad gives frequently occurring features lower learning rates

- AdaDelta improves AdaGrad by avoiding reducing learning rate to zero

- Adam is basically AdaGrad with a bunch of fixes

- Ftrl or follow the regularized leader works well on white models. At this time, Adam and ftrl make good defaults for deep neural networks as well as linear models

### Hand-Tuning ML Models

- Instead of setting number of epochs, you need to define number of steps. This is because number of epochs is not failure-friendly in distributed training

- There are a few rule of thumbs that may help guide you. When you monitor your training error, it should steadily decrease and typically, it's steeply at first, and then it should eventually plateau as the training converges.

- If the training has not converged, try running it for longer. If the training error decreases too slowly, increasing the learning rate may help it decrease faster

- But sometimes, the exact opposite may happen if the learning rate is too high. If the training error varies widely, try decreasing the learning rate

- Lowering learning rate, plus larger number of steps or larger batch size is often a good combination

- Very small batch sizes can also cause instability. First, try larger values, like hundred or thousand and decrease until you see degradation.

## Hyperparameter Tuning

- A parameter is a real valued variable that changes during model training, like all those base and biases that we've come to know so well

- A hyper-parameter, on the other hand, is a setting that we set before training, and it doesn't change afterwards. Examples of hyper-parameters are learning rate, regularization rate, batch size, number of hidden layers in the neural net, and number of neurons in each layer.

### Think Beyond Grid Search

- ML Engine nicely abstracts the way the process of hyperparameter tuning. All you need to do to use this service is as follows

- One, you need to express the hyperparameters in need of tuning as a command-line argument

- Then, you need to ensure different iterations of training don't clobber each other

- Finally, you'll need to supply those hyperparameters to the training job

- to supply hyperparameters when submitting a training job, and here is how. First you create the yaml file like this one, then you supply the path to the yaml file via command-line parameters to the G Cloud ML Engine command

## Regularization for sparsity

- We learned about L2 regularisation and how we can keep the magnitudes of parameter weights small, as well as how learning to write and batch size affect training

- How we can perform Hyperparameter tuning, which is an outer automation loop trying to find the hyperparameters that give the best generalize of a model


### Deep Dive Regularization for sparsity

- L1 regularization, adds the sum of the absolute value the parameter weights to the last function, which tends to force the weights of not very protective features to zero

- L1 Regularization really helped prune our complex model down into a much smaller generalizable model.

- This acts as a built-in feature selector by killing all bad features and leaving only the strongest in the model

- L2 regularization, which is added to sum of the squared parameter weights terms to the last function

- This was great at keeping weights small, having instability and a unique solution, but it can leave the model unnecessarily large and complex, since all of the features may still remain a little bit small weights

- This sparse model has many benefits. First, with fewer coefficients to store and load, there is a reduction in storage and memory needed with a much smaller model size, which is especially important for embedded models. Also, with fewer features, there are a lot fewer mult ads which not only leads to increased training speed, but more importantly increase prediction speed.

- In practice though, usually the L2-norm provides more generalizable models and the L1 norm. However, we will end up with much more complex heavy models if we use L2 instead of L1

## Logistic Regression

### Saturation

- Since we use the derivative and back propagation to update the weights, it is important for the gradient not to become zero, or else, training will stop. This is called saturation, when all activations end up in these plateaus which leads to a vanishing gradient problem and makes training difficult

### Benefit of Logistic Regression

- The great thing about logistic regression is that it already outputs the calibrated property estimate since the sigmoid function is a cumulative distribution function of the logistic probability distribution. This allows us to actually predict probabilities instead of just binary answers like yes or no, true or false, buy or sell, et cetera

### Making Logistic Regression Better

- Adding regularization to logistic regression helps keep the model simpler by having smaller parameter weights

- To find the optimal L1 and L2 hyperparameter choices during hyperperimeter tuning, you're searching for the point in the validation loss function where you obtain the lowest value

- You can tune your choice of threshold to optimize the metric of your choice. Is there any easy way to help us do this? A Receiver Operating Characteristic Curve or ROC curve for short, shows how a given Malos predictions create different true positive versus false positive rates when different decision thresholds are used

- adding in penalty terms to the objective function like with L1 regularization for sparsity and L2 regularization for keeping model width small, and adding early stopping can help in this regard

- It is also important to choose a tuned threshold for deciding what decisions to make when your probability estimate outputs

## Neural Networks

- You can imagine much more complicated spaces that even this spiral that really necessitate the use of neural networks. Neural networks can help as an alternative to feature crossing by combining features

### Linear to Non Linear Neural Networks

- You're probably thinking now, "Hey, I thought neural networks are all about adding layers upon layers in neurons. How can I do deep learning when all of my layers collapse into just one?"

- The solution is adding a non-linear transformation layer which is facilitated by a nonlinear activation function such as sigmoid, Tanh or ReLU

- Usually, neural networks have all layers nonlinear for the first and minus one layers and then have the final layer transformation be linear for regression or sigmoid or softmax which we'll talk about soon for classification

- This means that whenever there are two or more linear layers consecutively, they can always be collapsed back into one layer no matter how many they are

### Non Linear Activation Importance

- Why is it important adding non-linear activation functions to neural networks? The correct answer is because it stops the layers from collapsing back into just a linear model.

## Training Neural Networks

- There are some interesting failure cases when training to talk about though, such as vanishing gradients, exploding gradients and dead layers

### Vanish Gradients

- First, during the training process especially for deep networks gradients can vanish, each additional layer in your network can successively reduce signal vs noise. 

- A simple way to fix this is to use non saturating non-linear activation functions such as ReLUs, ELUs, et cetera

### Exploading Gradients

- Second, we can also have the opposite problem where gradients explode, by getting bigger and bigger until our weights gets so large we overflow. 

- There are many techniques to try and minimize this. Such as weight organization and smaller batch sizes. 

- Another technique is grading and clipping, where we check to see if the normal the gradient exceeds some threshold. 

- Another useful technique is batch normalization which solves the problem called internal co-variance shift

### Die layer

- Another common failure mode of grading descent is that real layers can die

- You can using Leaky or parametric ReLUs or even the slower ELUs to prevent this

- You can lower your learning rates to help stop ReLu layers from not activating and not staying

### Drop Out Layer

- Another form of regularization that helps build more generalizable models is adding dropout layers to our neural networks. To use dropout, I add a wrapper to one or more of my layers. Intenser flow, the parameter you pass is called dropout, which is the probability of dropping a neuron temporarily from the network rather than keeping it turned on

- Typical values for dropout are between 20 to 50 percent. If you go much lower than that, there is not much effect from the network since you are rarely dropping any nodes. If you go higher, then training doesn't happen as well since the network becomes too sparse to have the capacity to learn without distribution

## Multi Class Neural Network

### What if we have both mutually exclusive labels and probabilities

- For our classification output, if we have both mutually exclusive labels and probabilities, we should use softmax cross entropy with logits version two. This means that there is only one true class for each example, and we allow for soft labels with the true class, does not need to be one hotted for the true class, but can be any combination of values between zero and one for each class, as long as they all sum up to one

### What If the labels are mutually exclusive, the probabilities aren't

- If the labels are mutually exclusive, the probabilities aren't, then we should use sparse softmax cross entropy with logits. This doesn't allow for soft labels, but does help produce the model data size, since you can compress your labels and are just being the index of the true class, rather than a vector of the number of classes for each example

### What If our labels aren't mutually exclusive

- If our labels aren't mutually exclusive, we should use sigmoid cross entropy with logits. This way, we will get a probability for each possible class, which can give us confidence scores of each class being represented in the output such as an image with multiple classes in it, or we want to know the existence of each class

## Embeddings

- You will learn how to use embeddings to manage sparse data; to make machine learning models that use sparse data consume less memory and train faster

- Embeddings are also a way to do dimensionality reduction and in that way, make models simpler and more generalizable

- Embeddings are so useful, that creating and embedding can be thought of as a machine learning problem in its own right

- Think of a good reusable embedding as being similar to a software library

### Example of Embeddings

- The feature cross of day hour has a hardened 68 unique values but we are forcing it to be represented with just two real value numbers. So, the model learns how to embed the feature cross in lower dimensional space

### Recommendations Case

- the idea is that we have an input that has n dimensions. So what is n in the case of the movies that we looked at? 500,000, right? Remember that the movie ID is a categorical feature and would normally be one heart encoding it. So, n = 500,000

- In our case, we represented all the movies in a two dimensional space, so d = 2

- The key point is that d is much much less than n, and the assumption is that user interest in movies can be represented by d aspects we don't need a much larger number of aspects to represent user interest in movies.

### Data-driven Embeddings

- It is easier to train a model that has D inputs, than it is to train a model that has N input. Remember that N is much much larger than D.The fewer the number of input nodes, the fewer the weights that we have to optimize

- This means that the model trains faster, and has less chance of overfitting

### Sparse Tensors

- Storing the input vector as a one heart encoded array is a bad idea. A dense representation is extremely inefficient, both for storage and for compute

- categorical columns are an example of something that is sparse. Tensor flow can do math operations on sparse tensors without having to convert them into dense. This saves memory and optimizes compute

- We just take two steps. First take the original input and represent the input. Second send it through an embedding layer. The first step is done by taking the input and representing it as a sparse tensor. The second step is done through the call to embedding column

### Train an Embedding

- Mathematically, an embedding isn't really different from any other hidden layer in a network. You can view it as a handy adapter that allows a network to incorporate spores or categorical data well

- The waits when using a deep neural net are learned with back propagation just as with other layers. And you can do this with a regression problem. All with a classification problem.

### Similarity Property

- The number of embeddings is the hyperparameter to your machine learning model. You will have to try different numbers of embedding dimensions because there is a trade-off here.

- Higher dimensional embeddings can more accurately represent the relationship between input values. But, the more dimensions you have, the greater the chance of overfitting

- the model gets larger and this leads to slower training. So, a good starting point is to go with the fourth root of the total number of possible values

- For example, if you're embedding movie IDs and you have 500,000 movies in your catalogue, the total number of possible values is 500,000. So a good starting point would be the fourth root of 500,000. Now the square root of 500,000 is about 700, and the square root of 700 is about 26

## Custom Estimator

### Keras

- Keras, if you haven't heard of it, is a very intuitive opensource front end to deep learning model

- Keras comes into play when you think of custom estimators because Keras provides a convenient way to write the model function for a custom estimato

- unlike TensorFlow, Keras is not an implementation of CNNs or RNNs. What Keras is, is that it's a high level neural networks API written in Python but which supports TensorFlow as a backend

- In other words, when you call a Keras function it turns around and calls a set of TensorFlow functions to implement that functionality

### What Keras does is that it allows us to write our own model

- If you're using Keras, you might want to write a model using Keras, but train and evaluate the Keras model using estimator. So, using kerastorator model is just another example of the kind of flexibility that you might want, and that is what we're going to talk about in this model.

### Model Function

- in summary, we call train_and_evaluate with a base class estimator, passing in a function that returns an estimator_spec, and that's it, we have a custom estimator.

### Keras vs TF Estimator API Usage

- keras is meant for fast prototyping, it does not handle distributor training or scale predictions

- for productionization, we will want to use the Estimator API. So oftentimes you will take ML prototypes written in Keras and you will have to operationalize them