# Fundamentals of Machine Learning

## Evaluating machine-learning models

#### Simple hold-out validation

```
num_validation_samples = 10000

np.random.shuffle(data)

validation_data = data[:num_validation_samples]
data = data[num_validation_samples:]
training_data = data[:]

model = get_model()
model.train(training_data
validation_score = model.evaluate(validation_data)

# Tune the model, retrain it, evaluate it ...

model = get_model()
model.train(np.concatenate([training_data,
                            validation_data]))

test_score = model.evaluate(test_data)
```

#### K-fold validation

```
k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []

for fold in range(k):
    validation_data = data[num_validation_samples * fold: 
    num_validation_samples * (fold + 1)]
    training_data = data[:num_validation_samples * fold] + 
        data[num_validation_samples * (fold + 1):]
    
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

validation_score = np.average(validation_scores)

model.get_model()
model.train(data)
test_score = model.evaluate(test_data)
```

#### Iterated K-folld validation with shuffling

It consists of applying K-fold validation multiple times, shuffling the data every time before splitting it K ways. The final score is the average of the scores obtained at each run of K-fold validation. 

## Data preprocessing, feature engineering and feature learning

#### Vectorization

All inputs and targets in a neural network must be tensors of floating-poing data. Whatever data you need to process, you must first turn into tensors, a step called _data vectorization_. 

#### Value normalization

Before you feed data into your network you have to normalize each feature independentrly so that it had a standard deviation of 1 and a mean of 0.

* Take small values (the 0-1 range)
* Be homogenous (all features in the same range)

Additionaly:
* Normalize each feature independently to have a mean of 0.
* Normalize each feature independently to have a stddev of 1.

```
x -= x.mean(axis=0)
x /- x.std(axis=0)
```

#### Handling missing values

If you're expecting missing values in the test data, but the network was trained on data without any missing values, the network won't have learned to ignore missing values. In this situation, you should artificially generate training samples with missing entries: copy some training samples several times, and drop some of the features that you expect are likely to be missing in the test data.

### Feature engineering 

Feature engineering is the process of using your own knowledge about the data and about the machine-learining algorithms at hand to make the algorithm work better by applying hardcoded transformations to the data before it goes into the model. 

## Overfitting and underfitting 

Optimization refers to the process of adjusting a model to get the best perfomance possible on the training data, whereas generalization refers to how well the trained model performs on data it has never seen before. 

A model trained on more data will naturally generalize better. When that isn't possible, the next-best solution is to constraints on what information that your model is allowed to store or to add constraints on what information it's allowed to store.

The process of fighting overfitting this way is called regularization. 

### Reducing the network's size

The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model. 

The general workflow to find an appropriate model size is to start with relatively few layers and parameters, and increase the size of the layers or add new layers until you see diminishing returns with regard to validation loss. 

### Adding weight regularization

A common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it's done by adding to the loss function of the network a cost associated with having large weights.

* L1 regularization – The cost added is proportional to the absolute value of the weight coefficuients (the L1 norm of the weights).
* L2 regularization – The cost added is proportional to the square of the value of the weight coefficients.

In [2]:
from keras import models
from keras import layers
from keras import regularizers

model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                      activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), 
                      activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# regularizers.l1(0.001) – L1 regilarization
# regularizers.l1_l2(l1=0.001, l2=0.001) – Simultaneous L1 and L2 regilarizations

### Adding dropout 

Dropout is one of the most effective and most commonly used regularization techniques for neural networks. Dropout, applied to a layer, consists of randomly dropping out a number of output features of the layer during training. The _dropout rate_ is the fraction of the features that are zeroed out; it's usually set netween 0.2 and 0.5. 

In [4]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

To recup:
* Get more traning data
* Reduce the capacity of the network
* Add weight regularization
* Add dropout

## Choosing the right last-layer activation and loss function for your model

| Problem Type | Last-layer Activation | Loss Function |
|-----|-----|-----|
| Binary classification | `sigmoid` | `binary_crossentropy` | 
| Multiclass, single-label classification | `softmax` | `categorical_crossentropy` | 
| Multiclass, multilabel classification | `sigmoid` | `binary_classification` |
| Regression to arbitrary values | None | `mse` |
| Regression to values between 0 and 1 | `sigmoid` | `mse` or `binary_crossentropy` |

# Chapter summary
* Define the problem at hand and the data on which you’ll train. Collect this data, or annotate it with labels if need be.
* Choose how you’ll measure success on your problem. Which metrics will you monitor on your validation data?
* Determine your evaluation protocol: hold-out validation? K-fold valida- tion? Which portion of the data should you use for validation?
* Develop a first model that does better than a basic baseline: a model with statistical power.
* Develop a model that overfits.
* Regularize your model and tune its hyperparameters, based on perfor- mance on the validation data. A lot of machine-learning research tends to focus only on this step—but keep the big picture in mind.