<h1>4 - Fundamentals of machine learning</h1>

<h2>Four branches of machine learning</h2>

- Supervised learning - Generally, almost all applications of deep learning that are in the spotlight these days belong in this category, such as optical character recognition, speech recognition, image classification, and language translation.

classification 
regression
Sequence generation - Given a picture, predict a caption describing it. 
Syntax tree prediction - Given a sentence, predict its decomposition into a syntax tree.
Object detection - Given a picture, draw a bounding box around certain objects inside the picture.
Image segmentation - Given a picture, draw a pixel-level mask on a specific object.


- Unsupervised learning

Finding interesting transformations of the input data without the help of any targets. Dimensionality reduction and clustering 

- Self-supervised learning

Subset of supervised learning. Self-supervised learning is supervised learning without human-annotated labels. you can think of it as supervised learning without any humans in the loop. Example: trying to pre- dict the next frame in a video, given past frames, or the next word in a text, given previ- ous words

- Reinforcement learning

an agent receives information about its environment and learns to choose actions that will maximize some reward.

Classification and Regression definitions:
    
 Sample or input—One data point that goes into your model.

 Prediction or output—What comes out of your model.

 Target—The truth. What your model should ideally have predicted, according
to an external source of data.

 Prediction error or loss value—A measure of the distance between your model’s prediction and the target.

 Classes—A set of possible labels to choose from in a classification problem. For example, when classifying cat and dog pictures, “dog” and “cat” are the two classes.

 Label—A specific instance of a class annotation in a classification problem. For instance, if picture #1234 is annotated as containing the class “dog,” then “dog” is a label of picture #1234.

 Ground-truth or annotations—All targets for a dataset, typically collected by humans.

 Binary classification—A classification task where each input sample should be categorized into two exclusive categories.

 Multiclass classification—A classification task where each input sample should be categorized into more than two categories: for instance, classifying handwritten digits.

Multilabel classification—A classification task where each input sample can be assigned multiple labels. For instance, a given image may contain both a cat and a dog and should be annotated both with the “cat” label and the “dog” label. The number of labels per image is usually variable.

 Scalar regression—A task where the target is a continuous scalar value. Pre- dicting house prices is a good example: the different target prices form a con- tinuous space.

 Vector regression—A task where the target is a set of continuous values: for example, a continuous vector. If you’re doing regression against multiple val- ues (such as the coordinates of a bounding box in an image), then you’re doing vector regression.

 Mini-batch or batch—A small set of samples (typically between 8 and 128) that are processed simultaneously by the model. The number of samples is often a power of 2, to facilitate memory allocation on GPU. When training, a mini-batch is used to compute a single gradient-descent update applied to the weights of the model.

<h2>1. Evaluating a Model</h2>

Train, Validate (feedback to improve), and Test

<h3>With Little Data</h3>

simple hold-out validation, K- fold validation, and iterated K-fold validation with shuffling.

Simple hold-out Validation:

Set apart some fraction of your data as your test set. Train on the remaining data, and evaluate on the test set. As you saw in the previous sections, in order to prevent infor- mation leaks, you shouldn’t tune your model based on the test set, and therefore you should also reserve a validation set.

Flaw: if little data is available, then your validation and test sets may contain too few samples to be statisti- cally representative of the data at hand.

In [1]:
num_validation_samples = 10000 

np.random.shuffle(data)

validation_data = data[:num_validation_samples]

data = data[num_validation_samples:]

training_data = data[:]

model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

# At this point you can tune your model,
# retrain it, evaluate it, tune it again...

#Once you’ve tuned your hyperparameters, it’s common to train your final model from scratch on all non-test data available.
model = get_model()
model.train(np.concatenate([training_data, validation_data]))


test_score = model.evaluate(test_data)

NameError: name 'np' is not defined

<h3>K-FOLD VALIDATION</h3>

split your data into K partitions of equal size. For each parti- tion i, train a model on the remaining K – 1 partitions, and evaluate it on partition i. Your final score is then the averages of the K scores obtained. 

Pro: when the performance of your model shows significant variance based on your train- test split.

In [2]:
k=4
num_validation_samples = len(data) // k
np.random.shuffle(data)
validation_scores = [] for fold in range(k):
    
    #Selects the validation- data partition
    validation_data = data[num_validation_samples * fold: num_validation_samples * (fold + 1)]
    
    #Uses the remainder of the data as training data. Note that the + operator is list concatenation, not summation.
    training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]
    
    #Creates a brand-new instance of the model (untrained)
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data) 
    validation_scores.append(validation_score)

#Validation score: average of the validation scores of the k folds
validation_score = np.average(validation_scores)

#Trains the final model on all non- test data available
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)


SyntaxError: invalid syntax (<ipython-input-2-6ba662902ca0>, line 4)

<h3>ITERATED K-FOLD VALIDATION WITH SHUFFLING</h3>

relatively little data available and you need to evaluate your model as precisely as possible.

applying K-fold validation multiple times, shuffling the data every time before splitting it K ways. The final score is the average of the scores obtained at each run of K-fold validation. Note that you end up training and evaluating P × K models (where P is the number of iterations you use), which can very expensive.

<h3>Pointers</h3>

Randomly Shuffle Data : Don't have image sets of 1-7s in training and 8-9 in test

Arrow of time: Don't shuffle time data, make sure test data is at the end of the timeline

Redundancy: Make sure you don't have the same data in train and test!

<h2>2. Data preprocessing, FE and Feature learning<h2>

<h3>Data preprocessing for neural networks</h3>

-raw data at hand more amenable to neural networks. This includes vectorization, normalization, handling missing values, and feature extraction.

<h4>VECTORIZATION</h4>

-All inputs and targets in a neural network must be tensors of floating-point data 
-turn into tensors, a step called data vectorization

<h4>VALUE NORMALIZATION</h4>

In general, it isn’t safe to feed into a neural network data that takes relatively large val- ues (for example, multidigit integers, which are much larger than the initial values taken by the weights of a network) or data that is heterogeneous (for example, data where one feature is in the range 0–1 and another is in the range 100–200). Doing so can trigger large gradient updates that will prevent the network from converging. To make learning easier for your network, your data should have the following characteristics:

Take small values—Typically, most values should be in the 0–1 range.

Be homogenous—That is, all features should take values in roughly the same range.

Additionally, the following stricter normalization practice is common and can help. (wouldn't do this in digit classification)

Normalize each feature independently to have a mean of 0.

Normalize each feature independently to have a standard deviation of 1.

In [3]:
#This is easy to do with Numpy arrays:

#Assuming x is a 2D data matrix of shape (samples, features)
x -= x.mean(axis=0) 
x /= x.std(axis=0)

NameError: name 'x' is not defined

<h4>HANDLING MISSING VALUES</h4>

In general, with neural networks, it’s safe to input missing values as 0, with the con- dition that 0 isn’t already a meaningful value. The network will learn from exposure to the data that the value 0 means missing data and will start ignoring the value.


<h4>Feature engineering</h4>

modern deep learning removes the need for most feature engineer- ing, because neural networks are capable of automatically extracting useful features from raw data.

Still important:
    
    - Good features still allow you to solve problems more elegantly while using fewer resources.
    - Good features let you solve a problem with far less data.

<h3>Overfitting and underfitting</h3>

The fundamental issue in machine learning is the tension between optimization and generalization.

If model is giving illrelevant patterns:
- the best solution is to get more training data.
- modulate the quantity of information that your model is allowed to store or to add constraints on what information it’s allowed to store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.
- Preventing overfitting is called regularization. Some techniques:


<h3> 1. Reduce the network size: </h3>


- determined by the number of layers and the number of units per layer (these are known as capacity)
- The more capacity the network has, the more quickly it can model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large differ- ence between the training and validation loss).

<h3> 2. Adding weight regularization </h3>

- Simpler models are less likely to over- fit than complex ones.
-  mitigate overfitting is to put constraints on the complex- ity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. (weight regularization)


two flavors:

L1 regularization—The cost added is proportional to the absolute value of the weight coefficients (the L1 norm of the weights).

L2 regularization—The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights). L2 regularization is also called weight decay in the context of neural networks.



In [4]:
# Example: Let’s add L2 weight regularization to the movie-review classifi- cation network.

from keras import regularizers
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
    activation='relu', input_shape=(10000,))) 

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
    activation='relu')) 

model.add(layers.Dense(1, activation='sigmoid'))

#l2(0.001) means every coefficient in the weight matrix of the layer will
# add 0.001 * weight_coefficient_value to the total loss of the network.

#Alternative alternative to L2 regularization, you can use one of the following Keras weight regularizers.

from keras import regularizers 

regularizers.l1(0.001) #L1 regularization
regularizers.l1_l2(l1=0.001, l2=0.001) #Simultaneous L1 and L2 Regularization

SyntaxError: invalid syntax (<ipython-input-4-86dc8cb97d1a>, line 9)

<h3>Adding dropout</h3>

Dropout is one of the most effective and most commonly used regularization tech- niques for neural networks.

Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training. 

In Keras, you can introduce dropout in a network via the Dropout layer, which is applied to the output of the layer right before it:

In [6]:
#ADDING TWO DROPOUTS

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu')) 
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

NameError: name 'models' is not defined

See 4.5 for outline