# A Single Neuron

**1. What is Deep Learning**

Some of the most impressive advances in artificial intelligence in recent years have been in the field of deep learning. Natural Language translation, image recognition, and game playing are all tasks where deep learning models have neared or even exceeded human-level performance.

So what is deep learning? Deep learning is an approach to machine learning characterized by deep stacks of computations. The depth of computation is what has enabled deep learning model to disentangle the kinds of complex and hierarchical patterns found in the most challenging real-world datasets.

Through their power and sociability neural networks have become the defining model of deep learning. Neural networks are composed of neurons, where each neuron individually performs only a simple computation. The power of a neural network comes instead from the complexity of the connections these nuerons can form.

**2. The Linear Unit**

![image.png](attachment:image.png)

The input is x. Its connection to the neuron has a weight which is w. Whenever a value flows through a connection, you multiply the value by the connection's weight. For the input x, what reaches the neuron is $ w \times x $. A neural networks "learns" by modifying its weights.

The b is a special kind of weight we call the bias. The bias doesn't have any input data associated with it; instead, we put a 1 in the diagram so the value that reaches the neruon is just b. The bias enables the neuron to modify the output independently of its input.

The y is the value the neuron ultimately outputs. To get the output, the neuron sums up all the values it receives through its connections. This neuron's activation is $y = w \times x + b$.

**3. The Linear Unit as a Model**

Though individual neurons will usually only function as part of a larger network, it's often useful to start with a single neuron model as a baseline. Single neuron models are linear models. 

![image-2.png](attachment:image-2.png)

**4. Multiple Inputs**

We can just add more input connections to the neuron, one for each additional feature. To find the output, we would multiply each input to its connection weight and then add them all together.

![image-3.png](attachment:image-3.png)

The formular for this neuron would be $ y = w_0 x_0 + w_1 x_1 + w_2 x_2 + b $. A linear unit with two inputs will fit a plane, and a unit with more inputs than that will fit a hyperplane. 

**5. Linear Units in Keras**

The easiest way to create a model in Keras is through Keras.Sequential, which creats a neural network as a stack of layers. We can create models like those above using denselayer.

We could define a linear model accepting three input features and producing a single output.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# Create a nework with 1 linear layers

model = keras.Sequential([layers.Dense(units = 1, input_shape = [3])])

With the first argument, units, we define how many output we want. In this case we are just predicting 1 output, we'll use units = 1.
With the second argument, input_shape, we tell Keras the dimensions of the inputs. Setting input_shape = [3] ensures the model will accept three features as input.

# Deep Neural Networks

**1. Introduction**

we're going to see how we can build neural networks capable of learning the complex kinds of relationships deep neural nets are famous for.

The key idea here is modularity, building up a complex network from simpler functional units. 

**2. Layers**

Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. 

![image.png](attachment:image.png)

You could think of each layer in a neural network as performing some kind of relatively simple transform its input in more complex ways. In a well-trained neural netwrok, each layer is a transformation getting us a little bit closer to a solution. 

A "layer" in Keras is a very general kind of thing. A layer can be, essentially, any kind of data transformation. Many layers, like the convultional and recurrent layers, transform data through use of neurons and differ primarilly in the pattern of connections they form. Others though are used for feature engineering or just simple arithmetic. 

**3. The Activation Function**

It turns out, however, that two dense layers with nothing in between are no better than a single dense layer by itself. Dense layers by themselves can never move us out of the world of line and planes. What we need is something nonlinear. What we need are activation functions. 

An activation functions is simply some function we apply to each of layer's outputs. The most common is the rectifier function max(0, x). 

![image-2.png](attachment:image-2.png)

The rectifier function has a graph that's a line with the negative part "rectified" to zero. Applying the function to the outputs of a neuron will put a bend in the data, moving us away from simple lines. 

When we attach the rectifier to a linear unit, we get a rectified linear unit or ReLU. Applying a ReLU activation to a linear unit means the output becomes max(0, $wx+b$).

**4. Stacking Dense Layers** 

Now that we have some nonlinearity, let's see how we can stack layers to get complex data transformation. 

![image-3.png](attachment:image-3.png)

The layers before the output layer are sometimes called hidden since we never see their outputs directly. Now notice that the final layer is a linear unit. That makes this network apppropriate to a regression task, where we are trying to predict some arbitary numeric value. Other tasks might require an activation function on the output. 

**5. Building Sequential Models**

The Seuquential model we've been using will connect together a list of layers in order from first to last : the first layers gets the input, the last layer produces the output.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # the hidden ReLU layers
    layers.Dense(units=4, activation='relu', input_shape=[2]),
    layers.Dense(units=3, activation='relu'),
    # the linear output layer 
    layers.Dense(units=1),
])

Be sure to pass all the layers together in a list, like[layer, layer, layer, ...], instead of as separate qrguments. To add an activation function to a layer, just give its name in the activation argument. 

# Stochastic Gradient Descent 

**1. Introduction**

We learned how to build fully-connected networks out of stacks of dense layers. When first created, all of the network's weights are set randomly -- the network doensn't know anything yet. In this lesson, we're going to see how to train a neural network.

As with all machine learning tasks, we begin with a set of training data. Each example in the training data consists of some features together with an expected target. Training the network means adjusting its weights in such a way that it can transform the feature into the target. 

In addition to the training data, we need two more things : 
- A "loss function" that measures how good the network's predictions are 
- An "optimizer" that can tell the networks how to change its weight

**2. The Loss Function**

We've seen how to design an architecture for a network, but we haven't seen how to tell a network what problem to solve. This is the job of the loss function. 

The loss function measures the disparity between the target's true value and the value the model predicts.

Different problems call for different loss functions. We have been looking at regression problems, where the task is predict some numerical value. A common loss function for regression problem is the mean absolute error or MAE. For each prediction y_pred, MAE measures the disparity from the true target y_true by an absolute difference abs(y_true - y_pred). 

The total MAE loss on an dataset is the mean of all these absolute difference.

![image.png](attachment:image.png)

Besides MAE, other loss function you might see for regression problems are the mean-squared error (MSE) or the Huber loss. 

During training, the model will use the loss function as a guide for finding the correct value of its weights. In other words, the loss function tells the network its objective. 

**3. The optimizer - Stochastic Gradient Descent**

We've described the problem we want the network solve, but now we need to say how to solve it. This is the job of the optimizer. The optimizer is an algorithm that adjusts the weights to minimize the loss.

Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps. 

- Sample some training data and run it through the network to make predictions
- Measure the loss between the predictions and the true values
- Finally, adjust the weights in a direction that makes the loss smaller 

Then just do this over and over until the loss is as small as you like. Each iteration's sample of training data is called minibatch while a complete round of the training data is called epoch. The number of epochs you train for is how many times the network will see each training example. 

![image-2.png](attachment:image-2.png)

**4. Learning Rate and Batch size**

Notice that the line only makes a small shift in the direction of each batch. The size of these shift is determined by the learning rate. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values. 

The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Their interaction is often subtle and the right choice for these parameters isn't always obvious. 

Fortunately, for most work it won't be necessary to do an extensive hyperparameter search to get satisfactory results. Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning. Adam is great general-purpose optimizer.

**5. Adding the Loss and Optimizer**

After defining a model, you can add a loss function and optimizer with the model's comple method 

In [None]:
model.compile(
    optimizer = "adam",
    loss = "mae"
)

**6. Example**

In [1]:
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']


# Define a model 

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

model.compile(
    optimizer='adam',
    loss='mae',
)

# This will show the changes of loss 

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

# convert the training history to a dataframe

history_df = pd.DataFrame(history.history)

# use Pandas native plot method

history_df['loss'].plot();

FileNotFoundError: [Errno 2] No such file or directory: '../input/dl-course-data/red-wine.csv'

# Overfitting and Underfitting

**1. Introduction**

Recall from the example in the previous lesson that Keras will keep a history of the training and validation loss over the epochs that it is training the model. In this lesson, we're going to learn how to interpret these learning curves and how we can use them to guide model development. In particular, we'll examine at the learning curves for evidence of underfitting and overfitting and look at a couple of strategies for correcting it.

**2. Interpreting the Learning Curves**

You might think about the information in the training data as being of two kinds : signal and noise. The signal is the part that generalizes, the part tha can help our model make predictions from new data. The noise is that part that is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions. The noise is the part might look useful but really isn't

We train a model by choosing weights or parameters that minimize the loss on a training set. You might know, however, that to accurately assess a model's performances, we need to evaluate it on a new set of data, the validation data. 

When we train a model we've been plotting the loss on the training set epoch by epoch. To this we'll add a plot the validation data too. These plots we call the learning curves. To train deep learning models effectively, we need to be able to interpret them. 

![image.png](attachment:image.png)

Now, the training loss will go down eigher when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signla. (Whatever noise the model learned from the training set won't generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned. 

Ideally, we ould create model that learn all of the signal and non of the noise. This will practically never happen. Instead we make a trade. we can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease. After the certain point, however, the trade can turn against us, the cost exceeds the benefit, and the validation loss begins to rise. 

This trade-off indicates that there can be two problems that occur when training a model : not enough signal or too much noise. Underfitting the training set is when the loss is not as low as it could be becuase the model hasn't learned enough signal. Overfitting the trainin set is when the loss is not as low as it could be because the model learned too much noise. The trick to training deep learning model is finding the best balance between the two. 

**3. Capacity**

A model's capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you shoud try increasing its capacity.

You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationship, while deeper networks prefer more nonlinear ones.

In [None]:
model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])

wider = keras.Sequential([
    layers.Dense(32, activation='relu'),
    layers.Dense(1),
])

deeper = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])

**4. Early Stopping**

We mentioned that when a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss ins't decreasing anymore. Interrupting the training this way is called early stopping. 

![image.png](attachment:image.png)

Once we detect that the validation loss is starting to rise agian, we can reset the weights back to where the minimum occured. This ensures that the model won't continue to learn noise and overfit the data.

Training with early stopping also means we're in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough. Just set your training epochs to some large number, and early stopping will take care of the rest. 

**5. Adding Early Stopping**

In Keras, we include early stopping in our training through a callback. A callback is just a function you want run every so often while the network trains. 

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

These parameters say: "If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found." It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.

**6. Example**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])
model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put your callbacks in a list
    verbose=0,  # turn off training log
)

# Dropout and Batch Normalization

**1. Introduction**

There's more to the world of deep learning than just dense layers. There are dozens of kinds of layers you might add to a model. Some are like dense laers and define connections between neurons, and others can do preprocessing or transformations of other sorts.

In this lesson, we'lle learn about a two kinds of special layers, not containing any neurons themselves, but that add some functionality that can sometimes benefit a model in various ways. Both are commonly used in modern architectures.

**2. Dropout**

The first of these is the "dropout layer", which can help correct overfitting. 

In the last session we talked about how overfitting is caused by the network learning spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very a specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile : remove one and the conspiracy falls apart. 

This is the idea behind dropout. To break up these conspiracies, we randomly drop out some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weights patterns tend to be more robust.

![image.png](attachment:image.png)

You could also think about dropout as creating a kind of ensemble of networks. The predictions will no longer be made by one big network, but insetad by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual. 

**3. Adding Dropout**

In Keras, the dropout rate argument rate defines what percentage of the input units to shut off. Put the Dropout layer just before the layer you want the dropout applied to.

In [None]:
Keras.Sequential([
    # ...
    layers.Dropout(rate = 0.3), # apply 30% dropout to the next lay
    layers.Dense(16), 
    # ...
])

**4. Batch Normalization**

The next special layer we'll look at performs "batch normalization" which can help correct training that is slow or unstable. 

With neural networks, it's generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn's StandardScaler or MinMaxScaler. The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Feature that tend to produce activation of very different size can make for unstable training behavior.

Now, if it's good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! In fact, we have a special kind of layer that can do this, the batch normalization layer. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescailing parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.

Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix variou problems that can cause the training to get "stuck". Consider adding barch normalization to your models, especially if you're having trouble during training.

**5. Adding Batch Normalization**

It seems that barch normalization can be used at almost any point in a network. You can put it after a layer. And if you add it as the first layer of you network, it can act as a kind of adaptive preprocessor, standing in for something like scikit-learn's StandardScaler. 

In [None]:
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),

**6. Example**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)


# Show the learning curves
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

# Binary Classification

**1. Introduction**

So far in this course, we've learned about how neural networks can solve regression problems. Now we're goint to apply neural networks to another common machine learning problem : classification. Most everything we've learned up until now still applies. The main difference is in the loss function we use and in what kind of outputs we want the final layer to produce

**2. Binary Classification**

Classification into one of two classes is a common machine learning problem. You might want to predict whether or not a customer is likely to make a purchase, whether or not a credit card transaction was fraudulent, whether deep space signals show evidence of a new planet, or a mdeical test evidence of a disease. These are all binary classification problems.

In your raw data, the classes might be represented by strings like "Yes" and "No", or "Dog" and "Cat". Before using this data we'll assign a class label : one class will be 0 and the other will be 1. Assigning numeric labels puts the data in a form a neural network can use.

**3. Accuracy and Cross-Entropy**

Accuracy is one of the many metrics in use for measuring success on a classication problem. Accuracy is the ratio of correct predictions to total predictions : accuracy = number_correct / total. A model that always predicted correctly would have an accuracy socre of 1.0. All else being equal, accuracy is a reasonable metric to use whenever the classes in the dataset occur with about the same frequency.

The problem with accuracy (and most other classification metrics) is that it can't be used as a loss function. SGD needs a loss function that changes smoothly, but accuracy, being a ratio of counts, changes in "jumps". So, we have to shoose a substitute to act as the loss function. THis substitute is the cross-entropy function.

Now, recall that the loss function defines the objective of the neural network during training. With regression, our goal was to minimize the distance between the expected outcome and the predicted outcome. We chose MAE to measure this distance. 

For classification, what we want instead is a distance between probabilities, and this is what cross-entropy provides. Cross-Entropy is a sort of measure for the distance from one probability distribution to another. 

The idea is that we want our network to predict the correct class with probability 1.0. The further away the predicted probability is from 1.0, the greater will be the cross-entropy loss. 

The tecchnical reasons we use cross-entropy are a bit subtle, but the main thing to take away from this sections is jsut this : use cross-entropy for a classification loss; other metrics you might care about will tend to improve along with. 

**4. Making Probabilities with the Sigmoid Function**

The cross-entropy and accuracy function both require probabilities as inputs, meaning, numbers from 0 to 1. To convert the real-valued outputs produced by a dense layer into probabilities, we attach a new kind of activation function, the sigmoid function

![image.png](attachment:image.png)

To get the final class prediction, we define a threshold probability. Typically this will be 0.5, so that rounding will give us the correct class : below 0.5 means the class with label 0 and 0.5 or above means the class with label 1. A 0.5 threshold is what Keras uses by default with its accuracy metric. 

**5. Example**

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(4, activation='relu', input_shape=[33]),
    layers.Dense(4, activation='relu'),    
    layers.Dense(1, activation='sigmoid'),
])

ModuleNotFoundError: No module named 'tensorflow'

In the final layer include a 'sigmoid' activation so that the model will produce class probabilities. Add the cross-entropy loss and accuracy metric to the model with its compile method. For two-class problems, be sure to use 'binary' versions. 

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

The model in this particular problem can take quite a few epoches to complete training, so we'll include an early stopping callback for convenience.

In [None]:
early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0, # hide the output because we have so many epochs
)