## Deep Learning Notes from [DeepLizard](https://deeplizard.com/learn/playlist/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU)

In [1]:
from keras.models import Sequential
from keras.layers import Dense , Activation

Dense Layer - connects input to each output layers      
Here 32 is hidden layer , first layer is specified by shape which has 10 neurons        
Even though we have specified 2 layers but this is a 3 layer model since we have specified input_shape which
becomes the first layer

In [17]:
model = Sequential([
    Dense(32,input_shape = (10,), activation='relu'), # 32 - no of neurons , input_shape - input data
    Dense(2 , activation='softmax')
])

## There are different types of layers:
 - Dense (fully connected)
  - Connects each input to each output layer
 - Convolutional Layers 
  - Image Analysis
 - Pooling Layers
 - Recurrent Layers
  - Time Series
 - Normalization Layers
 - Many others

In [30]:
# Creating another model to classify cats and dogs with 3 input features , note first layer in list is hidden
# and is actually second layer 

layers = [
    Dense(5, input_shape = (3,),  activation='relu'),   # we need to provide only the first layer shape
    Dense(2 , activation='softmax') # model will be able to infer the next layers
]

model = Sequential(layers)

### Another way of specifying activation function

In [37]:
model = Sequential()

In [39]:
model.add(Dense(5 , input_shape = (3,)))
model.add(Activation('relu'))

In [41]:
model.add(Dense(2))
model.add(Activation('softmax'))

In [43]:
# model.get_config()

## Optimizer

The optimizer updates the weights in the model , eg : SGD  
The main function of optmizer is to minimize the loss 


### What happens during one pass ?

During first pass the weights are initialized at random  and passes through network , at the last layer a probability 
is spit out , the porbability is compared to actual value , difference  is called loss , and this gets optmized using optimizer

### Epcoh - Single pass through the data is called epoch

In [62]:
from keras.optimizers import Adam
from keras.metrics import sparse_categorical_crossentropy

In [59]:
layers = [
    Dense(16, input_shape = (1,) , activation='relu'),
    Dense(32 , activation= 'relu'),
    Dense(2, activation='softmax')
]

In [60]:
model = Sequential(layers)

In [63]:
model.compile(Adam(learning_rate=.001) , loss= 'sparse_categorical_crossentropy' , metrics=['accuracy'])

In [93]:
model.fit(x=train_samples, y=train_labels, batch_size=10, epochs=20, shuffle=True, verbose=2 )

In [80]:
model.loss

'sparse_categorical_crossentropy'

In [85]:
model.layers

[<keras.layers.core.Dense at 0x7f815100fa50>,
 <keras.layers.core.Dense at 0x7f815100fd50>,
 <keras.layers.core.Dense at 0x7f815100fe10>]

### Validation in Keras

#### If we want to use the train data and take out validation from that we can use validation_split which will take the given percent as  validaiton data

In [92]:
model.fit(x=train_samples, y=train_labels,validation_split=0.3, batch_size=10, epochs=20, shuffle=True, verbose=2 )

#### Else if we have seperate validation data , we can pass that using validate_data

valid_set = [(sample , label)......(sample,label)]

In [None]:
model.predict(test_sample, batch_size=10, verbose=0)

### How to know if the model is overfitting
#### If the validation loss is more than training loss , the model is probably overfitting

### Steps to reduce overfitting
 - #### We can do Data Augmentation
     - Cropping
     - Rotating
     - Zooming
     - Flipping
     - etc
 - #### Dropout
     - Randomly ignores the node from the layers, hence the name dropout

### Underfitting 
#### Not able to classify the data it was even trained on 

### Steps to reduce underfitting
 - #### Increase the model complexity. adding more layers or more neurons - opposite to overfitting
 - #### Add more features to input sample
 - #### Reduce dropout


### Unsupervised Learning 

#### Accuracy - we do not measure accuracy since we do not know the labels. Clustering is one example.

### Autoencoder 
a nn which takes in input ans outputs a reconstruction of this input . The loss function here is measuring how similar the reconstructed image is compared to original image

### Flow - Image -> Encoder -> Compressed Representation -> Decoder -> Reconstructed Input

loss = Reconstructed / Original

#### Optimizer is still some version of SGD.  Practical Application is removing noise form the similar image 

### Semi - Supervised Learning

Uses a combination of both supervised and unsupervised learning 

Pseudo Labeling - labeling some portion of the data, whereas other portions remain unlabeled

Steps - First train the labeled data using regular nn , then predict the unlabeled data using this model , then again train the model on combination of (labeled +  pseudo - labeled data(the one we predicted) and run through the full model again

### Data Augmentation

When we want to create new data based on modification to existing data, to add more data to create more samples
Eg : if most of the images of dogs are facing towards right then it is difficult to learn the left facing dogs
We do this by horizontally flipping the images , vertically we don't do it and it wont make sense because in real world we rarely will have vertical images

### One Hot Encoding

Vector length will be equal  to number of classes / labels.  
Each element of the vector will be zero , except the one which the vector corresponds to (hence the name one hot encode)
Eg :

cat - [1,0,0]    
dog - [1,0,1]      
lizard - [0,0,1]    

### CNN

A type of nn which is specialized in pattern detection. Hence useful for image analysis.  
Basis of CNN is convolutions ie hidden layers , they can have other layers as well.
At each layer we need to specify the **filter** we want to detect.

Eg : Edge detector filter detects the edges      
As the layer progresses , the ending layers would be able to detect more features like ears , eyes , mouth etc

#### Filter

A filter is a matrix of some size Eg (3,3) , (4,4) , this matrix slides through the whole image one by one thereby visiting all the pixels in the image , this process is called **convolve**

when adding a convolutional layer to a model, we also have to specify how many filters we want the layer to have. 

Gradient ascent differs from gradient descent by trying to maximize the  loss in order to emphasize pattern detection of the filter


#### Zero Padding

When we apply filter to orignal image in cnn layer , the resulting output shrinks , or becomes smaller   
this happens because the filter cannot slide to some of the pixels over the edges so the output becomes smaller

Eg for 4 by 4 input and 3 by 3 filter the output is only 2 by 2 matrix 

This can be calculate ahead of time using the forumla
$ (n-f +1)*(n-f+1) $    
where n = image row/column. 
      f = filter column/row

In [99]:
# Calculation for above example 2*2 here
(4-3+1)* (4-3+1)

4

In [101]:
# Another example. for 28*28 image with 3*3 filter output will be 26*26
(28 -3+1)* (28-3+1)

676

In [102]:
math.sqrt(676)

26.0

If we pass through many layers and many filter the image will tend to become smaller and smaller    
Another issue is we are losing valueable information at the edges and throwing away that info

#### Zero padding comes into play where we can preserve the image as original image

While specifying filter we can specify whether to use padding or not   
Depending upon the input it could be one border or two or 3 and so on   
Most nn can automatically add padding we just need to specify if we want padding or not

#### Filter size is specified by *kernel_size* parameter

there are two types of padding ,  default is valid padding
- valid - means no padding
- same - means padding needs to be done

### Max Pooling in NN

Max pooling is done after applying the filter in the network.   
Filter size $2 * 2$     
Stride = 2

We calculate the max values of the $2*2$ matrix and then move by the number of strides,whcih is 2 here

We can think this $2*2$ as pools and since we are taking max , hence the name max pooling

Let's take an example


A $28*28$ matrix after applying filter -> $26*26$ -> after max pooling becomes $13*13$

#### Why do we need max pooling ?
- Since the network will look only larger values(more prominent features), it will be able to learn more and since size decreases it will decrease the computational load
- Reduce overfitting

In [111]:
from keras.layers import MaxPooling2D #since the previous layers were 2d we are using maxpool2d
from keras.layers import Conv2D ,Flatten

layers = [
    Dense(16 , activation='relu', input_shape = (20,20,3)),
    Conv2D(32, kernel_size = (3,3) ,activation= 'relu' , padding='same'),
    MaxPooling2D(pool_size=(2,2) ,strides= 2, padding='valid'),
    Conv2D(64, kernel_size=(5,5) ,activation='relu', padding='same'),
    Flatten(),
    Dense(2,activation='softmax')
]

model=Sequential(layers)

In [112]:
model.layers

[<keras.layers.core.Dense at 0x7f81334f2dd0>,
 <keras.layers.convolutional.Conv2D at 0x7f81334f2250>,
 <keras.layers.pooling.MaxPooling2D at 0x7f81335ff7d0>,
 <keras.layers.convolutional.Conv2D at 0x7f81335ffd90>,
 <keras.layers.core.Flatten at 0x7f81334cc210>,
 <keras.layers.core.Dense at 0x7f81334ccad0>]

### Average pooling - takes the average