# Machine Learning

ML is a practice of using algorithms to analyze data, learn from that data, and predict for new data

# Machine Learning vs Traditional Learning

For Example - Analyzing the sentiment of a media outlet and classifying the sentiment as positive or negative

Traditional Algorithm will look for particular words tagged as positive or negative and based on the count of those words, it may come to an conclusion if the sentiment is positive or negative. It can only predict based on the words it know as positive or negative

Machine Algorithm will analyze large amounts of data and learn the features that classify the sentiment as positive or negative
With what it has learned it can classify new data as positive sentiment or negative sentiment

# Deep Learning

DL is a subfield of ML that uses algorithms inspired by the function and structure of the brain's neural network

The learning can take place in 2 ways - Supervised or Unsupervised

Supervised Learning occurs when the algorithm learns and make inferences from the data which has already been labelled

Unsupervised Learning occurs when the algorithm learns and make inferences from unlabelled data

Labelled Data- If you are learning from the data that has 1000 images of dogs and cats and each image is labelled with either dog or cat

Unlabelled Data - The images are not labelled with dogs and cats and the algorithm now will be learning based on the differnt features of the images and classifying the images based on their likeness or differences

Since the algorithms are based on the function and structure of brain's neural network, the models in deep learning are called Artificial Neural Network

# Artifical Neural Network (ANN)

ANN are computing systems inspired by the brain's neural network

<pre>
1)These network contains a collection of connected units called neurons or artificial neurons.
2)Each connection between neurons can transmit a signal from each neuron to another
3)The receiving neuron process the signal and downstreams the signal to the connected neuron.
4)Neurons are organized in layers where each layer performs a particular transformation,
The signal is transferred from the input layer to output layer, with each layer in between called as hidden layer
</pre>

We will use python's neural network API Keras

# Keras Sequential Model

Sequential Model is a linear stack of layers

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Activation
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [7]:
model=Sequential([Dense(32,input_shape=(10,),activation='relu'),Dense(2,activation='softmax')])

Dense is the most basic layer used in keras, it connects each input to each output

Hidden layer is considered as a Dense Layer, since it connects its each input to each output

# Layers in ANN

<pre>
Different types of layers include-> Each layer is suited for a particular type of task and each layer performs a differnt type of transormation

1) Dense - connects each input to each output
2) Convolutional - most suited for working with images
3) Pooling 
4) Recurrent - suited for working with time series data
5) Normalization
6) Many others
</pre>

Let us understand, how the layers actually work in neural network, consider a neural network with 3 layers, input(3 nodes), hidden (5 nodes) and output(2 nodes), Each node is called a neuron, The nodes in the input layers are the features of a particular sample which is passed, Each neuron is connected to other neuron via weights, weights is just a number between 0 and 1, The inputs received at the next neuron is multiplied by the weights to get the weighted sum and then this weighted sum is passed to an activation function, which transforms the weighted sum into a number between 0 and 1. Then this output is passed to the output layer. The nodes consisting in the output layer are the categories, like here if the problem was to classify the image to cats or dogs, so we would require two nodes for 2 categories, if we were to include lizards as well, we would require 3 nodes.

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [4]:
model=Sequential([Dense(5, input_shape=(3,), activation='relu'), Dense(2, activation='softmax')])
#Here we define the model from the hidden layer and we pass the input shape of the input
#layer to the first layer of the model, because the model should know what type of data it will be 
#initially dealing with, the model will infer at the later stage

# Activation Functions in Neural Network

In ANN, activation function of a neuron defines the output of that neuron given a set of inputs

Like Sigmoid activation function, if the input is a very negative number then the activation function will try to transform it close it to 0, if the input is a very positive number then it will try to transform it closer to 1 anf if the input is close to 0 then it transforms it between 0 and 1

Why we need activation function? Biolgically inspired by the activity in our brains, where different neuron are fired or actived by differnt stimuli, like when we smell freshly baked cookies this would cause certain neurons in the brain to fire and when you smell something unpleasant, some other neuron may get fire. So some neuron are firing or not. Like Sigmoid, the more the value is closer to 1 the more it is activated, and more the value is closer to 0, less he neuron is activated. But this is not the case with every activation function to have the value between 0 and 1. Mostly used activation function like relu [Rectified Linear Unit], it transforms the value between max(0,x). Greater the value more activated the neuron is.

Another way to add layer and Activation function

In [6]:
model=Sequential()
model.add(Dense(5,input_shape=(3,)))
model.add(Activation('relu'))

# Training

Training a model is like a optimizatiom problem, where we are trying to optimize the weights within the model, and during the process of learning these weights will constantly be changing and try to reach the optimal value.

How the weights will be optimized will depend on the type of optimizer we are using. Most common optimizer is Stochastic Gradient Descent (SGD), every optimizer has a particular objective. SGD's objective is to minimize the loss function. The loss function can be similar to mean squared error, there can be different types of loss function. SGD's objective is to assign such weights such that the loss function is close to 0.

What is the actual loss? Suppose we are passing an image to the model to classify it as a dog or cat, so when predicting the model will assign probalilities of it being a cat or dog. Loss is the error between what the model is actually predicting versus what the label actually is.

So the data is repeatedly passed and the weights are optimally adjusted anf the model learns

# How a neural network learns?

Single pass of a data through the model is called an epoch.
We will be passing data for multiple epochs till the model learns to predict accurately

When the model is initially passed with the data, it sets some weights and at the end of the network, the output is generated. Then the loss of that computed output is calculated with respect to the actual label. At this point the model will calculate the gradient of the loss function with respect to each weight [Gradient is just another word for the derivative of a function with respect the the variables]. The gradient calculated will then be multiplied with the learning rate[learning rate is a value between 0.01 and 0.0001]. The value we get is used as the updated weight.

The weights are updated after every epoch while  SGD works to minimize the loss. The weights are slowly moving towards their optimized value.

This incremental updation of weights towards optimal value is what we mean when we say the model is learning

<pre>
import keras
from keras import backend as k
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
#basic imports

model=Sequential([
    Dense(16, input_shape=(1,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2,activation='softmax')
])
#define a model with 2 hidden layers and 1 output layers

model.compile(Adam(lr=.00001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#while compiling the model, we pass the optimizers, Adam is just a variant of SGD, with learning rate as 0.0001
#loss function used here is sparse_categorical_crossentropy and metrics is what we want to see when
#the output is predicted

model.fit(scaled_train_samples,train_labels,batch_size=10,epochs=20, shuffle=True, verbose=2)
#this function is used to train the model, train_samples is the training data, with the labels in another
#parameter train_labels, batch_size is the info in how many batches we want to send the data,
#epochs is how many time you want to pass the data, shuffle means if you want to shullfe your
#data after every epoch, verbose is how much output you want to see 
</pre>

# Loss in neural network

As we discussed earlier the loss is calculated for each input and output, for example in a model whose objective is to classify images as dog or cat with 0 being the label for cat and 1 for the dog. Lets suppose the for a given output model outputs 0.25, the error is 0.25-0 = 0.25.

In the same way, the model will calculate loss for every input and after each epoch, the loss for every input will be passed in a loss function.

Example- Mean squared error - in this loss function we square every loss, sum them up and find their average.

The value of loss function will calculated after every epoch and the loss will be constantly be decreasing over multiple epochs as this is the objective of SGD and the weights are updated after every epoch

# Learning Rate in neural network

<pre>
Earlier we had a general idea, that after the loss function is calculated, the gradient of this loss function is calculated with respect to each weight. The gradient is then multiplied by the learning rate.

So initially we start with arbitrary weights, then we incrementally update weights to move closer and closer to the optimized value of weights as SGD focusses on minimizing the loss. These step size to move closer and closer to the optimized weight value depend on the learning rate. So learning rate can be defined as the step size. Basically it varies between 0.01 and 0.0001, bubt the actual value ay vary.

After every epoch when the gradient value is calculated and multiplied with the learning rate, this value is subtracted from the actual weight to get the new updated weights.

Deciding what learning rate to choose requires testing as this is one of those hyper patrameters which needs to be test tuned before applying it to the model. For starting out it can be set between 0.01 and 0.0001

If we chose a learning rate that is greater on this scale, we risk the possibility of overshooting, this happens when we take large steps in the direction of the minimum and shoot past the minimum and misses it. On the other hand if we chose learning rate on the smaller scale, then it might take us longer to reach to the minimum loss and optimized weights.
</pre>

# Train, Tests and Validation sets

In the process for the model to learn the data is broken into 3 parts, train data, validation data and test data, with each epoch the model will output the result of thr train datam will learn from it and simultaneosly the model also outputs to the data in the validation set, the model has not seen this data before and the weights are not updated on the basis of validation data. The main use of validation data is to ensure model does not overfit and underfit to the data.

Overfitting- The model becomes really good at classifying the training data, but it is not as good as classifying the validation data.

Test set is the data which we are going to use while testing the data when it predicts the label.

Difference between test data and other 2 data is that test data is unlabelled while rest 2 data are labelled

In [2]:
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    

in model.fit function, the train samples are to be in a format of either a numpy array or a list of numpy array, train lables should be in the form of numpy array.

We dont have to define validation explicitly, we can define a parameter validation_split and give it a fraction to indicate this fraction of data should be used as validation data


We can also create an explicit validation set differnt from train data. To do that we can pass the data in validation_data parameter and keras expects it to be a list of tuples of the form (sample,label)

Test set will be of the same format as that of the train set and we will use this when we will call model.predict

# Predicting with the neural network

If we are happy with the metrics of the model, we will pass the test data for prediction
These predictions are based on what the model has learned till now.

<strong>predictions=model.predict(scaled_test_samples,batch_size=10, verbose=0)</strong>

We get the probabilities of the predictions in each tuple

# Overfitting in neural network

<pre>
When the model is good at predicting the data in training set, but it does not perform well while classifying the data for which it wasn't trained on. This scenario is called overfitting.

How do we know? Based on the metrics, when the validation set metrices are worse than the training set metrices or test metrices are worse than the training set metrices, it is unable to generalize.

How to reduce?
1) Easiest way to reduce it is to add more data, more the data, better the model will learn, with more data, we can also add more diversity in the data and model will be less likely to overfit 

2) Data Augumentation- Is the process of adding additional augmented data by reasonably modifying the data. Like for image classifier, rotating, flipping or zooming the data to add more data to the data set

3) Reducing the complexity of the model, by reducing the number of layers or neurons so that the model is able to generalize better

4) Dropout- If added to our model, it randomly drops a subset of nodes from the layer, this will 
prevent certain nodes to participate in predicting thus making the model generalize better
</pre>

# Underfitting in neural network

<pre>
If model is not even able to predict the data on which it was trained on. This scenario is said as Underfitting. This can be analyzed when the metrices for the training set are poor

How to reduce?

1) Increase the complexity of the model - increasing the number of layers/ increasing the number of neurons or chaning the layers we are using

2) Add more features to the input sample in our training set if we can. Example- if we want to predict the stock prices based on the closing proces of last 3 days. So initial features will be close1, close2, close3. If we add more features to it like opening price and volume the model might learn to classify data better

3) Reduce Dropout(Regularization Technique) The dropout only used for purposes of training and not for validation. So if we see that the model works well for the validation set but not for the training set, this is a good indication that we need to reduce the dropout
</pre>

# Supervised Learning

Supervised learning occurs when the data in the training set is labelled. After each prediction, the loss will be calculated based on the true label(encoded)

In [3]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

Using TensorFlow backend.


In [4]:
model=Sequential([Dense(16,input_shape=(2,),activation="relu"),
                 Dense(32, activation="relu"),
                 Dense(2,activation="softmax")])

In [5]:
model.compile(Adam(lr=0.0001),loss="sparse_categorical_crossentropy",metrics=['accuracy'])

In [16]:
#weight,height
train_sample=[[150,67],[130,60],[200,65],[125,52],[230,72],[181,70]]

In [17]:
#male=0
#female=1
train_label=[1,1,0,1,0,0]

model.fit(x=train_sample,train_label,batch_size=3,epochs=10,shuffle=True,verbose=2)

# Unsupervised Learning

Data in the training set is not labelled

But if the label is not available, how will the model be evaluated? Since model is not available with the label so there is no point of calculating the accuracy. The accuracy is not the metric in unsupervised learning.

Given the unlabelled data to the model, the model will attempt to learn some type of structure from the data and will try to extract features from it. Essentially model will learn to create mapping for given inputs to particular outputs based on the learning about the structure of the data

One area of unsupervised learning is clustering algorithms. Like given height and weight of people without giving the labels. The model will try to group them into clusters. If plotted the chart, the data can be divided into 2 groups. So each group can be given a label.

Another example of unseupervised learning is Autoencoders, Autoencoder is a ANN which takes the input and outputs the rescontruction of input. Example-given the images of handwritten text (numbers in our case), the goal is to output the recontructed image which should be as close as to the input image. Since this is a neural network, the loss will be calculated, the loss can be defined as the differnce between the original image and the output image. The loss function has to be minimized as this is an objective of SGD(optimizer)

Application of Autoencoders- Denoising the image,where we try to extract the meaningful data from the image that contains noise

# Semi Supervised Learning

Semi Supervised learning uses combination of supervised and unsupervised learning, Suppose you have large data and out of it only some of the data is labelled, so instead of manually labellin the rest of the data, we can train the model on the available labelled data and once the model is created, we use this model to predict the output for the unlabelled data by passing them in the model, after we have the label for the unlabelled data, we can train the model using the full data. This way we can achieve unsupervised learning

# Data Augmentation

Data augmentation is the process of creating more data by reasonably modifying the data available to us by flipping, zooming, rotating, change color, cropping. 

If we have small amount of data in our dataset, the  to make the model more robust we can add more data by modifying the original, This can also be used to reduce overfitting by adding more data to the data set. For example if the data set contains the images of the dogs facing right, but once the model is deployed and if it comes across the images of the dogs facing left, it might not be able to predict, so flipping can be the reasonable modification to increase the data

# One Hot Encoding

Labels for images in Keras are one hot encoding vectors. When we train our model using the labelled images of dogs and cats, the model may not be interpretting these labels as words ans the output which the model predicts is not in the form of words. So these labels are encoded to take the form of integers

One method of encoding the label for the categorical data is using one hot encoding. One hot encoding transforms the labels of categorical data in vectors of 0s and 1s, the length of these vectors is equal to the number of categories. Each index in the vector is associated with a particular category

With each category having its own place in the vector, the intuition behind the name one hot is simple, in the vector all the elements will be 0 except the actual category will be 1

# Convolutional Neural Network(s) (CNN or CompNet)

Most widely used for image analysis and classification. Think of a CNN as a ANN with some specialization for being able to pick out or detect patterns o make sense of them. This pattern detection is what makes CNN useful for image analysis.

CNN has hidden layers called Convolutional layers. It has other layers but the basis of a CNN is convolutional layer  

What convolutional layers do- jsut like other layers it transforms the input and passes the transformed output to the next layer. This operation is called Convolution operation. These layers are able to detect patterns. With each layer we need to specify the filters, these filters actually detect patterns, by pattern we mean if we can imagine what goes on in a single image, it may have edges, shapes, corners, textures, objects etc, so these filters specifically detect a particular pattern like there mey be filters for edge detection, filter to detect square shapes, filter to detect circular shapes, etc. These basis geometrical filters are used at the beginning of the network, the more deeper we go the more sophisticated the filters become, so the filters may be able to detect objects like ear, eyes, skin, etc

Example- we have a CNN and we pass the handwritten images to the network, as we know in each conovlutional layer we need to specify filter. Filter can be thought of as matrix for which we decide the number of rows and coulmns. values within the matrix are initially set with random numbers. Suppose we chose to have 1 filter in the conv. layer of matrix 3*3, now in the input, the filter will slide over each 3*3 pixel in the original image pixels, until it slid over every possible 3*3 matrix. This sliding is actually referred to as Convolving.  

The dot product of the filter matrix with the image matrix. This will occur for each 3*3 matrix.
and we store these dot products.

After the filter is convolved with the input image, we have a new representation of the image. This new matrix will be passed to the next layer. 


# Visualizing Convolutional Filters from CNN