# 1 DL Fundamentals

DL, ML’den farkli olarak feature extraction gerektirmeden kendisi feature learning yaparak calisir. Tabular olmayan data ile de (unstructured data) calisir. 

**What is it** : DL automates the process of extracting useful patterns from data via neural networks and optimization.

Deep Learning is a subfield of machine learning that uses artificial neural networks to learn from large amounts of data. It involves training complex models with multiple layers, allowing them to extract high-level features from raw input data.

•	**ANN**: ANN stands for **Artificial Neural Network**, which is a computational model inspired by the structure and function of the human brain. ANN is used to recognize patterns in data, make predictions, and learn from past experiences by adjusting the weights of connections between artificial neurons. Main terms are: Neural Networks, Activation Functions, Cost Functions, Gradient Descent, Backpropagation. ANN tabular data ile kullanılır. Multilayer perceptron networks da denir.

•	**CNN**: CNN stands for **Convolutional Neural Network**, which is a specialized type of ANN that is commonly used for image recognition and computer vision tasks. It uses a unique structure that allows it to identify and classify objects within images by extracting spatial features using convolutional layers. Tabluar data olmayan, unstructured data ile de, yani image, text, documents vb ile de calisir.


•	**RNN**: RNN stands for **Recurrent Neural Network**, which is a type of ANN that is designed to process sequential data such as time-series data, natural language, and speech. RNN uses loops to allow information to persist, making it capable of recognizing patterns in sequential data and making predictions based on previous inputs.


## Basic concepts in Deep Learning

1. **Perceptron Model**

The idea behind the perceptron model is, converting a simplified biological neuron to a mathematical model. The history of the perceptron model dates back to 1958. The picture below shows how to convert the simplified biological neuron to a perceptron model. 

![image.png](attachment:269d3b25-4964-4ac7-a69e-12d67d1c6b44.png)

2. **The Function in the Perceptron Model**

There is a function in the perceptron model that does the calculations in the model. 

![image.png](attachment:cbab4660-3567-4238-9e2b-88dc8e42c211.png)

- y=(x1*w1+b) + (x2*w2+b)
- where:
- x1, x2 : inputs,
- w1, w2 : weights,
- b : bias.  (her bir nörona bias eklenir. Ogrenmeyi kolaylastirmak için eklnen bir değerdir. Parametrik algoritmalardaki intercept yerine bias var.).


## Neural Networks

Neural Networks
•	Neuron: single perceptron model
•	Layers: Input Layer, Hidden Layer, Output Layer
•	Deep Neural Network

1. **Neuron**

It exactly defines a single perceptron model. It is the most fundamental unit in neural networks. We will refer the single perceptrons as neurons. Each circle in a neural network represents a neuron.

2. **Layers**:Every neural network consists of 3 fundamental layers:

•	Input Layer: Receives the information
•	Hidden Layer: Where the computations are done
•	Output Layer: Where the final results of the computations show up

![image.png](attachment:7ff0882e-a2da-4ca4-a51b-ad0272f01c9a.png)

3. **Deep Neural Network**

A neural network is called a deep neural network if it contains two or more hidden layers. Tek hidden layer varsa neural network; ikiden fazla ise deep neural network olur.

## ACTIVATION FUNCTIONS

1. **What is Activation Function**

İnputları alıp bunları ağırlıklarıyla çarpan ve bias ekleyen nöron (node), aktivasyon fonksiyonunu kullanarak output üretir. Yani her nöronun 2 gorevi: inputu weight ile agirliklandir ve bias ekle. 2 act function ile output uretir.

Act function; deep neural networku linear regressiondan ayiran ozelliktir. They are decision making units in deep neural networks.

The activation function is used to introduce nonlinearity into the neural network so that it can learn more complex function. Without the Activation function, the neural network would be only able to learn function, which is a linear combination of its input data. Activation function translates the inputs into outputs. **The activation function is responsible for deciding whether a neuron should be activated or not. It makes the decision by calculating the weighted sum and further adding bias with it. The basic purpose of the activation function is to introduce non-linearity into the output of a neuron.**

Each neuron is characterized by its weight, bias, and activation function.  

In a neural network, the input is fed to the input layer, the neurons perform a linear transformation on the input using the weights and biases. 

- (weight * input) + bias
- Then, an activation function is applied to the above result.
- y= activation((weight*input)+bias)

After that, the output from the activation function moves to the next hidden layer and the same process is repeated. 

![image.png](attachment:a21915b3-6f86-4f63-a50a-a64f7087671f.png)

2. **Types of Activation Functions**

There are several types of activation functions. We need to choose the appropriate activation function while creating a deep learning model. 

There are 3 main act functions: 

- **Binary step function** (Negatifleri 0, pozitifleri 1 yapan ve dolayısıyla binary output üreten bir fonksiyondur. Multi classification yoktur.). 
- İkincisi **linear act fonksiyonu** (reg problemleri – ve + sonsuz arasi değerler alabiliyordu. Linear act fonksiyon da kendisine gelen değeri olduğu gibi cikariyor, bundan dolayi no act function da denir. Yani gelen inputa agirliklar ve bias ekleyip olduğu gibi cikarir. Act func’siz linear functiondur.  Backpropagationda türev alindigi için, yani weights güncellendiği için lin act function efektif değildir, cunku sabit sayi var, güncellenmiyor. Lin act func.da there is no backpropagation. Bu nedenle tum hidden layerlari sanki tek bir hidden layer varmis haline getiriyor). 
- Ucuncusu ise **Non-Linear Activation functions**. These allow backpropagation. Since they can be türevi alinabilir, bu nedenle weights can be updated and thus allow backpropagation. It also allow multiple hidden layers. Modern neural network models use non-linear act functions.).**The most common types of non-linear activation functions are :**

1. **Sigmoid (logistic)**: değerleri 0 ile 1 arasina sıkıştırır and normalizes the output of each neuron. Pros: smooth gradient, clear predictions. Cons: vanishing gradient, computationally expensive ve outputs are not zero centered. Sigmoidin bu dezavantajlarini gidermek için TanH act function ortaya cikiyor.
2. **TanH/Hyperbolic Tangent**: -1 ile +1 arasinda değer uretir. Dolayisiyla datada negatfi değerler varsa ve bunlar önemliyse, bir binary classification yapmıyorsak TanH kullanılabilir. Prod: smooth gradient, more efficient than sigmoid, outputs zero centered. Cons: vanishing gradient (weight’in güncellenmesini engeller). Computationally expensive. NLP alaninda hidden layersda kullanılır.
3. **ReLU (Rectified Linear Unit)**: Pros: Non-linear. Computationally efficient. Pozitif olan değerleri direkt geçirir, neg olanlari ise sifira ceker ve böylece non-linear bir yapi elde eder. Cons: The dying ReLU problem (when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn. Sifir veya sifira cok yakin değerlerde türev alamadigi için o nöronu oldurur. Bunu asmak için LeakyReLU geliştirilmiştir.
4. **LeakyRELU**: Pros: Prevents dying ReLU prblem. Negatif alanda it has a small positive slope, so it does enable backpropagation, even for negative input values. Otherwise like ReLU. CONS: results not consistent: leaky ReLU does not provide consistent predictions for negative input values. Her calisitiridlginda frkli sonuçlar verdiği için tutarsız ve güvenilmez olabiliyor. Bu nedenle hidden layersda cok kullanılmaz.
5. **Softmax activation functions**: Pros: able to handle multiple classes: only one class in other activation functions. Softmax normalizes the outputs for each class between 0 and 1. İkinci pros ise useful for output neurons. Genellikle output layerda kullanılır. Mutliclass classificationda.

💡**Tips**:

	All these activation functions are non-linear.  Modern neural network models use non-linear activation functions.

![image.png](attachment:bf482150-fe58-4e79-9dd2-445068b71ca8.png)

## Multi-Class Classification

It’s important how to choose the appropriate activation function for multi-class classification problems. 

1. **Binary Classification**

In a binary classification problem, we observe a single node at the output layer.  This single node can output a binary classification between one of two possible choices, such as whether: 

- A given email is spam or not spam.
- A given tumor is malignant or benign.

2. **Multi-Class Classification**

In multi-class classification, there are classes more than two classes.  For example:
- Is this picture a dog, a cat, a horse, or a donkey?
- Is this flower a rose, a daisy, an acacia, or a Lilac?
- Is that plane a Boeing 777, Airbus 320, Boeing 747, or Embraer 190?

Thus, there are multiple nodes in the output layer.

There are two types of multi classes:
- **Mutually Exclusive Classes** (featurelara gore iris ya versicolor, ya setosa, ya virginicadir)
- **Non-Exclusive Classes** (İnputlara gore hasta girp de olabilir nezzle de olabilir vb)

![image.png](attachment:ecea8445-365a-4916-b102-1486ea1e7ab0.png)

In Mutually Exclusive Classes each data point has a single class. It can be either green, red, or blue. But not blue and red at the same time. However, in Non-Exclusive Classes, each data point might have multiple classes. For example data point 1 has class "A" and "B" and data point 3 has class "C" and "B".

We should use Softmax function for Mutually Exclusive Classes and  Sigmoid function for Non-Exclusive Classes.


## Key Terminology – continue 

1. **Cost Function**

In Deep Learning, the cost function (also referred to as loss function) is **used to predict how bad the model is performing**. In other words, a cost function is a measure of how wrong the model is in terms of its ability to predict the result. This is basically explained as a difference between the predicted value and the actual value. 

Thus, the objective of a Deep Learning model is **to find the right weights and biases that minimizes the cost function**. In other words, models learn by minimizing the cost function. 

![image.png](attachment:e2c45483-169e-483f-a257-40b64d20da75.png)

![image.png](attachment:4f1d552a-90c6-4838-bf2c-36adf7abc981.png)

**2. Gradient Descent**

Gradient descent is **an optimization algorithm used to minimize the cost function**.  It is best used when the parameters cannot be calculated analytically and must be searched for by an optimization algorithm.  The name of this concept gives an idea about the meaning of it. According to Cambridge Dictionary "Gradient" means "how steep a slope is" and the "Descent"means "a movement down".


**3. Learning Rate**

Imagine you are walking on the top of mountains and you are able to take huge steps. Your goal is to reach the river on the ground.  The cost function (the error) is maximum at the top of the mountain and it is minimum at the river level. 

The size of your steps can be considered as the learning rate. If the steps are large enough you reach the minimum cost function very quickly, but if it is too large you might miss the minimum level. The learning rate is demonstrated as the "incremental steps" in the picture under the Gradient Descent title.

**The learning rate is a hyperparameter in machine learning that determines the size of the steps taken in the gradient descent optimization algorithm when updating the model parameters during training. In simpler terms, the learning rate is a value that controls how quickly or slowly a machine learning model learns from the data it is being trained on. It determines how much the model's parameters are adjusted with each step of the optimization algorithm.**

**A high learning rate means that the model's parameters are updated quickly, which can lead to faster convergence during training, but may also cause the algorithm to overshoot the optimal solution and become unstable. On the other hand, a low learning rate means that the model's parameters are updated slowly, which can result in slower convergence during training, but may also lead to a more stable and accurate solution.**

The choice of learning rate is critical in the training process because it can have a significant impact on the performance of the model. **An appropriate learning rate can help the model converge quickly and accurately, while an inappropriate learning rate can result in poor performance or even failure to converge.**

There are various techniques to choose an appropriate learning rate, such as using a fixed learning rate, adapting the learning rate during training, or using an adaptive learning rate method such as **Adam or Adagrad**. The choice of method depends on the specific problem being solved and the type of machine learning model being used.


**4. Adam Optimizer: Adaptive Moment Estimation**

Adam, AdaGrad, RMSProp, SGDNesterov are examples of optimizers and they are used to implement the Gradient Descent algorithm.

Adam is a method for optimization. There are several optimization methods to implement the Gradient Descent algorithm and Adam is one of the way to implement the Gradient Descent algorithm. However, it is among the latest trends and perhaps the most efficient method. The picture below demonstrates the comparison of Adam with other optimization methods.
![image.png](attachment:3917990a-5969-4ede-816f-59442bb96058.png)

**Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used in machine learning to update the weights of a neural network during training.**

Adam combines the advantages of two other popular optimization algorithms: stochastic gradient descent (SGD) and root mean square propagation (RMSProp). It uses the first and second moments of the gradients to adaptively adjust the learning rate of each weight, which helps the algorithm to converge faster and more reliably.

**The key advantages of Adam are that it requires less memory than other optimization algorithms and is less sensitive to hyperparameters such as the learning rate**. This makes it a popular choice for training deep neural networks on large datasets. To summarize, Adam is a powerful optimization algorithm that uses a combination of adaptive learning rates and momentum to optimize the weights of a neural network during training, allowing for faster and more reliable convergence.


**5. Backpropagation:**

Backpropagation is a widely used algorithm for training artificial neural networks in machine learning. **It is a supervised learning algorithm that enables the network to learn from labeled data and adjust the weights and biases of the network to minimize the error between the predicted output and the true output.**

In simpler terms, backpropagation is a method for computing the gradients of the loss function with respect to the weights and biases of the neural network. These gradients are used to update the parameters of the network during training, with the aim of reducing the error of the network on the training data.

**The backpropagation algorithm works by propagating the error from the output layer back through the network, layer by layer, and calculating the contribution of each neuron to the error. This is done using the chain rule of calculus, which allows the gradients to be computed efficiently.**

Once the gradients are computed, they are used to update the weights and biases of the network using an optimization algorithm such as gradient descent. This process is repeated iteratively until the network converges to a set of weights and biases that minimize the error on the training data.

In summary, backpropagation is a powerful algorithm for training neural networks by computing the gradients of the loss function with respect to the weights and biases of the network. It allows the network to learn from labeled data and adjust its parameters to minimize the error on the training data, enabling it to make accurate predictions on new data.

**The objective of a Deep Learning model is to find the right weights and biases that minimize the cost function (error). Because the model learns by minimizing the error. And we mentioned Gradient Descent as an optimization algorithm used to minimize the error. In this respect, we need to move backward through a network to update the weights and biases. This whole process (going back and updating the weights and biases) is called Backpropagation.** 
