# <center>521160P Introduction to artificial intelligence<br><br>Exercise material<br><br>Reinforcement learning and deep learning<br></center>


Reinforcement learning and deep learning are areas of artificial intelligence that often solve different types of problems. They can also be combined, as was done in one application in the development of artificial intelligence for Atar video games with the algorithm deep-Q learning [1]. However, this exercise material deals with reinforcement learning and deep learning as their own areas. 

## Reinforcement learning

Machine learning is divided into supervised learning, unsupervised learning, and reinforcement learning. The reinforcement learning approach to problems is different from supervised learning based on data and its sample output variables or guided learning based on data structure. In reinforcement learning, learning takes place through real-time attempts and mistakes [2]. In it, the actor, i.e. the **agent**, tries to find the best possible **movement from the premises** to which it is able to move when the agent operates in a certain **operating environment**. The movements selected by the agent are evaluated using positive and negative **feedback** depending on whether the movement led to the desired outcome. The values of the feedback are stored in the state-motion table by always updating the value of the state-motion pair to be examined with the update rule of the confirmation learning method. Over time, the agent learns the **strategy** by which he or she achieves his or her goal with optimal movements. In Figure 1, the agent learns the best movements for the farms while operating in a specific operating environment.

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\reinforcement.png' width='550' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 1. The agent works in a specific operating environment, learning the best movements.</span>
</div>
<br>

Let’s take a simple example of reinforcement learning. A solution for traffic light control at an intersection is implemented by reinforcement learning. The traffic light control system acts as an example agent. The agent receives information about the number of cars in the different lanes of the intersection, based on which it must be able to pass the cars smoothly past the intersection. The numbers of cars in the lanes of the intersection at different times are spaces and the decisions made by the steering system, such as which lanes are shown the green light and for how long, are movements. Feedback on the agent's actions in different situations is given, for example, based on the waiting time of the cars at the intersection. The operating environment of the example consists of all possible spaces, how many cars can accumulate in the different lanes of the intersection, and all possible movements that the agent can perform in different situations. In the beginning, the agent will make random movements and traffic will be congested at the intersection. Over a long period of time, the agent learns to relieve congestion using the values ​​in the status-motion table learned from the feedback and is able to form strategies for different situations.

As the agent learns trial and error by making a strategy, it leverages previously learned exploitation to select movements that lead to good feedback. In addition to this, the agent has the option to select lower-performing movements or completely new movements (exploration) in order to get a more comprehensive picture of the entire operating environment. Often, when teaching an agent, the strategy alternates between the two ways of selecting movements, for example, by selecting nine out of ten times the best-observed move and one out of ten random moves (epsilon-Greedy policy).

Different reinforcement learning methods take advantage of different features. The method can be model-free, in which case it is assumed that there are no dependencies between spaces and movements, and the method does not model its environment during learning. Correspondingly, the model-based method learns the model from its environment. In this case, for example, when an agent moves $A_{1}$ in state $S_{1}$ and receives a new state $S_{2}$ from the environment and feedback $R_{1}$, the model learns the probability of $S_{1}$ ends up in state $S_{2}$ and feedback value $R_{1}$. 

Confirmation Learning Method The Monte Carlo simulation is a Model-Free Method that requires waiting feedback up to the end state of the event to calculate the value of the new state [3]. The terminal mode can be, for example, the final move of one Blackjack card game or the time when new feedback can be given on the intersection congestion information when the cars are stopped. The Monte Carlo simulation uses the update rule of Equation 1. 

<br>
\begin{equation}
Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha \: (R_{t}-Q(S_{t}, A_{t}))\;, \tag{1}
\end{equation}
<br>

*where $Q(S_{t}, A_{t})$ is the Q value of the state motion pair, $R_{t}$ is the value of the feedback received by the state motion pair, and $\alpha$ is the learning rate.), which indicates the weighting factor by which the new information replaces the old information. The learning rate gets values between $[0,1]$, where at 0 no learning takes place at all and at 1 the new feedback value completely replaces the previous Q value.*

If the operating environment is deterministic, ie it does not change over time, or stationary, ie when the environment changes, the probability distribution of the change is known, the value $\alpha = \frac{1}{N(S_{t}, A_{t})}$ can be used as the learning rate, where $N(S_{t}, A_{t})$ tells you the number of times that space movement has been visited.

Let’s take an example of how Monte Carlo simulation works. For a state motion pair $(S, A)$, the feedback values $R_{t}=[4,3,2,1]$ and the learning rate $\alpha$ is $\frac{1}{2}$. Using Equation 1, the Q values of the state-motion pair at different points in time can be calculated.

<br>
$\hspace{7.55cm} Q(S, A)=0$<br>
$R_{1}=4, \hspace{6cm}     Q(S, A)= 0 + \frac{1}{2}(4-0) = 2$<br>
$R_{2}=3, \hspace{6cm}     Q(S, A)= 2 + \frac{1}{2}(3-2) = \frac{5}{2}$<br>
$R_{3}=2, \hspace{6cm}     Q(S, A)= \frac{5}{2} + \frac{1}{2}(2-\frac{5}{2}) = \frac{9}{4}$<br>
$R_{4}=1, \hspace{6cm}     Q(S, A)= \frac{9}{4} + \frac{1}{2}(1-\frac{9}{4}) = \frac{13}{8}$<br>
<br>

Another very reinforcement learning method, Q-learning, is also a Model-Free Method in which the Q value of a state-motion pair is updated in the state-motion table immediately after the motion from the state. So you don't have to wait for feedback until the end of the whole event. Q-learning uses the update rule of Equation 2.

<br>
\begin{equation}
Q(S_{t},A_{t}) \leftarrow Q(S_{t},A_{t}) + \alpha \: [R_{t} + \gamma \max (Q(S_{t+1},A_{t})) - Q(S_{t},A_{t})]\:, \tag{2}
\end{equation}
<br>

*where $Q(S_{t},A_{t})$ is the Q value of the state motion pair, $\max(Q(S_{t + 1}, A_ {t})) $ is the value of the following state motion pairs the largest Q-value and γ is the so-called. a discount factor that determines how much the values of future feedback are attenuated relative to the values of the feedback that is immediately available. The damping factor gets values between [0,1], where at 0 the agent is short-sighted taking into account only the following feedback values of the movements and at 1 the agent is far-sighted, evaluating all future feedback to be of equal value.*


## Deep learning 

Deep learning is currently one of the hottest research topics in artificial intelligence, and the development of neural networks used in computing has been inspired over time by the structure and function of the human brain. Deep learning has been studied since the 1960s, but a complete breakthrough was not achieved until the 2010s with the increase in the use of mass data, the increase in computing power of computers, and the development of machine learning methods. Deep learning can be applied to almost all aspects of artificial intelligence.

Artificial neural networks, like the human brain, consist of neurons that are simple interconnected data processing units. The structures of the biological and artificial neurons resemble each other as shown in Figure 2. In a biological neuron, its importing branches carry the information as an electrical impulse towards the cell-site, from where the exporting branch carries the processed information forward. Correspondingly, in an artificial neuron, the inputs multiplied by different weighting factors and the bias term are summed together in an adder. The sum is then calculated by an activation function, such as a sigmoid function or a ReLU (rectified linear unit) function, a nonlinear description [4]. If a threshold is used as the activation function, the neuronal term perceptron is used.

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\neurons.png' width='850' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 2. The structures of the biological and artificial neurons resemble each other.</span>
</div>
<br>

A multilayer neural network constructed of artificial neurons consists of an input layer, an output layer, and hidden layers between them, as in Figure 3. Teaching neural networks means adjusting the weighting factors of neurons in the network. A single data sample enters the input layer, from where information travels through the hidden layers to the output layer. Once the output variable of the sample is known, by adjusting the weighting factors, it can be made to correspond to the output variable predicted by the network. This is done with a backpropagation algorithm by traversing the network backwards from the output layer to the input layer by calculating the partial derivatives of the cost function for each weighting factor. The neural network weighting factors are then adjusted to be more optimal by an update rule using partial derivatives, minimizing the error term between the prediction and the output variable. 

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\neuralnetwork.png' width='700' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 3. A multilayer neural network consists of an input layer, an output layer, and hidden layers.</span>
</div>
<br>

By feeding data samples multiple times through the neural network, optimal weighting factors are eventually learned. If there is little data and a lot of neurons in the neural network used, there is a risk of over-learning. Similarly, if highly complex data is input to a simple neural network, not all of its properties can be learned from the data and sub-learning occurs. 

Next, let’s take a closer look at the deep-learning architecture called convolutional neural networks (CNNs), which are most commonly used in image and video recognition and classification applications. Convolutional neural networks include e.g. convolutional layers, pooling layers, fully-connected layers, drop-out layers, and normalization layers [5]. In a convolutional neural network, the values of matrix-shaped inputs, such as the pixel values of the images, are initially processed in convolutional layers, from which the traits produced pass through the neural network to the output layer.

The convolution layers contain masks of weighting factors. The mask is moved at each point of the matrix-shaped input and a convolution operation is calculated between the mask and the point of the input. The function of the weight coefficients of the masks to be updated during teaching is to remove the versatile features from the input. In Figure 4, the 3x3 input is convolved with a 3x3 mask with a shift of one unit in the x-axis and y-axis directions. One unit-wide zero-padding has been added around the original feed, as convolution with masks larger than 1x1 reduces the size of the final result. The size of the features produced by the convolution layer is calculated by Equation 3. 

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\convolution.png' width='400' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span style="text-align:center">Figure 4 3x3-sized input (a blue border) is placed around one unit wide zero line.<br>With a 3x3 mask (red border) offset in the x-axis and y-axis directions of one unit.</span>
</div>
<br>

<br>
\begin{equation}
O =  \frac{W-F+2P}{S} + 1\:, \tag{3}
\end{equation}
<br>

where $W$ is the input width, $F$ is the width of the mask, $P$ is the width of the added zero row (padding), $S$ is the stride and $O$ is the width of the output produced by the convolution layer. 

The pooling layer is often placed in the neural network immediately after the convolution layer. The pooling layer compresses the information, reducing the spatial size of the features. This in turn speeds up the calculation and reduces the risk of over-learning. In max-pooling, the largest values are selected within the grid-divided areas. In average-pooling, the values of the areas divided by the grid are averaged. In Figure 5, a 4x4 input is subsampled for comparison with a 2x2 maximum value subsampling layer and an average subsampling layer. 

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\pooling.png' width='400' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 5. Pooling with max pooling and average pooling</span>
</div>
<br>

When the image samples of the data are to be predicted to which category they belong, the rest of the convolutional network consists of fully-connected layers. The matrix-shaped features produced by the convolution layers and subsampling layers are converted to a one-dimensional list. The numeric values in the list are input to the input layer, from where they are converted through the possible hidden layers to the predicted class information at the output layer. 

Overlearning is a common problem in teaching convolutional neural networks. In addition to pooling layers, it can be controlled by various neuron dropout layers and normalization layers. In the neuron dropout layers in the teaching phase, some of the neurons are randomly dropped out of the neural network as shown in Figure 6. The most common normalization layer is batch normalization. In it, during the iteration, the batch features are subtracted from the batch average and divided by the batch standard deviation.

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\dropout.png' width='450' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 6. On the left is the original neural network and on the right is the neural network after the drop of neurons. </span>
</div>
<br>

Entire convolutional networks are formed from different layer types. Figure 7 shows the structure of a convolutional neural network used to predict for a 32x32 pixel RGB image whether its category is dog, cat or bird. The convolutional network consists of three consecutive blocks formed by a convolution layer and a maximum value subsampling layer. The features converted from the image are then fed to the output layer via the fully-connected layer as class information. After teaching the convolutional neural network in Figure 7, the mesh-predicted test image of the cat is identified with a 97 percent posterior probability for the category cat, a 2 percent posterior probability for the class dog, and a 1 percent posterior probability for the class bird. 

<br>
<div style="width:image width px; font-size:80%; text-align:center;">
    <center>
    <img src='imgs\network.jpg' width='1100' height='auto' style='padding-bottom:0.5em;' />
    </center>
    <span>Figure 7. A convolutional neural network formed for predicting 32x32 pixel RGB images.</span>
</div>
<br>

## Sources

[1]    Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D. & Riedmiller M. (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[2]    Sutton R. S. & Barto A. G. (1998) Reinforcement learning: An introduction. MIT press.

[3]    Silver D. Reinforcement learning lectures. URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html.

[4]    Bishop, C. M. (2006) Pattern recognition and machine learning. Springer Science+ Business Media.

[5]    Fei-Fei L. Convolutional Neural Networks for Visual Recognition. URL: https://cs231n.github.io/convolutional-networks/
