# RNN 
- so far we made an assumption for our NNs different from biological neurons
- that is all the information goes from left to right
- why not have neuronal connections which are less restrictive? connections that can go backwards
- When we create connections that can go backwards in a specific way with a time delay of one step, we call them recurrent neural networks

#### note: there are all kinds and ways to make backwards connections but there is only one specific way that is called recurrent neural networks. Diagram: 
- 1 hidden layer in which every hidden unit has an arrow that goes back towards every other hidden unit. INCLUDING ITSELF

- we'll start with simple RNN also aka Elman RNN
- Code exercise on shapes to reinforce our understanging
- well, simple RNN almost never used anymore, for the time being LSTM is the most popular unit
- LSTM and GRU - what advantage they have over simple RNN
- practice on datasets

#### Recall: W.T * x + b is just a neuron!

## Simple RNN / Elman Unit
- imagine you use ANN for classifying words, e.g. Part of speech tags, or spam detection
- there are multiple diferent answers depending on the context
- the context the word bank used, should change the chance of being classified as spam or not
- hidden features (features transformed in hidden layers) also called hidden representations
- hidden layers are the network, output to these hidden layers are hidden representations of the input
- back to our classification problem:
- for each NN, instead of taking in a single input at the current timestamp, it also takes the hidden representation from the previous timestamp
- notice: by doing this, there is a pathway from all previous words to current output!
- sanırım bu noktada markov model'lerden ayrılıyo, çünkü sadece bi öncekine bağlı değil.
- it does not just depend on the previous timestamp but all the past timestamps through the previous hidden representations, so this is called the recurrent NN.
- hidden representations = hidden states. A common terminology dealing with the sequences
- typically we assume the first hidden state is a vector of zero, seeipad for equation
- as usual it's useful to think about shapes: can we infer what the shapes of our weights and bias vectors should be?
- M : size of hidden vector. D: size of input vector.
- Wh. : MxM
- Wx : DxM
- bh and bx : M

### How to use Elman Unit to solve problems?
- 1) Many to one tasks (ex. Spam Detection or Sent Analysis): you have a whole seq of inputs but only single output
- 2) Many-to-Many tasks (ex. Parts of Speech tagging or Time series anomaly detection): you have a seq of inputs and a seq of outputs.
For the anomaly detection, any point in a time series could be classified as anomalous

- our network will end with a final dense layer as it always does and a final activation which is appropriate for the task in hand
- yet thereare multiple ways to connect the final dense layer
- h(t) is the output of RNN in every time step, the question is what to do with all these hidden outputs?
- answer depends on the type of task
- if we're doing many-to-one, we only keep the final hidden state, which contain all the info from the time series, pass only that to the final dense layer, so take only the final hidden state or keep global max pool over past hidden states, and keep max
- if we're doing a many-to-many task, then keep all the hidden states, each of which contain the info only up to that point, pass all those hidden states to the final dense layer to get a big tree of separate predictions, one for ech timestamp.
- NOTE: same dense layer is applied to all timestamps, just as the same weights are used at every timestamp (another example of weight-sharing) = same simple RNN applied in all timestamps

#### Consider a CNN with conv > pool > conv > pool... with TxD input and TxM output
- well the RNN gives us the exacly same shape bcz we'll be using T hidden vectors which are of size M.
- they are not different things but they're different perspectives on the same kind of data. (convolutions vs RNNs).

#### Just as in CNN before passing our data to final dense layer, we need to obtain a single flat feature vector.
- In both cases we reduce TxM to M.
- one way to obtain such a feature vector is global max pooling
- In RNN, it takes the max value over time, such that you end up with M different features. Put simply, we're getting rid of the time dimension!
- It makes sense to take maximum since we use that as a proxy to which value matters most
- ex: think of a movie review, suppose word "terrible" appears but not towards the end of sentence. Due to the vanishing gradient problem, RNN may not be able to recognize the word terrible if it appears too far away from the end. However by taking max, we can look at all the hidden values from each previous timestamp, which let us see more clearly words mattering most for predicting the target.

### ANN vs RNN
- for many-to-many case, let's think what shape of output will be.
- ANN: D input, M hidden units, K output
- RNN : T x D shaped input sequence, T x M shaped hidden vector sequence (that means every time stamp gets its own hidden state vector), after passing each of these hidden state vector from the final dense layer, each of output will have shape K, so we'll have a sequence of shape TxK
- so imagine our task is to predict parts of speech tags (there are 8 of them), then K would be 8.
- you can see how this is analogous to ANN

### Can you stack multiple RNN layers
- Yes
- if the output of a hidden layer is TxM1, this is just another time-series, so we can pass this to another hidden layer and get another hidden output shaped TxM2, which is yet another timeseries
#### Note: one common mistake is to confuse T and M: The sequencec length is not the same as the size of hidden vector
- normally we wouldn't change the size of the hidden layers (M1 = M2)
- previously when you see a circle in a diagram for ANN, you can think of it as a single number
- yet, for RNN diagrams, notice each circle does not represent a number but a whole RNN in a layer which outputs a vector

#### we studied Elman Unit, but these ideas on RNN units can apply also on LSTM and GRU, interface is same in all cases
- the unit themselves are different but the way they incorporates to RNN are same

## Code Preparation
model structure stays the same

#### In Tensorflow: 

there is a self loop in hidden RNN, self loop means simple RNN and no loop means Dense
- for RNNs the default activation function is Hyperbolic tangent tho you can use other activation functions if you like, this is unlike the dense layer where the default activation is Identity
- let's again think by the shape of data in Many-to-one
    - we start with a multivariate timeseries shaped TxD
    - this then goes thru a simpleRNN layer
    - SimpleRNN takes in a timeseries of vectors, x1, x2 all the way to XT (each vector sized D)
    - it converts them into a timeseries of hidden vectors, h1, h2... hT (each vector sized M (genelde))
    - each hidden vector depends on the current input X + past hidden vector
    - we only keep hT (sized M)
    - final dense layer like a regular ANN > output yT (a vector of sized K)
    
#### now let's consider shape in many-to-many case:

this will return all the hidden vectors in a timeseries, we get all the hidden vectors in a single array, of size TxM
- suppose this is a many-to-many task where we want a prediction for each timestamp
- produced y1, y2, ... yT : each of y is sized K. TxK
- then we can pass the output automatically thru a Dense Layer and the output will be TxK
- you don't need any extra argument for Dense Layer. Tensorflow will automatically know whether handling a single vector vs handling a time-series of vectors.
- if you pass single vector, you'll get back a vector of size K.
- if you pass a series of vectors, you will get back a series of vectors, each of which of size K.

#### another scenario: we're dealing again with many-to-one task but we need global max pooling:

that's when you wanna keep all hidden values and take the maximum value over time, apply globalmaxpooling just as we did wirh CNNs
- this compare a timeseries of TxM into a single vector of size M, completely eliminating the time-dimension, then we move to our typical final dense layer

#### last scenario : stacking Layers
- when you stack multiple RNN layers together
- recall :  the input of RNN must be a timeseries, therefore the output of the previous layer must also be a timeseries
- we know that this can be done by setting the return_sequences = True

### Easy to use LSTM and GRU (Preview)

## RNN: Paying Attention to Shapes 
- yine bende olmayan bi colab notebook açtı te allam :/

In [None]:
import tensorflow as tf
print(tf.__version__)

In [2]:
from tensorflow.keras.layers import Input, SimpleRNN, Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD, Adam

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2023-06-21 17:31:28.035828: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Things you should automatically know and have memorized:
- N = number of samples
- T = sequence length
- D = number of input features
- M = number of hidden units
- K = number of output units

#### sidenote: K > 1 not automatically means you doing a classification with softmax, you can do a multidimensional regression as well
- imagine you're trying to predict lat-long coordinates, K would be two but still a reg problem

#### ya bu multidim regression olayı benim çok aklıma yatmadı, iki output'un illa birbiriyle alakası oluyo mu ki o durumda?

In [3]:
#make some data:
N = 1 #sample
T = 10 #sequence length
D = 3 # feature number (vector dimensionality)
K = 2
X = np.random.randn(N, T, D) #bu ne demek ya, our X will be shaped NxTxD tamam ama burda olan olay ne?


In [4]:
X

array([[[-0.19453217, -0.7936975 ,  1.23878508],
        [-1.52211801, -0.3633141 , -1.09612726],
        [-0.27881969,  0.74103717,  1.03219534],
        [ 0.63763537,  1.08308935, -0.18640772],
        [-0.56533241, -0.01329518, -1.90528166],
        [-0.08785873, -2.34340253, -0.50334106],
        [ 0.84780722, -0.87618371,  0.19949783],
        [-0.50037419, -0.02923432,  1.64864089],
        [-0.56440474, -1.35151236,  0.30333497],
        [-0.0813958 , -0.46163601, -0.8162573 ]]])

In [9]:
# make an RNN
# number of hidden units M
M = 5

i = Input(shape = (T,D))
x = SimpleRNN(M)(i) #we assumed default activation which is tanh
x = Dense(K)(x) #let's assume we're doing regression so there is no activation function
model = Model(i,x)

In [10]:
#use model to make a pred
Yhat = model.predict(X)
print(Yhat)

[[ 0.28285617 -0.09928322]]


In [11]:
#hocanınki 1x2 çıktı benim 10x2 niye? bende her timestamp için ayrıca mı hesapladı?
#?SimpleRNN
# e ama return sequences default'ta false da gözüküyo ne mana
#çünkü gerizekalı dense layer'ı x yerine i'ye bağlamışsın şimdi düzeldi thx to chatgpt

In [12]:
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 10, 3)]           0         
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 5)                 45        
                                                                 
 dense_1 (Dense)             (None, 2)                 12        
                                                                 
Total params: 57
Trainable params: 57
Non-trainable params: 0
_________________________________________________________________


In [13]:
## his is a model with random weights and biases
model.layers[1].get_weights()

#it looks like three big arrays

[array([[-0.83532584, -0.6514124 , -0.02678806, -0.72216   , -0.79930604],
        [-0.29870445,  0.13089561,  0.59346145,  0.21358949, -0.29906106],
        [-0.06798428, -0.69046766, -0.79839563, -0.35065335,  0.251105  ]],
       dtype=float32),
 array([[ 0.05338168,  0.83906746, -0.18806717, -0.48868334, -0.13760741],
        [ 0.42733547, -0.09336749, -0.15131587, -0.2911593 ,  0.8372555 ],
        [-0.5687558 , -0.20128128,  0.47295716, -0.6277616 ,  0.13501695],
        [ 0.02498548, -0.4723776 , -0.6730736 , -0.45234   , -0.34437704],
        [-0.7003053 ,  0.15360272, -0.5147986 ,  0.27878094,  0.3784738 ]],
       dtype=float32),
 array([0., 0., 0., 0., 0.], dtype=float32)]

In [14]:
#it's actually more helpful to print shapes of these arrays:
# first output is input > hidden
# second output is hidden > hidden
# third output is bias term

a,b,c = model.layers[1].get_weights()
print(a.shape, b.shape, c.shape)

#the first weight is 3x5, input to hidden layer
# the second weight is 5x5, hidden layer to hiddenlayer (recurrent)
# third weight vector by M which means the bias term

(3, 5) (5, 5) (5,)


In [17]:
#mantıklı çünkü M'i biz 5 belirlemiştik

Wx, Wh, bh = model.layers[1].get_weights()
Wo, bo = model.layers[2].get_weights() #dense layer

In [35]:
#last step is to do our manual RNN calculation
#we'll initialize a hidden state to a vector of zeroes
#this is also our way to confirm that the initial hidden state is indeed zero by comparing two outputs

h_last = np.zeros(M)
x = X[0] #bcz we only had 1 sample for X, so this will select all the sample. 1den fazla sample'ımız olsa napcaktk ki?
Yhats = [] #where we store the outputs

for t in range(T):
    h = np.tanh(x[t].dot(Wx) + h_last.dot(Wh) + bh)
    y = h.dot(Wo) + bo #we'll only care about this value in the last iteration
    Yhats.append(y)
    
    h_last = h
    
print(Yhats[-1])

[0.02319835 0.19658582]


In [36]:
print(Yhats)

[array([-0.05560619, -0.25817303]), array([ 0.7174579 , -0.82431744]), array([0.01325308, 1.03616249]), array([-0.79721722,  0.64520941]), array([-0.26528943, -0.69266743]), array([0.4197635 , 0.93329063]), array([-0.2404735 ,  0.12029364]), array([ 0.41897202, -0.78028751]), array([ 1.0837912 , -0.25283553]), array([0.02319835, 0.19658582])]


#### see this is equal to what we calculated before in step 10

### Bonus exercise: Calculate this for more than 1 sample

In [42]:
#make some data:
N = 4 #sample
T = 10 #sequence length
D = 3 # feature number (vector dimensionality)
K = 2
X = np.random.randn(N, T, D)
X

#now I have 4 samples

array([[[-0.36626404, -0.28749524, -0.67429819],
        [-0.01822258,  0.98873318,  0.25819092],
        [ 0.39025371, -1.17494434, -0.12687456],
        [ 0.36452043,  0.24914082,  0.51245152],
        [-0.40053118,  0.37266551,  1.01692962],
        [-0.24503446, -0.71286245, -0.65378666],
        [ 0.96050579,  0.59688585,  1.79240562],
        [-0.21362317,  1.1411716 , -1.35714152],
        [-0.95025084,  0.19518304,  0.92154138],
        [ 0.83387336,  0.93824977, -0.86927701]],

       [[ 0.66820928, -0.58017879, -0.93077255],
        [ 1.01782246,  0.42371897, -0.11747005],
        [-0.26584489, -0.69592804,  2.32856302],
        [ 1.21116868,  0.68692477, -0.90206902],
        [-1.08766561,  0.12870312,  0.84659945],
        [ 0.10460487, -0.83208768,  0.69560535],
        [ 0.21582375, -0.14485585,  1.43077286],
        [ 0.85893891, -0.27599019,  1.56865603],
        [-0.68035431,  0.67757608,  1.46120034],
        [ 1.40912492,  0.13270857,  0.52291865]],

       [[ 1.0024

In [43]:
M = 5

i = Input(shape = (T,D))
x = SimpleRNN(M)(i) 
x = Dense(K)(x) #let's assume we're doing regression so there is no activation function
model = Model(i,x)

In [44]:
#use model to make predictions
Yhat = model.predict(X)
print(Yhat)

#we have 2 outputs for each of 4 samples

[[ 1.5781996  -1.4572457 ]
 [ 0.47916812 -0.23345041]
 [ 1.6686987  -1.5694398 ]
 [ 0.9615597  -0.7691152 ]]


In [45]:
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 10, 3)]           0         
                                                                 
 simple_rnn_4 (SimpleRNN)    (None, 5)                 45        
                                                                 
 dense_4 (Dense)             (None, 2)                 12        
                                                                 
Total params: 57
Trainable params: 57
Non-trainable params: 0
_________________________________________________________________


In [46]:
model.layers[1].get_weights()

#tamam her sample için ayrı weight hesapşamamış mantıklı

[array([[-0.5862994 ,  0.5813628 ,  0.7974624 ,  0.07803494, -0.7945263 ],
        [-0.8436101 , -0.09644419,  0.68786484, -0.43702722, -0.7325891 ],
        [ 0.3859269 , -0.4878196 ,  0.7217867 , -0.63964015,  0.15325826]],
       dtype=float32),
 array([[-0.21672702,  0.35616493,  0.34219792,  0.6425723 , -0.54422176],
        [-0.2823942 ,  0.53048223, -0.29434738,  0.3309667 ,  0.66532904],
        [-0.23114593, -0.09281105,  0.8623884 , -0.12879865,  0.42149094],
        [ 0.8877089 ,  0.3662093 ,  0.22412716,  0.09347865,  0.13744962],
        [-0.17840238,  0.6700836 ,  0.04809683, -0.6724838 , -0.25419044]],
       dtype=float32),
 array([0., 0., 0., 0., 0.], dtype=float32)]

In [47]:
a,b,c = model.layers[1].get_weights()
print(a.shape, b.shape, c.shape)

(3, 5) (5, 5) (5,)


In [48]:
Wx, Wh, bh = model.layers[1].get_weights()
Wo, bo = model.layers[2].get_weights()

In [49]:
h_last = np.zeros(M)
Yhats = np.zeros(((N,T,K))) #where we store the outputs #allahım vallahi kendim buldum şu şekil store etmeyi

for i in range(N):
    x = X[i]
    for t in range(T):
        h = np.tanh(x[t].dot(Wx) + h_last.dot(Wh) + bh)
        y = h.dot(Wo) + bo 
        
        Yhats[i,t] = y

        h_last = h

print(Yhats[:,-1])
###AY OLDU OLDU

[[ 1.57819971 -1.45724568]
 [ 0.48013484 -0.2339073 ]
 [ 1.66979636 -1.57084694]
 [ 0.96161535 -0.76916654]]
