### Neural Network Training: Tensorflow Implementation

Given a training set of (x,y) examples, how do you train the NN?

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(units=25, activation="sigmoid"),
    Dense(units=15, activation="sigmoid"),
    Dense(units=1, activation="sigmoid"),
])

from tensorflow.keras.losses import BinaryFocalCrossentropy

model.compile(loss=BinaryFocalCrossentropy)#Specify the loss function
model.fit(X,Y,epochs=100)#Train for 100 times, e.g. number of steps in gradient descent

Therefore:
1. Specify the model that tells tensorflow to compute for the inference (e.g. Sequential)
2. Compile the model using a specific loss function
3. Train the model!

### Training Details

Model training steps:
1. Specify how to compute output given input $x$ and parameters $w$ and $b$ (define the model)
2. Specify the loss (for logistic regression: $f_{\vec{w},b}(\vec{x}^i,y^i) = -(y^i)log(f_{\vec{w},b}(\vec{x}^i)) - (1-y^i)log(1-f_{\vec{w},b}(\vec{x}^i))$)
3. and the cost function (for logistic regression: $J(\vec{w},b) = -\frac{1}{m}\sum_{i=1}^{m}[(y^i)log(f_{\vec{w},b}(\vec{x}^i)) + (1-y^i)log(1-f_{\vec{w},b}(\vec{x}^i))]$)
4. Train the data to minimize the cost function: **Using gradient descent**

**Remember**: The *loss function* is for a single example and the *cost function* is the loss over the entire training set

Now, how these steps map to training a NN:
1. *model = Sequential([ Dense(...), Dense(...),...])*
    * Specifies the entire architecture of the NN layer by layer
2. *model.compile(loss=BinaryCrossentropy())* specifying the loss, taking this average over the training set gives you the cost!
    * *BinaryCrossentropy* is referring to the exact loss function described above that is used for logistic regression
    * For regression you might use *MeanSquaredError()* loss 
3. *model.fit(X,y, epochs=100)*
    * Within the *fit* it computes the derivatives for gradient descent using **backpropagation**

### Alternatives to the Sigmoid function

Recall the demand prediction for t-shirts. Using a sigmoid function assumes that people are either aware or not. **The degree to which buyers are e.g. aware may not be binary, there are degrees of awareness**

To have the activation value take on many values we can swap in a different function:  
$$g(z) = max(0,z)$$

This basically says that if $z<0$ then $g(z) = 0$ and if $z>0$ then $g(z) = z$

This function is called **ReLU** and it stands for **Rectified Linear Unit**

The **Linear activation function** is also another popular one and this is:
$$g(z)=z$$

### Choosing the activation function

It turns out depending on the target label, there will be a fairly straight forward choice for these activation functions.

**Output Layer**
* Considering if you are working on a classification problem, then the *output layer* will surely be the *sigmoid* activation function as the output should be 1 or 0
* If you are trying to predict stock prices, then the *output layer* would most likely be the *Linear* activation function, because then the output can be positive or negative
* However, if you are trying to predict the pricing of a house then surely the *output layer* should be *ReLU* activation function

**Hidden Layers**
* *ReLU* is by far the most common choice in how NNs are trained
* *Sigmoid* function is hardly ever used with the one exception: Use a *sigmoid* at the *output layer* when it is a Binary classification problem
* This is because *ReLU* is computed faster. Also when calculating gradient descent it is faster since *sigmoid* is flat at two places and *ReLU* only at 1
* *ReLU* also learns therefore faster

### Why do we need activation functions

What if we only used Linear activation functions? Then the model would just become a big *Linear regression* model and would not be able to fit to anything more complex.

Alternatively, if we had to use the *Linear activation function* for the hidden layers and a *sigmoid activation function* for the output layer, it can be shown that the model becomes equivalent to **Logistic Regression**.

Therefore, the common rule is do not use the *Linear activation function* for hidden layers, rather use *ReLU*.

### ReLU activation function

* This function is composed of different linear pieces (piecewise linear). 
* The slope remains consistent during the linear portion and then changes abruptly at transition points. 
* At every transition point a new linear function is added.
* The *ReLU* can 'deactivate' portions to fit to the needed slope and then turn them on as each portion starts matching

### Multiclas Classification

More than just two output labels.  

For example instead of trying to recognize only 1 or 0, if you want to recognize entire zipcodes (many number), it is still classification as there is a limited number of values y can have.

In previous examples the classification problems had "groups" of x's and o's, but now imagine there are several "groups" that all can be represented with different shapes like rectangles, triangles etc.

Now you want to predict the chance that y=1(o's), y=2(x's), y=3(rectangles), or y=4(triangles).

**Now you need a function that can have a decision boundary that splits these into four different categories and this is called *softmax***

### Softmax Regression algorithm

*It is a generalization of logistic regression, which is a binary classification algorithm to the multiclass classification context.*  

If we take the normal logistic regression. where $a_1 = g(z)$ then we have that $a_1$ is some number between 0 and 1. If it is trying to calculate the probability of the number being a 1 and the output is 0.71, then you have a 71% chance that it is 1.  

But also, it is actually calculating the probability of it being 0 also, and that would then be the remainder. **Therefore the logistic regression calculates two values: Probability of it being a 1 (79%), and the probability of it being a 0 (21%)**

This works for 2 possible output values. Now lets look at Softmax regression where there are 4 possible output values:

It calculates:  
$z_1 = \vec{w}_1\cdot\vec{x} + b_1$  for y = 1  
$z_2 = \vec{w}_2\cdot\vec{x} + b_2$  for y = 2  
$z_3 = \vec{w}_3\cdot\vec{x} + b_3$  for y = 3  
$z_4 = \vec{w}_4\cdot\vec{x} + b_4$  for y = 4  



Then to estimate the $a_1$ it does the following:
$$a_1 = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$$

It effectively calculates $a_1$ estimate as the chance of $P(y=1)$ given the input features $x$.

Therefore:  
$$a_2 = \frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$$

$$a_3 = \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$$

$$a_4 = \frac{e^{z_4}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$$

Given the output values:  
$a_1 = 0.30$  
$a_2 = 0.20$  
$a_3 = 0.15$  

Then $a_4$ must be equal to $(1-0.3-0.2-0.15)=0.35$

In the general case you have that $(y = 1,2,3,...N)$:
$$z_j = \vec{w}_j\cdot\vec{x} + b_j$$

$$a_j = \frac{e^{z_j}}{\sum_{k=1}^Ne^{z_k}}$$

The loss function for Softmax regression is the following:

$loss(a_1, a_2,...a_N,y)$ = $-loga_1$ if $y=1$, $-loga_2$ if $y=2$, $-loga_3$ if $y=3$, ....,  $-loga_N$ if $y=N$

Here we use the *Crossentropy* loss.

### NN with Softmax output

In order to build a NN to do *multiclass classification* we need to take the *Softmax* model and put it on the output layer of the NN.

E.g., if you have to classify 10 classes, then the output layer will be a *Softmax* activation layer with 10 outputs. Now it can give you the probability of any of the 10 classes.

Different from the sigmoid or ReLU, $a_1$ is not only a function of $z_1$ (e.g. $a_1 = g(z_1)$) but now it is a function of all the '$z$' values, as seen from the formula:  
$$a_1 = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}+e^{z_4}}$$

**Tensorflow implementation**

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy, BinaryCrossentropy

model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='softmax')
])

model.compile(loss=SparseCategoricalCrossentropy())

model.fit(X,Y,epochs=100)

### Improved Implementation of Softmax

Depending on how you calculate the losses, the computer will have different round off errors, e.g.

In [1]:
x1 = 2/10000
print(f"{x1:.18f}")

0.000200000000000000


In [2]:
x2 = 1 + (1/10000)-(1-1/10000)
print(f"{x2:.18f}")

0.000199999999999978


There is a better way to computing losses that has less rounding errors and therefore more accurate predictions.  

Having intermediate values has more rounding errors. E.g., in logistic regression if we calculate $a$ first separately and thereafter plug it into the loss function there will be rounding errors. **INSTEAD** it is better to plug the formula for $a$ directly into the loss function and calculate it straight.  

To do this in code looks like this:



In [None]:
model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='linear') #Previously it was 'logistic'
])

model.compile(loss=BinaryCrossentropy(from_logits=True)) #This true argument makes it get computed more accurately

model.fit(X,Y,epochs=100)

#Take teh output value and map it to the logistic function in order to get the probability

logit = model(X)
f_x = tf.nn.sigmoid(logit)

Now let's apply this same logic to the *Softmax* regression.

In [None]:
model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='linear') #Previously it was 'softmax'
])

model.compile(loss=SparseCategoricalCrossentropy(from_logits=True)) #Set again to true

model.fit(X,Y,epochs=100)

#Now because the model ouputs z1,...,z10 instead of a1,...,a10
logits = model(X)
f_x = tf.nn.softmax(logits)

### Classification with multiple outputs (Multi-label classification problem)

Which is where associated with each image there could be multiple labels.  

e.g., given a self-driving example, there is a single picture and it the picture there can be pedestrians, busses and/or cars. Associated with a single input (image) there can be 3 (or more!) labels. 

This also means that the target output $y$ will be a vector of many numbers (3 in this case).

**How to build a NN for this?**
* Possibly treat it as 3 different problems and build a NN for each 'label' e.g. first NN detects cars, second detects busses and third detects the pedestrians
* Another way is to have a NN with an output layer of a vector that contains 3 numbers
    * Each node can have a *sigmoid* activation function to predict the probability of each label

### Softmax Lab

In [7]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

def my_softmax(z):
    ez = np.exp(z)
    sm = ez/np.sum(ez)
    return(sm)


In [8]:
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0, random_state=30)

In [9]:
#Obvious organization

model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=4, activation='softmax')
])

model.compile(loss=SparseCategoricalCrossentropy(),
              optimizer = tf.optimizers.Adam(0.001),
)

model.fit(
    X_train, y_train,
    epochs=10
)

p_nonpreferred = model.predict(X_train)
print(p_nonpreferred[:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.8978  
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 975us/step - loss: 0.3992
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1652  
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 795us/step - loss: 0.0922
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 954us/step - loss: 0.0679
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 854us/step - loss: 0.0562
Epoch 7/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 844us/step - loss: 0.0498
Epoch 8/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 886us/step - loss: 0.0450
Epoch 9/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 866us/step - loss: 0.0417
Epoch 10/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 850us/step - lo

In [11]:
#Now the preferred way of doing it

preferred_model = Sequential([
    Dense(25, activation='relu'),
    Dense(15, activation='relu'),
    Dense(4, activation='linear')
])

preferred_model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

preferred_model.fit(
    X_train,y_train,
    epochs=10
)

p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value: ",np.max(p_preferred), "smallest value: ", np.min(p_preferred))

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 850us/step - loss: 1.0921 
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 824us/step - loss: 0.4360
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 780us/step - loss: 0.1970
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 793us/step - loss: 0.1184
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 790us/step - loss: 0.0874
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 775us/step - loss: 0.0721
Epoch 7/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 782us/step - loss: 0.0632
Epoch 8/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 806us/step - loss: 0.0574
Epoch 9/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 839us/step - loss: 0.0536
Epoch 10/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 846us/step - l

In [12]:
# The output predictions are not yet probabilities!

sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value: ",np.max(sm_preferred), "smallest value", np.min(sm_preferred))

two example output vectors:
 [[2.0704423e-03 1.7401201e-03 9.5333314e-01 4.2856321e-02]
 [9.8938560e-01 9.4401883e-03 9.1024896e-04 2.6394229e-04]]
largest value:  0.9999908 smallest value 7.082175e-11


In [13]:
#To find the most likely category, the softmax is not required. You can find the index of the largest output using np.argmax()
for i in range(5):
    print(f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

[-2.8254275  -2.999236    3.3067744   0.20466296], category: 2
[ 5.188727    0.53661877 -1.8023942  -3.0403824 ], category: 0
[ 3.8686466  0.6968279 -1.4726437 -2.415347 ], category: 0
[-1.4387716   4.5538263  -3.151655   -0.32492137], category: 1
[-0.11418828 -4.5038643   5.19038    -1.9442217 ], category: 2


### Multi-class Classification Lab

Imagine an example of taking in a photo and having to classify subjects in the photo as {dog, cat, horse, other}

In [14]:
# This uses Scikit learn to create a training set with 4 categories
classes = 4
m = 100
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
std = 1.0
X_train, y_train = make_blobs(n_samples=m, centers=centers, cluster_std=std, random_state=30)

In [15]:
#show unique classes in data set
print(f"unique classes {np.unique(y_train)}")
#show how classes are represented
print(f"class representation {y_train[:10]}")
#show shapes of the dataset
print(f"shape of the X_train: {X_train.shape}, shape of the y_train: {y_train.shape}")


unique classes [0 1 2 3]
class representation [3 3 3 0 3 3 3 3 2 0]
shape of the X_train: (100, 2), shape of the y_train: (100,)


In [17]:
tf.random.set_seed(1234) #applied to achieve consistent results
model = Sequential([
    Dense(2, activation='relu', name="L1"),
    Dense(4, activation='linear', name="L2")
])

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = tf.keras.optimizers.Adam(0.01),
)

model.fit(
    X_train, y_train,
    epochs = 200
)

Epoch 1/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 1.7661  
Epoch 2/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.6083 
Epoch 3/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 1.4886 
Epoch 4/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.3901 
Epoch 5/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 1.3091 
Epoch 6/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 1.2421 
Epoch 7/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.1872 
Epoch 8/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.1424 
Epoch 9/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.1055 
Epoch 10/200
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 1.0748 
Epoch 11

<keras.src.callbacks.history.History at 0x2457d4d28b0>

In [18]:
#Gathering the trained parameters from the first layer
l1 = model.get_layer("L1")
W1,b1 = l1.get_weights()

### Advanced optimization

Gradient descent is a optimization algorithm that is widely used throughout ML.

Remember that gradient descent is:
$$w_j = w_j - \alpha\frac{\partial}{\partial w_j}J(\vec{w},b)$$

In this case it takes small steps towards the minimum of the cost function. But what if there could be an algorithms that can adjust the learning rate to make bigger steps?  

This algorithm is called the ***Adam algorithm***. This algorithm can adjust the *learning rate* in such a manner that if it is too big, it can make it smaller and if it is too small it can make it bigger. It *optimizes* the learning rate.

Adam - *Adaptive Moment estimation*

This algorithm also uses a different *learning* rate with every parameter $w_1, w_2, w_3, ..., b$

**The logic behind the algorithm is**:
* if the parameter keeps moving in the same direction > Then increase the learning rate and,
* if the parameter keeps oscillating > Then reduce the learning rate


### Alternative Layer Types (Convolutional NNs)

**All the layers explored so far were the *Dense*, this means that every neuron in a layer gets all the activations from the previous layer**.

Using this type you can still build very powerful NNs.  

However, there are NNs with other properties. Another type is called the ***Convolutional*** Layer.

How does **Convolutional layer** work (image example):
* Rather than having each neuron can look at the entire image, we have the neuron just being able to look at a single part of the image
* The second neuron also looks at the pixels in a limited region. These regions are random
* Why do we do this?
    * Speeds up the computation
    * Needs less training data
    * Less prone to overfitting

How does the **CNN** work:
* Given a EKG that has a time-series of say $x_1$ to $x_{100}$
* Take the neurons in the *Convolutional* layer and each neuron will only get parts of tge time-series
    * e.g., neuron one gets $x_1$ to $x_{20}$, neuron two gets $x_{11}$ to $x_{30}$, neuron 3 gets $x_21$ to $x_40$ and so on
    * Say this layer has 9 units
* The next unit (also *Convolutional*) has, say 3 units
    * In this unit the neurons also only get to see part of the activations produced by the first layer
    * e.g., neuron one gets $a_1$ to $a_{5}$, neuron two gets $a_{3}$ to $a_{7}$, and so on
* Finally the output layer is a sigmoid that will take all the activations of the last hidden layer and compute the probability  

Here there are quite some **architectural** choices:
* How big is the *window* of inputs each neuron can look at?
* How many neurons per layer?  

With this you can build more effective NNs


### Computation Graph and Backpropagation on a Large NN

Take a NN with with input $\vec{x}$ and two nodes with activations $\vec{a^{[1]}}$ and $\vec{a^{[2]}}$ with *ReLU* activation function:



Given the values $w^{[1]} = 2$, $b^{[1]} = 0$, $w^{[2]} = 3$, $b^{[2]} = 1$, $x = 1$ and $y = 5$  
And $g(z) = max(0,z)$

Now we have that:  
$\vec{a^{[1]}} = g(w^{[1]}*\vec{x} + b^{[1]})$ and  
$\vec{a^{[2]}} = g(w^{[1]}*\vec{a^{[1]}} + b^{[2]})$, finally  
$J(w,b) = \frac{1}{2}(\vec{a^{[2]}} - y)^2$

The computation graph would look like (step by step calculation):  

1. $t^{[1]}$ = $w^{[1]}*\vec{x}$ = 2
2. $z^{[1]}$ = $t^{[1]} + b^{[1]}$ = 2
3. ${a^{[1]}} = g(z^{[1]})$ = 2
4. $t^{[2]} = w^{[2]}*{a^{[1]}}$ = 6
5. $z^{[2]} = t^{[2]} + b^{[2]}$ = 7
6. ${a^{[2]}} = g(z^{[2]})$ = 7
7. $J = \frac{1}{2}({a^{[2]}}-y)^2 = 2$

Now for the **Backprop**:

You have to find the derivative at each point. Starting with $J$ w.r.t ${a^{[2]}}$ and working your wat back to $w^{[1]}$. To do this you will need to use the **Chain-rule** in calculus. This is a very efficient way of calculating the derivate and it can be done in N+P times for a NN of N nodes and P parameters. Instead of traditionally it would take NxP times to calculate this.

In [19]:
import sympy 
from sympy import symbols, diff

In [20]:
#Lets compare the symbolic derivative with the arithmetic calculation
J,w = symbols('J,w')
J = w**3
J

w**3

In [21]:
dJ_dw = diff(J)
dJ_dw

3*w**2

In [22]:
dJ_dw.subs([(w,2)]) #derivative at the point of w=2

12

In [23]:
#Now for the arithmetic calculation:
J = (2)**3
J_epsilon = (2+0.001)**3
k = (J_epsilon-J)/0.001
print(f"J = {J}, J_epsilon = {J_epsilon}, dJ_dw~= k = {k}")

J = 8, J_epsilon = 8.012006000999998, dJ_dw~= k = 12.006000999997823
