# Binary Cross Entropy Discussion

```
tf.nn.sigmoid_cross_entropy_with_logits(
    labels=None, logits=None, name=None
)
```

以上的func是因為式子是將數值經過 sigmoid 變成機率值，再進 BCE 得到 loss，
而這 func 的輸入為 logits，也就是不用再自己轉一層 sigmoid。

當我們 tf.keras.losses.BinaryCrossentropy 設 from_logits = True 時
會直接進 tf.nn.sigmoid_cross_entropy_with_logits。

而設為 False 時，則會將經 sigmoid 值反向變回 logits，最終再進

tf.nn.sigmoid_cross_entropy_with_logits，脫褲子放屁！

```
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))
```

因此建議 tf.keras.losses.BinaryCrossentropy 直接用 from_logits = True
且，tf.nn.sigmoid_cross_entropy_with_logits 是 numeric stable，原因為以下：

```
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x)) 這個是 BCE 公式
以下是化簡
= z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
= z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
= z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
= (1 - z) * x + log(1 + exp(-x))
= x - x * z + log(1 + exp(-x))
```

For x < 0, to avoid overflow in exp(-x), we reformulate the above， 會做以下事：

```
 x - x * z + log(1 + exp(-x))
= log(exp(x)) - x * z + log(1 + exp(-x))
= - x * z + log(1 + exp(x))
```

```
max(x, 0) - x * z + log(1 + exp(-abs(x)))
```
由以下實驗可以看到，math.exp x給太大會 overflow，所以再轉成下面那
```
import math

def a(x, z):
    return -x*z + math.log(1+math.exp(x))

def b(x, z):
    return max(x,0) - x*z +math.log(1+math.exp(-abs(x)))

print(a(1000,5))
# print(b(1000000,5))
```
再來若要知道機率值，可以 predict 出來的值再自己做 sigmoid 即可 
會壓在 0~1 之間，極大接近 1 ，極小接近 0

出來的便可當作是 label 1 的機率值

In [23]:
import logging, os, warnings
import numpy as np
import pandas as pd
import my_trace as tc
import tensorflow as tf
from sklearn.utils import shuffle
from data_helper import XY_from_df
from sklearn.metrics import classification_report, confusion_matrix


# tensorflow only show Error
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# sklearn ignore warnings
warnings.filterwarnings('ignore')

SEED = 5
TRAIN_PATH = "../dataset/train.csv"
TARGET_NAMES = ["bad", "good"]
np.random.seed(SEED)
tf.random.set_seed(SEED)

# get data
df_train = pd.read_csv(TRAIN_PATH)
X, Y = XY_from_df(df_train)

## Model Struture

In [26]:
def get_model():
    model = tf.keras.Sequential([
        
    tf.keras.layers.Dense(64, activation=None),
    tf.keras.layers.ReLU(),
    tf.keras.layers.Dropout(0.5, seed=SEED),
        
    tf.keras.layers.Dense(32, activation=None),
    tf.keras.layers.ReLU(),
    tf.keras.layers.Dropout(0.5, seed=SEED),
        
    tf.keras.layers.Dense(1,activation=None),
    ])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.005,name='adam'),
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['acc'])

    return model

## train

In [27]:
model = get_model()
model.fit(X, Y, epochs=100, batch_size=64, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fde3429c990>

## predict

In [62]:
model.predict(X[:10])

array([[ 9.66855  ],
       [ 4.721215 ],
       [-6.414172 ],
       [-0.9882879],
       [-5.2085776],
       [ 9.643806 ],
       [ 3.7633142],
       [ 3.3025436],
       [24.281986 ],
       [-3.9070172]], dtype=float32)

In [63]:
# 得到 output 機率
def predict_proba(X):
    results = []
    P = tf.nn.sigmoid(model.predict(X))
    for p in P.numpy():
        label_1_p = float(p[0])
        results.append([1-label_1_p , label_1_p])
    return np.array(results)

In [64]:
predict_proba(X[:10])

array([[6.33001328e-05, 9.99936700e-01],
       [8.82577896e-03, 9.91174221e-01],
       [9.98364504e-01, 1.63549639e-03],
       [7.28749633e-01, 2.71250367e-01],
       [9.94560305e-01, 5.43969544e-03],
       [6.48498535e-05, 9.99935150e-01],
       [2.26804018e-02, 9.77319598e-01],
       [3.54839563e-02, 9.64516044e-01],
       [0.00000000e+00, 1.00000000e+00],
       [9.80295699e-01, 1.97043009e-02]])