# In-class exercise 7: Deep Learning 1 (Part A)
In this notebook we will see how to write efficient and numerically stable code.

In [126]:
import numpy as np
import matplotlib.pyplot as plt
import time

%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import minmax_scale

## Loading the data

In [128]:
X, y = load_breast_cancer(return_X_y=True)

# Scale each feature to [-1, 1] range
X = minmax_scale(X, feature_range=(-1, 1))

Check the shapes

In [130]:
# TODO
# 检查 X 和 y 的形状
print("The shapes of X :", X.shape)  # 输出特征矩阵的形状 (n_samples, n_features)
print("The shape of y :", y.shape)  # 输出标签向量的形状 (n_samples,)

The shapes of X : (569, 30)
The shape of y : (569,)


# 1. Vectorization

## 1.1. Logistic regression (two classes)

**Setting:** Logistic regression (two classes)

**Task:** Generate predictions for the entire dataset

**Data:** $X \in \mathbb{R}^{n \times d}$, $y \in \mathbb{R}^{n}$

**Model:** $f(x) = \sigma(w^T x + b)$

In [134]:
n_features = X.shape[1]#值为30，样本特征为30
w = np.random.normal(size=[n_features], scale=0.1)  # weight vector，正态分布所以分配权重给30个特征对应的权重，调整方差从1到0.1
b = np.random.normal(size=[1])  # bias，给个标量，b~N（0，1）

Check the shapes

In [136]:
print("w:", w)
print("b:", b)


w: [-0.03283724 -0.00453142  0.03874803 -0.0254831   0.01703173 -0.08540781
 -0.08210018 -0.01399902 -0.03677893  0.01313341  0.05320284  0.01832433
 -0.04683557  0.06853425 -0.03517877 -0.06111264  0.01527318  0.1097886
  0.06336582  0.14272366  0.14136258  0.09721994 -0.20071091 -0.07773389
  0.17203439  0.06824441  0.03175286 -0.12386361  0.0902873   0.02730797]
b: [0.45633641]


Define the `sigmoid` function

In [138]:
def sigmoid(t):
    """Apply sigmoid to the input array."""
    # TODO
    return 1 / (1 + np.exp(-t))

Does it work for any input?

In [140]:
# input is a scalar
print(sigmoid(0))

# input is a vector
print(sigmoid(np.array([0, 1, 2])))

# input is a matrix
print(sigmoid(np.array([[0, 1, 2], [-1, -2, -3]])))

0.5
[0.5        0.73105858 0.88079708]
[[0.5        0.73105858 0.88079708]
 [0.26894142 0.11920292 0.04742587]]


This is called **broadcasting**. The smaller array is "broadcast" across the larger array so that they have compatible shapes. Numpy does this automatically. Let's see how it works. (Also see [here](https://numpy.org/doc/stable/user/basics.broadcasting.html#).)

In [142]:
# How does broadcasting work between a scalar and a vector?
# TODO
vector = np.array([1, 2, 3])
scalar = 5

# Broadcasting in action
result = vector + scalar
print(result)

[6 7 8]


In [143]:
# How does broadcasting work between a scalar and a matrix?
# TODO
# Define a scalar and a matrix
scalar = 3
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

# Perform broadcasting addition
result = matrix + scalar

# Print results
print("Matrix:\n", matrix)
print("Scalar:", scalar)
print("Result of broadcasting:\n", result)

Matrix:
 [[1 2 3]
 [4 5 6]]
Scalar: 3
Result of broadcasting:
 [[4 5 6]
 [7 8 9]]


In [144]:
# How does broadcasting work between a vector and a matrix?
# TODO
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
row_vector = np.array([10, 20, 30])  # Shape: (3,)

result1 = matrix + row_vector  # Broadcasting
print(result1)
column_vector = np.array([[10],
                          [20]])  # Shape: (2, 1)

result2 = matrix + column_vector  # Broadcasting
print(result2)

[[11 22 33]
 [14 25 36]]
[[11 12 13]
 [24 25 26]]


### Bad - for loops

Generate predictions with a logistic regression model using a for-loop.

In [146]:
def predict_for_loop(X, w, b):
    """Generate predictions with a logistic regression model using a for-loop.

    Args:
        X: data matrix, shape (N, D)
        w: weights vector, shape (D)
        b: bias term, shape (1)

    Returns:
        y: probabilities of the positive class, shape (N)
    """
    # TODO
    N, D = X.shape  # Number of samples (N) and features (D)
    y = np.zeros(N)  # Initialize the output probabilities

    for i in range(N):
        # Calculate z = x_i * w + b for the i-th sample
        z = np.dot(X[i], w) + b
        # Apply the sigmoid function and ensure it's a scalar
        y[i] = sigmoid(z).item()  # .item() ensures the result is a scalar

    return y

### Good - vectorization

Generate predictions with a logistic regression model using vectorized operations.

In [148]:
def predict_vectorized(X, w, b):
    """Generate predictions with a logistic regression model using vectorized operations.

    Args:
        X: data matrix, shape (N, D)
        w: weights vector, shape (D)
        b: bias term, shape (1)

    Returns:
        y: probabilies of the positive class, shape (N)
    """
    # TODO
    # Compute the linear model z = X @ w + b
    z = np.dot(X, w) + b  # X @ w is the same as np.dot(X, w)
    
    # Apply the sigmoid function to z to get probabilities
    y = sigmoid(z)
    
    return y

### Make sure that both variants produce the same results

In [150]:
results_for_loop = predict_for_loop(X, w, b)
results_vectorized = predict_vectorized(X, w, b)

Are the results the same?

In [152]:
# TODO
# Check the shape of the results
print(f"Shape of results_for_loop: {results_for_loop.shape}")
print(f"Shape of results_vectorized: {results_vectorized.shape}")

# Check if both results are numerically close
# if np.allclose(results_for_loop, results_vectorized):这是比较close，要看相同用=
#     print("Both methods produce the same result!")
# else:
#     print("The results are different.")
np.all(results_for_loop == results_vectorized)

Shape of results_for_loop: (569,)
Shape of results_vectorized: (569,)


False

What is the norm of the difference?

In [154]:
# TODO
# Compute the difference between the two results
difference = results_for_loop - results_vectorized

# Compute the L2 norm (Euclidean norm) of the difference
norm_difference = np.linalg.norm(difference)

# Print the result
print("Norm of the difference:", norm_difference)

Norm of the difference: 6.77599954797753e-16


Are they close enough?

In [280]:
# TODO
#depends on the thereholds that we have already set, such as 1e-6 or 1e-10
if np.allclose(results_for_loop, results_vectorized):#这是比较close，要看相同用=
    print("Both methods produce the same result!")
else:
    print("The results are different.")

Both methods produce the same result!


### Compare the runtime of two variants

In [158]:
%%timeit
predict_for_loop(X, w, b)

2.18 ms ± 26 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [159]:
%%timeit
predict_vectorized(X, w, b)

8.79 μs ± 70.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


## 1.2. K-nearest neighbors
A more complicated task: compute the matrix of pairwise distances.

Given a data matrix `X` of size `[N, D]`, compute the matrix `dist` of pairwise distances of size `[N, N]`, where `dist[i, j] = l2_distance(X[i], X[j])`.

The L2 distance is:

$$
d_2(a, b) = \sqrt{\sum_{i=1}^d (a_i - b_i)^2}
$$

### Bad - for loops

In [163]:
def l2_distance(x, y):
    """Compute Euclidean distance between two vectors."""
    # TODO
    return np.sqrt(np.sum((x - y) ** 2))

In [164]:
def distances_for_loop(X):
    """Compute pairwise distances between all instances (for loop version).

    Args:
        X: data matrix, shape (N, D)

    Returns:
        dist: matrix of pairwise distances, shape (N, N)
    """
    # TODO
    N = X.shape[0]  # Number of rows
    dist = np.zeros((N, N))  # Initialize the distance matrix

    for i in range(N):
        for j in range(N):
            dist[i, j] = l2_distance(X[i], X[j])
    
    return dist

In [165]:
# compute pairwise distances using for loops
dist1 = distances_for_loop(X)
print(dist1)

[[0.         2.99965852 1.91448055 ... 3.46242403 2.02714709 5.10607491]
 [2.99965852 0.         1.50824977 ... 1.49479679 3.33795732 3.28948876]
 [1.91448055 1.50824977 0.         ... 1.89129137 2.10557371 3.96949079]
 ...
 [3.46242403 1.49479679 1.89129137 ... 0.         3.27879383 2.51965124]
 [2.02714709 3.33795732 2.10557371 ... 3.27879383 0.         5.32401844]
 [5.10607491 3.28948876 3.96949079 ... 2.51965124 5.32401844 0.        ]]


### Good - vectorization

How can we compute all the distances in a vectorized way?

Start with a simpler example.

In [284]:
x = np.arange(5, dtype=np.float64)#是个1D列向量
x

array([0., 1., 2., 3., 4.])

Use `numpy` broadcasting to compute the matrix of pairwise distances in a vectorized way. We achieve this by adding a new axis to `x` using `np.newaxis`.

In [286]:
# TODO
x[:,None].repeat(x.shape[0],axis=1)
#x[:,None]这个操作将1D（5，）变到2D（5，1）
#axis=1表示再列的维度上面的重复。是按照行进行操作，首先把向量看成一列，再复制到五列。

array([[0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1.],
       [2., 2., 2., 2., 2.],
       [3., 3., 3., 3., 3.],
       [4., 4., 4., 4., 4.]])

In [290]:
# TODO
x[None,:].repeat(x.shape[0],axis=0)
#根据上面这个就是1D（5，）变到2D（1，5）再按照对行的维度上进行重复

array([[0., 1., 2., 3., 4.],
       [0., 1., 2., 3., 4.],
       [0., 1., 2., 3., 4.],
       [0., 1., 2., 3., 4.],
       [0., 1., 2., 3., 4.]])

In [292]:
# TODO
x[:,None]-x[None,:]#对应相减

array([[ 0., -1., -2., -3., -4.],
       [ 1.,  0., -1., -2., -3.],
       [ 2.,  1.,  0., -1., -2.],
       [ 3.,  2.,  1.,  0., -1.],
       [ 4.,  3.,  2.,  1.,  0.]])

The same result can be achieved using `None` indexing.

In [296]:
# TODO
np.sum(np.square(x[:,None]-x[None,:]))

100.0

In [175]:
# TODO

In [315]:
def distances_vectorized(X):
    """Compute pairwise distances between all instances (vectorized version).

    Args:
        X: data matrix, shape (N, D)

    Returns:
        dist: matrix of pairwise distances, shape (N, N)
    """
    
    return np.sqrt(np.sum((X[:,None]-X[None,:])**2,axis=-1))
    #（N，1，D）-（1，N，D）=（N，N，D）。np.sum(..., axis=-1)沿着最后一个维度，特征D求和

In [317]:
# compute pairwise distances using vectorized operations
dist2 = distances_vectorized(X)

### Make sure that both variants produce the same results

In [320]:
np.allclose(dist1, dist2)

True

### Best - library function

In [323]:
from scipy.spatial.distance import cdist, pdist, squareform

dist3 = cdist(X, X)
dist4 = squareform(pdist(X))

Make sure that both variants produce the same results

In [326]:
# Use np.allclose to compare
np.allclose(dist2, dist3)

True

### Compare the runtime

In [185]:
%%timeit
distances_for_loop(X)

1.04 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [186]:
%%timeit
distances_vectorized(X)

3.4 ms ± 141 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [187]:
%%timeit
cdist(X, X)

2.93 ms ± 94 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [188]:
%%timeit
squareform(pdist(X))

2.41 ms ± 62.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Lessons:
1. For-loops are extremely slow! Avoid them whenever possible.
2. A better alternative - use matrix operations & broadcasting
3. An even better alternative - use library functions (if they are available).
4. Implementations with for-loops can be useful for debugging vectorized code.

# 2. Numerical stability
Typically, GPUs use single precision (32bit) floating point numbers (in some cases even half precision / 16bit). This significantly speeds ups the computations, but also makes numerical issues a lot more likely. 
Because of this we always have to be extremely careful to implement our code in a numerically stable way.

Most commonly, numerical issues occur when dealing with `log` and `exp` functions (e.g. when computing cross-entropy of a categorical distribution) and `sqrt` for values close to zero (e.g. when computing standard deviations or normalizing the $L_2$ norm).

In [332]:
np.finfo(np.float64), np.finfo(np.float32), np.finfo(np.float16)

(finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64),
 finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32),
 finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16))

## 2.1. Avoiding numerical overflow (exploding `exp`)

Softmax function $f : \mathbb{R}^D \to \Delta^{D - 1}$ converts a vector $\mathbf{x} \in \mathbb{R}^D$ into a vector of probabilities.

$$f(\mathbf{x})_j = \frac{\exp(x_j)}{\sum_{d=1}^{D} \exp(x_d)}$$

In [334]:
def softmax_unstable(logits):
    # TODO
    exp= np.exp(logits)
    return exp/np.sum(exp,axis=0)

Apply the softmax function to the following vector.

In [336]:
x = np.linspace(0.0, 4.0, 5).astype(np.float32)
x

array([0., 1., 2., 3., 4.], dtype=float32)

In [338]:
softmax_unstable(x)

array([0.01165623, 0.03168492, 0.08612854, 0.23412167, 0.6364086 ],
      dtype=float32)

Now apply it to the following vector

In [199]:
x = np.linspace(50.0, 90.0, 5).astype(np.float32)
x

array([50., 60., 70., 80., 90.], dtype=float32)

In [340]:
softmax_unstable(x)

array([0.01165623, 0.03168492, 0.08612854, 0.23412167, 0.6364086 ],
      dtype=float32)

### How to avoid the explosion?

Shift the values by a constant $C$.

$$f(\mathbf{x})_j = \frac{\exp(x_j - C)}{\sum_{d=1}^{D} \exp(x_d - C)}$$

In [342]:
def softmax_stable(logits):
    """Compute softmax values for each sets of scores in logits."""
    # TODO
    logits_shifted= logits-np.max(logits,axis=0)
    denominator = np.sum(np.exp(logits_shifted),axis=0)
    return np.exp(logits_shifted)/denominator
    

In [344]:
x = np.linspace(50.0, 90.0, 5).astype(np.float64)

In [346]:
softmax_unstable(x)

array([4.24816138e-18, 9.35719813e-14, 2.06106005e-09, 4.53978686e-05,
       9.99954600e-01])

## 2.2. Working in the log-space / simplifying the expressions

Binary cross entropy (BCE) loss for a logistic regression model (corresponds to negative log-likelihood of a Bernoulli model)

$$\log p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, b) = -\sum_{i=1}^{N} y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i + b) + (1 - y_i) \log (1 - \sigma(\mathbf{w}^T \mathbf{x}_i + b))$$


Implement the BCE computation.

In [358]:
def sigmoid(t):
    # TODO
    return 1/(1+np.exp(-t))


def binary_cross_entropy_unstable(scores, labels):
    """Compute binary cross-entropy loss for one sample."""
    # TODO
    return -labels * np.log(sigmoid(scores))-(1-labels)* np.log(1-sigmoid(scores))

In [360]:
x = np.array([[20.0, 20.0]])  # [1, 2]
w = np.array([[1.0, 1.0]])  # [1, 2]
y = np.array([1.0])  # [1,]

# 1. compute logits
# TODO
scores = np.dot(x,w.T)
# 2. compute loss
# TODO
binary_cross_entropy_unstable (scores,y)#溢出了

  return -labels * np.log(sigmoid(scores))-(1-labels)* np.log(1-sigmoid(scores))
  return -labels * np.log(sigmoid(scores))-(1-labels)* np.log(1-sigmoid(scores))


array([[nan]])

Try to simplify the BCE loss as much as possible

In [366]:
def binary_cross_entropy_stable(scores, labels):
    # TODO
    return np.log(1+np.exp(scores)) - labels * scores

In [368]:
# 1. compute logits
# TODO
scores = np.dot(x,w.T)
# 2. compute loss
# TODO
binary_cross_entropy_stable (scores,y)

array([[0.]])

## 2.3. Loss of numerical precision

Implement the log sigmoid function 

$$f(x) = \log \sigma(x) = \log \left(\frac{1}{1 + \exp(-x)}\right)$$

In [215]:
def log_sigmoid_unstable(x):
    # TODO
    return np.log(1/(1+np.exp(-x)))

`float32` has much lower "resolution" than `float64`

In [217]:
x = np.linspace(0, 30, 11).astype(np.float32)
x, log_sigmoid_unstable(x)

(array([ 0.,  3.,  6.,  9., 12., 15., 18., 21., 24., 27., 30.],
       dtype=float32),
 array([-6.9314718e-01, -4.8587341e-02, -2.4756414e-03, -1.2338923e-04,
        -6.1989022e-06, -3.5762793e-07,  0.0000000e+00,  0.0000000e+00,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32))

In [218]:
x = np.linspace(0, 30, 11).astype(np.float64)
log_sigmoid_unstable(x)

array([-6.93147181e-01, -4.85873516e-02, -2.47568514e-03, -1.23402190e-04,
       -6.14419348e-06, -3.05902274e-07, -1.52299796e-08, -7.58256125e-10,
       -3.77513576e-11, -1.87960758e-12, -9.34807787e-14])

Implement the log-sigmoid function in a numerically stable way

In [220]:
def log_sigmoid_stable(x):
    # TODO
    return -np.log1p(np.exp(-x))#log1p dengyushangmiande

In [221]:
x = np.linspace(0, 30, 11).astype(np.float32)
log_sigmoid_stable(x)

array([-6.9314718e-01, -4.8587352e-02, -2.4756852e-03, -1.2340219e-04,
       -6.1441938e-06, -3.0590229e-07, -1.5229981e-08, -7.5825601e-10,
       -3.7751344e-11, -1.8795287e-12, -9.3576229e-14], dtype=float32)

Relevant functions: `np.log1p`, `np.expm1`, `scipy.special.logsumexp`, `scipy.special.softmax` -- these are also implemented in all major deep learning frameworks.

## Lessons:
1. Be especially careful when working with `log` and `exp` functions in **single precision** floating point arithmetics
2. Work in the log-space when possible
3. Use numerically stable library functions when available