### 1. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron?
**Answer**: Logistic Regression provides probabilities as output, which gives more nuanced information about the confidence of the model's predictions. Logistic Regression is also easier to interpret, can be updated in a more fine-grained way during training due to the nature of its cost function, and tends to work better for non-linearly separable data. The Perceptron, on the other hand, makes hard binary decisions and can only model linearly separable data.

To make a Perceptron equivalent to a Logistic Regression classifier, you could replace the step function used in the Perceptron with a logistic (sigmoid) activation function and train it using a log-likelihood cost function.

### 2. Why was the logistic activation function a key ingredient in training the first MLPs?
**Answer**: The logistic activation function was essential because it is differentiable. This allowed the use of gradient-based optimization methods like backpropagation for training MLPs. It also helps to introduce non-linearity into the model, making it capable of learning from error and adapting during training.

### 3. Name three popular activation functions. Can you draw them?
**Answer**: Three popular activation functions are ReLU (Rectified Linear Unit), Sigmoid, and Tanh. 
- ReLU: \( $f(x) = max(0, x)$ \)
<img src = "https://www.nomidl.com/wp-content/uploads/2022/04/image-10.png">
- Sigmoid: \( $f(x) = \frac{1}{1+e^{-x}}$ \)
<img src = "https://www.nomidl.com/wp-content/uploads/2022/04/image-11.png">
- Tanh: \( $f(x) = \tanh(x)$ \)
<img src = "https://www.baeldung.com/wp-content/uploads/sites/4/2022/02/tanh.png">


### 4. MLP with specific architecture: Questions about shapes and equations
- **Shape of input matrix \( X \)**: [N, 10] where N is the number of samples.
- **Shape of hidden layer's weight vector \( Wh \)**: [10, 50]
- **Shape of its bias vector \( bh \)**: [50]
- **Shape of output layer's weight vector \( Wo \)**: [50, 3]
- **Shape of its bias vector \( bo \)**: [3]
- **Shape of network's output matrix \( Y \)**: [N, 3]
  
**Equation**:  
\[ $Y = ReLU(ReLU(X \cdot Wh + bh) \cdot Wo + bo)$ \]

### 5. Neurons in the output layer for specific tasks
**Answer**: For spam or ham classification, you need just 1 neuron in the output layer using the sigmoid activation function. For MNIST, you need 10 neurons in the output layer and typically use the softmax activation function.

### 6. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?
**Answer**: Backpropagation is an algorithm for efficiently computing gradients of the loss function with respect to the weights of the network, which are needed to update the weights to minimize the loss. It works by applying the chain rule of calculus in a clever way to propagate error backwards from the output to the input layer. Reverse-mode autodiff is a more general technique for efficiently computing gradients for any computational graph and is often used as a part of backpropagation.

### 7. Can you list all the hyperparameters you can tweak in an MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?
**Answer**: Hyperparameters in an MLP include the number of hidden layers, the number of neurons in each hidden layer, learning rate, batch size, activation functions, regularization methods, and many others. To combat overfitting, you can reduce the number of hidden layers or neurons, apply dropout, or use techniques like L1/L2 regularization.

### 8. Train a deep MLP on the MNIST dataset and see if you can get over 98% precision.

In [1]:
import tensorflow as tf

# Load MNIST data
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize to [0, 1]

# Flatten the data
x_train = x_train.reshape(-1, 28*28)
x_test = x_test.reshape(-1, 28*28)

# Create the MLP model
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(28*28,)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Create checkpoints
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("mnist_model.h5", save_best_only=True)

# Use TensorBoard
tensorboard_cb = tf.keras.callbacks.TensorBoard(log_dir="./logs")

# Train the model
model.fit(x_train, y_train, epochs=10, validation_split=0.2,
          callbacks=[checkpoint_cb, tensorboard_cb])

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)

print(f"Test accuracy: {test_acc * 100:.2f}%")

# To start TensorBoard, run the following command in your terminal:
# tensorboard --logdir=./logs


Epoch 1/10
Epoch 2/10
  25/1500 [..............................] - ETA: 6s - loss: 0.1948 - accuracy: 0.9450

  saving_api.save_model(


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 98.12%
