# Feedforward Neural Networks (FNN)

## Problem Type
**Feedforward Neural Networks (FNN)** are primarily used for:
- **Supervised Learning**
- **Regression** and **Classification** tasks
- **Applications**: Image recognition, speech recognition, tabular data analysis, and many other predictive modeling tasks.

### How Feedforward Neural Networks Work
- **Input Layer:**
  - The input layer receives the data. Each neuron in this layer represents a feature in the dataset.
- **Hidden Layers:**
  - FNNs consist of one or more hidden layers where computations are performed. Each neuron in these layers applies a linear transformation followed by a non-linear activation function (e.g., ReLU, sigmoid).
- **Activation Functions:**
  - Non-linear activation functions allow the network to model complex relationships. Common functions include ReLU, sigmoid, and tanh.
- **Output Layer:**
  - The final layer outputs predictions. For classification, it might use a softmax function to output probabilities. For regression, it outputs a continuous value.
- **Forward Propagation:**
  - Input data is passed through the network, layer by layer, in a forward direction, to generate an output.
- **Loss Function:**
  - The difference between the predicted output and the actual target is measured using a loss function (e.g., cross-entropy for classification, mean squared error for regression).
- **Backpropagation:**
  - The network uses backpropagation to compute the gradient of the loss function with respect to each weight by applying the chain rule, allowing the network to update weights via gradient descent.
- **Weight Updating:**
  - The network's weights are adjusted to minimize the loss function. This process is repeated over many iterations (epochs) until the model converges or meets a predefined stopping criterion.

### Key Tuning Metrics
- **`number_of_layers`:**
  - **Description:** Number of hidden layers in the network.
  - **Impact:** More layers can capture complex patterns but may lead to overfitting if the network is too deep.
  - **Default:** Typically ranges from `1` to `3`, but deeper networks are used in more complex tasks.
- **`number_of_neurons_per_layer`:**
  - **Description:** Number of neurons in each hidden layer.
  - **Impact:** More neurons allow for capturing more features but increase the risk of overfitting and require more computational resources.
  - **Default:** Common choices range from `64` to `512` per layer.
- **`activation_function`:**
  - **Description:** Function used to introduce non-linearity into the model.
  - **Impact:** ReLU is commonly used due to its performance benefits, but other functions like sigmoid or tanh may be used depending on the task.
  - **Default:** `ReLU` for hidden layers, `softmax` for classification output, `linear` for regression output.
- **`learning_rate`:**
  - **Description:** Step size for updating the network's weights during training.
  - **Impact:** Higher values speed up training but may cause instability; lower values provide more stable convergence but slow down training.
  - **Default:** Ranges from `1e-3` to `1e-5`, often adjusted dynamically using a learning rate scheduler.
- **`batch_size`:**
  - **Description:** Number of samples processed before the model's weights are updated.
  - **Impact:** Larger batch sizes lead to more stable gradient estimates but require more memory.
  - **Default:** Typically `32`, `64`, or `128`.
- **`epochs`:**
  - **Description:** Number of times the entire dataset is passed through the network during training.
  - **Impact:** More epochs allow the model to learn better but may lead to overfitting if trained too long.
  - **Default:** Usually ranges from `10` to `100`, depending on the dataset and task.

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| Capable of modeling complex, non-linear relationships | Requires large amounts of data and computational power |
| Flexible architecture that can be adapted to various types of data | Prone to overfitting, especially with deep architectures |
| Can approximate any continuous function given sufficient neurons and layers (Universal Approximation Theorem) | Often requires extensive hyperparameter tuning to achieve optimal performance |
| Benefits from modern hardware acceleration (e.g., GPUs) | Black-box nature makes it difficult to interpret the model |
| Extensive support in deep learning frameworks (e.g., TensorFlow, PyTorch) | Training can be time-consuming and resource-intensive |

### Evaluation Metrics
- **Accuracy (Classification):**
  - **Description:** Ratio of correct predictions to total predictions.
  - **Good Value:** Higher is better; values above 0.85 indicate strong model performance.
  - **Bad Value:** Below 0.5 suggests poor model performance.
- **Precision (Classification):**
  - **Description:** Proportion of true positives among all positive predictions.
  - **Good Value:** Higher values indicate fewer false positives, especially important in imbalanced datasets.
  - **Bad Value:** Low values suggest many false positives.
- **Recall (Classification):**
  - **Description:** Proportion of actual positives correctly identified.
  - **Good Value:** Higher values indicate fewer false negatives, important in recall-sensitive applications.
  - **Bad Value:** Low values suggest many false negatives.
- **F1 Score (Classification):**
  - **Description:** Harmonic mean of Precision and Recall.
  - **Good Value:** Higher values indicate a good balance between Precision and Recall.
  - **Bad Value:** Low values suggest a poor balance between Precision and Recall.
- **R-squared (Regression):**
  - **Description:** Proportion of variance in the dependent variable explained by the model.
  - **Good Value:** Higher is better; values closer to 1 indicate a strong model.
  - **Bad Value:** Values closer to 0 suggest the model does not explain much of the variance.
- **Mean Absolute Error (MAE) (Regression):**
  - **Description:** Measures the average absolute difference between predicted and actual values.
  - **Good Value:** Lower is better; values close to `0` indicate high accuracy.
  - **Bad Value:** Higher values suggest significant prediction errors.
- **Root Mean Squared Error (RMSE) (Regression):**
  - **Description:** Measures the square root of the average squared difference between predicted and actual values.
  - **Good Value:** Lower is better; values close to `0` indicate high accuracy.
  - **Bad Value:** Higher values suggest the model's predictions deviate significantly from actual values.
- **Log Loss (Classification):**
  - **Description:** Measures the performance of a classification model where the output is a probability value between 0 and 1.
  - **Good Value:** Lower is better; values close to `0` indicate that the model's predictions are close to the true labels.
  - **Bad Value:** Higher values suggest poor prediction probabilities.



In [None]:
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  # Suppresses INFO and WARNING messages
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from keras.layers import Dense, Input
from keras.models import Sequential
from keras.optimizers import Adam
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Load Iris dataset
iris = load_iris()
X = iris.data
y_ = iris.target.reshape(-1, 1)  # Convert data to a single column

In [None]:
# One Hot encode the class labels
encoder = OneHotEncoder(sparse_output=False)
y = encoder.fit_transform(y_)

In [None]:
# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
# Build the model
model = Sequential()

model.add(Input(shape=(4,)))  # Add an Input layer
model.add(Dense(10, activation="relu", name="fc1"))
model.add(Dense(10, activation="relu", name="fc2"))
model.add(Dense(3, activation="softmax", name="output"))

# Adam optimizer with learning rate of 0.001
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

print("Neural Network Model Summary: ")
print(model.summary())

# Train the model
model.fit(X_train, y_train, verbose=2, batch_size=5, epochs=200)

# Test on unseen data
results = model.evaluate(X_test, y_test)

print("Final test set loss: {:4f}".format(results[0]))
print("Final test set accuracy: {:4f}".format(results[1]))

In [None]:
predictions = model.predict(X_test)
predictions = np.argmax(predictions, axis=1)  # Convert predictions to labels
true_labels = np.argmax(y_test, axis=1)  # Convert one-hot encoded y_test to labels

# # Map the numerical labels back to original class names
predictions = iris.target_names[predictions]
true_labels = iris.target_names[true_labels]

print(classification_report(true_labels, predictions))

In [None]:
cm = confusion_matrix(true_labels, predictions)

#plt.figure(figsize=(10, 7))
sns.heatmap(
    cm,
    annot=True,
    cmap="Blues",
    xticklabels=iris.target_names,
    yticklabels=iris.target_names,
)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()