# Session 6

## Hyperparameter Tuning
**Question**

Does model hyperparameter settings like Neural Networks: Number of layers, number of nodes in each layer, activation functions, dropout rate, etc. will lead to change in accuracy of model while algorithm hyperparameters like Gradient Descent: Batch size, number of epochs, learning rate schedule, etc. will not lead to change in accuracy of model but just faster convergence ?

**Solution**

In machine learning, the term "hyperparameter" is used broadly to describe settings or configurations that govern the learning process but are not learned from the data. However, the context in which the term is used can vary, and this leads to the distinction between model hyperparameters and algorithm hyperparameters.



Both model hyperparameters and algorithm hyperparameters can affect the accuracy of the model, not just the speed of convergence. Here's how:

### Model Hyperparameters:

1. **Number of Layers and Nodes**: Affect the model's capacity to learn. Too few can lead to underfitting, while too many can lead to overfitting.
2. **Activation Functions**: Different activation functions can capture different types of non-linearity in the data.
3. **Dropout Rate**: Helps in preventing overfitting, thereby potentially increasing the generalization accuracy.

Changing these can significantly affect the accuracy of the model on both the training and testing datasets.

### Algorithm Hyperparameters:

1. **Batch Size**: Smaller batches often provide a regularizing effect and lower generalization error, whereas larger batches converge faster but might overfit to the training data.
2. **Number of Epochs**: Training for more epochs can sometimes improve accuracy, but it can also lead to overfitting.
3. **Learning Rate and Schedule**: Too large a learning rate can cause the model to converge to a suboptimal solution, while too small a learning rate can cause the model to train very slowly, or not converge at all.

These also have a direct impact on the model's performance and accuracy, not just the speed of convergence. 

### Convergence and Accuracy

It's important to note that faster convergence doesn't always equate to higher accuracy. Sometimes a model may converge quickly to a suboptimal solution. Similarly, a model may take longer to converge but may find a better (or worse) solution in terms of accuracy.

### Summary

Both types of hyperparameters are crucial to tune carefully during the model development process. Failing to do so can result in a poorly performing model. Often, researchers and practitioners will use techniques like grid search or random search, and more advanced methods like Bayesian optimization, to find the best hyperparameter settings in terms of accuracy and computational efficiency.

## Open CV vs Neural Nets feature detection

**Question**

There are few algorithms like Canny Edge detection in OpenCv for edge detection and in neural- nets kernels also try to detect features like edges so what's the difference? Where do we use OpenCv, where do we use Neural Nets?



*OpenCV is the inspiration to neural networks*

*OpenCv traditionally is trained to work on single images, not on bulk images. U have to iterate it explicitly in a loop & keep taking individual actions. OPenCV has some feature detection techniques but all those will have to be applied manually in OpenCV. But with CNN feature extraction happens implicitly we dont say what type of features to extract etc, we dont define what should be kernel values, we just say this is size of kernel, this is initial random values it will learn automatically* 

*OPenCv is more of at an image level u want to do some correction, it does that. DL technique is we look at several images & we try to learn patterns from that & then feed to a fully connected layer & make predictions*

## Filters in OPenCV feature extraction methods vs filters in CNN

Yes, in traditional computer vision methods using OpenCV (or similar libraries), feature extraction often relies on pre-defined filters or techniques with fixed values, patterns, or parameters. These are based on mathematical and empirical observations about image structures.

Some common examples include:

1. **Sobel and Scharr Filters**: These filters are used for edge detection by computing the gradient magnitude in the horizontal (x) and/or vertical (y) directions.
   
2. **Laplacian Filter**: Used to compute the second derivative of an image, highlighting regions of rapid intensity change.
   
3. **Gabor Filters**: Used to detect edges and textures in images.
   
4. **Histogram of Oriented Gradients (HOG)**: A feature descriptor used for object detection. It counts occurrences of gradient orientation in localized portions of an image.
   
5. **Scale-Invariant Feature Transform (SIFT)**, **Speeded-Up Robust Features (SURF)**, **Oriented FAST and Rotated BRIEF (ORB)**: These are algorithms used for detecting and describing local features in images.

In contrast, Convolutional Neural Networks (CNNs) in deep learning do not rely on hand-crafted filters. Instead, they learn the optimal filters directly from the data during the training process. The initial values of these filters are usually set randomly (or using certain initialization strategies), and they get updated via backpropagation to best fit the data. 

Over the years, CNNs have demonstrated superior performance in many computer vision tasks compared to traditional methods. This is largely because they can learn complex hierarchies of features directly from data, without relying on manually engineered features which may not capture all nuances of the data.

Both traditional computer vision techniques like the Canny edge detector and Convolutional Neural Networks (CNNs) are capable of edge detection, but they are fundamentally different in their approaches and use-cases.

### Traditional Methods (e.g., Canny Edge Detector)

1. **Deterministic**: The algorithm is fixed and designed by human experts to capture specific types of edges.
   
2. **Fast**: These methods are usually computationally less intensive compared to CNNs.

3. **Hand-engineered Features**: The features (edges in this case) are decided in advance and explicitly programmed.

4. **Less Data-dependent**: Useful when you have limited labeled data for training.

5. **Transparency**: Easier to understand, interpret, and debug because you know exactly what the algorithm is doing.

6. **Limited Flexibility**: Designed for a specific task and may not generalize well to more complex or different types of tasks.

### Convolutional Neural Networks (CNNs)

1. **Learned Features**: The network learns to identify the features that are most useful for a given task during the training process.

2. **Data-dependent**: Requires a substantial amount of labeled data to train effectively.

3. **Highly Flexible**: Can learn complex hierarchical features and generalize to a wide range of tasks.

4. **Computationally Intensive**: Requires more computational resources, especially for large networks.

5. **Black Box**: Often difficult to interpret, as the features are learned implicitly during training.

6. **End-to-End Learning**: CNNs can learn to perform the entire task (e.g., not just edge detection but object recognition) in a single pass, which can be an advantage.

### When to Use Which:

1. **Resource Constraints**: If computational resources are limited, traditional methods may be more suitable.

2. **Data Availability**: If you have limited labeled data, traditional methods may offer a better solution.

3. **Task Complexity**: For simple edge detection, traditional methods might suffice. For complex tasks where edge detection is just a part of the overall objective, CNNs can be more suitable.

4. **Interpretability**: If you need to understand, explain, or debug the algorithm, traditional methods offer more transparency.

5. **Real-Time Requirements**: Traditional methods can be faster and may be more suitable for real-time applications.

6. **Generalization**: If the task requires recognizing complex patterns and generalizing across varied scenarios, CNNs are more capable.

In summary, traditional methods like the Canny edge detector are generally quicker and easier to implement and understand, but they are less flexible and may not generalize well to complex tasks. CNNs, on the other hand, are more flexible and can learn complex features but are computationally intensive and require more data. The choice between the two would depend on the specific requirements of your project.

## Recap

![image.png](attachment:image.png)

**Save and Load model** 

If u did not have save option of model and u wanted to make prediction a month later using the same model u will have to re-train it cz all weights, bias a model learns its all in memory. A month later it will not reside in memory, so u will need to retrain model. 

If saved u can just load it in memory and start making predictions.

![image.png](attachment:image.png)

If ur image classifier is trained to recognise cat images it will not be able to identify humans in picture. Its a classification problem to recognise cats or dogs. Classification does not recognise multiple parts of image.

This scenario is not under Image classification, it is under Image detection or object detection within an image.

**Kernel**

The trainable parameters r dependent on kernel size. So small kernel will mean less parameters to train. Large kernel size means more trainable parameters in model as image convolves with the filter/kernel. We can try to play around with kernel size based on problem, not something standard.

**Padding and Stride**

With padding='same' and stride=1 it will ensure that o/p feature map size is same as i/p image.

**Pool Size**

Normally we keep $2*2$, so image reduces to half everytime. In case o/p from convolution layer v.huge we may need to think/test its configuration size apart from standard $2*2$ pool size. 

## Data Augmentation

### Image Data Generator arguments

Many of these arguments in Image data generator will allow you to augment your data. These parameters like rotation_range, zoom_range, horizontal_flip will randomly make changes to ur data. It brings in more variation to ur data.

![image.png](attachment:image.png)

**Question**


If in every epoch training data will change cz of augmentation how will loss ever reduce?



**Solution**

That's a good question. While data augmentation does introduce variability in the training data across epochs, it doesn't mean the data becomes completely unrecognizable or drastically different from the original. Instead, the augmented data retains the core characteristics of the original data, but with slight perturbations or modifications.

Here's why the loss can still reduce despite the variability introduced by data augmentation:

1. **Consistent Semantics**: Data augmentation techniques such as rotation, zooming, or flipping don't change the semantic content of the image. For example, a rotated image of a cat is still an image of a cat. The model still learns the underlying patterns to recognize cats, even if they appear at different orientations.

2. **Generalization**: Augmentation acts as a regularizer. By showing the model different variations of the same data, the model becomes more robust and less sensitive to small changes in the input. This robustness often translates to better generalization on the validation and test datasets.

3. **Avoiding Overfitting**: Without augmentation, a model, especially a deep one, might start to memorize the training samples rather than learning the general patterns. By introducing augmentation, the model is less likely to overfit to specific instances of the training data, as it's constantly exposed to slightly different variations.

4. **Loss Computation**: The loss is computed based on the difference between the predicted labels and the true labels. Since the labels remain consistent (e.g., all augmented versions of a cat image still have the label "cat"), the model can still improve its predictions and reduce the loss over time.

5. **Gradual Improvement**: While each epoch might present slightly different data to the model, the changes are incremental. The model will adjust its weights based on the augmented data, refining its predictions over time.

6. **Learning Rate and Optimization**: Modern optimization techniques, combined with appropriate learning rates, ensure that the model converges towards a solution even in the presence of noise or variability in the data.

In practice, data augmentation has been shown to improve model performance across a variety of tasks, especially when the amount of labeled training data is limited. The key is to use reasonable augmentation techniques that represent plausible variations the model might encounter in real-world scenarios.

## How many random images will be generated during augmentation?


However, in practice, when you train a model using this data generator, you'll typically specify a batch_size and steps per epoch. The generator will then provide that many augmented images for each step during each epoch. The actual images in each batch will be different for each epoch due to the random nature of the augmentations.



There are ways to control e.g. in 1 epoch i want 1 augmented version, 30 or 100 augmented version to go etc.

Since i do not want to create many duplicates while i train my model, if i have 100 images in training samples, i will create 100 augmented versions in 1 epoch, 1 augmented version for each image only, no duplicates i dont want 1 image to be augmented many times. Different patterns can be picked e.g rotation, flip, shift but 1 image 1 augmented version. I want to control this else there gonna be too much repetitive same images coming again & again.

<font color=blue>*Note*

*When u r training ur neural nets u cannot afford to have less data, to bring in more data with more variety across ur epochs, so augmentation will significantly try to improve performance of ur model by bringing in more variety & variations in the input images.*

**Question**

Importance of batch size, will it affect performance?

**Solution**

Effectively on training data the error rate should not be so bad but for some reson it is bad, it is called underfitting. Like decision tree depth is v.less 1 it is not able to make prediction on training data itself, it is underfitting.


Underfitting reasons -

1. The model is not adequate, u need to make model more complex, more layers etc
2. Provide proper data to training. In each epoch we keep on passing batches of data not entire dataset in 1 go.
3. We pass batches cz we need to cater to large volumes of data to train r model.
4. Either ur model is less complex or u r not passing adequate data for updating the weights. Try changing batch size 32,50,100 etc.
5. Giving some more data for every weight update mechanism by increasing batch size so when u calculate loss & acc at end of 1 epoch it should be decent number.
6. If u observe in model.fit as batch size increasing when model is running, training accuracy increasing within an epoch.

![image.png](attachment:image.png)

At batch_8 acc is 0.38, at batch_21 acc is 0.48, also in 1 epoch when all training batch is complete then validation data is used and you get validation accuracy e.g. here after all 27 training batches complete in 1 epoch then we see validation accuracy which is evaluated on validation data.

