# TOPIC: Understanding Pooling and Padding in CNN

1.  
Pooling, also known as downsampling, is a crucial operation in Convolutional Neural Networks (CNNs). Its primary purpose is to progressively reduce the spatial dimensions (width and height) of the input volume, while retaining the most important information. Pooling layers are typically inserted between successive convolutional layers in a CNN architecture. The two most common types of pooling are max pooling and average pooling.

Here's a breakdown of the purpose and benefits of pooling in CNNs:

Dimensionality Reduction: Pooling reduces the dimensionality of each feature map, which helps in reducing the computational complexity of the network. This is essential for managing memory usage and computational resources, especially in deeper networks where the number of parameters can become extremely large.

Translation Invariance: Pooling provides a certain degree of translation invariance. This means that the exact location of a feature within the input is less important. For example, if a particular feature is detected in one region of the image, max pooling will select the most prominent activation in that region, regardless of its precise location.

Increased Receptive Field: Pooling effectively increases the receptive field of the neurons in the subsequent layers. By reducing the spatial dimensions, pooling allows neurons in deeper layers to capture information from a larger area of the input volume, which can lead to learning more abstract features.

Feature Generalization: Pooling helps in generalizing the features learned by the convolutional layers. By aggregating information from small local neighborhoods, pooling enables the network to focus on the most salient features while discarding irrelevant details, thus promoting better generalization to unseen data.

Control Overfitting: Pooling can help in reducing overfitting by introducing a form of regularization. By summarizing the presence of features in small regions, pooling reduces the chances of overfitting to noisy details in the training data.

Computational Efficiency: Pooling reduces the spatial dimensions of the feature maps, leading to a decrease in the number of parameters and computations required in subsequent layers. This makes the network more computationally efficient, both in terms of memory usage and processing time.

Overall, pooling plays a crucial role in CNNs by facilitating dimensionality reduction, enhancing translation invariance, increasing the receptive field, promoting feature generalization, controlling overfitting, and improving computational efficiency. These benefits contribute to the effectiveness of CNNs in various computer vision tasks such as image classification, object detection, and segmentation.

2.  Max Pooling:

Max pooling is a pooling operation often used in convolutional neural networks (CNNs).
In max pooling, for each region of the input feature map, the maximum value within that region is taken.
It helps in retaining the most activated features, discarding less relevant ones, and providing a form of translation invariance.
Min Pooling:

Min pooling is not as commonly used as max pooling but operates similarly.
Instead of taking the maximum value, it takes the minimum value within each region of the input feature map.
This operation can be useful in certain scenarios, such as when you want to highlight the least activated features or suppress noise.
The main difference between the two lies in the operation they perform on the input feature maps. Max pooling focuses on retaining the most significant features, while min pooling tends to highlight the least significant features. However, max pooling is far more common in practice due to its effectiveness in feature retention and computational simplicity.






3.   Padding in Convolutional Neural Networks (CNNs) is a technique used to preserve the spatial dimensions of the input volume after applying convolutional operations. It involves adding additional pixels around the border of the input image or feature map. These additional pixels are usually filled with zeros, hence the term "zero-padding." The size of the padding is determined by parameters specified by the user or chosen based on the desired output size.

Here's a detailed discussion of the concept of padding in CNNs and its significance:

Preservation of Spatial Dimensions: One of the primary reasons for using padding is to preserve the spatial dimensions of the input volume. Convolutional operations, especially with large filter sizes, can cause the spatial dimensions of the feature maps to shrink with each layer. Padding helps maintain the spatial size of the feature maps, which can be crucial for retaining spatial information and preventing information loss, especially at the borders of the input.

Centering the Convolution Kernel: Padding allows the convolutional kernel to be centered around each pixel in the input image. Without padding, the convolution operation would only be applied to pixels near the center of the image, leading to a reduction in the effective size of the feature maps and potential loss of information at the borders.

Boundary Effects Mitigation: Convolution operations near the borders of the input feature maps tend to have fewer valid positions for the kernel to slide over, leading to reduced output size and potential information loss. Padding helps mitigate these boundary effects by ensuring that every pixel in the input has the same opportunity to be processed by the convolutional kernel.

Control Over Output Size: By adjusting the amount of padding, users can control the spatial dimensions of the output feature maps. This is particularly useful when designing CNN architectures, as it provides flexibility in determining the size of the feature maps at each layer, which can affect the overall performance of the network.

Handling Strides: Padding also plays a role in handling convolutional layers with strides greater than 1. When the stride is greater than 1, the convolutional operation skips certain positions in the input, resulting in a reduction in the size of the output feature maps. Padding can be used to compensate for this reduction and achieve the desired output size.

Improvement in Performance: In some cases, padding can improve the performance of CNNs by mitigating issues such as vanishing gradients or loss of spatial information, particularly in deeper architectures where the network has multiple convolutional layers.

Overall, padding is a fundamental concept in CNNs that helps preserve spatial information, mitigate boundary effects, control the output size, and improve the performance of convolutional operations, making it an essential component in the design and implementation of CNN architectures for various computer vision tasks.



4.  When it comes to convolutional neural networks (CNNs), padding plays a crucial role in determining the size of the output feature maps after convolution. Let's discuss "same-padding" (also known as zero-padding) and "valid-padding" and their effects on the output feature map size:

Same-padding (Zero-padding):

Same-padding is a technique where the input is padded with zeros around the border so that the output feature map has the same spatial dimensions as the input.
This is typically achieved by adding zeros to the input image or feature map around its borders before applying the convolution operation.
Same-padding helps in retaining the spatial dimensions of the input, which is useful when you want to maintain the spatial resolution throughout the network.
The formula to calculate the output size with same-padding is:
Output size
=
Input size
+
2
×
padding
−
filter size
stride
+
1
Output size= 
stride
Input size+2×padding−filter size
​
 +1
Valid-padding:

Valid-padding is a padding technique where no padding is applied to the input. It means the convolution operation is applied only to positions where the filter fits entirely within the input without overflowing.
With valid-padding, the spatial dimensions of the output feature map are reduced compared to the input, as the convolution operation can't be applied near the borders.
This is useful when you want to reduce the spatial dimensions gradually, as you move deeper into the network, which is common in many CNN architectures.
The formula to calculate the output size with valid-padding is:
Output size
=
Input size
−
filter size
stride
+
1
Output size= 
stride
Input size−filter size
​
 +1

# TOPIC: Exploring LeNet

1.   LeNet-5 is one of the pioneering Convolutional Neural Network (CNN) architectures developed by Yann LeCun et al. in the 1990s. It was primarily designed for handwritten digit recognition tasks and played a significant role in popularizing CNNs for image classification. Here's an overview of the LeNet-5 architecture:

Input Layer: LeNet-5 takes as input grayscale images of size 32x32 pixels. In the original implementation, the input images were centered and normalized.

Convolutional Layers: LeNet-5 consists of two convolutional layers followed by subsampling layers (pooling layers). The convolutional layers apply a set of learnable filters (kernels) to extract features from the input images. In the original LeNet-5 architecture:

The first convolutional layer uses a 5x5 kernel with a stride of 1. It produces feature maps with 6 channels.
The second convolutional layer also uses a 5x5 kernel with a stride of 1 and produces feature maps with 16 channels.
Subsampling (Pooling) Layers: After each convolutional layer, LeNet-5 employs subsampling layers to reduce the spatial dimensions of the feature maps and control the number of parameters. These subsampling layers use average pooling in the original LeNet-5:

After the first convolutional layer, a subsampling layer with 2x2 average pooling and a stride of 2 is applied.
After the second convolutional layer, another subsampling layer with 2x2 average pooling and a stride of 2 is applied.
Activation Function: Throughout the network, the hyperbolic tangent (tanh) function was used as the activation function, providing non-linearity to the model.

Fully Connected Layers: Following the convolutional and subsampling layers, LeNet-5 includes three fully connected layers:

The first fully connected layer has 120 units.
The second fully connected layer has 84 units.
The final output layer is a softmax layer with 10 units, corresponding to the 10 classes (digits 0-9) in the MNIST dataset, which was the dataset LeNet-5 was originally trained on.
Softmax Activation: The softmax activation function is applied to the output layer to produce a probability distribution over the 10 classes, indicating the likelihood of each input image belonging to a particular digit class.

Training: LeNet-5 was trained using the backpropagation algorithm with stochastic gradient descent (SGD) optimization. It utilized techniques such as weight sharing and gradient-based learning to efficiently learn and update the parameters of the network.

Overall, LeNet-5 demonstrated the effectiveness of CNNs for handwritten digit recognition and laid the foundation for subsequent advancements in deep learning and computer vision. Its simple yet effective architecture paved the way for more complex CNN architectures used in various image recognition tasks today.






2.  LeNet-5 is a classic convolutional neural network (CNN) architecture, pioneered by Yann LeCun and his colleagues in the 1990s. It was primarily designed for handwritten digit recognition tasks, such as recognizing digits in checks or forms. Here are the key components of LeNet-5 and their respective purposes:

Convolutional Layers:

LeNet-5 consists of two convolutional layers.
The purpose of convolutional layers is to extract various features from the input images.
Each convolutional layer applies a set of learnable filters (also known as kernels) to the input image, convolving them across the input spatial dimensions to produce feature maps.
These feature maps capture different levels of abstraction, such as edges, textures, or more complex patterns.
Pooling Layers:

LeNet-5 includes pooling layers after each convolutional layer.
The pooling layers serve to downsample the feature maps obtained from the convolutional layers.
Typically, max pooling is used, which takes the maximum value within each pooling region.
Pooling helps reduce the spatial dimensions of the feature maps, making the network more computationally efficient and providing some degree of translational invariance.
Activation Functions:

LeNet-5 primarily uses the sigmoid activation function (specifically, the hyperbolic tangent function) and occasionally the ReLU (Rectified Linear Unit) activation function.
Activation functions introduce non-linearity into the network, enabling it to learn complex mappings between the input and output.
The non-linear activation functions help the network capture intricate patterns and relationships in the data.
Fully Connected Layers:

Following the convolutional and pooling layers, LeNet-5 has three fully connected layers.
Fully connected layers connect every neuron in one layer to every neuron in the next layer, similar to traditional artificial neural networks.
These layers serve as classifiers, taking the high-level features extracted by the convolutional layers and making predictions based on them.
The last fully connected layer typically outputs the class scores, which are then passed through a softmax function to obtain class probabilities.
Softmax Layer:

The final layer of LeNet-5 is a softmax layer.
Softmax converts the raw class scores into probabilities, ensuring that they sum up to 1.
It provides the probability distribution over the different classes, indicating the network's confidence in its predictions.

3.  LeNet-5, being one of the earliest CNN architectures, brought significant advancements in image classification tasks, particularly in handwritten digit recognition. However, it also has its set of advantages and limitations when applied to modern image classification tasks. Let's discuss them:

Advantages of LeNet-5:
Effective Feature Extraction: LeNet-5 demonstrated the effectiveness of convolutional layers in automatically learning hierarchical features from input images. It showed that convolutional layers followed by pooling layers can efficiently capture spatial hierarchies in the data, making it suitable for image classification tasks.

Parameter Sharing: LeNet-5 introduced the concept of parameter sharing, where the same set of weights (kernel) is used across different spatial locations of the input image. This reduces the number of parameters in the network, making it more efficient in terms of memory usage and training time.

Translation Invariance: Through the use of pooling layers, LeNet-5 achieved translation invariance to some extent, enabling it to recognize patterns regardless of their precise location in the input image. This property is particularly useful in tasks where object position and orientation may vary.

Simple and Elegant Architecture: LeNet-5 has a relatively simple architecture compared to modern CNNs, which makes it easy to understand and implement. This simplicity also makes it suitable for educational purposes and as a baseline for more complex models.

Highly Effective for Handwritten Digit Recognition: LeNet-5 demonstrated state-of-the-art performance on handwritten digit recognition tasks, especially on datasets like MNIST. It showed that CNNs could achieve remarkable accuracy even with relatively small training datasets.

Limitations of LeNet-5:
Limited Capacity: LeNet-5 has a relatively shallow architecture compared to modern CNNs. It may not have enough capacity to capture complex patterns present in large and diverse datasets, such as those encountered in modern image classification tasks like ImageNet.

Small Receptive Field: Due to its small kernel sizes and limited depth, LeNet-5 has a small receptive field, meaning it may struggle to capture contextual information from distant parts of the input image. This can limit its ability to recognize objects in cluttered scenes or images with complex backgrounds.

Lack of Activation Function Diversity: LeNet-5 predominantly uses the hyperbolic tangent (tanh) activation function, which has limitations in handling large gradients during training. Modern CNNs often use more robust activation functions like ReLU (Rectified Linear Unit) to alleviate this issue.

Not Suitable for Large-Scale Image Classification: While LeNet-5 performs well on small-scale image classification tasks like MNIST, its performance may degrade significantly when applied to larger and more complex datasets due to its limited capacity and shallow architecture.

Training Dynamics: Training LeNet-5 can be slower compared to modern architectures, especially when dealing with large datasets, as it lacks some of the optimization techniques and architectural innovations that have been developed since its inception.

In summary, while LeNet-5 laid the foundation for modern CNNs and achieved groundbreaking results in handwritten digit recognition, its limitations in terms of capacity, receptive field, and training dynamics make it less suitable for large-scale image classification tasks compared to more contemporary architectures. Nonetheless, it remains a significant milestone in the development of convolutional neural networks and continues to serve as a benchmark in the field of deep learning.






In [None]:
4.  import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Define the LeNet-5 architecture
model = models.Sequential([
    layers.Conv2D(6, kernel_size=(5, 5), activation='tanh', input_shape=(28, 28, 1)),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(16, kernel_size=(5, 5), activation='tanh'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Flatten(),
    layers.Dense(120, activation='tanh'),
    layers.Dense(84, activation='tanh'),
    layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_data=(x_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc}')


# TOPIC: Analyzing AlexNet

AlexNet is a landmark convolutional neural network (CNN) architecture developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 by a significant margin, marking a breakthrough in the field of computer vision. Here's an overview of the AlexNet architecture:

Architecture Overview:
Input Layer: AlexNet takes as input RGB images of size 224x224 pixels. The images are standardized to have a mean of zero and a standard deviation of one.

Convolutional Layers: AlexNet consists of five convolutional layers. These layers use small receptive fields (typically 3x3 or 5x5) to convolve the input images and extract features. The first convolutional layer has a stride of 4, while the subsequent layers use a stride of 1.

Activation Function: Throughout the network, AlexNet uses the rectified linear unit (ReLU) activation function after each convolutional layer. ReLU helps introduce non-linearity into the model and accelerates training compared to traditional activation functions like tanh or sigmoid.

Max Pooling Layers: After the first, second, and fifth convolutional layers, AlexNet employs max-pooling layers to downsample the feature maps. Max pooling reduces the spatial dimensions of the feature maps while retaining the most important features. The max-pooling layers typically use a 3x3 window with a stride of 2.

Normalization Layers: In between some of the convolutional layers, AlexNet includes local response normalization (LRN) layers. These layers help normalize the responses within local neighborhoods across multiple feature maps, enhancing the model's generalization ability.

Fully Connected Layers: Following the convolutional and pooling layers, AlexNet includes three fully connected layers. The first two fully connected layers have 4096 neurons each, while the third fully connected layer serves as the output layer with 1000 neurons corresponding to the 1000 ImageNet classes.

Dropout: To prevent overfitting, AlexNet utilizes dropout regularization after the first two fully connected layers. Dropout randomly sets a fraction of the neurons to zero during training, effectively forcing the network to learn redundant representations and improving generalization.

Softmax Activation: The softmax activation function is applied to the output layer to produce a probability distribution over the 1000 ImageNet classes, indicating the likelihood of each class given the input image.

Training: AlexNet was trained using stochastic gradient descent (SGD) with momentum. Data augmentation techniques such as random cropping and horizontal flipping were also employed to increase the diversity of the training data and improve generalization.

Advantages of AlexNet:
Deep Architecture: AlexNet was one of the first deep CNN architectures, consisting of eight layers (five convolutional and three fully connected layers). Its depth allowed it to learn complex hierarchical features directly from raw pixels, leading to superior performance on image classification tasks.

ReLU Activation: The use of ReLU activation after each convolutional layer helped alleviate the vanishing gradient problem and accelerate training. ReLU also provided sparsity and improved the representational capacity of the network.

Data Augmentation: AlexNet employed data augmentation techniques during training, such as random cropping and horizontal flipping, which helped prevent overfitting and improved the model's generalization ability.

Dropout Regularization: By incorporating dropout regularization after the fully connected layers, AlexNet effectively prevented overfitting and improved the model's robustness to noise in the training data.

LRN Layers: Local response normalization (LRN) layers helped enhance the model's generalization ability by normalizing responses within local neighborhoods across multiple feature maps.

Limitations of AlexNet:
Computational Complexity: AlexNet's architecture is computationally expensive, especially compared to earlier shallow CNN architectures like LeNet-5. Training and deploying AlexNet require significant computational resources.

Memory Consumption: The large number of parameters in AlexNet's fully connected layers can lead to high memory consumption during both training and inference, making it less suitable for deployment on resource-constrained devices.

Limited Receptive Field: Despite its depth, AlexNet's receptive field is relatively small compared to modern architectures like VGG and ResNet. This limited receptive field may restrict its ability to capture global contextual information in large images.

Overfitting Risk: While dropout regularization helped mitigate overfitting to some extent, AlexNet may still be prone to overfitting, especially when trained on small datasets or with limited data augmentation.

Training Dynamics: Training AlexNet can be challenging due to its depth and complexity. Proper initialization, learning rate scheduling, and regularization techniques are crucial for achieving good performance.

2.  
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, was a revolutionary convolutional neural network (CNN) that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. Its success not only marked a significant milestone in the field of computer vision but also reignited interest in deep learning. AlexNet's architectural innovations contributed to its breakthrough performance, and they can be summarized as follows:

Deeper Architecture:

Before AlexNet, most neural networks were relatively shallow due to computational limitations and a lack of effective training techniques. AlexNet introduced a deeper architecture consisting of 5 convolutional layers followed by 3 fully connected layers, making it significantly deeper than previous CNNs used for similar tasks.
ReLU (Rectified Linear Unit) Activation:

AlexNet was one of the first neural networks to use ReLU as its activation function instead of the traditional tanh or sigmoid. ReLU helped the network train faster by mitigating the vanishing gradient problem, as it provides a linear output for positive input values and zero for negative input values.
Use of Dropout:

AlexNet introduced the use of dropout layers to reduce overfitting in the fully connected layers. By randomly setting a fraction of the input units to 0 at each update during training, dropout prevents complex co-adaptations on training data, making the model more robust and generalizable.
Overlapping Pooling:

Traditional pooling layers in CNNs before AlexNet used non-overlapping pooling windows. AlexNet introduced overlapping max-pooling, where the pooling windows overlapped with each other. This approach reduced the size of the network and improved its performance by making the representation more robust to small translations in the input image.
Large-scale Data and GPU Training:

AlexNet was trained on two GPUs for a week, which was unprecedented at the time. The use of GPUs allowed for faster computation, enabling the training of deeper and more complex models on large datasets like ImageNet. Additionally, training on GPUs facilitated the use of larger minibatches, which contributed to the network's convergence and performance.
Data Augmentation:

To further combat overfitting, AlexNet implemented data augmentation techniques such as image translations, horizontal reflections, and alterations in the intensity of the RGB channels. These techniques significantly increased the diversity of the training data, helping the model generalize better to new, unseen data.
Local Response Normalization (LRN):

Although later research showed that LRN is not always necessary, AlexNet used LRN to apply a form of lateral inhibition inspired by the behavior of real neurons, making the model less sensitive to high-frequency features and improving generalization.
These architectural innovations made AlexNet significantly more effective and efficient than its predecessors, setting a new standard for CNNs in computer vision tasks. Its success not only demonstrated the potential of deep learning in practical applications but also inspired the development of more advanced deep learning models and techniques in the years that followed.






3.    In AlexNet, each type of layer (convolutional, pooling, and fully connected) plays a crucial role in the overall architecture, contributing to its success in image classification tasks. Let's discuss the role of each layer type in AlexNet:

1. Convolutional Layers:
The convolutional layers in AlexNet are responsible for extracting hierarchical features from the input images. These layers consist of learnable filters (kernels) that slide across the input images, performing element-wise multiplication and summation operations to produce feature maps. The key role of convolutional layers in AlexNet includes:

Feature Extraction: Convolutional layers extract low-level features such as edges, textures, and patterns from the input images. As the network progresses through deeper convolutional layers, higher-level features are learned, representing more complex structures.

Hierarchical Representation: The hierarchical nature of convolutional layers allows the network to learn increasingly abstract representations of the input images. Each subsequent convolutional layer builds upon the features learned by the previous layers, enabling the network to capture intricate patterns and objects.

Parameter Sharing: One of the significant advantages of convolutional layers is parameter sharing, where the same set of weights (kernel) is used across different spatial locations of the input image. This reduces the number of parameters in the network, making it more efficient and facilitating the learning of translation-invariant features.

2. Pooling Layers:
Pooling layers in AlexNet are interspersed between convolutional layers and are responsible for reducing the spatial dimensions of the feature maps while preserving the most important information. The pooling layers in AlexNet use max pooling, which selects the maximum value within each pooling region. The key role of pooling layers includes:

Spatial Reduction: Pooling layers downsample the feature maps, reducing their spatial dimensions. This helps in controlling the computational complexity of the network by reducing the number of parameters and computations in subsequent layers.

Translation Invariance: By selecting the maximum value within each pooling region, max pooling introduces a degree of translation invariance. This property enables the network to focus on the most salient features while discarding irrelevant details, making it more robust to object variations in the input images.

3. Fully Connected Layers:
The fully connected layers in AlexNet, also known as dense layers, are responsible for learning high-level representations of the extracted features and making predictions. These layers connect every neuron in one layer to every neuron in the next layer, forming a fully connected network. The key role of fully connected layers includes:

High-Level Representation: Fully connected layers aggregate the features learned by the convolutional layers and transform them into a compact representation suitable for classification. The neurons in these layers capture complex combinations of features and relationships between them.

Classification: The final fully connected layer in AlexNet serves as the output layer and is responsible for predicting the class probabilities of the input images. It typically employs the softmax activation function to produce a probability distribution over the classes.

Model Complexity: Fully connected layers contribute significantly to the model's parameter count and computational complexity. However, they also enable the network to learn complex decision boundaries and achieve high classification accuracy.

In [None]:
4.  import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical

# Load and preprocess the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Define the AlexNet architecture
model = models.Sequential([
    # Convolutional layers
    layers.Conv2D(96, kernel_size=(11, 11), strides=(4, 4), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    layers.Conv2D(256, kernel_size=(5, 5), padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    layers.Conv2D(384, kernel_size=(3, 3), padding='same', activation='relu'),
    layers.Conv2D(384, kernel_size=(3, 3), padding='same', activation='relu'),
    layers.Conv2D(256, kernel_size=(3, 3), padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    # Fully connected layers
    layers.Flatten(),
    layers.Dense(4096, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(4096, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer=optimizers.Adam(lr=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=20, batch_size=128, validation_data=(x_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc}')
