<a href="https://colab.research.google.com/github/tn-220/EmergingClass/blob/main/cats_dogs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There is a data set from Microsoft at Kaggle which includes cat and dog images. You can download the dataset from:

https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765

After downloading the data set, unzip it.



I have put around 1500 cat images at MyDrive/pets/Cats

Also I have put around 1500 dog images at MyDrive/pets/Dogs

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import numpy as np                # a library used for working with arrays and numerical operations.
import matplotlib.pyplot as plt   # a data visualization library
import os                         # a module that provides a way of interacting with the operating system
import cv2                        # a library for image and video processing.
from tqdm import tqdm             # a library that provides a progress bar for iterative operations in a loop or on an iterable object.

data_dir = "/content/drive/MyDrive/pets/"   # This is my directory which includes two subdirectories of Cats and Dogs. You may place yours

categories = ["Dogs","Cats"]

for item in categories: 
    i = 1
    path = os.path.join(data_dir,item)  # create path to dogs and cats
    for img in os.listdir(path):  # iterate over each image
        img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE)  # convert to array
        plt.imshow(img_array, cmap='gray')  # graph it
        plt.show()  # display!
        print(img)
        i = i+1
        if(i>3):
          break  # This is to see few of these images


In [None]:
print(img_array)

In [None]:
print(img_array.shape)

In [None]:
IMG_SIZE = 100

new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap='gray')
plt.show()

In [None]:
IMG_SIZE = 200

new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap='gray')
plt.show()

In [None]:
training_data = []

def create_training_data():
    for item in categories: 

        path = os.path.join(data_dir,item)  
        class_num = categories.index(item)  # get the classification  (0 or a 1). 0 is dog and cat is 1

        for img in tqdm(os.listdir(path)):  # iterate over each image
            try:
                img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE)  # convert to array
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))  # resize to normalize data size
                training_data.append([new_array, class_num])  # add the image to the training data
            except Exception as e:  # to handle any exceptions. If an exception occurs, the code ignores it and continues to the next iteration.
                pass

In [None]:
create_training_data()

print("\n",len(training_data))
print("\n",type(training_data))

In [None]:
print(np.shape(training_data))
print(training_data[0])

In [None]:
import random

random.shuffle(training_data)

"""
The code above shuffles the order of elements in the list training_data randomly. 
In this case, it is likely being used to ensure that the model doesn't accidentally learn 
to classify images based on their order in the dataset
"""


In [None]:
for sample in training_data[:10]:
    print(sample[1])


In [None]:
np.shape(training_data)

In [None]:
X = []
y = []

for features, label in training_data:
    X.append(features)
    y.append(label)

X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)    # (batch_size, height, width, channels), 1 channel for grayscale 


In [None]:
import pickle       #  a module used for serializing and de-serializing Python objects.

""" By using pickle, we can save time and resources that would otherwise 
be spent on preprocessing the data every time the script is run. """

pickle_out = open("X.pickle","wb")
pickle.dump(X, pickle_out)
pickle_out.close()

pickle_out = open("y.pickle","wb")
pickle.dump(y, pickle_out)
pickle_out.close()


In [None]:
# to loadi the data that was previously saved using the pickle.dump() function.

pickle_in = open("X.pickle","rb")
X = pickle.load(pickle_in)

pickle_in = open("y.pickle","rb")
y = pickle.load(pickle_in)


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D


X = X/255.0

model = Sequential()  #  create a new sequential model in Keras where you can easily add, remove or modify the layers in the model.

model.add(Conv2D(256, (3, 3), input_shape=X.shape[1:])) # the layer will learn 256 different filters during training, each looking for a different feature in the input image.
model.add(Activation('relu'))             # an activation function that returns the input if it is positive, and 0 otherwise.
model.add(MaxPooling2D(pool_size=(2, 2))) # to reduce the size of the output feature maps while preserving the most important features.


model.add(Conv2D(256, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors

model.add(Dense(64))  # adds a fully connected layer with 64 neurons to the neural network model.

model.add(Dense(1))   # adds a fully connected layer with a single neuron to the neural network model.
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(np.array(X), np.array(y), batch_size=32, epochs=20, validation_split=0.3)



**Batch_size:** refers to the number of training examples utilized in one iteration during the training process. During training, the data is divided into batches, and the model's parameters are updated based on the average of the gradients computed for each batch. Setting the batch size too small can lead to noisy gradients, and setting it too large can lead to memory issues.


**Epochs:** refer to the number of times the training process goes through the entire dataset. Each epoch consists of one full iteration through the entire training dataset.


**Validation_split:** is the fraction of the training dataset that is held out and used as a validation set during training. The model is trained on the training dataset, and its performance is evaluated on the validation set after each epoch. The validation set is used to monitor the model's performance on data it has not seen during training and can help detect overfitting.