![Washu Math](https://sites.wustl.edu/scao/files/2020/10/Screen-Shot-2020-10-25-at-1.03.49-PM.png)


# WashU Math Undergrad Seminar Dec 9 2020

## Thank you for attending and Adeli Hutton for hosting.

# How computers learn to recognize cats and dogs: an introduction to deep learning and the optimization methods behind the curtain

Welcome to an undergrad seminar like you have never seen before! Today we will learn how computers learn to tell cats from dogs using machine learning! and some Python language.

![](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png)
<br/>
<br/>
In this notebook we will use a software package called tensorflow!
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/TensorFlowLogo.svg/1200px-TensorFlowLogo.svg.png" alt="drawing" width="500"/>
<br/>
<br/>
<br/>
We will leave PyTorch for next semester's [Math 450: Optimization Methods in Machine Learning](https://scaomath.github.io/teaching/sp2021-math450).
<img src="https://upload.wikimedia.org/wikipedia/commons/9/96/Pytorch_logo.png" alt="drawing" width="700"/>
<br/>
<br/>
If you have already registered on Kaggle, now please click the **COPY and EDIT** button on the upper right corner.
![](https://sites.wustl.edu/scao/files/2020/10/Screen-Shot-2020-10-25-at-1.09.19-PM.png)

# Notebook style Python

This is called a "**Notebook**".

This a markdown cell. We can write words in this cell. 

Welcome to our math seminar.

## Command vs. Edit Modes

There two different keyboard input modes:

1. **Command mode** - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin.
2. **Edit mode** - when you're typing in a cell. Indicated by a green cell border

Experiment with switching between command and edit modes in this cell. 

Hint: If you're in command mode, type `enter` or double-click to enter edit mode. If you're in edit mode, type `esc` or `cmd`+`m` (`ctrl`+`m` in Windows/Linux) to enter command mode.

In [None]:
print("This is a code cell.")
print("Hello world")
print(f"3+2 is {3+2}")
# a more advanced f-string example

In [None]:
# Simple variable assignment
# This is a comment
x = 5.0

In [None]:
print(type(x), '\n', dir(x))

In [None]:
# simple calculations
# ** means exponential
print(2**3)

In [None]:
# simple lambda functions
f = lambda x: x**3

In [None]:
f(2)

# Logic

Computer are really good at computing, when being given **EXACT and CLEAR** instructions. For example, what is $f(5)$ if $f(x) = x^2 - x + 2$, or if something is true, do another thing. However, computer is not so good at many things (used to). For examples, recognizing cats and dogs from photos. 

![](https://storage.googleapis.com/kaggle-competitions/kaggle/3362/media/woof_meow.jpg)

In [None]:
# == vs =
a = 2 # we let a be 2

In [None]:
a == 3 # check if a is 2

In [None]:
# simple if-then condition
# flow control

In [None]:
a = 1
if a == 2:
    print(f'a is {a}')
else:
    print('Nothing')

In [None]:
# browse and introduce the Cats vs dog competition

# Computer can learn from examples!

Just like us! Imagine we are preparing an exam, teachers will give us some practice exams (with answers available), we will train ourselves by doing these practice problems, honing our skills, then in the actual exam, we will be able to tackle exam problems without knowing the answer beforehand (hopefully). The examples are call data.

Imaging how computers learn calculus just by looking at the problems and solutions (no theorems)...

First let us load some packages into our system.

In [None]:
import os
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import random

In [None]:
dir(pd)

In [None]:
print(os.listdir("../input/dogs-vs-cats/"))

Let us unzip the compressed images in the `train.zip` and `test1.zip` (this may take a while).

In [None]:
!unzip -q '../input/dogs-vs-cats/train.zip'
!unzip -q '../input/dogs-vs-cats/test1.zip'

## Prepare the data

In [None]:
filenames = os.listdir("./train")
print(filenames[:10])

In a computer system, we need to represent "cat" or "dog" these abstract words into 0s and 1s so that computer can understand! We store our data in a Dataframe.

In [None]:
categories = []
for filename in filenames:
    category = filename.split('.')[0]
    if category == 'dog':
        categories.append(1)
    else: # cat
        categories.append(0)

df = pd.DataFrame({
    'filename': filenames,
    'category': categories
})
print(df.head(20))

Now let us view a sample image (randomly chosen).

In [None]:
from keras.preprocessing.image import load_img

In [None]:
sample = df.sample(1)
image = load_img("./train/"+sample['filename'].values[0])
fig = plt.figure()
fig.set_size_inches(6,6)
plt.imshow(image)
print(sample)

# Deep learning model

In [None]:
from keras import layers, applications, optimizers, callbacks
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense, Activation,GlobalMaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
from keras.applications import VGG16
from keras.models import Model, load_model
from keras.utils import plot_model, to_categorical

image_size = 224
input_shape = (image_size, image_size, 3)

epochs = 6
batch_size = 16

pre_trained_model = VGG16(input_shape=input_shape, include_top=False, weights="imagenet")
    
for layer in pre_trained_model.layers[:15]:
    layer.trainable = False

for layer in pre_trained_model.layers[15:]:
    layer.trainable = True
    
last_layer = pre_trained_model.get_layer('block5_pool')
last_output = last_layer.output
    
# Flatten the output layer to 1 dimension
x = GlobalMaxPooling2D()(last_output)
# Add a fully connected layer with 512 hidden units and ReLU activation
x = Dense(512, activation='relu')(x)
# Add a dropout rate of 0.3
x = Dropout(0.3)(x)
# Add a final sigmoid layer for classification
x = layers.Dense(1, activation='sigmoid')(x)

model = Model(pre_trained_model.input, x)

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.Adam(lr=1e-4),
              metrics=['accuracy'])

model.summary()

In [None]:
plot_model(model, to_file='/model_vgg16.png', show_shapes=True)

# Preprocessing the data for the model

- `train_df`: data for training the model.
- `validate_df`: data for validating the trained model (the model has not seen these data before).

In [None]:
df['category'] = df['category'].astype('str')
train_df, validate_df = train_test_split(df, test_size=0.1)
train_df = train_df.reset_index()
validate_df = validate_df.reset_index()

# validate_df = validate_df.sample(n=100).reset_index() # use for fast testing code purpose
# train_df = train_df.sample(n=1800).reset_index() # use for fast testing code purpose

total_train = train_df.shape[0]
total_validate = validate_df.shape[0]

In [None]:
train_datagen = ImageDataGenerator(
    rotation_range=16,
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    width_shift_range=0.1,
    height_shift_range=0.1
)

train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    "./train/", 
    x_col='filename',
    y_col='category',
    class_mode='binary',
    target_size=(image_size, image_size),
    batch_size=batch_size
)

validation_datagen = ImageDataGenerator(rescale=1./255)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    "./train/",  
    x_col='filename',
    y_col='category',
    class_mode='binary',
    target_size=(image_size, image_size),
    batch_size=batch_size
)

## Augment the data

In [None]:
example_df = train_df.sample(n=1).reset_index(drop=True)
example_generator = train_datagen.flow_from_dataframe(
    example_df, 
    "./train/", 
    x_col='filename',
    y_col='category',
#     class_mode='binary'
)
plt.figure(figsize=(12, 12))
for i in range(0, 9):
    plt.subplot(3, 3, i+1)
    for X_batch, Y_batch in example_generator:
        image = X_batch[0]
        plt.imshow(image)
        break
plt.tight_layout()
plt.show()

# Test if our model can recognize this image!

In [None]:
test_filenames = os.listdir("./test1/")
test_df = pd.DataFrame({
    'filename': test_filenames[:64]
})

nb_samples = test_df.shape[0]
test_gen = ImageDataGenerator(rescale=1./255)
test_generator = test_gen.flow_from_dataframe(
    test_df, 
    "./test1/", 
    x_col='filename',
    y_col=None,
    class_mode=None,
    batch_size=batch_size,
    target_size=(image_size, image_size),
    shuffle=False
)

In [None]:
# this may take a while
predict = model.predict_generator(test_generator, steps=np.ceil(nb_samples/batch_size))
threshold = 0.5
test_df['category'] = np.where(predict > threshold, 1,0)

In [None]:
test_df

## check prediction results

Without training, the model just assigns every images it sees as a "dog" ("1" label), also the `dog` or `cat` strings are removed from the testing image filenames to avoid "cheating".

In [None]:
sample_test = test_df.sample(n=9).reset_index()
sample_test.head()
plt.figure(figsize=(12, 12))
for index, row in sample_test.iterrows():
    filename = row['filename']
    category = row['category']
    category = 'cat' if category == 0 else 'dog'
    img = load_img("./test1/"+filename, target_size=(256, 256))
    plt.subplot(3, 3, index+1)
    plt.imshow(img)
    plt.xlabel(f'{filename} : {category} ')
plt.tight_layout()
plt.show()

# Now let's load a trained model

The prediction is like a random guess. Imaging we have spent three days doing practice exam for the actual exam!

In [None]:
model = load_model('../input/vgg16catsvsdogs/model_0_vgg16.h5')
model.summary()

## Let this model see the images again and check the results

In [None]:
# this may take a while
predict = model.predict_generator(test_generator, steps=np.ceil(nb_samples/batch_size))
threshold = 0.5
test_df['category'] = np.where(predict > threshold, 0, 1)

In [None]:
sample_test = test_df.sample(n=9).reset_index()
sample_test.head()
plt.figure(figsize=(12, 12))
for index, row in sample_test.iterrows():
    filename = row['filename']
    category = row['category']
    category = 'cat' if category == 0 else 'dog'
    img = load_img("./test1/"+filename, target_size=(256, 256))
    plt.subplot(3, 3, index+1)
    plt.imshow(img)
    plt.xlabel(f'{filename} : {category} ')
plt.tight_layout()
plt.show()

## Pretty accurate! isn't it

# How this computer algorithm achieves that?!
long story...first we have to learn how computer represent images.

![](https://sites.wustl.edu/scao/files/2020/10/linear_dogs.jpg)

### Computer stores image as a "matrix"

In [None]:
# scala, vector
a = 1
v = [1,2]

In [None]:
# matrix
m = [[1,2], [3,4]]
print(np.array(m))

In [None]:
# tensor

## This is a tensor

![](https://www.tensorflow.org/guide/images/tensor/reshape-before.png)

Reference: Introduction to Tensors at TensorFlow guide. https://www.tensorflow.org/guide/tensor

In [None]:
# example of imshow
a = np.array([[0,4], [2,10]])
plt.imshow(a);

## Let us load an image from Pokemon dataset

In [None]:
pokemon_filename = os.listdir("../input/pokemon-images-dataset/pokemon_jpg/pokemon_jpg/")
random_pokemon = random.choice(pokemon_filename)
G = plt.imread("../input/pokemon-images-dataset/pokemon_jpg/pokemon_jpg/"+random_pokemon)
plt.imshow(G)

In [None]:
random_pokemon

## But what is G????

In [None]:
# check G
type(G)

In [None]:
G.shape

# G is a tensor!

In [None]:
# show only 1 color channel
G1 = G[:,:,0]
plt.imshow(G1, cmap='Reds');

## computer stores these images as tensors!


# How computer learns?

We tranform the problem into an optimization problem! First the neural network is a nonlinear function: 
$$
\hat{y} = h(x; w), \quad \text{vs } y
$$
where $x$ is the datum (sample, matrix or tensor), $\hat{y}$ is the output of the model, $y$ is called the ground truth, and $w$ is the parameter of our model.

If $y$ are just a real scalar value (for example, stock prices) then we can tweak our model's $w$ by solving the following minimization problem: $i$ stands for indices for $i$-th samples:
$$
\min_{w\in \Omega} L(w):= \min_{w\in \Omega} \|h(x; w) - y\|^2,
$$
where $L(w)$ is called the loss function (it is a function in $w$!!!).

You might have seen this picture:
![](https://sites.wustl.edu/scao/files/2020/12/nn.png)

Each layer can be written as the following:
$$
a^{(l+1)} = \sigma(W a^{(l)} + b)
$$
<br/>
<br/>
<br/>
The VGG16 is a deep convolutional neural network, and this is a miniature of our model: imagine those little blocks are like magnifiers + translators.
![](https://sites.wustl.edu/scao/files/2020/12/cnn.png)

## Cross-entropy

What if the ground truth we are interested in is a probability distribution, for example, if $x$ stands for an image of a dog:
$$
P(y = 1| x) = 1 \text{ and } P(y= 0 | x) = 0.
$$
So we are really interested in approximating $P(y|\mathbf{x})$:

$$
h(\mathbf{x}) := h(\mathbf{x};\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{x})}
=: \sigma(\mathbf{w}^\top \mathbf{x})  \in (0,1)
$$

where $\sigma(z)$ is the Sigmoid function $1/(1+e^{z})$
or more compactly.

Now $h(\mathbf{x})$ is our estimate of $ P(y=1|\mathbf{x})$ (conditional probability of giving sample $\mathbf{x}$, it is in class 1), and $1 - h(\mathbf{x})$ is our estimate of $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$, moreover, because $y = 0$ or $1$, 

$$
P(y|\mathbf{x}) \text{ is estimated by } h(\mathbf{x})^y \big(1 - h(\mathbf{x}) \big)^{1-y}.
$$

When the true $y$ is 1, we want $h(\mathbf{x})$ closer to 1, and vice versa.

The cross entropy loss for two probability distribution is defined as, $K=2$ is the no. of classes, $\hat {y}$ is the prediction from the model (try to estimate $y$)

$$
H(p,q)\ =\ -\sum^{K}_{k=1}p_{k}\log q_{k}\ =\ -y\log {\hat {y}}-(1-y)\log(1-{\hat {y}})
$$

Since we estimate $y$ using $h(\mathbf{x})$,

$$
L (\mathbf{w}; X, \mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N 
\Bigl\{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}; \mathbf{w}) \big) 
+ (1 - y^{(i)}) \ln\big( 1 - h(\mathbf{x}^{(i)};\mathbf{w}) \big) \Bigr\}.
\tag{$\star$}
$$

and the minimization problem we are solving is:

$$
\min_{\mathbf{w}} L (\mathbf{w}; X, \mathbf{y})
$$

# Stochastic Gradient descent

Loss

$$L(\mathbf{w}) := L(\mathbf{w}; X,\mathbf{y}) = \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)})$$ 

# Gradient descent for this loss:
> Choose an initial guess $\mathbf{w}_0$, step size (learning rate) $\eta$, number of iterations $M$<br><br>
>    For $k=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{k+1} =  \mathbf{w}_k - \eta\nabla_{\mathbf{w}} L(\mathbf{w}_k) =  \mathbf{w}_k - \eta\frac{1}{N}\sum_{i=1}^N \nabla_{\mathbf{w}} f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)})$


### SGD
* > Choose an initial guess $\mathbf{w}_0$, step size (learning rate) $\eta$, number of iterations $M$<br><br>
>    For $k=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{k+1} =  \mathbf{w}_k - \eta\nabla_{\mathbf{w}} L(\mathbf{w}_k) =  \mathbf{w}_k - \eta\frac{1}{n_{\text{batch}}}\sum_{i=1}^{n_{\text{batch}}} \nabla_{\mathbf{w}} f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)})$

# Summary:

1. Computer represents data as vector, matrix, or tensor.
2. A deep learning model learns how to classify images through optimization.
3. Our Neural network model (VGG16) uses mathematical operations to extract features from images.
4. The model is "trained" through solving an optimization problem.
5. We will learn how to code these in Math 450 in Spring 2021.

Email: s.cao@wustl.edu