# Neural Networks on MNIST

This Jupyter notebook explains various approaches for implementing neural networks that recognize digits on [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.

## Preparing the MNIST dataset

Most deep learning frameworks provide APIs for loading famous datasets like MNIST (e.g., `torchvision.datasets.MNIST` in pytorch). The APIs are handy, but hide the important step for preparing a training data for a deep learning framework; when graduating from an example dataset to the real data, we must convert a training data of our interest into the data structure that is acceptable by a deep learning framework.

The cell below downloads the original distribution of the MNIST dataset on the Web, converts the dataset into `numpy` arrays, and saves the arrays as the file `mnist.npz` with keyword names.

In [1]:
import gzip
import os
import sys
import struct
import numpy as np

def read_image(fi):
    magic, n, rows, columns = struct.unpack(">IIII", fi.read(16))
    assert magic == 0x00000803
    assert rows == 28
    assert columns == 28
    rawbuffer = fi.read()
    assert len(rawbuffer) == n * rows * columns
    rawdata = np.frombuffer(rawbuffer, dtype='>u1', count=n*rows*columns)
    return rawdata.reshape(n, rows, columns).astype(np.float32) / 255.0

def read_label(fi):
    magic, n = struct.unpack(">II", fi.read(8))
    assert magic == 0x00000801
    rawbuffer = fi.read()
    assert len(rawbuffer) == n
    return np.frombuffer(rawbuffer, dtype='>u1', count=n)

if __name__ == '__main__':
    os.system('wget -N http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz')
    os.system('wget -N http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz')
    os.system('wget -N http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz')
    os.system('wget -N http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz')

    np.savez_compressed(
        'mnist',
        train_x=read_image(gzip.open('train-images-idx3-ubyte.gz', 'rb')),
        train_y=read_label(gzip.open('train-labels-idx1-ubyte.gz', 'rb')),
        test_x=read_image(gzip.open('t10k-images-idx3-ubyte.gz', 'rb')),
        test_y=read_label(gzip.open('t10k-labels-idx1-ubyte.gz', 'rb'))
    )

The file contains four numpy arrays (one tensor and array for each split of training and test sets) with the keywords:

+ `train_x`: $60000 \text{ (images)} \times 28 \text{ (y)} \times 28 \text{ (x)}$
+ `train_y`: $60000 \text{ (labels)}$
+ `test_x`: $10000 \text{ (images)} \times 28 \text{ (y)} \times 28 \text{ (x)}$
+ `test_y`: $10000 \text{ (labels)}$


### Install pytorch

In [2]:
!pip install torch torchvision



### Transformers

In [3]:

import torch
import torch.nn as nn
data = "Transformers is a media franchise produced by American toy company Hasbro and Japanese toy company Takara Tomy   It primarily follows the heroic Autobots and the villainous Decepticons   two alien robot factions at war that can transform into other forms   such as vehicles and animals   The franchise encompasses toys   animation   comic books   video games and films   As of            it generated more than      trillion          billion   in revenue         making it one of the highest  grossing media franchises of all time  The franchise began in          with the Transformers toy line   comprising transforming mecha toys from Takara  s Diaclone and Micro Change toylines rebranded for Western markets         The term   Generation      covers both the animated television series The Transformers and the comic book series of the same name   which are further divided into Japanese   British and Canadian spin  offs   respectively   Sequels followed   such as the Generation    comic book and Beast Wars TV series   which became its own mini  universe   Generation    characters underwent two reboots with Dreamwave Productions in          and IDW Publishing in            with a third starting in            There have been other incarnations of the story based on different toy lines during and after the     th century   The first was the Robots in Disguise series   followed by three shows   Armada   Energon   and Cybertron   that constitute a single universe called the   Unicron Trilogy    A live  action film series started in            again distinct from previous incarnations   while the Transformers   Animated series merged concepts from the G   continuity   the          live  action film and the   Unicron Trilogy     For most of the         s   in an attempt to mitigate the wave of reboots   the   Aligned Continuity   was established   In            Transformers   Cyberverse debuted   once again   distinct from the previous incarnations  Although initially a separate and competing franchise started in            Tonka  s GoBots became the intellectual property of Hasbro after their buyout of Tonka in            Subsequently   the universe depicted in the animated series Challenge of the GoBots and follow  up film GoBots   Battle of the Rock Lords was retroactively established as an alternate universe within the Transformers multiverse        FictionTransformers   Generation                          Main articles   Transformers   Generation    and The Transformers   TV series  Classic Transformers franchise logo used until         Spider  Man battles Megatron on the cover of The Transformers       Generation One is a retroactive term for the Transformers characters that appeared between          and            The Transformers began with the         s Japanese toy lines Micro Change and Diaclone   They presented robots able to transform into everyday vehicles   electronic items or weapons   Hasbro bought the Micro Change and Diaclone toys   and partnered with Takara         Marvel Comics was hired by Hasbro to create the backstory   editor  in  chief Jim Shooter wrote an overall story   and gave the task of creating the characters to writer Dennis O  Neil         Unhappy with O  Neil  s work   although O  Neil created the name   Optimus Prime       Shooter chose Bob Budiansky to create the characters        The Transformers mecha were largely designed by Shōji Kawamori   the creator of the Japanese mecha anime franchise Macross   which was adapted into the Robotech franchise in North America           Kawamori came up with the idea of transforming mechs while working on the Diaclone and Macross franchises in the early         s   such as the VF     Valkyrie in Macross and Robotech     with his Diaclone mechs later providing the basis for Transformers        The primary concept of Generation One is that the heroic Optimus Prime   the villainous Megatron   and their finest soldiers crash land on pre  historic Earth in the Ark and the Nemesis before awakening in            Cybertron hurtling through the Neutral zone as an effect of the war   The Marvel comic was originally part of the main Marvel Universe   with appearances from Spider  Man and Nick Fury   plus some cameos         as well as a visit to the Savage Land          The Transformers TV series began around the same time   Produced by Sunbow Productions and Marvel Productions   later Hasbro Productions   from the start it contradicted Budiansky  s backstories   The TV series shows the Autobots looking for new energy sources   and crash landing as the Decepticons attack           Marvel interpreted the Autobots as destroying a rogue asteroid approaching Cybertron           Shockwave is loyal to Megatron in the TV series   keeping Cybertron in a stalemate during his absence           but in the comic book he attempts to take command of the Decepticons           The TV series would also differ wildly from the origins Budiansky had created for the Dinobots                   the Decepticon turned Autobot Jetfire           known as Skyfire on TV             the Constructicons   who combine to form Devastator                     and Omega Supreme                   The Marvel comic establishes early on that Prime wields the Creation Matrix   which gives life to machines   In the second season   the two  part episode The Key to Vector Sigma introduced the ancient Vector Sigma computer   which served the same original purpose as the Creation Matrix   giving life to Transformers     and its guardian Alpha Trion  In            the cartoon became the film The Transformers   The Movie   which is set in the year            It introduced the Matrix as the   Autobot Matrix of Leadership     as a fatally wounded Prime gives it to Ultra Magnus   however   as Prime dies he drops the matrix   which is then caught by Hot Rod who subsequently becomes Rodimus Prime later on in the film   Unicron   a transformer who devours planets   fears its power and recreates a heavily damaged Megatron as Galvatron   as well as Bombshell or Skywarp becoming Cyclonus   Thundercracker becoming Scourge and two other Insecticons becoming Scourge  s huntsmen   the Sweeps   Eventually   Rodimus Prime takes out the Matrix and destroys Unicron           In the United Kingdom   the weekly comic book interspliced original material to keep up with U  S   reprints           and The Movie provided much new material   Writer Simon Furman proceeded to expand the continuity with movie spin  offs involving the time travelling Galvatron                   The Movie also featured guest voices from Leonard Nimoy as Galvatron   Scatman Crothers as Jazz   Casey Kasem as Cliffjumper   Orson Welles as Unicron and Eric Idle as the leader of the Junkions   Wreck  Gar   though unnamed in the movie     The Transformers theme tune for the film was performed by Lion with   Weird Al   Yankovic adding a song to the soundtrack  The third season followed up The Movie   with the revelation of the Quintessons having used Cybertron as a factory   Their robots rebel   and in time the workers become the Autobots and the soldiers become the Decepticons     Note   This appears to contradict background presented in the first two seasons of the series     It is the Autobots who develop transformation           Due to popular demand           Optimus Prime is resurrected at the conclusion of the third season           and the series ended with a three  episode story arc   However   the Japanese broadcast of the series was supplemented with a newly produced OVA   Scramble City   before creating entirely new series to continue the storyline   ignoring the          end of the American series   The extended Japanese run consisted of The Headmasters   Super  God Masterforce   Victory and Zone   then in illustrated magazine form as Battlestars   Return of Convoy and Operation   Combination   Just as the TV series was wrapping up   Marvel continued to expand its continuity   It followed The Movie  s example by killing Prime         and Megatron           albeit in the present day   Dinobot leader Grimlock takes over as Autobot leader           There was a G  I   Joe crossover         and the limited series The Transformers   Headmasters   which further expanded the scope to the planet Nebulon           It led on to the main title resurrecting Prime as a Powermaster          "


#"The franchise began in 1984 with the Transformers toy line, comprising transforming mecha toys from Takara's Diaclone and Micro Change toylines rebranded for Western markets.[2] The term Generation 1 covers both the animated television series The Transformers and the comic book series of the same name, which are further divided into Japanese, British and Canadian spin-offs, respectively. Sequels followed, such as the Generation 2 comic book and Beast Wars TV series, which became its own mini-universe. Generation 1 characters underwent two reboots with Dreamwave Productions in 2001 and IDW Publishing in 2005, with a third starting in 2019. There have been other incarnations of the story based on different toy lines during and after the 20th century. The first was the Robots in Disguise series, followed by three shows (Armada, Energon, and Cybertron) that constitute a single universe called the Unicron Trilogy"

data = data.replace(".", "")
data.replace(",","")

data = data.lower()
data = data.split(" ")


uniques = len(set(data))
print(uniques)
words = list(set(data))

numdata = []
for word in data:
  numdata.append(words.index(word))

numdata = torch.IntTensor(numdata)
print(numdata)

519
tensor([ 10, 230, 269,  ...,   0,   0,   0], dtype=torch.int32)


In [4]:

import math




class self_attention(nn.Module):
  def __init__(self, device="cuda:0"):
    super().__init__()
    self.num_words = 12
    self.num_dims = 12

    self.pos_encoding = torch.zeros((self.num_words,self.num_dims))

    for pos in range(self.num_words):
      for i in range(self.num_dims):
        if pos % 2 == 0:
          self.pos_encoding[pos,i] = math.sin(pos/math.pow(10000, i/self.num_dims))
        else:
          self.pos_encoding[pos,i] = math.cos(pos/math.pow(10000, i-1/self.num_dims))
    self.pos_encoding = self.pos_encoding.to(device)

    self.Q = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.K = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.V = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.softmax = nn.Softmax(dim=0)
    self.embedding = nn.Embedding(519,self.num_dims).to(device)
    self.flatten = nn.Flatten(start_dim=0)
    self.preoutput = nn.Linear(self.num_words*self.num_dims,512).to(device)
    self.gelu = nn.GELU()
    self.output = nn.Linear(512, 519).to(device)

  def forward(self, x, device="cuda:0"):
    x = x.to(device)
    x = self.embedding(x)
    x += self.pos_encoding
    x_q = self.Q(x)
    x_k = self.K(x)
    x_v = self.V(x)
    x_qk =  x_q.matmul(x_k.transpose(0,1))/math.sqrt(self.num_dims)
    x = self.softmax(x_qk)*x_v
    x = self.flatten(x)
    x=self.preoutput(x)
    x = self.gelu(x)
    x = self.output(x)

    return x

torch.manual_seed(0)
self_atten = self_attention()

#p = self_atten(numdata[:12])

optimizer = torch.optim.Adam(self_atten.parameters(), lr = 0.0001)
loss = nn.CrossEntropyLoss()

for k in range(50):
  loss_n = 0
  for i in range(len(numdata)-13):
    optimizer.zero_grad()
    x = numdata[i:i+12]
    y = torch.zeros(519)
    y = y.to("cuda:0")
    y[numdata[i+12]] = 1
    y_p = self_atten(x)
    l = loss(y_p, y)
    l.backward()
    optimizer.step()
    loss_n += l
  loss_n /= len(numdata)-13
  print("epoch:", k ,"loss:", loss_n)


#p = self_atten(numdata[:12])

#print(p, words[p.argmax()], data[13])

accuracy = 0
for i in range(len(numdata)-13):
  x = numdata[i:i+12]
  y = data[i+12]
  y_p = self_atten(x)
  y_p = y_p.to("cpu")
  #print(y, words[y_p.argmax()])
  if y != words[y_p.argmax()]:
    print("Wrong!")
  else:
    accuracy += 1
accuracy /= len(numdata)-13
print("accuracy:", accuracy)







epoch: 0 loss: tensor(4.4186, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 1 loss: tensor(3.6963, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 2 loss: tensor(3.6020, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 3 loss: tensor(3.5178, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 4 loss: tensor(3.4197, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 5 loss: tensor(3.2865, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 6 loss: tensor(3.1224, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 7 loss: tensor(2.9508, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 8 loss: tensor(2.7855, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 9 loss: tensor(2.6288, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 10 loss: tensor(2.4811, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 11 loss: tensor(2.3421, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 12 loss: tensor(2.2102, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 13 loss: tensor(2.0828, device='cuda:0', grad_fn=<DivBackward0>)
ep

In [23]:
import math
class self_attention_pure(nn.Module):
  def __init__(self, device="cuda:0", p_in=12, mid = 512, dims=0):
    super().__init__()
    print(p_in)
    self.num_words = p_in
    self.num_dims = p_in
    if dims != 0:
      self.num_dims = dims

    self.Q = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.K = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.V = nn.Linear(self.num_dims, self.num_dims).to(device)
    self.softmax = nn.Softmax(dim=0)
    self.preoutput_0 = nn.Linear(self.num_dims,mid).to(device)
    self.preoutput_1 = nn.Linear(mid, self.num_dims).to(device)
    self.gelu = nn.GELU()

  def forward(self, x, device="cuda:0"):
    x = x.to(device)
    id = x
    x_q = self.Q(x)
    x_k = self.K(x)
    x_v = self.V(x)
    x_qk = x_q*x_k/math.sqrt(self.num_dims)
    x = self.softmax(x_qk)*(x_v)
    x = self.preoutput_0(x)
    x = self.preoutput_1(x)
    x += id
    x = self.gelu(x)

    return x


class self_attention_w_pos_enc(nn.Module):
  def __init__(self, device="cuda:0", words = 12, dims = 12):
    super().__init__()
    self.num_words = words
    self.num_dims = dims
    self.embedding = nn.Embedding(519,self.num_dims).to(device)
    self.pos_encoding = torch.zeros((self.num_words,self.num_dims))

    for i in range(self.num_words):
      for j in range(self.num_dims):
        if j % 2 == 0:
          self.pos_encoding[i,j] = math.sin(i/math.pow(10000, j/self.num_dims))
        else:
          self.pos_encoding[i,j] = math.cos(i/math.pow(10000, j-1/self.num_dims))
    self.pos_encoding = self.pos_encoding.to(device)

    self.attention = self_attention_pure(device=device, p_in = self.num_dims, dims = self.num_dims, mid=512)
    self.output = nn.Linear(512, 519).to(device)

  def forward(self, x, device="cuda:0"):
    x = x.to(device)
    x = self.embedding(x)
    x += self.pos_encoding
    x = self.attention(x)
    x = self.output(x)

    return x

class self_attention_w_pos_enc_pure(nn.Module):
  def __init__(self, device="cuda:0", num_words = 12, num_dims=12, out=512):
    super().__init__()
    self.num_words = num_words
    self.num_dims = num_dims
    self.embedding = nn.Embedding(519,self.num_dims).to(device)
    self.pos_encoding = torch.zeros((self.num_words,self.num_dims))

    for i in range(self.num_words):
      for j in range(self.num_dims):
        if j % 2 == 0:
          self.pos_encoding[i,j] = math.sin(i/math.pow(10000, j/self.num_dims))
        else:
          self.pos_encoding[i,j] = math.cos(i/math.pow(10000, j-1/self.num_dims))
    self.pos_encoding = self.pos_encoding.to(device)

    self.attention = self_attention_pure(device=device, p_in = self.num_dims, dims = self.num_dims, mid=out)

  def forward(self, x, device="cuda:0"):
    x = x.to(device)
    x = self.embedding(x)
    x += self.pos_encoding
    x = self.attention(x)
    return x


class multi_layer_self_attention(nn.Module):
  def __init__(self, device="cuda:0", outs=[144,256,512,519], seq_length=12):
    super().__init__()
    self.attention_w_enc = self_attention_w_pos_enc_pure(device=device, num_words=seq_length, num_dims=12, out=outs[0])
    self.attention_1 = self_attention_pure(device=device, p_in=12, mid=outs[1])
    self.linear_1 = nn.Linear(seq_length*12, outs[0]).to(device)
    self.linear_2 = nn.Linear(outs[0], outs[1]).to(device)
    self.flatten = nn.Flatten(start_dim=0)
    self.output = nn.Linear(outs[1], outs[3]).to(device)


  def forward(self, x, device="cuda:0"):
    x = x.to(device)
    x = self.attention_w_enc(x)
    x = self.attention_1(x)
    x = self.flatten(x)
    x = self.linear_1(x)
    x = self.linear_2(x)
    x = self.output(x)
    return x

torch.manual_seed(0)
seq_length = 24
self_atten = multi_layer_self_attention(seq_length=seq_length)
#p = self_atten(numdata[:12])

optimizer = torch.optim.Adam(self_atten.parameters(), lr = 0.0001)
loss = nn.CrossEntropyLoss()
schedule = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2, cooldown=3, factor=0.333)



for k in range(50):
  loss_n = 0
  for i in range(len(numdata)-(seq_length+1)):
    optimizer.zero_grad()
    x = numdata[i:i+seq_length]
    y = torch.zeros(519)
    y = y.to("cuda:0")
    y[numdata[i+seq_length]] = 1
    y_p = self_atten(x)
    l = loss(y_p, y)
    l.backward()
    optimizer.step()
    loss_n += l
  loss_n /= len(numdata)-seq_length-1
  schedule.step(loss_n)
  print("epoch:", k ,"loss:", loss_n)


#p = self_atten(numdata[:12])

#print(p, words[p.argmax()], data[13])

accuracy = 0
for i in range(len(numdata)-(seq_length+1)):
  x = numdata[i:i+seq_length]
  y = data[i+seq_length]
  y_p = self_atten(x)
  y_p = y_p.to("cpu")
  #print(y, words[y_p.argmax()])
  if y != words[y_p.argmax()]:
    print("Wrong!")
  else:
    accuracy += 1
accuracy /= len(numdata)-seq_length-1
print("accuracy:", accuracy)

12
12
epoch: 0 loss: tensor(4.2351, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 1 loss: tensor(3.5792, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 2 loss: tensor(3.3300, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 3 loss: tensor(3.0878, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 4 loss: tensor(2.7788, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 5 loss: tensor(2.3964, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 6 loss: tensor(2.0140, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 7 loss: tensor(1.6950, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 8 loss: tensor(1.4465, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 9 loss: tensor(1.2531, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 10 loss: tensor(1.0971, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 11 loss: tensor(0.9650, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 12 loss: tensor(0.8525, device='cuda:0', grad_fn=<DivBackward0>)
epoch: 13 loss: tensor(0.7556, device='cuda:0', grad_fn=<DivBackward