# Learnable Test Project
## Task Descripition
For this task, you are going to implement a Connectionist temporal classification(CTC) based convolutional recurrent neural netowrk(CRNN) ocr model that transcripts english handwritten images to strings. 

Follow the instruction below, you are going to:
1. Build a data pipeline
2. Implement an image encoder (most likely a convolutional neural network encoder)
3. Build a decoder (most likely a recurrent neural network based decoder)
4. Implement the training function to train the model 
5. Implement the validation function to validate the trained model


To prepare yourself with above tasks, please first familiar yourself with [CRNN-CTC](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c), [pytorch CTC loss](https://pytorch.org/docs/stable/nn.html#ctcloss), [pytorch Dataset](https://pytorch.org/docs/stable/data.html), [pytorch LSTM](https://pytorch.org/docs/stable/nn.html#lstm) and [pytorch GRU](https://pytorch.org/docs/stable/nn.html#gru).
Also here is an example for using pytorch Dataset [pytorch Dataset example](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)

There are CTC-based OCR models in PyTorch out there on the internet, and you are allowed to search and reference those codes. Futhermore, you are allowed to copy paste code from those repos for this task. But you must **cite the source**. However, you are not allowed to copy paste other people's implementation in its entirety, i.e., you must follow our defined function headers to implement the model. 

We also provided a few helper functions. You can find them in helper functions.py. You can also create your own python files with helper functions or variables. But the main functionalities should remain in this notebook.  

### Please enter your name here.
Candidate Name:Wenzheng He

## 1.1 data pipeline
You are going to build a dataloader for your model. You can either follow the structure provided below or write your own dataloader. You should assume that loading all image samples to ram exhausts ram. So, please do not load all images into memory at once, instead, use a [online dataloader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)

### Data Discription:
For this task, you will be given \~30k handwritten english words, \~5k training and \~3k validation, with necessary infomation for you to train your OCR model. Please ignore the 'unlabeled' data.

This dataset is a subset from IAM and to make the task easier, we only pick data samples whose label's length is **shorter or equal to 3**.

You will be given a file: `meta_candidate.json` and a folder `data_candidate/`. 

`data_candidate/` contains image files for training and validation, as well as unlabeled handwritten englisht words. Each with the filename of `<img_id>.png`


`meta_candidate.json` has the following structure: 
```text
{
  "image id":{"bin"(int): binarization threshhold
              "label"(str): image label
              "path"(str): basename for image name ('<img_id>.png')
              "train"(str): label for the usage of image samples,
                       "train" for training, 
                       "val" for validation, 
                       "unlabeled" is unlabeled data.
              }
}
```

Please DO NOT use data from anywhere else.

Note that the selected subset from IAM doesn't contain certain letters and thus those letters are exincluded in the alphabet for this task. Pay attention to the variable *ALPHABET* below and the helper funcion load_lexicon. 

In [257]:
#GLOBAL VARIABLES
DATA_PATH = './data/data_candidate/'
META_PATH = './data/meta_candidate.json'
BATCH_SIZE = 64
MAX_LEN = 3 # max length of word is 3
ALPHABET = 'abcdefghilmnoprstuwy' # only these letters will appear in data

import torch
from torch.utils.data import DataLoader, Dataset
from helper_functions import * # take your time to check out our helpful helper functions
import os
import cv2
import torch.nn.functional as F
import numpy as np

class OcrDataset(Dataset):
    def __init__(self, use='train',augment=False):
        '''
        dataset class for the ocr model

        :param use(str): the set of data you want to load, "train" for training, "val" for evaluation
        :param augment(bool): a flag that decides if we are going to augment data
        '''
        super(OcrDataset, self).__init__()
        ##### your code here #######ctcloss
        self.augment = augment
        self.meta = load_json(META_PATH) #read in mata data
        self.meta = [b for a,b in self.meta.items() 
                     if len(b['label']) > 0 and ##exclud unlabelled
                     len(b['label']) <= 3 and ##only include upto 3 character label
                     set(b['label']).intersection(ALPHABET) == set(b['label']) and ##make sure only include labels using selected letters
                     b['train']==use] ##pick the correct set of data.
        ###########################

    def __getitem__(self, idx):
        '''
        return a sample, augmentation is optional. 
        :param index: index of data
        :return: {"img"(numpy array or torch tensor): grayscale image with shape (1, im_height, im_width)
                  "label(numpy array or torch tensor)": embedded and padded label with shape (3 ). Because by default ctc loss 
                                                        perserves 0 as padding token, so we need to increase all token 
                                                        indices by one. Ex: for label 'by', the function returns [2, 20, 0]
                  "len"(int): number of characters of the sample's label.
                  ... add anything more for your convenience
                 }
        '''
        ##### your code here ######
        token2index, _ = get_lexicon()
        if torch.is_tensor(idx):
            idx = idx.tolist()
        img_path = os.path.join(DATA_PATH,
                                self.meta[idx]['path'])
        img = cv2.imread(img_path) ## read in img
        img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) ##covert to grayscale
        img = cv2.resize(img, (128,64), interpolation = cv2.INTER_AREA) ##resize img to 128x64 WxH close to avg W&H in the dataset.
        img = torch.tensor(img[None,:,:]).float() ##Add batch dimension and convert to float tensor
        label = self.meta[idx]['label'] ##get label target
        length = len(label) ##get num of characters
        label_token = torch.LongTensor([token2index[i]+1 for i in label] + [0]*(3-length)) ##covert token to index, pad 0 at the end
        ###########################
        if self.augment:
            # augments
            pass
        return {'img':img,'len':length,'label':label_token}

    def __len__(self):
        '''
        total number of samples
        :return(int): number of samples
        '''
        ##### your code here ######
        return len(self.meta)
        ###########################

### test dataloader

In [258]:
ocr_dataset_train = OcrDataset('train')
data_loader_train = DataLoader(ocr_dataset_train, batch_size = BATCH_SIZE, shuffle=True)
ocr_dataset_eval = OcrDataset('val')
data_loader_eval= DataLoader(ocr_dataset_eval, batch_size = BATCH_SIZE, shuffle=True)

for i,sample in enumerate(data_loader_train):
    print(sample['img'],sample['label'],sample['len'])
    if i >1:
        break

tensor([[[[251., 251., 251.,  ..., 251., 251., 251.],
          [251., 251., 251.,  ..., 251., 251., 251.],
          [251., 251., 251.,  ..., 251., 251., 251.],
          ...,
          [251., 251., 250.,  ..., 249., 250., 250.],
          [250., 251., 251.,  ..., 250., 250., 250.],
          [249., 250., 251.,  ..., 251., 251., 249.]]],


        [[[251., 251., 251.,  ..., 251., 251., 251.],
          [251., 251., 251.,  ..., 251., 251., 251.],
          [251., 251., 251.,  ..., 251., 251., 251.],
          ...,
          [251., 251., 251.,  ..., 144., 179., 179.],
          [251., 251., 251.,  ..., 142., 176., 176.],
          [251., 251., 251.,  ..., 200., 221., 221.]]],


        [[[255., 255., 255.,  ..., 255., 255., 255.],
          [255., 255., 255.,  ..., 255., 255., 255.],
          [255., 255., 255.,  ..., 255., 255., 255.],
          ...,
          [235., 235., 236.,  ..., 246., 245., 244.],
          [237., 237., 239.,  ..., 249., 244., 242.],
          [237., 238., 239., 

### 1.2 Convolutional Recurrent Neural Network Model
You are going to implement a convolutional recurrent neural network model. Follow the instruction below to build a encder and a decoder. Then combine them to get the ocr model.

#### 1.1.1 Encoder
Implement any image encoder you like for the CTC model. You can reference existing ones. Note we only want resonable accuracy for this task. So, please consider inference speed when designing the model architecture. Overly complex CNN model would also increase your training time and thus leave you with less time to tune the model. 

You should try to guarantee that the height dimension for output feature map is 1, otherwise, you'll need to deal with this problem in decoder.

In [314]:
##### your code here ######
# input to encoder is a batch of samples(batch_size, 1, im_height, im_width)
class CTCEncoder(nn.Module):
    def __init__(self):
        super(CTCEncoder, self).__init__()
        ##set some parameters for cnn
        ks = [3, 3, 3, 3, 3, 3, 3, 3] ##kernel
        ps = [1, 1, 1, 1, 1, 1, 1, 0] ##padding
        ss = [1, 1, 1, 1, 1, 1, 1, 1] ##stride
        nm = [32, 64, 128, 256, 256, 512, 512, 512] ##feature map depth

        ##define cnn structure
        cnn = nn.Sequential()
        
        def convRelu(i, batchNormalization=False):
            nIn = 1 if i == 0 else nm[i - 1]
            nOut = nm[i]
            cnn.add_module('conv{0}'.format(i),
                           nn.Conv2d(nIn, nOut, ks[i], ss[i], ps[i]))
            if batchNormalization:
                cnn.add_module('batchnorm{0}'.format(i), nn.BatchNorm2d(nOut))
            cnn.add_module('relu{0}'.format(i), nn.ReLU(True))

        convRelu(0)                                                 # 32x64x128
        cnn.add_module('pooling{0}'.format(0), nn.MaxPool2d(2, 2))  # 32x32x64
        convRelu(1)                                                 # 64x32x64
        cnn.add_module('pooling{0}'.format(1), nn.MaxPool2d(2, 2))  # 64x16x32
        convRelu(2, True)                                           # 128x16x32
        convRelu(3)                                                 # 256x16x32
        cnn.add_module('pooling{0}'.format(2), nn.MaxPool2d(2, 2))  # 256x8x16
        convRelu(4, True)                                           # 256x8x16
        convRelu(5)                                                 # 512x8x16
        cnn.add_module('pooling{0}'.format(3), nn.MaxPool2d((2, 2),(2, 1),(0,1)))  # 512x4x17
        convRelu(6, True)                                                          # 512x4x17
        convRelu(7)                                                                # 512x2x15
        cnn.add_module('pooling{0}'.format(4), nn.MaxPool2d((2, 2),(2, 1),(0,1)))  # 512x1x16
        
        self.cnn = cnn
    def forward(self, feature):
        return self.cnn(feature)
###########################

In [316]:
##test input/output shape
en_test = CTCEncoder()
test = torch.tensor(ocr_dataset_train[100]["img"][None,:,:,:]).float()
result = en_test.forward(test)
print("input shape:{}".format(test.shape))
print("output shape:{}".format(result.shape))

input shape:torch.Size([1, 1, 64, 128])
output shape:torch.Size([1, 512, 1, 16])


  


#### 1.1.2 Decoder
implement a Decoder that takes cnn encoded features and output loglogits. In CRNN model, the decoder is most likely RNN based. 

In [222]:
class CTCDecoder(nn.Module):
    '''
    decoder for the transcription model

    '''
    def __init__(self):
        super(CTCDecoder, self).__init__()
        ##### your code here ######
        self.nh = 256
        self.nOut = len(ALPHABET)+1
        self.rnn = nn.LSTM(512, self.nh, bidirectional=True)
        self.fc = nn.Linear(self.nh * 2, self.nOut)
        ###########################
    def forward(self, feature):
        '''
        forward function for decoding submodule

        :param feature(torch tensor): encoded feature with shape (batch_size, hidden_dim, 1, feature_map_width) 
        :return: loglogits(torch tensor) for each block on the image(feature_map_width, batch_size, lexicon_size) 
        '''
        ##### your code here ######
        feature = feature.squeeze(2) #remove height dimension from CNN output
        feature = feature.permute(2, 0, 1) #move feature_map_width to the first dimension
        recurrent, _ = self.rnn(feature)
        T, b, h = recurrent.size()
        t_rec = recurrent.view(T * b, h) ##concat bidiretional lstm results

        output = self.fc(t_rec)  # [T * b, nOut]
        output = output.view(T, b, -1)
        output = output.log_softmax(2)
        ###########################
    
        return output

In [319]:
##test input/output shape
de_test = CTCDecoder()
result_de = de_test.forward(result)
print("input:{}\noutput:{}".format(result.shape,result_de.shape))

input:torch.Size([1, 512, 1, 16])
output:torch.Size([16, 1, 21])


#### 1.1.3 Combine Encoder and Decoder

In [196]:
import torch.nn as nn
from torch.nn import CTCLoss #https://pytorch.org/docs/stable/nn.html#ctcloss
import torch.optim as optim
import torch

USE_GPU=True
LR = 1e-3

class Crnn(nn.Module):
    '''
    module for the whole transcription model, crnn stands for
    convolutional recurrent network
    '''
    def __init__(self):
        super(Crnn, self).__init__()
        ##### fill in #####
        self.encoder = CTCEncoder()
        self.decoder = CTCDecoder()
        self.loss = CTCLoss()
        self.optimizer = optim.Adam(self.parameters(), lr = LR)
        ################### 
    def forward(self, x):
        '''
        forward function for crnn model, converts a batch of
        images to a batch of sequences of log logits.
        :param x(pytorch tensor): a tensor of images (batch_size, 1, im_height, im_width)
        :return: loglogits(pytorch tensor) with shape(feature_map_width, batch_size, lexicon_size) 
        '''
        encoded_feature = self.encoder(x)
        loglogits = self.decoder(encoded_feature)
        return loglogits


Crnn()

Crnn(
  (encoder): CTCEncoder(
    (cnn): Sequential(
      (conv0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (relu0): ReLU(inplace=True)
      (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (conv1): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (relu1): ReLU(inplace=True)
      (pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (batchnorm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu2): ReLU(inplace=True)
      (conv3): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (relu3): ReLU(inplace=True)
      (pooling2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (batchnorm4): B

In [318]:
##test input/output shape
model_test = Crnn()
result_crnn = model_test(test)
print("input:{}\noutput:{}".format(test.shape,result_crnn.shape))

input:torch.Size([1, 1, 64, 128])
output:torch.Size([16, 1, 21])


In [322]:
#test input/output for one batch
sample_test = next(iter(data_loader_train))
result_crnn = model_test(sample_test['img'])
print("input:{}\noutput:{}".format(sample_test['img'].shape, result_crnn.shape))

input:torch.Size([64, 1, 64, 128])
output:torch.Size([16, 64, 21])


In [None]:
##encoder/decoder reference source:https://github.com/foamliu/CRNN/blob/master/models.py

### 1.3 train step

In [266]:
def train(model, sample):
    '''
    train function for transcription model

    :param model(pytorch module): pytorch transcription model
    :param sample(dict): a minibatch of samples, the dict should at least include
                        keys 'img', 'label', 'len'
    :return: loss(float) and accuracy(float) for current train step
    '''
    # get loglogits
    model.train() #set mode of model
    x = sample['img']
    if USE_GPU:
        x.cuda()
    loglogits = model(x)
    
    ##### fill in the code to format output, get loss and back propergate #####
    # check #https://pytorch.org/docs/stable/nn.html#ctcloss for document for CTC loss

    #format target and output
    batch_size = loglogits.shape[1]
    target_lengths = torch.LongTensor((sample['len']))
    targets = sample['label']
    if USE_GPU:
        targets.cuda()
    output_lengths = torch.autograd.Variable(torch.IntTensor([loglogits.size(0)] * batch_size))
    
    #get loss
    loss = model.loss(loglogits, targets, output_lengths, target_lengths)
    
    #back prop and update weights
    model.optimizer.zero_grad()
    loss.backward()
    model.optimizer.step()
    #####
    
    accy = calc_accy(loglogits, sample['label'], sample['len'])
    return loss.item(), accy

### 1.4 validation step

In [267]:
def validate(model, sample):
    '''
    validation function for transcription model

    :param model(pytorch module): pytorch transcription model
    :param sample(dict): a minibatch of samples, the dict should at least include
                        keys 'img', 'label', 'len'
    :return: accuracy(float) for validation step
    '''
    ##### fill in to code to forward and get validation output #####
    model.eval() #set to eval mode
    x = sample['img']
    if USE_GPU:
        x.cuda()
    logits = model(x)
    labels = sample['label']
    label_lengths = sample['len']
    ################################################################
    accy = calc_accy(logits, labels, label_lengths)
    return accy

### Training and Expermenting
Using the functions above to create your training script. Tune hyperparameters and do any experiment and analysis you like. The traing accuracy should >90% and validation accuracy should >50%. Feel free to add any other features, such as training curve visualization, checkpoing loading and saving, and so on. 

In [328]:
def train_loop(n_epochs, model, start_epoch = 0):
    if USE_GPU:
        model.cuda()
    for epoch in range(1, n_epochs+1):
        ##variable to track loss/acc
        train_loss = 0.0
        train_acc = 0.0
        valid_acc = 0.0
        ##train
        for i,sample in enumerate(data_loader_train):
            loss_item, acc = train(model, sample)
            train_loss += loss_item
            train_acc += acc
        ##valid
        for i,sample in enumerate(data_loader_eval):
            acc = validate(model, sample)
            valid_acc += acc
        print('Epoch: {} \tTraining Loss: {:.6f} \tTraining acc: {:.6f} \tValidation acc: {:.6f}'.format(
            start_epoch + epoch, 
            train_loss/len(data_loader_train),
            train_acc/len(data_loader_train),
            valid_acc/len(data_loader_eval),
            ))


In [270]:
#### free style time #####
model = Crnn()
#first train 10 epochs
train_loop(10, model)

Epoch: 1 	Training Loss: 3.307678 	Training acc: 0.000000 	Validation acc: 0.000000
Epoch: 2 	Training Loss: 2.717684 	Training acc: 0.000000 	Validation acc: 0.000000
Epoch: 3 	Training Loss: 2.138598 	Training acc: 0.000000 	Validation acc: 0.000000
Epoch: 4 	Training Loss: 1.539362 	Training acc: 0.093924 	Validation acc: 0.046474
Epoch: 5 	Training Loss: 1.045012 	Training acc: 0.401975 	Validation acc: 0.151415
Epoch: 6 	Training Loss: 0.768983 	Training acc: 0.573351 	Validation acc: 0.415175
Epoch: 7 	Training Loss: 0.576011 	Training acc: 0.666862 	Validation acc: 0.322696
Epoch: 8 	Training Loss: 0.461980 	Training acc: 0.733398 	Validation acc: 0.514630
Epoch: 9 	Training Loss: 0.351848 	Training acc: 0.790039 	Validation acc: 0.497638
Epoch: 10 	Training Loss: 0.272768 	Training acc: 0.834983 	Validation acc: 0.477758


In [275]:
#train 5 more epochs
train_loop(5, model, 10)

Epoch: 11 	Training Loss: 0.211068 	Training acc: 0.871853 	Validation acc: 0.534096
Epoch: 12 	Training Loss: 0.169552 	Training acc: 0.899544 	Validation acc: 0.374516
Epoch: 13 	Training Loss: 0.123459 	Training acc: 0.927127 	Validation acc: 0.554404
Epoch: 14 	Training Loss: 0.112462 	Training acc: 0.938368 	Validation acc: 0.576053
Epoch: 15 	Training Loss: 0.080453 	Training acc: 0.954102 	Validation acc: 0.604098


In [276]:
#train 10 more see if we can get any better
train_loop(10, model, 15)

Epoch: 16 	Training Loss: 0.055026 	Training acc: 0.968338 	Validation acc: 0.547925
Epoch: 17 	Training Loss: 0.074274 	Training acc: 0.956467 	Validation acc: 0.613589
Epoch: 18 	Training Loss: 0.057719 	Training acc: 0.964149 	Validation acc: 0.610342
Epoch: 19 	Training Loss: 0.036333 	Training acc: 0.979687 	Validation acc: 0.619543
Epoch: 20 	Training Loss: 0.023815 	Training acc: 0.987305 	Validation acc: 0.627943
Epoch: 21 	Training Loss: 0.016887 	Training acc: 0.993164 	Validation acc: 0.631258
Epoch: 22 	Training Loss: 0.011599 	Training acc: 0.995703 	Validation acc: 0.640238
Epoch: 23 	Training Loss: 0.008070 	Training acc: 0.997070 	Validation acc: 0.639520
Epoch: 24 	Training Loss: 0.009193 	Training acc: 0.996289 	Validation acc: 0.614114
Epoch: 25 	Training Loss: 0.012915 	Training acc: 0.992383 	Validation acc: 0.580391


In [None]:
##seems like we might start to overfit the training data.
##save the current model
torch.save(model.state_dict(), "model_crnn.pt")  