# Team KASA



#### Extracting, cropping and preprocessing images from SVHN dataset

---


The notebook contains the code to extract SVHN dataset, crop the images by specified image size (eg. 32x32 or 54x54) and depth (rgb=3, grey_scale=1) and stored as a .h5 file extension

Installing packages if not previously installed

In [14]:
# ! pip install seaborn
# ! pip install h5py

### Importing packages

In [2]:
import os
import sys
import tarfile
import numpy as np
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
from IPython.display import display, Image, HTML
import h5py

plt.rcParams['figure.figsize'] = (16.0, 4.0)
%matplotlib inline

### Downloading datasets and extracting images
The datasets are downloaded from <a href="http://ufldl.stanford.edu/housenumbers/">this page</a>.    
There are distributed into three categories- train, test and extra. Depending on memory constraints, the required datsets can be downloaded.  
Each of these catergories contains images as well as a digitStruct.mat that specifies the boundaries of the image to be extracted   
**NOTE:** The extra dataset contains ~ 2 lakh images and one should ensure there is enough memory to download this dataset

 **** CHANGE !! **** Select the datafiles to be loaded   
 Since uploading extra takes a lot of time the code has been commented 

In [3]:
# **** CHANGE *** Select the datafiles to be loaded
# datasets=['train','test','extra']
datasets=['train','test']

from data_import import load_file
load_file(datasets)

Loading train tar file....
train tar file loaded!
Loading test tar file....
test tar file loaded!
Extracting train.tar.gz ...
Extracting test.tar.gz ...


### Cropping images using digitStruct.mat  and generating the datasets

In [4]:
from extract_box import DigitStructWrapper

print("This may take a while!!")
print('Collecting digit structure for train data...')
digitFile=DigitStructWrapper(os.path.join('data','train','digitStruct.mat'))
box_train=digitFile.unpack_all()

print('Collecting digit structure for test data...')
digitFile=DigitStructWrapper(os.path.join('data','test','digitStruct.mat'))
box_test=digitFile.unpack_all()

# print('Collecting digit structure for extra data...')
# digitFile=DigitStructWrapper(os.path.join('data','extra','digitStruct.mat'))
# box_extra=digitFile.unpack_all()


This may take a while!!
Collecting digit structure for train data...
Collecting digit structure for test data...


In [6]:
size_image=32
depth=3

from crop_generate_data import building_dataset
print('Train set')
X_train, y_train = building_dataset(box_train, "data/train/",size_image,depth)
print("Train", X_train.shape, y_train.shape)


print('Test set')
X_test, y_test = building_dataset(box_test, "data/test/",size_image,depth)
print("Test", X_test.shape, y_test.shape)

# print('Extra set')
# X_extra, y_extra = building_dataset(box_extra, "data/extra/",size_image,depth=)
# print("Test", X_extra.shape, y_extra.shape)




Train set
Train (33402, 32, 32, 3) (33402, 6)
Test set
Test (13068, 32, 32, 3) (13068, 6)


To get the restrict to 5-digit numbers instead of 6-digit numbers run the below code snippet

In [7]:
y_train=y_train[:,:-1]
# y_extra=y_extra[:,:-1]
y_test=y_test[:,:-1]
print('Training set', X_train.shape, y_train.shape)
# print('Extra set', X_val.shape, y_extra.shape)
print('Test set', X_test.shape, y_test.shape)

Training set (33402, 32, 32, 3) (33402, 5)
Test set (13068, 32, 32, 3) (13068, 5)


To convert to greyscale image from rgb run the below code snippets

In [8]:
depth=1
def rgb2gray(images):
    """Convert images from rbg to grayscale
    """
    greyscale = np.dot(images, [0.2989, 0.5870, 0.1140])
    return np.expand_dims(greyscale, axis=3)


# Transform the images to greyscale
X_train = rgb2gray(X_train).astype(np.float32)
X_test = rgb2gray(X_test).astype(np.float32)
# X_extra = rgb2gray(X_extra).astype(np.float32)


In [10]:
print('Training set', X_train.shape, y_train.shape)
# print('Extra set', X_val.shape, y_extra.shape)
print('Test set', X_test.shape, y_test.shape)

Training set (33402, 32, 32, 1) (33402, 5)
Test set (13068, 32, 32, 1) (13068, 5)


To split to train and validation sets

In [11]:
def random_sample(a, b):
    patch = np.array([True]*a + [False]*(a-b))
    np.random.shuffle(patch)
    return patch

# Pick 4000 training and 2000 extra samples
sample1 = random_sample(X_train.shape[0], 4000)
# sample2 = random_sample(X_extra.shape[0], 2000)

# Create valdidation from the sampled data
X_val = np.concatenate([X_train[sample1]])
y_val = np.concatenate([y_train[sample1]])

# Keep the data not contained by sample
X_train = np.concatenate([X_train[~sample1]])
y_train = np.concatenate([y_train[~sample1]])

# Moved to validation and training set
# del X_extra, y_extra 

print("Training", X_train.shape, y_train.shape)
print('Validation', X_val.shape, y_val.shape)

Training (29402, 32, 32, 1) (29402, 5)
Validation (4000, 32, 32, 1) (4000, 5)


The final dataset will be stored in the data folder by the name given below

In [12]:
file_str="digits_"+str(size_image)+"_"+str(size_image)+"_"+str(depth)+".h5"
file_str

'digits_32_32_1.h5'

In [13]:
# Create file
h5f = h5py.File('data/'+file_str, 'w')

# Store the datasets
h5f.create_dataset('train_dataset', data=X_train)
h5f.create_dataset('train_labels', data=y_train)
h5f.create_dataset('valid_dataset', data=X_val)
h5f.create_dataset('valid_labels', data=y_val)

# Close the file
h5f.close()