<a href="https://colab.research.google.com/github/stefanocostantini/pytorch-book/blob/master/ch4_pytorch_real_word_data_representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import os
import imageio
import numpy as np

## Images

### Single image

We need to be able to lead an image from common image formats and then transform the data into a tensor representation

In [None]:
# Let's read a PNG image

img_arr = imageio.imread('https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/image-dog/bobby.jpg')
img_arr.shape

(720, 1280, 3)

What are these? The first number is the height, while the second is the width. The last one is the number of channels, Red, Green and Blue.

So, for example, indexing `img_arr[0,0]` will return the RGB values for the top-leftmost pixel.

We could convert this to a PyTorch tensor, but PyTorch requires this to be in the layout Channels x Height X Width, so we need to change the layout, using the `permute` method. In this case we need to put the RGB values first, while the other two dimensions can stay the same

In [None]:
img = torch.from_numpy(img_arr)
out = img.permute(2,0,1)

In [None]:
out.shape

torch.Size([3, 720, 1280])

### Multiple images in a batch

When we have multiple images, we store them in a batch, a tensor with four dimensions: N x C x H x W (Number of images, channels, height and width).

For a batch of 3 images, in RGB colour, each of 256 x 256 pixels, it can be created as follows:

In [None]:
batch_size = 3
batch = torch.zeros(batch_size, 3, 256, 256, dtype=torch.uint8)

In [None]:
data_dir = 'https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/image-cats/'
filenames = ['cat1.png', 'cat2.png', 'cat3.png']

In [None]:
# Loop through file names, load a NumPy array, convert to tensor with the right
# permutation and add to the tensor
for i, filename in enumerate(filenames):
  filepath = data_dir + filename
  print(filepath)
  img_arr = imageio.imread(filepath)
  img_torch = torch.from_numpy(img_arr)
  out = img_torch.permute(2,0,1)
  out = out[:3] # just keeping the R,G,B channels and ignore alpha
  batch[i] = out

https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/image-cats/cat1.png
https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/image-cats/cat2.png
https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/image-cats/cat3.png


In [None]:
# Now the batch has the required shapes
batch.shape

torch.Size([3, 3, 256, 256])

### Image data normalisation

Best training performance can be achieved when input data ranges from 0 to 1, or from -1 and 1. So we want to cast vectors to floating point and then normalise.

There are 2 ways to do this:
1. just divide the values of the pixels by 256
2. calculate mean and standard deviation of channel data, and then scale it so that the output has zero mean and sd=1 across each channel.

In [None]:
# First method
batch_std1 = batch.float()
batch_std1 /= 255

In [None]:
batch_std1.shape

torch.Size([3, 3, 256, 256])

In [None]:
# Check RBG values for the top-left pixel in first image of the batch
batch_std1[0,:,0,0]


tensor([0.6118, 0.5451, 0.5059])

In [None]:
# Second method
batch_std2 = batch.float()
number_of_channels = batch_std2.shape[1]
for i in range(number_of_channels):
  mean = torch.mean(batch_std2[:,i])
  std = torch.std(batch_std2[:,i])
  batch_std2[:,i] = (batch_std2[:,i] - mean)/std

In [None]:
# And again we can check the RGB values for the top-left pixel in first image of the batch
batch_std2[0,:,0,0]


tensor([0.1439, 0.4632, 0.7792])

### 3D images (volumentric data)

No real difference, we just have a 5th dimension, which we we can call `depth`. So the 5D tensor will have this dimension: N x C x D x H x W (Number of images, channels, depth, height and width)

We can show how it works by loading an image in a specialised format


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
dir_path = '/content/drive/My Drive/ML/dicom'

In [None]:
vol_arr = imageio.volread(dir_path, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%)21/99 files (21.2%)22/99 files (22.2%)23/99 files (23.2%)24/99 files (24.2%)25/99 files (25.3%)26/99 files (26.3%)27/99 files (27.3%)28/99 files (28.3%)29/99 files (29.3%)30/99 files (30.3%)31/99 files (31.3%)32/99 files (32.3%)33/99 files (33.3%)34/99 files (34.3%)35/99 files (35.4%)36/99 files (36.4%)37/99 files (37.4%)38/99 files (38.4%)39/99 files (39.4%)40/99 files (40.4%)41/99 files (41.4%)42/99 files (42.4%)43/99 files (43.4%)44/99 files (44.4%)45/99 files (45.5%)

(99, 512, 512)

Here we have an array where the first dimension is the depth, the second is the height and the third is the width. Note that this is just one 3D image (provided with 99 separate files).

We can convert this into a tensor, but we note that there is one dimension missing, i.e. the channels. In this case, we just have one channel (greyscale). We add it with the `unsqueeze` method.

In [None]:
vol = torch.from_numpy(vol_arr).float()
vol = torch.unsqueeze(vol, 0)
vol.shape

torch.Size([1, 99, 512, 512])

Now we could repeat the process for each image in the dataset, grouping them in a batch, i.e. a tensor with 5 dimension, of which the first one would be the 3D image number.

## Tabular data

Most common type of data (e.g. `csv`). Challenge is that this is often heterogenous data (integers, floats, text, etc.) which need to be converted into numeric tensors.

Let's load a `csv` file and convert it into a tensor. 

In [None]:
path = 'https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/tabular-wine/winequality-white.csv'
wineq_numpy = np.loadtxt(path, dtype=np.float32, delimiter=';', skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

We can check that the file has been loaded correctly by loading the data separately as pandas, getting the column and seeing whether they match the dimensions of the array above

In [None]:
import pandas as pd
columns = pd.read_csv(path, sep=';').columns

In [None]:
wineq_numpy.shape, len(columns)

((4898, 12), 12)

In [None]:
columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [None]:
# We can then convert the array into a tensor
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float32)

This tensor contains a column with the scores, though in general we'd want that to be in a separate tensor as it would be the set of labels used in training (aka the `target`). We do this below

In [None]:
data = wineq[:,:-1]
target = wineq[:,-1]

In [None]:
data.shape, target.shape

(torch.Size([4898, 11]), torch.Size([4898]))

Now we need to choose what we want to do to the labels. We could treat them as **categorical data** - in which case we would convert them to integers. The other option is to apply **one-hot encoding**

In [None]:
# Convert the target into categorical data
target = target.long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

In [None]:
# Apply one-hot encoding - this is done with the `scatter_` method.
# Note: there are 4898 rows, and 10 labels. So, the resulting tensor needs to be
# a 2D tensor (4898, 10). 

# We're starting from a 1D tensor and we want to go to a 2D tensor without changing
# the content of the tensor

target_unsqueezed = target.unsqueeze(1)
target.shape, target_unsqueezed.shape


(torch.Size([4898]), torch.Size([4898, 1]))

In [None]:
# Then we can use the `scatter_` method. The first argument is the dimension we
# apply this to, the second is a column tensor indicating the indices to scatter,
# the third is the scalar to scatter (1 in this case).

# First we define the empty tensor, with the same number of rows as the target, 
# but 10 columns, given that there are 10 different scores

target_onehot = torch.zeros(target.shape[0], 10)

# And then we populate it
target_onehot.scatter_(1, target_unsqueezed, 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [None]:
target_onehot.shape

torch.Size([4898, 10])

We can use a boolean tensor to subset another one. In this case, we can set a threhold on the target variable, obtain a bool tensor and then use that to subset the data

In [None]:
subsetter = target <= 3
subsetter.shape, subsetter.dtype

(torch.Size([4898]), torch.bool)

In [None]:
bad_wine = data[subsetter]
data.shape, bad_wine.shape

(torch.Size([4898, 11]), torch.Size([20, 11]))

## Time series

The objective in this case is to take the time fields in tabular data and use it to add a time dimension, i.e. going from a 2D tensor to a 3D one.

In [None]:
# Load a dataset of bike rentals over two years
path = 'https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/bike-sharing-dataset/hour-fixed.csv'
bikes_numpy = np.loadtxt(
    path,
    dtype=np.float32,
    delimiter=",",
    skiprows=1,
    converters={1: lambda x: float(x[8:10])})

In [None]:
columns = pd.read_csv(path, sep=',').columns
columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [None]:
bikes_numpy.shape

(17520, 17)

There are 17,520 entries, with data across 17 dimensions. There is one entry per hour over a period of 2 years.

Suppose now we want to reshape the data to have one collection per day (24 hours). In this case we want a tensor with these dimensions:
- Number of collections: 730 (number of days in 2 years)
- Number of data point per collection: 24 (number of hours in a day)
- Number of dimensions, or channels per data point: 17

We can do this using `view`

In [None]:
bikes = torch.from_numpy(bikes_numpy)
daily_bikes = bikes.view(-1, 24, bikes.shape[1])

In [None]:
daily_bikes.shape

torch.Size([730, 24, 17])

The only issue is that now we have N (sequences) x L (hours) x C (channels), but the desidered ordering is N x C x L. So we need to transpose the last two dimensions

In [None]:
daily_bikes = daily_bikes.transpose(2,1)
daily_bikes.shape

torch.Size([730, 17, 24])

Finally, let's apply one-hot encoding to one of the categorical values, i.e. `wheathersit` which is in position 9. This can take four values, from 0 to 4.

In [None]:
# First let's extract this and check its dimensions
weather = daily_bikes[:,9,:]
weather.shape

torch.Size([730, 24])

In [None]:
# We actually want it to have shape 730 x 4 (categories) x 24, so we need to add a dimension
# but without adding any data. Also, we reduce the values by 1, as the weather rating go from 1 to 4
# and we need it to be 0-based
weather_unsqueezed = weather.long().unsqueeze(1) - 1
weather_unsqueezed.shape

torch.Size([730, 1, 24])

In [None]:
# Now we can apply the one-hot encoding. First define the target tensor
weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])
weather_onehot.shape

torch.Size([730, 4, 24])

In [None]:
# And finally we can use scatter
weather_onehot.scatter_(1, weather_unsqueezed, 1.0)
weather_onehot.shape

torch.Size([730, 4, 24])

In [None]:
# The last step is to concatenate this with the orginal dataset
daily_bikes_onehot = torch.cat((daily_bikes, weather_onehot), dim=1)
daily_bikes_onehot.shape

torch.Size([730, 21, 24])

## Text

Goal is turning text into a tensor of numbers, in line with the other cases. 

Essentially, whether we operate at character level or word level, the technique is the same: we use one-hot encoding. 

### Character level

In [None]:
# Let's read some text
!curl  https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/jane-austen/1342-0.txt >> /content/sample_data/jane.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  694k  100  694k    0     0  1286k      0 --:--:-- --:--:-- --:--:-- 1283k


In [None]:
with open('/content/sample_data/jane.txt', 'r', encoding='utf8') as f:
  text = f.read() 

len(text)


1408380

We now need to parse through all the characters in the text and provide a one-hot encoding for each of them. Each character will be represented by a vector of length equal to all the different characters in the encoding. 

To start, we just focus on one line of text, splitting on `n`.

In [None]:
lines = text.split('\n')
line = lines[200]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

Now, we first create a tensors of zeros which would hold the one-hot encoded vectors (based on characters). The dimensions are:
- length of line: we need a vector for each character in the line
- number of possible characters: using ASCII, this would be 128

In [None]:
ASCII_size = 128
letter_t = torch.zeros(len(line), ASCII_size)
letter_t.shape

torch.Size([70, 128])

To do the one-hot encoding by character, we enumerate through the (lower-cased) string. For each character, we get its corresponding ASCII value using the `ord` function and then populate the corresponding position in the tensor

In [None]:
for i, letter in enumerate(line.lower().strip()):
  letter_index = ord(letter) if ord(letter) < 128 else 0 # to only encode "normal characters"
  letter_t[i][letter_index] = 1

### Word level

In this case, we want to apply one-hot encoding to individual words. We proceed as follows:

- We first define a function to split strings into lists of words (removing punctuation)
- We use the function on the entire corpus to build a dictionary to provide the positioning index for each word
- We iterate through the words in the chosen sentence (also split into individual words) and populate the tensor of one-hot encoded vectors

In [None]:
def clean_text(string):
  punctuation = ".,;:!?“\'_-"
  words = string.lower().replace('\n', ' ').split()
  clean_words = [word.strip(punctuation) for word in words]
  return clean_words

In [None]:
# Clean and split our chosen sentence
words_to_encode = clean_text(line)
len(words_to_encode)

11

In [None]:
# Now we apply the same function to the corpus, define the set of unique words,
# and build a lookup dictionary which we can use in the one-hot encoding
unique_words = sorted(set(clean_text(text)))
len(unique_words)

8222

In [None]:
dictionary = {word: i for i, word in enumerate(unique_words)}
len(dictionary), dictionary['impossible']

(8222, 3820)

In [None]:
# We have a dictionary with 8,222 unique words. Now we can define a tensor for
# the one-hot encoding of our sentence. Accordingly the dimensions will be:
# - 11: the length of our sentence
# - 8222: the length of the dictionary  

In [None]:
word_t = torch.zeros(len(words_to_encode), len(dictionary))
word_t.shape

torch.Size([11, 8222])

In [None]:
# And finally we can do the one-hot encoding
for i, word in enumerate(words_to_encode):
  word_index = dictionary[word]
  word_t[i][word_index] = 1
  print(f'#{i} {word_index}: {word}')

#0 3820: impossible
#1 4893: mr
#2 879: bennet
#3 3820: impossible
#4 8007: when
#5 3731: i
#6 428: am
#7 5045: not
#8 228: acquainted
#9 8082: with
#10 3609: him
