<a href="https://colab.research.google.com/github/rahiakela/deep-learning-with-pytorch/blob/4-real-world-data-representation-with-tensors/real_world_data_representation_with_tensors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real-world data representation with tensors

Tensors are the building blocks for data in PyTorch. Neural networks take tensors in input and produce tensors as outputs. In fact, all operations within a neural network and during optimization are operations between tensors, and all parameters (such as weights and biases) in a neural network are tensors. Having a good sense of how to perform operations on tensors and index them effectively is central to using tools like PyTorch successfully.


## Setup

In [1]:
import numpy as np
import torch
import csv

In [2]:
torch.set_printoptions(edgeitems=2, precision=2)

## Tabular data

The simplest form of data you’ll encounter in your machine learning job is sitting in a spreadsheet, in a CSV (comma-separated values) file, or in a database. Whatever the medium, this data is a table containing one row per sample (or record), in which columns contain one piece of information about the sample.

Columns may contain numerical values, such as temperatures at specific locations, or labels, such as a string expressing an attribute of the sample (like "blue"). **Therefore, tabular data typically isn’t homogeneous; different columns don’t have the same type.** You might have a column showing the weight of apples and another encoding their color in a label.

PyTorch tensors, on the other hand, are homogeneous. Other data science packages, such as Pandas, have the concept of the data frame, an object representing a data set with named, heterogenous columns. By contrast, information in PyTorch is encoded as a number, typically floating-point (though integer types are supported as well).

Numeric encoding is deliberate, because neural networks are mathematical entities that take real numbers as inputs and produce real numbers as output through successive application of matrix multiplications and nonlinear functions.

**Your first job as a deep learning practitioner, therefore, is to encode heterogenous, real-world data in a tensor of floating-point numbers, ready for consumption by a neural network.**

We start with something fun: wine. The Wine Quality data set is a freely available table containing chemical characterizations of samples of vinho verde (a wine from northern Portugal) together with a sensory quality score. You can download the data set for white wines at https://archive.ics.uci.edu/ml/machine-learning-databases/winequality/winequality-white.csv.

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-learning-with-pytorch/wine-datasets.png?raw=1' width='800'/>

You hope to find a relationship between one of the chemical
columns in your data and the quality column. Here, you’re expecting to see quality increase as sulfur decreases.

Before you can get to that observation, however, you need to be able to examine
the data in a more usable way than opening the file in a text editor. We’ll show you how to load the data by using Python and then turn it into a PyTorch tensor.

### Loading dataset

Python offers several options for loading a CSV file quickly. Three popular options are:

* The csv module that ships with Python
* NumPy
* Pandas

The third option is the most time- and memory-efficient, but we’ll avoid introducing an additional library into your learning trajectory merely to load a file.

In [3]:
! wget https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/tabular-wine/winequality-white.csv

--2020-07-06 04:59:02--  https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/tabular-wine/winequality-white.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264426 (258K) [text/plain]
Saving to: ‘winequality-white.csv’


2020-07-06 04:59:02 (4.93 MB/s) - ‘winequality-white.csv’ saved [264426/264426]



In [4]:
wine_path = 'winequality-white.csv'
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=';', skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

Next, check that all the data has been read.

In [5]:
col_list = next(csv.reader(open(wine_path), delimiter=';'))
wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

And now proceed to convert the NumPy array to a PyTorch tensor.

In [6]:
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.type()

(torch.Size([4898, 12]), 'torch.FloatTensor')

At this point, you have a torch.FloatTensor containing all columns, including the last, which refers to the quality score.

### Preparing training and testing set

You could treat the score as a continuous variable, keep it as a real number, and perform a regression task, or treat it as a label and try to guess such label from the chemical analysis in a classification task. 

In both methods, you typically remove the score from the tensor of input data and keep it in a separate tensor, so that you can use the score as the ground truth without it being input to your model:

In [7]:
# Select all rows and all columns except the last one
data = wineq[:, :-1]
data, data.shape

(tensor([[ 7.00,  0.27,  ...,  0.45,  8.80],
         [ 6.30,  0.30,  ...,  0.49,  9.50],
         ...,
         [ 5.50,  0.29,  ...,  0.38, 12.80],
         [ 6.00,  0.21,  ...,  0.32, 11.80]]), torch.Size([4898, 11]))

In [8]:
# Select all rows and the last column
target = wineq[:, -1]
target, target.shape

(tensor([6., 6.,  ..., 7., 6.]), torch.Size([4898]))

If you want to transform the target tensor in a tensor of labels, you have two options, depending on the strategy or how you want to use the categorical data.

1. One option is to treat a label as an integer vector of scores
2. The other approach is to build a one-hot encoding of the scores

In [9]:
target = wineq[:, -1].long()
target

tensor([6, 6,  ..., 7, 6])

If targets were string labels (such as wine color), assigning an integer number to each string would allow you to follow the same approach.

The other approach is to build a one-hot encoding of the scores—that is, encode
each of the ten scores in a vector of ten elements, with all elements set to zero but one, at a different index for each score. 

This way, a score of 1 could be mapped to the vector (1,0,0,0,0,0,0,0,0,0), a score of 5 to (0,0,0,0,1,0,0,0,0,0) and so on.

**One-hot encoding is appropriate for quantitative scores when fractional values between integer scores (such as 2.4) make no sense for the application (when score is either this or that).**

You can achieve one-hot encoding by using the scatter_ method, which fills the
tensor with values from a source tensor along the indices provided as arguments.

In [10]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0.,  ..., 0., 0.],
        [0., 0.,  ..., 0., 0.],
        ...,
        [0., 0.,  ..., 0., 0.],
        [0., 0.,  ..., 0., 0.]])

First, notice that its name ends with an underscore. This convention in PyTorch indicates that the method won’t return a new tensor but modify the tensor in place. The arguments for scatter_ are:

* The dimension along which the following two arguments are specified
* A column tensor indicating the indices of the elements to scatter
* A tensor containing the elements to scatter or a single scalar to scatter (1,in this case)

The second argument of scatter_, the index tensor, is required to have the same
number of dimensions as the tensor you scatter into. Because target_onehot has two dimensions (4898x10), you need to add an extra dummy dimension to target by
using unsqueeze:

In [11]:
target_unsqueezed = target.unsqueeze(1)
target_unsqueezed

tensor([[6],
        [6],
        ...,
        [7],
        [6]])

The call to unsqueeze adds a singleton dimension, from a 1D tensor of 4898 elements to a 2D tensor of size (4898x1), without changing its contents.

**PyTorch allows you to use class indices directly as targets while training neural networks. If you want to use the score as a categorical input to the network, however, you’d have to transform it to a one-hot encoded tensor.**

### Data normalization

Now go back to your data tensor, containing the 11 variables associated with the
chemical analysis. You can use the functions in the PyTorch Tensor API to manipulate your data in tensor form. 

First, obtain means and standard deviations for each column:

In [12]:
data_mean = torch.mean(data, dim=0)  # mean
data_mean

tensor([6.85e+00, 2.78e-01, 3.34e-01, 6.39e+00, 4.58e-02, 3.53e+01, 1.38e+02,
        9.94e-01, 3.19e+00, 4.90e-01, 1.05e+01])

In [13]:
data_var = torch.var(data, dim=0)  # variance
data_var

tensor([7.12e-01, 1.02e-02, 1.46e-02, 2.57e+01, 4.77e-04, 2.89e+02, 1.81e+03,
        8.95e-06, 2.28e-02, 1.30e-02, 1.51e+00])

In this case, dim=0 indicates that the reduction is performed along dimension 0. 

At this point, you can normalize the data by subtracting the mean and dividing by the standard deviation, which helps with the learning process.

In [14]:
data_normalized = (data - data_mean) / torch.sqrt(data_var)
data_normalized

tensor([[ 1.72e-01, -8.18e-02,  ..., -3.49e-01, -1.39e+00],
        [-6.57e-01,  2.16e-01,  ...,  1.35e-03, -8.24e-01],
        ...,
        [-1.61e+00,  1.17e-01,  ..., -9.63e-01,  1.86e+00],
        [-1.01e+00, -6.77e-01,  ..., -1.49e+00,  1.04e+00]])

Next, look at the data with an eye to finding an easy way to tell good and bad wines apart at a glance. 

First, use the torch.le function to determine which rows in target
correspond to a score less than or equal to 3.

In [15]:
bad_indexes = torch.le(target, 3)
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

Note that only 20 of the bad_indexes entries are set to 1! 

**By leveraging a feature in PyTorch called advanced indexing, you can use a binary tensor to index the data tensor.**

This tensor essentially filters data to be only items (or rows) that correspond to 1 in the indexing tensor. The bad_indexes tensor has the same shape as target, with a value of 0 or 1 depending on the outcome of the comparison between your threshold and each element in the original target tensor:


In [16]:
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

Note that the new bad_data tensor has 20 rows, the same as the number of rows with a 1 in the bad_indexes tensor. It retains all 11 columns.

Now you can start to get information about wines grouped into good, middling,
and bad categories. Take the .mean() of each column:

In [17]:
bad_data = data[torch.le(target, 3)]
mid_data = data[torch.gt(target, 3) & torch.lt(target, 7)]
good_data = data[torch.ge(target, 7)]

bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
  print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.34  10.26  11.42


It looks as though you’re on to something here. At first glance, the bad wines seem to have higher total sulfur dioxide, among other differences. You could use a threshold on total sulfur dioxide as a crude criterion for discriminating good wines from bad ones. 

Now get the indexes in which the total sulfur dioxide column is below the midpoint you calculated earlier, like so:

In [18]:
total_sulfur_threshold = 141.83
total_sulfur_data = data[:, 6]
predicted_indexes = torch.lt(total_sulfur_data, total_sulfur_threshold)
predicted_indexes.shape, predicted_indexes.dtype, predicted_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(2727))

Your threshold implies that slightly more than half of the wines are going to be high-quality.

Next, you need to get the indexes of the good wines:

In [19]:
actual_indexes = torch.gt(target, 5)
actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(3258))

Because you have about 500 more good wines than your threshold predicted, you
already have hard evidence that the threshold isn’t perfect.

Now you need to see how well your predictions line up with the actual rankings.
Perform a logical and between your prediction indexes and the good indexes
(remembering that each index is an array of 0s and 1s), and use that intersection of wines in agreement to determine how well you did:

In [20]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()

n_matches, n_matches / n_predicted, n_matches / n_actual

(2018, 0.74000733406674, 0.6193984039287906)

You got around 2,000 wines right! Because you had 2,700 wines predicted, a 74 percent chance exists that if you predict a wine to be high-quality, it is. 

Unfortunately, you have 3,200 good wines and identified only 61 percent of them. Well, we guess you got what you signed up for; that result is barely better than random.

**This example is naïve, of course. You know for sure that multiple variables contribute to wine quality and that the relationships between the values of these variables and the outcome (which could be the actual score rather than a binarized version of it) is likely to be more complicated than a simple threshold on a single value.**

**Indeed, a simple neural network would overcome all these limitations, as would a lot of other basic machine learning methods.**

## Time series

In the preceding section, we covered how to represent data organized in a flat table. As we noted, every row in the table was independent from the others; their order did not matter. Equivalently, no column encoded information on what rows came before and what rows came after.

Going back to the wine data set, you could have had a Year column that allowed you to look at how wine quality evolved year over year. (Unfortunately, we don’t have such data at hand, but we’re working hard on collecting the data samples manually, bottle by bottle.)

In the meantime, we’ll switch to another interesting data set: data from a Washington, D.C., bike sharing system reporting the hourly count of rental bikes between 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information.

**The goal is to take a flat 2D data set and transform it into a 3D one.**

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-learning-with-pytorch/time-series.png?raw=1' width='800'/>

> Transforming a 1D multichannel data set into a 2D multichannel data set by separating the date and hour of each sample into separate axes.

In the source data, each row is a separate hour of data. We want to change the row-per-hour organization so that you have one axis that increases at a rate of one day per index increment and another axis that represents hour of day (independent of the date). The third axis is different columns of data (weather, temperature, and so on).

### Load Dataset

In [21]:
! wget https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/bike-sharing-dataset/hour-fixed.csv

--2020-07-06 04:59:10--  https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/bike-sharing-dataset/hour-fixed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1148001 (1.1M) [text/plain]
Saving to: ‘hour-fixed.csv’


2020-07-06 04:59:10 (10.7 MB/s) - ‘hour-fixed.csv’ saved [1148001/1148001]



In [22]:
# load csv file and convert date strings to numbers corresponding to the day of the month in column 1.
bikes_numpy = np.loadtxt('hour-fixed.csv', dtype=np.float32, delimiter=',', skiprows=1, converters={1: lambda x: float(x[8:10])})
bikes = torch.from_numpy(bikes_numpy)
bikes

tensor([[1.00e+00, 1.00e+00,  ..., 1.30e+01, 1.60e+01],
        [2.00e+00, 1.00e+00,  ..., 3.20e+01, 4.00e+01],
        ...,
        [1.74e+04, 3.10e+01,  ..., 4.80e+01, 6.10e+01],
        [1.74e+04, 3.10e+01,  ..., 3.70e+01, 4.90e+01]])

For every hour, the data set reports the following variables:

```python
instant         # index of record
day             # day of month
season          # season (1: spring, 2: summer, 3: fall, 4: winter)
yr              # year (0: 2011, 1: 2012)
mnth            # month (1 to 12)
hr              # hour (0 to 23)
holiday         # holiday status
weekday         # day of the week
workingday      # working day status
weathersit      # weather situation
                # (1: clear, 2:mist, 3: light rain/snow, 4: heavy rain/snow)
temp            # temperature in C
atemp           # perceived temperature in C
hum             # humidity
windspeed       # windspeed
casual          # number of causal users
registered      # number of registered users
cnt             # count of rental bikes
```
In a time-series data set such as this one, rows represent successive time points: a dimension along which they’re ordered. Sure, you could treat each row as independent and try to predict the number of circulating bikes based on, say, a particular time of day regardless of what happened earlier.

This existence of an ordering, however, gives you the opportunity to exploit causal relationships across time. You can predict bike rides at one time based on the fact that it was raining at an earlier time, for example. For the time being, you’re going to focus on learning how to turn your bike-sharing data set into something that your neural network can ingest in fixed-size chunks.

Now go back to your bike-sharing data set. The first column is the index (the
global ordering of the data); the second is the date; the sixth is the time of day. You have everything you need to create a data set of daily sequences of ride counts and other exogenous variables. Your data set is already sorted, but if it weren’t, you could use torch.sort on it to order it appropriately.

All you have to do to obtain your daily hours data set is view the same tensor in batches of 24 hours. 

Take a look at the shape and strides of your bikes tensor:

In [23]:
bikes.shape, bikes.stride()

(torch.Size([17520, 17]), (17, 1))

That’s 17,520 hours, 17 columns. Now reshape the data to have three axes (day, hour, and then your 17 columns):

In [24]:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 24, 17]), (408, 17, 1))

What happened here? First, the bikes.shape[1] is 17, which is the number of columns in the bikes tensor. 

But the real crux of the code is the call to view, which is important: it changes the way that the tensor looks at the same data as contained in storage.

**Calling view on a tensor returns a new tensor that changes the number of dimensions and the striding information without changing the storage.** As a result, you can rearrange your tensor at zero cost because no data is copied at all. Your call to view requires you to provide the new shape for the returned tensor. **Use the -1 as a placeholder for "however many indexes are left, given the other dimensions and the original number of elements."**

**Remember that Storage is a contiguous, linear container for numbers—floatingpoint, in this case. Your bikes tensor has rows stored one after the other in corresponding storage, as confirmed by the output from the call to bikes.stride() earlier.**

The rightmost dimension is the number of columns in the original data set. In the middle dimension, you have time split into chunks of 24 sequential hours. 

In other words, you now have N sequences of L hours in a day for C channels. 

To get to your desired NxCxL ordering, you need to transpose the tensor:

In [25]:
daily_bikes = daily_bikes.transpose(1, 2)
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 17, 24]), (408, 1, 17))

We mentioned earlier that the weather-situation variable is ordinal. In fact, it has 4 levels: 1 for the best weather and 4 for the worst. You could treat this variable as categorical, with levels interpreted as labels, or continuous. 

If you choose categorical, you turn the variable into a one-hot encoded vector and concatenate the columns with the data set. To make rendering your data easier, limit yourself to the first day for now. 

First, initialize a zero-filled matrix with a number of rows equal to the number of hours in the day and a number of columns equal to the number of weather levels:

In [26]:
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4)
first_day[:, 9]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

Then scatter ones into our matrix according to the corresponding level at each row. 

Remember the use of unsqueeze to add a singleton dimension earlier:

In [27]:
# decreasing the values by 1 because the weather situation ranges from 1 to 4, whereas indices are 0-based.
weather_onehot.scatter_(dim=1, index=first_day[:, 9].unsqueeze(1) -1, value=1.0)

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]])

The day started with weather 1 and ended with 2, so that seems right.

Last, concatenate your matrix to your original data set, using the cat function.

In [28]:
torch.cat((bikes[:24], weather_onehot), 1)[:1]

tensor([[ 1.00,  1.00,  1.00,  0.00,  1.00,  0.00,  0.00,  6.00,  0.00,  1.00,
          0.24,  0.29,  0.81,  0.00,  3.00, 13.00, 16.00,  1.00,  0.00,  0.00,
          0.00]])

For cat to succeed, the tensors must have the same size along the other dimensions (the row dimension, in this case).

Note that your new last four columns are 1, 0, 0, 0—exactly what you’d expect
with a weather value of 1.

You could have done the same thing with the reshaped daily_bikes tensor.
Remember that it’s shaped (B, C, L), where L = 24. 

First, create the zero tensor, with the same B and L but with the number of additional columns as C:

In [29]:
daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])
daily_weather_onehot.shape

torch.Size([730, 4, 24])

Then scatter the one-hot encoding into the tensor in the C dimension. Because operation is performed in place, only the content of the tensor changes:

In [30]:
daily_weather_onehot.scatter_(1, daily_bikes[:, 9, :].long().unsqueeze(1) - 1, 1.0)
daily_weather_onehot.shape

torch.Size([730, 4, 24])

Concatenate along the C dimension:

In [31]:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), dim=1)

We mentioned earlier that this method isn’t the only way to treat the weather-situation variable. Indeed, its labels have an ordinal relationship, so you could pretend that they’re special values of a continuous variable. You might transform the variable so that it runs from 0.0 to 1.0:

In [32]:
daily_bikes[:, 9, :] = (daily_bikes[:, 9, :] - 1.0) / 3.0

As we mention, rescaling variables to the [0.0, 1.0] interval or the [-1.0, 1.0] interval is something that you’ll want to do for all quantitative variables,
such as temperature (column 10 in your data set). You’ll see why later; for now, we’ll say that it’s beneficial to the training process.

You have multiple possibilities for rescaling variables. You can map their range to [0.0, 1.0]

In [33]:
temp = daily_bikes[:, 10, :]
temp_min = torch.min(temp)
temp_max = torch.max(temp)
daily_bikes[:, 10, :] = (daily_bikes[:, 10, :] - temp_min) / (temp_max - temp_min)
daily_bikes

tensor([[[1.00e+00, 2.00e+00,  ..., 2.30e+01, 2.40e+01],
         [1.00e+00, 1.00e+00,  ..., 1.00e+00, 1.00e+00],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        [[2.50e+01, 2.60e+01,  ..., 4.60e+01, 4.70e+01],
         [2.00e+00, 2.00e+00,  ..., 2.00e+00, 2.00e+00],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        ...,

        [[1.73e+04, 1.73e+04,  ..., 1.74e+04, 1.74e+04],
         [3.00e+01, 3.00e+01,  ..., 3.00e+01, 3.00e+01],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        [[1.74e+04, 1.74e+04,  ..., 1.74e+04, 1.74e+04],
         [3.10e+01, 3.10e+01,  ..., 3.10e+01, 3.10e+01],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]]])

or subtract the mean and divide by the standard deviation:

In [34]:
temp = daily_bikes[:, 10, :]
daily_bikes[:, 10, :] = (daily_bikes[:, 10, :] - torch.mean(temp)) / torch.std(temp)
daily_bikes

tensor([[[1.00e+00, 2.00e+00,  ..., 2.30e+01, 2.40e+01],
         [1.00e+00, 1.00e+00,  ..., 1.00e+00, 1.00e+00],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        [[2.50e+01, 2.60e+01,  ..., 4.60e+01, 4.70e+01],
         [2.00e+00, 2.00e+00,  ..., 2.00e+00, 2.00e+00],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        ...,

        [[1.73e+04, 1.73e+04,  ..., 1.74e+04, 1.74e+04],
         [3.00e+01, 3.00e+01,  ..., 3.00e+01, 3.00e+01],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]],

        [[1.74e+04, 1.74e+04,  ..., 1.74e+04, 1.74e+04],
         [3.10e+01, 3.10e+01,  ..., 3.10e+01, 3.10e+01],
         ...,
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00],
         [0.00e+00, 0.00e+00,  ..., 0.00e+00, 0.00e+00]]])

In this latter case, the variable has zero mean and unitary standard deviation. If the variable were drawn from a Gaussian distribution, 68 percent of the samples would sit in the [-1.0, 1.0] interval.

Great—you’ve built another nice data set that you’ll get to use later. For now, it’s important only that you got an idea of how a time series is laid out and how you can wrangle the data into a form that a network will digest.

**Other kinds of data look like a time series, in that strict ordering exists. The top two in that category are text and audio.**

## Text

Deep learning has taken the field of natural language processing (NLP) by storm, particularly by using models that repeatedly consume a combination of new input and previous model output. These models are called recurrent neural networks, and they’ve been applied with great success to text categorization, text generation, and automated translation systems.

Networks operate on text at two levels: at character level, by processing one character at a time, and at word level, in which individual words are the finest-grained entities seen by the network. The technique you use to encode text information into tensor form is the same whether you operate at character level or at word level. This technique is nothing magic; you stumbled upon it earlier. It’s one-hot encoding.

Let's load Jane Austen’s Pride and Prejudice from the Project Gutenberg website.

In [40]:
!wget https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/jane-austen/1342-0.txt

--2020-07-06 05:44:47--  https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/jane-austen/1342-0.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 711298 (695K) [text/plain]
Saving to: ‘1342-0.txt’


2020-07-06 05:44:47 (6.51 MB/s) - ‘1342-0.txt’ saved [711298/711298]



In [41]:
with open('1342-0.txt', encoding='utf8') as f:
  text = f.read()

At this point, you need to parse the characters in the text and provide a one-hot encoding for each of them. Each character will be represented by a vector of length equal to the number of characters in the encoding. This vector will contain all zeros except for a 1 at the index corresponding to the location of the character in the encoding.

First, split your text into a list of lines and pick an arbitrary line to focus on:

In [42]:
lines = text.split('\n')
line = lines[200]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

Create a tensor that can hold the total number of one-hot encoded characters for the whole line:

In [43]:
# 128 hardcoded due to the limits of ASCII
letter_tensor = torch.zeros(len(line), 128)
letter_tensor.shape

torch.Size([70, 128])

Note that letter_tensor holds a one-hot encoded character per row. Now set a 1 on each row in the right position so that each row represents the right character. 

The index where the 1 has to be set corresponds to the index of the character in the encoding:

In [45]:
# The text uses directional double quotes, which aren’t valid ASCII, so screen them out here.
for i, letter in enumerate(line.lower().strip()):
  letter_index = ord(letter) if ord(letter) < 128 else 0
  letter_tensor[i][letter_index] = 1

You’ve one-hot encoded your sentence into a representation that a neural network
can digest. You could do word-level encoding the same way by establishing a vocabulary and one-hot encoding sentences, sequences of words, along the rows of your tensor. Because a vocabulary contains many words, this method produces wide encoded vectors that may not be practical.

Let's define clean_words, which takes text and returns it lowercase and stripped of punctuation.

In [46]:
def clean_words(input_str):
  punctuation = '.,;:"!?”“_-'
  word_list = input_str.lower().replace('\n', ' ').split()
  word_list = [word.strip(punctuation) for word in word_list]
  return word_list

In [47]:
words_in_line = clean_words(line)
line, words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
  'mr',
  'bennet',
  'impossible',
  'when',
  'i',
  'am',
  'not',
  'acquainted',
  'with',
  'him'])

Next, build a mapping of words to indexes in your encoding:

In [48]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}

len(word2index_dict), word2index_dict['impossible']

(7261, 3394)

Note that all_words is now a dictionary with words as keys and an integer as value. You’ll use this dictionary to efficiently find the index of a word as you one-hot encode it.

Now focus on your sentence. Break it into words and one-hot encode it—that is,
populate a tensor with one one-hot encoded vector per word. Create an empty vector, and assign the one-hot encoded values of the word in the sentence.

In [50]:
word_tensor = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
  word_index = word2index_dict[word]
  word_tensor[i][word_index] = 1
  print('{:2} {:4} {}'.format(i, word_index, word))

print(word_tensor.shape)

 0 3394 impossible
 1 4305 mr
 2  813 bennet
 3 3394 impossible
 4 7078 when
 5 3315 i
 6  415 am
 7 4436 not
 8  239 acquainted
 9 7148 with
10 3215 him
torch.Size([11, 7261])


At this point, tensor represents one sentence of length 11 in an encoding space of size 7261—the number of words in your dictionary.

## Images