<a href="https://colab.research.google.com/github/rahiakela/deep-learning-with-pytorch/blob/4-real-world-data-representation-with-tensors/real_world_data_representation_with_tensors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real-world data representation with tensors

Tensors are the building blocks for data in PyTorch. Neural networks take tensors in input and produce tensors as outputs. In fact, all operations within a neural network and during optimization are operations between tensors, and all parameters (such as weights and biases) in a neural network are tensors. Having a good sense of how to perform operations on tensors and index them effectively is central to using tools like PyTorch successfully.


## Setup

In [0]:
! pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

In [0]:
import numpy as np
import torch
import csv

In [0]:
torch.set_printoptions(edgeitems=2, precision=2)

## Tabular data

The simplest form of data you’ll encounter in your machine learning job is sitting in a spreadsheet, in a CSV (comma-separated values) file, or in a database. Whatever the medium, this data is a table containing one row per sample (or record), in which columns contain one piece of information about the sample.

Columns may contain numerical values, such as temperatures at specific locations, or labels, such as a string expressing an attribute of the sample (like "blue"). **Therefore, tabular data typically isn’t homogeneous; different columns don’t have the same type.** You might have a column showing the weight of apples and another encoding their color in a label.

PyTorch tensors, on the other hand, are homogeneous. Other data science packages, such as Pandas, have the concept of the data frame, an object representing a data set with named, heterogenous columns. By contrast, information in PyTorch is encoded as a number, typically floating-point (though integer types are supported as well).

Numeric encoding is deliberate, because neural networks are mathematical entities that take real numbers as inputs and produce real numbers as output through successive application of matrix multiplications and nonlinear functions.

**Your first job as a deep learning practitioner, therefore, is to encode heterogenous, real-world data in a tensor of floating-point numbers, ready for consumption by a neural network.**

We start with something fun: wine. The Wine Quality data set is a freely available table containing chemical characterizations of samples of vinho verde (a wine from northern Portugal) together with a sensory quality score. You can download the data set for white wines at https://archive.ics.uci.edu/ml/machine-learning-databases/winequality/winequality-white.csv.

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-learning-with-pytorch/wine-datasets.png?raw=1' width='800'/>

You hope to find a relationship between one of the chemical
columns in your data and the quality column. Here, you’re expecting to see quality increase as sulfur decreases.

Before you can get to that observation, however, you need to be able to examine
the data in a more usable way than opening the file in a text editor. We’ll show you how to load the data by using Python and then turn it into a PyTorch tensor.

### Loading dataset

Python offers several options for loading a CSV file quickly. Three popular options are:

* The csv module that ships with Python
* NumPy
* Pandas

The third option is the most time- and memory-efficient, but we’ll avoid introducing an additional library into your learning trajectory merely to load a file.

In [14]:
! wget https://github.com/deep-learning-with-pytorch/dlwpt-code/tree/master/data/p1ch4/tabular-wine/winequality-white.csv

--2020-06-08 06:47:35--  https://github.com/deep-learning-with-pytorch/dlwpt-code/tree/master/data/p1ch4/tabular-wine/winequality-white.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/data/p1ch4/tabular-wine/winequality-white.csv [following]
--2020-06-08 06:47:36--  https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/data/p1ch4/tabular-wine/winequality-white.csv
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘winequality-white.csv.1’

winequality-white.c     [   <=>              ]   1.18M  2.44MB/s    in 0.5s    

2020-06-08 06:47:36 (2.44 MB/s) - ‘winequality-white.csv.1’ saved [1234214]



In [4]:
wine_path = 'winequality-white.csv'
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=';', skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

Next, check that all the data has been read.

In [5]:
col_list = next(csv.reader(open(wine_path), delimiter=';'))
wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

And now proceed to convert the NumPy array to a PyTorch tensor.

In [6]:
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.type()

(torch.Size([4898, 12]), 'torch.FloatTensor')

At this point, you have a torch.FloatTensor containing all columns, including the last, which refers to the quality score.

### Preparing training and testing set

You could treat the score as a continuous variable, keep it as a real number, and perform a regression task, or treat it as a label and try to guess such label from the chemical analysis in a classification task. 

In both methods, you typically remove the score from the tensor of input data and keep it in a separate tensor, so that you can use the score as the ground truth without it being input to your model:

In [7]:
# Select all rows and all columns except the last one
data = wineq[:, :-1]
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]]),
 torch.Size([4898, 11]))

In [8]:
# Select all rows and the last column
target = wineq[:, -1]
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.]), torch.Size([4898]))

If you want to transform the target tensor in a tensor of labels, you have two options, depending on the strategy or how you want to use the categorical data.

1. One option is to treat a label as an integer vector of scores
2. The other approach is to build a one-hot encoding of the scores

In [9]:
target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

If targets were string labels (such as wine color), assigning an integer number to each string would allow you to follow the same approach.

The other approach is to build a one-hot encoding of the scores—that is, encode
each of the ten scores in a vector of ten elements, with all elements set to zero but one, at a different index for each score. 

This way, a score of 1 could be mapped to the vector (1,0,0,0,0,0,0,0,0,0), a score of 5 to (0,0,0,0,1,0,0,0,0,0) and so on.

**One-hot encoding is appropriate for quantitative scores when fractional values between integer scores (such as 2.4) make no sense for the application (when score is either this or that).**

You can achieve one-hot encoding by using the scatter_ method, which fills the
tensor with values from a source tensor along the indices provided as arguments.

In [10]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

First, notice that its name ends with an underscore. This convention in PyTorch indicates that the method won’t return a new tensor but modify the tensor in place. The arguments for scatter_ are:

* The dimension along which the following two arguments are specified
* A column tensor indicating the indices of the elements to scatter
* A tensor containing the elements to scatter or a single scalar to scatter (1,in this case)

The second argument of scatter_, the index tensor, is required to have the same
number of dimensions as the tensor you scatter into. Because target_onehot has two dimensions (4898x10), you need to add an extra dummy dimension to target by
using unsqueeze:

In [12]:
target_unsqueezed = target.unsqueeze(1)
target_unsqueezed

tensor([[6],
        [6],
        ...,
        [7],
        [6]])

The call to unsqueeze adds a singleton dimension, from a 1D tensor of 4898 elements to a 2D tensor of size (4898x1), without changing its contents.

**PyTorch allows you to use class indices directly as targets while training neural networks. If you want to use the score as a categorical input to the network, however, you’d have to transform it to a one-hot encoded tensor.**

### Data normalization

Now go back to your data tensor, containing the 11 variables associated with the
chemical analysis. You can use the functions in the PyTorch Tensor API to manipulate your data in tensor form. 

First, obtain means and standard deviations for each column: