# PyTorch Tutorial - Tensors, grad, and logistic regression

Prof. Dorien Herremans, with many thanks to Nelson Lui for the base text. 

**To edit the notebook**:

There are two ways to edit the notebook.

You can either open it in the "playground", where you can change and run cells. After closing the tab, your changes will be lost. To do so, press "File" > "Open in playground".

Alternatively, you can make a copy of this notebook to your own Google Drive account through "File" > "Save a copy in Drive..."

**Activating the GPU on Colab**:

Colab now gives you 12 hours of free GPU time (before you have to request a new node).
Simply select "GPU" in the Accelerator drop-down in Notebook Settings (either through the Edit menu or the command palette at cmd/ctrl-shift-P).

# Setting up the notebook on colab

Let's check if we are using the GPU environment and cuda is installed: 

In [None]:
# Import PyTorch and other libraries
import torch
import numpy as np
from tqdm import tqdm

print("PyTorch version:")
print(torch.__version__)
print("GPU Detected:")
print(torch.cuda.is_available())

PyTorch version:
1.8.1+cu101
GPU Detected:
True


# What is PyTorch?

PyTorch is a deep learning package for building dynamic computation graphs.

More broadly, it's a GPU-compatible replacement for NumPy. You can think of it as NumPy + auto-differentiation.

# Basic Mechanics

If you are interested in basic operations, please go through the official PyTorch tutorial on tensors here: https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py

## Tensors
The `Tensor` type is essentially a NumPy `ndarray`. However, `Tensors` can critically be moved to the GPU for accelerated computing.

There are several types of `Tensors`, each of which correspond to a NumPy `dtype` and whether it is on the CPU or GPU.

The main ones you will probably use are:

| Data Type | CPU Tensor Type | GPU Tensor Type | NumPy dtype
| --- | --- | --- | --- | 
| 32-bit floating point | `torch.FloatTensor` | `torch.cuda.FloatTensor` | `float32` |
| 8-bit integer (unsigned) | `torch.ByteTensor` | `torch.cuda.ByteTensor` | `uint8` |
| 64-bit integer (signed)  | `torch.LongTensor` | `torch.cuda.LongTensor` | `int64` |

In general, you want to use `FloatTensor` by default, unless your data is specifically an integer (in which case you'd use a `LongTensor`) or your data is bits (you'd want to use `ByteTensor`).

You can find a full list of tensor types [here](http://pytorch.org/docs/master/tensors.html).

To construct a uninitialized 4x6 matrix (think `malloc` for those of you familar with `C` language, so not guaranteed to be all `0`), we can use:

In [None]:
# Note that torch.Tensor is short for torch.FloatTensor
uninit_float = torch.Tensor(4, 6)
print(uninit_float)
print("Type of above Tensor (it's also printed when you print the tensor):")
print(type(uninit_float))

tensor([[1.8783e+34, 3.0686e-41, 7.0065e-44, 6.8664e-44, 6.3058e-44, 6.7262e-44],
        [7.5670e-44, 6.3058e-44, 6.7262e-44, 6.8664e-44, 1.1771e-43, 6.7262e-44],
        [7.1466e-44, 8.1275e-44, 7.4269e-44, 6.8664e-44, 8.1275e-44, 6.7262e-44],
        [7.5670e-44, 6.4460e-44, 7.9874e-44, 6.7262e-44, 7.2868e-44, 7.4269e-44]])
Type of above Tensor (it's also printed when you print the tensor):
<class 'torch.Tensor'>


We can also create Tensors directly from (optionally nested) lists.

In [None]:
some_float_tensor = torch.Tensor([3.2, 4.3, 5.5])
print(some_float_tensor)

tensor([3.2000, 4.3000, 5.5000])


If we want a random uniform initialized `FloatTensor`, we can use `rand`.



In [None]:
rand_float = torch.rand(4, 6)
print(rand_float)
print(type(rand_float))

tensor([[0.2706, 0.5993, 0.8612, 0.1359, 0.9041, 0.2781],
        [0.1394, 0.6222, 0.2020, 0.1138, 0.3147, 0.1496],
        [0.2429, 0.2838, 0.4365, 0.5322, 0.9875, 0.2481],
        [0.1620, 0.5209, 0.6757, 0.2193, 0.0877, 0.0500]])
<class 'torch.Tensor'>


Let's print the `shape` of our random tensor. In NumPy / PyTorch / other tensor-manipulation libraries, `shape` refers to the dimensions of the tensor.

In [None]:
# Get the size of the rand float
print(rand_float.size())

# What's this weird torch.Size datatype?
print(type(rand_float.size()))
print()

# It's just a tuple!
print("Is rand_float.size() a tuple?")
print(isinstance(rand_float.size(), tuple))
print()

# We can even extract specific dimensions.
# The two lines below are functionally identical.
print("Size of rand_float dimension 1:")
print(rand_float.size()[0])
print(rand_float.size(0))

torch.Size([4, 6])
<class 'torch.Size'>

Is rand_float.size() a tuple?
True

Size of rand_float dimension 1:
4
4


## NumPy Bridge

It's very easy to convert a NumPy array into a Torch Tensor and vice versa as they will share their underlying memory locations (if the tensor is on CPU). Note that changing one will change the other.


In [None]:
a = torch.ones(6)
print (a)

b = a.numpy()
print (b)

tensor([1., 1., 1., 1., 1., 1.])
[1. 1. 1. 1. 1. 1.]


Notice how they share the same memory: you change 1 and the other changes as well: 

In [None]:
a.add_(4)
print(a)
print(b)

tensor([5., 5., 5., 5., 5., 5.])
[5. 5. 5. 5. 5. 5.]


We can just as easily convert the other way, and once again observe the same memory sharing behaviour: 

In [None]:
import numpy as np

a = np.ones(6)
b = torch.from_numpy(a)

np.add(a, 1, out=a)
print(a)
print(b)

[2. 2. 2. 2. 2. 2.]
tensor([2., 2., 2., 2., 2., 2.], dtype=torch.float64)


## Operations

PyTorch has a huge library of various operations (e.g. indexing, slicing, math, linear algebra, sampling, etc). They're all listed [here](http://pytorch.org/docs/0.3.1/torch.html). We'll experiment with the addition operation below.

We can add with the normal Python `+` operator.

In [None]:
other_rand_float = torch.rand(4, 6)
# Three ways to add!

# Python Operator +
print(rand_float + other_rand_float)

tensor([[0.6864, 0.7006, 1.2178, 0.3397, 1.4092, 0.8570],
        [0.6551, 1.4517, 0.7089, 0.3275, 0.8564, 0.3426],
        [0.7976, 0.5284, 1.0883, 1.0456, 1.0361, 0.5432],
        [1.1440, 0.6275, 1.0399, 1.0930, 0.1585, 0.4573]])


We can also use the `torch.add` function.

In [None]:
# torch.add
print(torch.add(rand_float, other_rand_float))

tensor([[0.6864, 0.7006, 1.2178, 0.3397, 1.4092, 0.8570],
        [0.6551, 1.4517, 0.7089, 0.3275, 0.8564, 0.3426],
        [0.7976, 0.5284, 1.0883, 1.0456, 1.0361, 0.5432],
        [1.1440, 0.6275, 1.0399, 1.0930, 0.1585, 0.4573]])


We can also add in-place to `rand_float`. This modifies the `rand_float` tensor. All PyTorch operations that modify in place end with an underscore ("_").

In [None]:
# Add in-place to rand_float. This modifies rand_float!
rand_float.add_(other_rand_float)
print(rand_float)

tensor([[0.6864, 0.7006, 1.2178, 0.3397, 1.4092, 0.8570],
        [0.6551, 1.4517, 0.7089, 0.3275, 0.8564, 0.3426],
        [0.7976, 0.5284, 1.0883, 1.0456, 1.0361, 0.5432],
        [1.1440, 0.6275, 1.0399, 1.0930, 0.1585, 0.4573]])


## Broadcasting

Broadcasting is a construct in NumPy and PyTorch that lets operations apply to tensors of different shapes. Under certain conditions, a smaller tensor can be "broadcast" across a bigger one. This is often desirable to do, since the looping happens at the C-level and is _incredibly_ efficient in both speed and memory.

In the example below, `x` has shape `(3,)` and y has shape `(5, 3)`. We can still add them together --- the smaller tensor is automatically added to each row of the larger tensor.

In [None]:
# Random LongTensors from 0 to 9.
x = torch.LongTensor(3).random_(10)
y = torch.LongTensor(5, 3).random_(10)

print(x)
print(y)
print(x+y)

tensor([2, 6, 0])
tensor([[8, 5, 8],
        [0, 1, 7],
        [7, 4, 1],
        [5, 5, 5],
        [8, 7, 4]])
tensor([[10, 11,  8],
        [ 2,  7,  7],
        [ 9, 10,  1],
        [ 7, 11,  5],
        [10, 13,  4]])


**Broadcasting, if used improperly, can also lead to inadvertent bugs**. 

Consider this example: Say you want to multiply a matrix of shape `(4, 6)` with one of shape `(6, 4)` to get something of shape `(4, 4)`. You might be tempted to use the `*` operator, which is for `elementwise` multiplication. For matrix multiplication, we use either `Tensor.mm` or the `@` operator.

However, broadcasting leads to a particularly nasty bug that is hard to detect due to broadcast (this behavior is thankfully being deprecated by PyTorch, hence you will see a bug when you run this in a recent version!).

In [None]:
x = torch.LongTensor(4, 6).random_(10)  # [4,6]
y = torch.LongTensor(6, 4).random_(10)  # [6,4]

print("x: ", x)
print("y: ", y)

# Matrix multiply
print("x @ y (matrix multiply) : ", x @ y)

# USUALLY UNINTENTIONAL ELEMENTWISE-MULTIPLICATION
print("x * y (elementwise multiply) : ", x * y)

x:  tensor([[7, 3, 7, 6, 4, 4],
        [1, 9, 8, 5, 8, 6],
        [6, 9, 4, 6, 9, 6],
        [9, 9, 1, 0, 9, 6]])
y:  tensor([[8, 2, 5, 3],
        [6, 9, 8, 0],
        [4, 4, 2, 9],
        [5, 3, 0, 2],
        [8, 1, 8, 2],
        [0, 9, 4, 4]])
x @ y (matrix multiply) :  tensor([[164, 127, 121, 120],
        [183, 192, 181, 125],
        [220, 190, 206, 108],
        [202, 166, 215,  78]])


RuntimeError: ignored

A big part of programming with tensors is keeping track of the expected shapes of your tensors and whether these shapes are actually showing up --- doing so will dramatically reduce the amount of bugs you have.

## Reshaping

It's often desirable to reshape a Tensor, maybe to broadcast with something else or to turn it into something that is easier to reason about.
We can do that with the `.view` function.

In [None]:
x = torch.LongTensor(4, 4).random_(10)
print(x)

# Turn it into a Tensor of shape (2, 8)
y = x.view(2, 8)
print(y)

# Turn it into a Tensor of shape (8, ?).
# The -1 is inferred from the shape of the Tensor.
z = x.view(8, -1)
print(z)

# Turn it into a Tensor of shape (16,) (flatten it).
# This is the same as x.view(16).
flat = x.view(-1)
print(flat)

tensor([[7, 2, 0, 5],
        [5, 0, 0, 6],
        [4, 0, 8, 2],
        [0, 3, 4, 0]])
tensor([[7, 2, 0, 5, 5, 0, 0, 6],
        [4, 0, 8, 2, 0, 3, 4, 0]])
tensor([[7, 2],
        [0, 5],
        [5, 0],
        [0, 6],
        [4, 0],
        [8, 2],
        [0, 3],
        [4, 0]])
tensor([7, 2, 0, 5, 5, 0, 0, 6, 4, 0, 8, 2, 0, 3, 4, 0])


## Slicing and Indexing

PyTorch follows the same conventions that NumPy uses for array slicing and indexing. [Here's a good intro to slicing and indexing in NumPy](http://www.scipy-lectures.org/intro/numpy/array_object.html#indexing-and-slicing).

In [None]:
x = torch.LongTensor(3, 5).random_(10)
print(x)

# Get the first row
print("First row:")
print(x[0])

# Get the last row
print("Last row:")
print(x[-1])

# Get the 3rd column
print("3rd column from left:")
print(x[:, 2])

# Get the last column
print("Last column from left:")
print(x[:, -1])

tensor([[0, 2, 5, 7, 2],
        [9, 3, 3, 9, 1],
        [4, 7, 3, 4, 4]])
First row:
tensor([0, 2, 5, 7, 2])
Last row:
tensor([4, 7, 3, 4, 4])
3rd column from left:
tensor([5, 3, 3])
Last column from left:
tensor([2, 1, 4])


Here's a slightly more complex example with a 3D Tensor --- slicing an indexing a 3D tensor is quite common in neural NLP, especially when dealing with the output of a recurrent neural network (RNN). The same slicing principles apply, though.

In [None]:
# Shape of x is (batch_size, sequence_length, hidden_dim)
# 3 is the batch size.
# 5 is the sequence length of all examples in the batch.
# 10 is the size of the RNN hidden state.
x = torch.LongTensor(3, 5, 10).random_(15)
print(x)

# Get the last LSTM outputs for each example in the batch
print("Final LSTM outputs for each example: ")
print(x[:, -1, :])

tensor([[[14,  1,  4,  9,  0, 10, 12,  2, 12, 12],
         [ 3, 11, 14,  1,  5,  2,  6,  1, 14, 12],
         [11, 12,  0,  6, 12,  9,  6,  1,  5,  1],
         [ 2,  5,  8, 11, 13,  7, 10,  9,  1,  4],
         [ 2, 10,  3,  0, 11,  7,  0,  7,  0,  6]],

        [[ 5, 13, 12,  0,  1,  9,  2,  0,  4,  6],
         [ 0,  8,  8,  9,  4,  6, 10,  4, 11,  0],
         [11,  6, 14, 13,  6,  1,  2,  7,  1,  9],
         [ 0,  8,  7,  7,  3, 10,  3,  7,  2,  3],
         [ 6, 11, 14,  2, 11, 12, 10,  0,  7,  9]],

        [[ 8,  3,  9, 13, 10, 12,  4,  3, 13, 11],
         [ 6,  4,  7, 13,  0,  9,  6, 14, 12,  9],
         [14, 11,  8,  9, 12,  6,  2, 13, 12,  7],
         [ 7,  0, 11,  7,  8,  2,  2, 14,  9,  2],
         [ 7,  5, 11, 14,  2,  5, 12, 10,  1,  5]]])
Final LSTM outputs for each example: 
tensor([[ 2, 10,  3,  0, 11,  7,  0,  7,  0,  6],
        [ 6, 11, 14,  2, 11, 12, 10,  0,  7,  9],
        [ 7,  5, 11, 14,  2,  5, 12, 10,  1,  5]])


# Using the GPU

PyTorch allows you to easily move computations to the GPU --- just move the associated input tensors to the GPU with the `.cuda()` function.

Note that GPU and CPU tensors are fundamentally different types.

In [None]:
if torch.cuda.is_available:
  # Create a Tensor
  x = torch.rand(3, 5)
  print(type(x))

  # Move it to the GPU
  x_gpu = x.cuda()
  print(type(x_gpu))

<class 'torch.Tensor'>
<class 'torch.Tensor'>


If you're using a machine with a GPU, you can run `nvidia-smi` in bash to get GPU usage statistics. Below, you can see the type of GPU, the current memory usage, the amount of memory the GPU has, and the % of the GPU being used for computatation.

In [None]:
!nvidia-smi

Tue Jun  1 03:54:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    32W / 250W |    897MiB / 16280MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's test out the GPU with a large matrix multiply!

In [None]:
# Some test inputs
test_input_one = torch.rand(1000, 9000)
test_input_two = torch.rand(9000, 1000)

In [None]:
%%timeit
test_input_one.mm(test_input_two)

1 loop, best of 5: 226 ms per loop


In [None]:
# Move to GPU
import os
using_GPU = os.path.exists('/opt/bin/nvidia-smi')
if using_GPU:
  gpu_test_input_one = test_input_one.cuda()
  gpu_test_input_two = test_input_two.cuda()

In [None]:
%%timeit -n 100
# This now automatically runs on the GPU!
if using_GPU:
  gpu_test_input_one.mm(gpu_test_input_two)

The slowest run took 7.66 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 5: 9.77 µs per loop


Using a GPU can give you massive speedups for tensor operations since most of them are easily parallelizable. Historically, the success of deep learning is inextricably tied to the ability to efficiently train the models on GPUs.

To take advantage of this, **you want to be using PyTorch tensor operations almost everywhere** --- avoid explicitly iterating over tensors!

# Computation Graphs

A computation graph is simply a way to define a sequence of operations to go from input to model output. 

You can think of the nodes in the graph as representing operations, and the edges in the graph represent tensors going in and out.

For example, say we wanted to build a linear regression model. This has the form $\hat y = Wx + b$. 

In this equation, $x$ is our input, $W$ is a learned weight matrix, $b$ is a learned bias, and $\hat y$ is the predicted output. 

As a computation graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

When implementing deep learning models, you're basically designing and specifying computation graphs. It's a bit like playing with Legos in that you're stringing together a bunch of blocks (the operations) to achieve a final desired output.

# Tensors, Variables and Autograd

One of PyTorch's key features (and what makes it a deep learning library) is the ability to specify arbitrary computation graphs and compute gradients on them automatically. For more detail, please see the official tutorial on grad at PyTorch: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py

We can do this on Tensor objects. In older versions of PyTorch, we need to first wrap the tensor in a `Variable` and import `torch.autograd.Variable`. You will see this in tutorials still floating around the internet, hence I wanted to mention this notation as well.  Some things you can do: 

*   The data of the tensor (accessed with the `.data` member)
*   The gradient with regards to this Variable (accessed with the `.grad` member)
*   The function that created it (accessed with the `.grad_fn` member)

For legacy purposes, I want to mention that you will sometimes see this: 

`x = Variable(torch.Tensor([1, 2, 3]), requires_grad=True)`

In newer version of TyTorch can simply use: 

`x = torch.Tensor([1., 2., 3.], requires_grad=True)`


In [None]:
x = torch.tensor([1., 2., 3.], requires_grad=True)
# You can access the underlying tensor with the .data attribute
print(x.data)

# Any operation you could use on Tensors, you can use on the legacy Variables
y = torch.tensor([4., 5., 6.], requires_grad=True)
z = x + y
print(z.data)

# But z also stores where it came from!
print(z.grad_fn)

tensor([1., 2., 3.])
tensor([5., 7., 9.])
<AddBackward0 object at 0x7fc5401fa450>


A note on the `requires_grad` argument: with most NN code, you don’t want to set `requires_grad=True` unless you explicitly want the gradient w.r.t. to your input. In this example, however, `requires_grad=True` is necessary because otherwise there would be no gradients to compute, since there are no model parameters.

Let's do some more operations and calculate the gradient.

In [None]:
z_sum = torch.sum(z)
print(z_sum)
print(z_sum.grad_fn)

tensor(21., grad_fn=<SumBackward0>)
<SumBackward0 object at 0x7fc5401fa310>


Say we want to calculate the derivative of the sum w.r.t. the 
first element of x (in math,  $\frac{\partial z_{sum}}{\partial x_0}$).

Autograd knows that: $$ z_{sum} = x_0 + y_0 + x_1 + y_1 + x_2 + y_2$$

So the derivative of $z_{sum}$ w.r.t $x_0$ is 1! Similarily, the derivative to all elements of $x$ is 1. Let's verify this with autograd.

In [None]:
# Backprop from s backwards into the grpah
# It'll follow the chain of computation by going from grad_fn to grad_fn
# until it reaches the input.
z_sum.backward()
print(x.grad)

tensor([1., 1., 1.])


Try running the block above multiple times! What do you notice happening?

**The gradient in `.grad` accumulates each time we call `.backward()`** --- this is convenient for some models, but we'll usually want to zero the gradient before running backpropagation when we're training our models (more on this later).

In most models we build, we'll generally want to explicitly zero-out the gradients (details forthcoming) before calling `.backward()`

# Structuring PyTorch models

At the highest level, `nn.Module` defines what most would refer to as a "model". It's a convenient way for encapsulating the trainable parameters of a model or a component of your model, and subclassing this class gives you Python functions for moving your model to the GPU, saving it, loading it etc.

When you're building your own model, you're going to subclass `nn.Module`. Critically, you also need to override the `__init__()` and `forward()` functions.

*   In `__init__()`, you should take arguments that modify how the model runs (e.g. # of layers, # of hidden units, output sizes). You'll also set up most of the layers that you use in the forward pass here.
*   In `forward()`, you define the "forward pass" of your model, or the operations needed to transform input to output. **You can use any of the Tensor operations in the forward pass.**



### A simple example `Module` : Logistic Regression

As a simple example of how to make a Module, let's build a logistic regression model.

Logistic regression takes an input $x$ and applies a linear transform to squash the input down to a probability distribution over the number of classification classes. If you recall from the lecture, we start with a linear regression model based on the input variables, which is then put into a logistic sigmoid function. As a module, this looks like:

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class LogisticRegression(nn.Module):
  # input_size: Dimensionality of input feature vector.
  # num_classes: The number of classes in the classification problem.
  def __init__(self, input_size, num_classes):
    # Always call the superclass (nn.Module) constructor first!
    super(LogisticRegression, self).__init__()
    # Set up the linear transform
    self.linear = nn.Linear(input_size, num_classes)
    # I do not yet include the sigmoid activation after the linear 
    # layer because our loss function will include this as you will see later

  # Forward's sole argument is the input.
  # input is of shape (batch_size, input_size)
  def forward(self, x):
    # Apply the linear transform.
    # out is of shape (batch_size, num_classes). 
    out = self.linear(x)
    out = torch.sigmoid(out)
    # Softmax the out tensor to get a log-probability distribution
    # over classes for each example.
    return out

Modules are also callable! As a result, we can do the following to apply them to an input. Note how the number of features determines the size of the linear layer above. 

In [None]:
# Binary classifiation
num_outputs = 1
num_input_features = 2

# Create the logistic regression model
logreg_clf = LogisticRegression(num_input_features, num_outputs)

print(logreg_clf)

LogisticRegression(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)


We set a learning rate a select the gradient descent optimizer to train the model. For a super small example, we manually define a training and test set for the XOR problem. 

In [None]:
import torch 
lr_rate = 0.001  # alpha

# training set of input X and labels Y
X = torch.Tensor([[0,0],[0,1], [1,0], [1,1]])
Y = torch.Tensor([0,1,1,0]).view(-1,1) #view is similar to numpy.reshape() here it makes it into a column

# Run the forward pass of the logistic regression model
sample_output = logreg_clf(X) #completely random at the moment
print(X)

loss_function = nn.BCELoss() 
# SGD: stochastic gradient descent is used to train/fit the model
optimizer = torch.optim.SGD(logreg_clf.parameters(), lr=lr_rate)

tensor([[0., 0.],
        [0., 1.],
        [1., 0.],
        [1., 1.]])
tensor([[0.5033],
        [0.4767],
        [0.5164],
        [0.4898]], grad_fn=<SigmoidBackward>)


Now we can train!

Take a moment to study what is happening here. This process will keep coming back. 

In [None]:
import numpy as np 
# from torch.autograd import Variable

#training loop:

epochs = 2001 #how many times we go through the training set
steps = X.size(0) #steps = 4; we have 4 training examples (I know, tiny training set :)

for i in range(epochs):
    for j in range(steps):
        # randomly sample from the training set:
        data_point = np.random.randint(X.size(0))
        # store the retrieved datapoint into 2 separate variables of the right shape
        x_var = torch.Tensor(X[data_point]) 
        y_var = torch.Tensor(Y[data_point])

        # print(x_var.size())
        
        optimizer.zero_grad() # empty (zero) the gradient buffers
        y_hat = logreg_clf(x_var) #get the output from the model

        loss = loss_function(y_hat, y_var) #calculate the loss
        loss.backward() #backprop
        optimizer.step() #does the update

    if i % 500 == 0:
        print ("Epoch: {0}, Loss: {1}, ".format(i, loss.data.numpy()))

Epoch: 0, Loss: 0.7317752242088318, 
Epoch: 500, Loss: 0.7088961601257324, 
Epoch: 1000, Loss: 0.6454429030418396, 
Epoch: 1500, Loss: 0.7373309135437012, 
Epoch: 2000, Loss: 0.6999746561050415, 


As expected the loss remains high. XOR needs a non-linear model to work well (or, a feature engineering trick: add a third input feature: $x_1 * x_2$). Next week, we'll tackle this properly with neural networks...

Below you can experiment how badly it works :)


In [None]:
test = [[0,0],[0,1],[1,1],[1,0]]

for trial in test: 
  Xtest = torch.Tensor(trial)
  y_hat = logreg_clf(Xtest)
  
  if y_hat > 0.5:
    prediction = 1
  else: 
    prediction = 0
    
  print("{0} xor {1} = {2}".format(int(Xtest[0]), int(Xtest[1]), prediction))



0 xor 0 = 1
0 xor 1 = 0
1 xor 1 = 0
1 xor 0 = 1


In this week's homework you will experiment with logistic regression on a problem that is actually suited for linear separation... instead of one that requires non-linear transformations. (Don't worry, we'll solve XOR later on with a deeper neural network!)
