# Classroom 3 - Basic machine learning with ```Pytorch```

The first thing we need to do for this workshops is install both ```pytorch``` and ```scikit-learn```, along with some other packages we need for this week.

```
pip install --upgrade pip
pip install torch sklearn matplotlib pandas
```

__Load packages__

In [1]:
# system tools
import os

# pytorch
import torch
from torch import nn

# pandas
import pandas as pd

# scikit-learn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# matplotlib
import matplotlib.pyplot as plt

__Creating a tensor__

In [None]:
x_tensor = torch.tensor([[1., -1.], 
                         [1., -1.]])
print(type(x_tensor))

In [None]:
print(x_tensor)

__Tensor to numpy arrray__

In [None]:
# tensor to numpy
x_array = x_tensor.numpy()
print(type(x_array))

__And back again__

In [None]:
# numpy to tensor
x_tensor2 =torch.tensor(x_array)
print(type(x_tensor2))

In [None]:
# check for identity
print(x_tensor2 == x_tensor)

## Finding the minimum of an polynomial

We begin here by creating an initial value for ```x``` and defining the function ```y```.

The goal is to find the _minimum_ value of y, i.e. in this case the turning point of the function.


In [None]:
x = torch.tensor([3.], 
                 requires_grad=True)

In [None]:
y = x**2 - 3*x + 2
print(y)

__Create SGD optimizer__

In [None]:
optimizer = torch.optim.SGD([x],     # starting value
                            lr=0.01) # learning rate


__Calcuate the gradient__

We first run a _backwards pass_ which computes the gradient of the function ```y``` for given value ```x```

In [None]:
y.backward()

In [None]:
print(x.grad) # examine

__Make a step in the right direction__

In [None]:
# step in the direction to minimize y
optimizer.step()

In [None]:
# set the gradient to zero. (This is a bit wierd but required)
optimizer.zero_grad()

In [None]:
# we see that x have improved (minimum is 1.5 so moving in the right direction)
print(x)
# we see that the gradient is set to zero
print(x.grad)

__Run this for 1000 steps__

In [None]:
for i in range(1000):
    #print(x)

    # forward pass / or just calculate the outcome
    y = x**2 - 3*x + 2

    # backward pass on the thing we want to minimize
    y.backward()

    # take a step in the "minimize direction"
    optimizer.step()

    # zero the gradient
    optimizer.zero_grad()

__Print the local minimum__

What we see is that using stochastic gradient descent with a defined starting point allows us to correctly calculate the local minimum of the function.

In [None]:
print(x)

### Bonus task

- Try and define some functions of your own and see if you can find the minimum. (There are tools online where you can check what the actual minimum is, to see if the algorithm gets it right!)

## Linear regression

The same general procedure can be used when performing linear regression on data points. 

In this example, we're using ```scikit-learn``` to artificially generate some data points for us.

In [None]:
X_numpy, y_numpy = datasets.make_regression(n_samples=100,    # number of individual data points
                                            n_features=1,     # each data point represents a single feature
                                            noise=20,         # technically, SD of gaussian noise applied to the output
                                            random_state=4)   # a random state for reproducibility

__Plot the data__

Note that here we're using ```matplotlib``` the lazy way, instead of explicitly defining ```fig, ax```. This is fine for experimental notebooks, but don't do it in your codebases!

In [None]:
# plot the sample
plt.plot(X_numpy, y_numpy, 'ro')
plt.show()

__Convert data to tensors__

In [None]:
# cast to float Tensor
X = torch.tensor(X_numpy, dtype=torch.float)
y = torch.tensor(y_numpy, dtype=torch.float)

__Check the shapes__

In [None]:
print(X.shape)
print(y.shape)

__Reshape ```y```__

In [None]:
y = y.view(y.shape[0], 1) # view is similar to reshape it simply sets the desired shape to (100, 1)
print(y.shape)

__Check datatypes__

In [None]:
print(y.dtype)
print(x.dtype)

__Get number of samples and features__

We'll use this information below when calculating loss function etc.

In [None]:
n_samples, n_features = X.shape

__Initialize a linear model__

In [None]:
# Linear model f = wx + b
input_size = n_features 
output_size = 1

# create a weight and biases (betas and intercept) initialized 'randomly'
model = nn.Linear(input_size, output_size)

__Set learning rate, check parameters__

In [None]:
learning_rate = 0.01 # feel free to change this
print(list(model.parameters())) # only two parameters a beta and an intercept

__Define a loss function and an optimization algorithm__

In [None]:
criterion = nn.MSELoss()

In [None]:
optimizer = torch.optim.SGD(model.parameters(), # parameters to optimize
                            lr=learning_rate    # the speed in which we optimize them  / how fast the model learns (think step size) 
                            ) 

__Run for 100 epochs__

In [None]:
epochs = 100
for epoch in range(epochs):
    # Forward pass / calc predicted y
    # a + b*X
    y_predicted = model(X)
    
    # calucate loss / MSE
    loss = criterion(y_predicted, y)
    
    # Backward pass / gradient and update
    loss.backward()
    optimizer.step()

    # zero grad before new step
    optimizer.zero_grad()

    # some print to see that it is running
    if (epoch+1) % 10 == 0:
        print(f'epoch: {epoch+1}, loss = {loss.item():.4f}')

__Get predicted values__

In [None]:
# Plot
predicted = model(X).detach().numpy()

__Plot results__

In [None]:
plt.plot(X_numpy, y_numpy, 'ro')
plt.plot(X_numpy, predicted, 'b')

## Logistic Regression Classifier with text data

So far we haven't actually looked at any text data! 

In the following section, we're going to use some real world text data in a binary classification problem. We're going to use document vectorization techniques we saw in the lecutre, and see how to build a Logistic Regression classifier with ```pytorch```.

In [None]:
filepath = os.path.join()

In [None]:
data = pd.read_csv(filepath)

__Creating train/test splits__

A common practice when building ML/DL models is to use explicitly defined subsets of data for different tasks - [training vs testing](https://upload.wikimedia.org/wikipedia/commons/b/bb/ML_dataset_training_validation_test_sets.png), for example. This is slightly different from how we work when doing statistical modelling (in most cases).

```scikit-learn``` has a simple tool that allows us to quickly split our dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data["text"], data["label"], 
                                                    test_size=0.2, 
                                                    random_state=42)

__Creating a document vectorizer__

There are a lot of different parameters here that we're not going to look at but please do [check them out in the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

The exact same approach can be applied using TfidfVectorizer() instead of CountVectorizer() - [give it a try](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)!

__Initialize vectorizer__

In [None]:
vectorizer = CountVectorizer()
# vectorizer = TfidfVectorizer()

__Fit to the training data__

In [None]:
# vectorized training data
X_train_vect = vectorizer.fit_transform(X_train)

# vectorized test data
X_test_vect = vectorizer.transform(X_test)

__Convert to tensors__

In [None]:
# vectorized training data
X_train_vect = torch.tensor(X_train_vect.toarray(), dtype=torch.float)

# vectorized test data
X_test_vect = torch.tensor(X_test_vect.toarray(), dtype=torch.float)

__Convert labels__

In [None]:
# training labels
y_train = torch.tensor(list(y_train), dtype=torch.float)
# test labels
y_test = torch.tensor(list(y_test), dtype=torch.float)

In [None]:
y_train = y_train.view(y_train.shape[0], 1)
y_test = y_test.view(y_test.shape[0], 1)

__Initialization parameters for Logistic Regression__

In [None]:
n_samples, n_features = X_train_vect.shape
input_size = n_features 
output_size = 1

__Creating the model__

Notice here that we are still using a Linear layer, but this time we have a different loss function - [Binary Cross Entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).

In [None]:
# create a weight and biases (betas and intercept) initialized 'randomly'
model = nn.Linear(input_size, output_size)
learning_rate = 0.01 # feel free to change this

In [None]:
print(list(model.parameters()))

In [None]:
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), # parameters to optimize
                            lr=learning_rate    # the speed in which we optimize them  / how fast the model learns (think step size) 
                            ) 

__Run the model for 100 epochs__

In [None]:
epochs = 100
for epoch in range(epochs):
    # Forward pass / calc predicted y
    # a + b*X
    m = nn.Sigmoid()
    y_predicted = model(X_train_vect)

    # calucate loss / MSE
    loss = criterion(m(y_predicted.round()), y_train)

    
    # Backward pass / gradient and update
    loss.backward()
    optimizer.step()

    # zero grad before new step
    optimizer.zero_grad()

    # some print to see that it is running
    if (epoch+1) % 10 == 0:
        print(f'epoch: {epoch+1}, loss = {loss.item():.4f}')

__Check performance against test data__

We need to explicitly use ```torch.no_grad()``` here to make sure that we freeze the gradients and don't accidently update them during inferencing.

In [None]:
with torch.no_grad():
    y_pred=model(X_test_vect)
    y_pred_class=y_pred.round()
    correct = sum(y_pred_class==y_test)
    print((correct/X_test.shape[0])*100)

### Bonus tasks

- Can you write your own version of ```CountVectorizer()```? In other words, a function that takes a corpus of documents and creates a bag-of-words representation for every document?
- What about ```TfidfVectorizer()```? Make sure to look over the formulae in the slides from Wednesday, and also the Jurafsky and Martin book.