<div class="alert alert-block alert-info" style="margin-top: 20px">

      
| Name | Description | Date
| :- |-------------: | :-:
|Reza Hashemi| Training and evaluating machine learning models - 2nd PyTorch Datasets  | On 23rd of August 2019 | width="750" align="center"></a></p>
</div>

# Training and evaluating machine learning models
- Train-test split
- k-fold Cross-Validation

In [0]:
!pip3 install torch torchvision



In [0]:
import numpy as np
import pandas as pd
import torch, torchvision
torch.__version__

'1.1.0'

In [0]:
# to use GPU
device = torch.device("cuda")
device

device(type='cuda')

## 1. Train-test split
- Splitting train and test data in Pytorch

### Import data
- Import [epileptic seizure data](https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition) from UCI ML repository
- Split train and test data using ```random_split()```
- Train logistic regression model with training data and evaluate results with test data

In [0]:
class SeizureDataset(torch.utils.data.Dataset):
  def __init__(self):
    # import and initialize dataset
    df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00388/data.csv")
    df = df[df.columns[1:]]
        
    self.X = df[df.columns[:-1]].values
    self.Y = df["y"].astype("category").cat.codes.values.astype(np.int32)
    
  def __getitem__(self, idx):
    # get item by index
    return self.X[idx], self.Y[idx]
  
  def __len__(self):
    # returns length of data
    return len(self.X)

In [0]:
seizuredataset = SeizureDataset()

In [0]:
NUM_INSTANCES = len(seizuredataset)
TEST_RATIO = 0.3
TEST_SIZE = int(NUM_INSTANCES * 0.3)
TRAIN_SIZE = NUM_INSTANCES - TEST_SIZE

print(NUM_INSTANCES, TRAIN_SIZE, TEST_SIZE)

11500 8050 3450


In [0]:
train_data, test_data = torch.utils.data.random_split(seizuredataset, (TRAIN_SIZE, TEST_SIZE))

print(len(train_data), len(test_data))

8050 3450


In [0]:
# when splitting train and test sets, data loader for each dataset should be made separately
train_loader = torch.utils.data.DataLoader(train_data, batch_size = 64, shuffle = True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size = 64, shuffle = False)

In [0]:
# logistic regression model
model = torch.nn.Linear(178, 5).to(device)
criterion = torch.nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)  
model

Linear(in_features=178, out_features=5, bias=True)

In [0]:
num_step = len(train_loader)

for epoch in range(100):
  for i, (x, y) in enumerate(train_loader):
    x, y = x.float().to(device), y.long().to(device)
    outputs = model(x)
    
    loss = criterion(outputs, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

  if (epoch + 1) % 10 == 0:
    print("Epoch: {}, Loss: {:.5f}".format(epoch + 1, loss.item()))

Epoch: 10, Loss: 18.91227
Epoch: 20, Loss: 35.44379
Epoch: 30, Loss: 38.28590
Epoch: 40, Loss: 53.74394
Epoch: 50, Loss: 36.99458
Epoch: 60, Loss: 41.47899
Epoch: 70, Loss: 33.08844
Epoch: 80, Loss: 17.66103
Epoch: 90, Loss: 36.11055
Epoch: 100, Loss: 31.70798


In [0]:
y_true, y_pred, y_prob  = [], [], []
with torch.no_grad():
  for x, y in test_loader:
    # ground truth
    y = list(y.numpy())
    y_true += y
    
    x = x.float().to(device)
    outputs = model(x)

    # predicted label
    _, predicted = torch.max(outputs.data, 1)
    predicted = list(predicted.cpu().numpy())
    y_pred += predicted
    
    # probability for each label
    prob = list(outputs.cpu().numpy())
    y_prob += prob

In [0]:
# calculating overall accuracy
num_correct = 0

for i in range(len(y_true)):
  if y_true[i] == y_pred[i]:
    num_correct += 1

print("Accuracy: ", num_correct/len(y_true))

Accuracy:  0.2353623188405797


## 2. k-fold Cross-Validation
- Perform k-fold cross validation in Pytorch
- Cross validation can be implemented using NumPy, but we rely on ```skorch``` and ```sklearn``` here for the facility of implementation

In [0]:
!pip install -U skorch

Collecting skorch
[?25l  Downloading https://files.pythonhosted.org/packages/c7/df/1e0be91bf4c91fce5f99cc4edd89d3dfc16930d3fc77588493558036a8d2/skorch-0.6.0-py3-none-any.whl (101kB)
[K     |███▏                            | 10kB 24.9MB/s eta 0:00:01[K     |██████▍                         | 20kB 2.0MB/s eta 0:00:01[K     |█████████▋                      | 30kB 3.0MB/s eta 0:00:01[K     |████████████▉                   | 40kB 2.0MB/s eta 0:00:01[K     |████████████████                | 51kB 2.5MB/s eta 0:00:01[K     |███████████████████▎            | 61kB 2.9MB/s eta 0:00:01[K     |██████████████████████▌         | 71kB 3.4MB/s eta 0:00:01[K     |█████████████████████████▊      | 81kB 3.9MB/s eta 0:00:01[K     |█████████████████████████████   | 92kB 4.3MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.2MB/s 
Installing collected packages: skorch
Successfully installed skorch-0.6.0


In [0]:
from skorch import NeuralNetClassifier
from sklearn.model_selection import cross_val_score

In [0]:
# import data
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00388/data.csv")
df = df[df.columns[1:]]

X_data = df[df.columns[:-1]].values.astype(np.float32)
y_data = df["y"].astype("category").cat.codes.values.astype(np.int64)

print(X_data.shape, y_data.shape)

(11500, 178) (11500,)


In [0]:
# generate skorch high-level classifier and perform 5-fold cross validation using cross_val_score()
logistic = NeuralNetClassifier(model, max_epochs = 10, lr = 1e-2)
scores = cross_val_score(logistic, X_data, y_data, cv = 5, scoring = "accuracy")

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1           nan       [32m0.1859[0m           nan  0.1532
      2           nan       [32m0.2005[0m           nan  0.0956
      3           nan       0.1870           nan  0.0965
      4           nan       0.1810           nan  0.0849
      5           nan       0.1929           nan  0.0832
      6           nan       0.1886           nan  0.0811
      7           nan       0.1848           nan  0.0855
      8           nan       0.1848           nan  0.0813
      9           nan       0.1853           nan  0.0843
     10           nan       0.1853           nan  0.0850
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1           nan       [32m0.2315[0m           nan  0.0807
      2           nan       [32m0.2375[0m           nan  0.0794
      3           nan       0.2261           nan  0.

In [0]:
# print out results
print(scores)
print(scores.mean(), scores.std())

[0.19130435 0.23478261 0.22304348 0.23478261 0.20478261]
0.2177391304347826 0.017180020928177813
