## Preparing Data for Deep Learning with Pytorch

This tutorial will cover everything you need to successfully prepare data for neural network analysis

## First step: imports

You will need to import torch (so you can do some general things), and then a torch specfic package (nn) that  specfically handles neural networks. You will also need to import numpy, pandas, scikit-learn, and probably random (for shuffling things)

In [4]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
import random

## Our dataset
For this tutorial, we will the sklearn breast cancer dataset, which contains 30 ________ measured for 569 samples. 

Our goal is to take this data and use a neural network to see if we can use this data to predict if a tumor is benign or malignant (precision medicine application!)

## First Step
Our first step is to load the dataset from sklearn - we are using sklearn's dataset here so you all do not have to deal with downloading and using large datafiles (yet).

### Labels in Pytorch 
Labels for pytorch must be numbers starting at zero and increasing sequentially (i.e. [0,1,2]) so always make sure that is the case! We can see here this is the case already. Otherwise, make sure it is the case!

In [5]:

#### load the data object
data = load_breast_cancer()

#### break it down into data and labels (for each index)
labels = list(data.target)
data = data.data

print(data)
print(labels)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 

## Scaling
We know that in data science, scaling is very important. We also know that negative values do not work with the universally applied ReLU function, so we will normalize the data with sklearn's MinMaxScaler (0-1 range).

In [6]:
scaler = MinMaxScaler()

scaled_data = MinMaxScaler().fit_transform(data)

print(scaled_data)

[[0.52103744 0.0226581  0.54598853 ... 0.91202749 0.59846245 0.41886396]
 [0.64314449 0.27257355 0.61578329 ... 0.63917526 0.23358959 0.22287813]
 [0.60149557 0.3902604  0.59574321 ... 0.83505155 0.40370589 0.21343303]
 ...
 [0.45525108 0.62123774 0.44578813 ... 0.48728522 0.12872068 0.1519087 ]
 [0.64456434 0.66351031 0.66553797 ... 0.91065292 0.49714173 0.45231536]
 [0.03686876 0.50152181 0.02853984 ... 0.         0.25744136 0.10068215]]


## Batching
We learned today that batching the data will help prevent overfitting, so here's a python function to let us make batches in whatever size we would prefer

In [7]:
def batchify(data,labels,batch_size=16):
    
    batches= []
    label_batches = []

    ''' We will move through the data, 16 (or some other number) points at a time, until 
    there are no more 16 point chunks left. We will add these subarrays of 16 into a new list as their
    own batch'''
    
    for n in range(0,len(data),batch_size):
        if n+batch_size < len(data):
            batches.append(data[n:n+batch_size])
            label_batches.append(labels[n:n+batch_size])

    ''' If the data does not have a number of points divisible by 16, then we will add on one more smaller 
    batch to make sure we get all the data!'''
    if len(data)%batch_size > 0:
        batches.append(data[len(data)-(len(data)%batch_size):len(data)])
        label_batches.append(labels[len(data)-(len(data)%batch_size):len(data)])
        
    return batches,label_batches
        
    
#### if your data has any order to it that you don't want to affect your analysis, shuffle it first like so:
temp = list(zip(scaled_data,labels)) 
random.shuffle(temp) 
data,labels = zip(*temp)

batches,labels_batches = batchify(data,labels)

print(batches)
print(label_batches)

[array([[0.52103744, 0.0226581 , 0.54598853, 0.36373277, 0.59375282,
        0.7920373 , 0.70313964, 0.73111332, 0.68636364, 0.60551811,
        0.35614702, 0.12046941, 0.3690336 , 0.27381126, 0.15929565,
        0.35139844, 0.13568182, 0.30062512, 0.31164518, 0.18304244,
        0.62077552, 0.14152452, 0.66831017, 0.45069799, 0.60113584,
        0.61929156, 0.56861022, 0.91202749, 0.59846245, 0.41886396],
       [0.64314449, 0.27257355, 0.61578329, 0.50159067, 0.28987993,
        0.18176799, 0.20360825, 0.34875746, 0.37979798, 0.14132266,
        0.15643672, 0.08258929, 0.12444047, 0.12565979, 0.11938675,
        0.08132304, 0.0469697 , 0.25383595, 0.08453875, 0.0911101 ,
        0.60690146, 0.30357143, 0.53981772, 0.43521431, 0.34755332,
        0.15456336, 0.19297125, 0.63917526, 0.23358959, 0.22287813],
       [0.60149557, 0.3902604 , 0.59574321, 0.44941676, 0.51430893,
        0.4310165 , 0.46251172, 0.63568588, 0.50959596, 0.21124684,
        0.22962158, 0.09430251, 0.18037035, 0

NameError: name 'label_batches' is not defined

## Next Steps
Now the data is pretty much ready to be used in a neural network!
First, ask me any questions you have.
Then, move onto the next tutorial!