In this episode, we'll demonstrate how to process numerical data that we'll later use to train our very first artificial neural network. 


## Samples and Labels

To train any neural network in a supervised learning task, we first need a data set of samples and the corresponding labels for those samples.

When referring to samples, we're just referring to the underlying data set, where each individual item or data point within that set is called a sample. Labels are the corresponding labels for the samples.

**Note that in deep learning, samples are also commonly referred to as input data or inputs, and labels are also commonly referred to as target data or targets.**

###  Expected data format

When preparing data, we first need to understand the format that the data need to be in for the end goal we have in mind. In our case, we want our data to be in a format that we can pass to a neural network model.

The first model we'll build in an upcoming episode will be a **Sequential model** from the Keras API integrated within TensorFlow.

The Sequential model receives data during training, which occurs when we call the ***fit()*** function on the model.

[Documentation of fit() function](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit)

In the ***fit()*** function: **x** is the input data and **y** are the labels for that input data in the same format or data structure.

## Process data in code

We'll start out with a very simple classification task using a simple numerical data set.

We first need to import the libraries we'll be working with. 

In [12]:
import numpy as np
from random import randint
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler

Next, we create two empty lists. One will hold the **input data**, the other will hold the **target data or labels**. 

In [13]:
train_labels = []
train_samples = []

### Data Creation

For this simple task, we'll be creating our own example data set.

As motivation for this data, let's suppose that an experimental drug was tested on individuals ranging from age 13 to 100 in a clinical trial. The trial had **2100** participants. Half of the participants were under 65 years old, and the other half was 65 years of age or older.

The trial showed that around 95% of patients 65 or older experienced side effects from the drug, and around 95% of patients under 65 experienced no side effects, generally showing that elderly individuals were more likely to experience side effects.

Ultimately, we want to build a model to tell us whether or not a patient will experience side effects solely based on the patient's age. The judgement of the model will be based on the training data.

**Labels:**
- 1: patient did experience side effects
- 0: patient didn´t experience side effects

In [14]:
for i in range(50):
    # The ~5% of younger individuals who did experience side effects
    random_younger = randint(13,64)
    train_samples.append(random_younger)
    train_labels.append(1)

    # The ~5% of older individuals who did not experience side effects
    random_older = randint(65,100)
    train_samples.append(random_older)
    train_labels.append(0)

for i in range(1000):
    # The ~95% of younger individuals who did not experience side effects
    random_younger = randint(13,64)
    train_samples.append(random_younger)
    train_labels.append(0)

    # The ~95% of older individuals who did experience side effects
    random_older = randint(65,100)
    train_samples.append(random_older)
    train_labels.append(1)

This is what the train_samples data looks like.

In [15]:
for i in train_samples:
    print(i)

21
82
53
77
35
91
45
95
64
71
31
68
57
90
47
69
29
79
39
66
58
66
59
89
31
77
47
93
64
72
45
85
38
89
42
90
59
79
52
88
20
93
28
65
63
75
21
93
44
65
40
97
18
78
57
65
63
67
37
66
46
67
63
99
35
78
28
84
25
99
17
74
32
82
29
83
35
65
54
81
48
75
21
78
23
84
26
86
48
81
23
97
17
73
44
91
14
95
57
72
55
96
51
68
64
86
45
88
64
82
54
92
18
88
64
100
19
88
25
96
62
80
23
72
40
73
32
78
23
75
48
73
48
92
26
92
18
92
56
89
44
68
57
70
38
98
17
77
24
85
59
72
55
83
46
70
30
88
52
87
44
93
23
67
23
66
44
79
63
90
62
79
38
93
38
98
40
82
52
66
30
78
59
77
47
91
29
71
36
65
20
79
23
67
36
79
20
92
30
98
38
91
17
65
17
78
40
74
50
93
47
94
49
79
40
88
42
89
17
78
64
82
61
72
22
76
22
68
50
74
14
69
64
74
52
70
22
84
62
67
39
69
30
86
22
82
61
83
18
67
46
77
53
88
17
86
13
73
22
98
35
98
21
66
56
85
45
75
47
99
62
87
38
78
39
90
26
100
61
86
22
77
42
89
60
66
34
99
64
97
49
91
57
98
25
76
57
69
60
100
48
73
27
69
29
98
41
78
41
85
33
99
43
94
38
96
22
79
62
91
13
80
63
83
41
73
23
70
41
97
38
75
2

This is what the train_labels look like.

In [16]:
for i in train_labels:
    print(i)

1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1


### Data Processing

We now convert both lists into numpy arrays due to what we discussed the fit() function expects, and we then shuffle the arrays to remove any order that was imposed on the data during the creation process.

In [17]:
train_labels = np.array(train_labels)
train_samples = np.array(train_samples)
train_labels, train_samples = shuffle(train_labels, train_samples)

In this form, we now have the ability to pass the data to the model because it is now in the required format, however, before doing that, we'll first scale the data down to a range from 0 to 1.

We'll use **scikit-learn's MinMaxScaler class** to scale all of the data down from a scale ranging from 13 to 100 to be on a scale from 0 to 1.

We reshape the data as a technical requirement just since the **fit_transform()** function doesn't accept 1D data by default. 

In [18]:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform(train_samples.reshape(-1,1))

Now that the data has been scaled, let's iterate over the scaled data to see what it looks like now. 

In [19]:
for i in scaled_train_samples:
    print(i)

[0.51724138]
[0.97701149]
[1.]
[0.25287356]
[0.18390805]
[0.71264368]
[0.31034483]
[0.79310345]
[0.73563218]
[0.88505747]
[0.87356322]
[0.01149425]
[0.81609195]
[0.73563218]
[0.65517241]
[0.94252874]
[0.04597701]
[0.35632184]
[0.20689655]
[0.52873563]
[0.82758621]
[0.52873563]
[0.28735632]
[0.70114943]
[0.79310345]
[0.29885057]
[0.3908046]
[0.31034483]
[0.95402299]
[0.51724138]
[0.55172414]
[0.66666667]
[0.01149425]
[0.82758621]
[0.91954023]
[0.18390805]
[0.75862069]
[0.3908046]
[0.1954023]
[0.04597701]
[0.91954023]
[0.68965517]
[0.94252874]
[0.93103448]
[0.67816092]
[0.75862069]
[0.68965517]
[0.45977011]
[0.86206897]
[0.1954023]
[0.33333333]
[0.10344828]
[0.27586207]
[0.37931034]
[0.97701149]
[0.73563218]
[0.85057471]
[0.25287356]
[0.87356322]
[0.93103448]
[0.56321839]
[0.06896552]
[0.03448276]
[0.14942529]
[0.68965517]
[0.28735632]
[0.59770115]
[0.22988506]
[0.59770115]
[0.72413793]
[0.51724138]
[0.94252874]
[0.67816092]
[0.10344828]
[0.74712644]
[0.74712644]
[0.57471264]
[0.35632184

[0.17241379]
[0.55172414]
[0.28735632]
[0.88505747]
[0.88505747]
[0.52873563]
[0.31034483]
[0.89655172]
[0.33333333]
[0.64367816]
[0.22988506]
[0.81609195]
[0.72413793]
[0.85057471]
[0.36781609]
[0.56321839]
[0.65517241]
[0.71264368]
[0.96551724]
[0.88505747]
[0.48275862]
[0.02298851]
[0.62068966]
[0.02298851]
[0.96551724]
[0.81609195]
[0.56321839]
[0.25287356]
[0.66666667]
[0.34482759]
[0.04597701]
[0.50574713]
[0.31034483]
[0.64367816]
[0.6091954]
[0.54022989]
[0.12643678]
[0.72413793]
[0.52873563]
[0.94252874]
[0.94252874]
[0.49425287]
[0.14942529]
[0.18390805]
[0.2183908]
[0.49425287]
[0.63218391]
[0.85057471]
[0.17241379]
[0.26436782]
[0.81609195]
[0.65517241]
[0.75862069]
[0.91954023]
[0.70114943]
[0.81609195]
[0.14942529]
[0.66666667]
[0.18390805]
[0.29885057]
[0.01149425]
[0.52873563]
[0.6091954]
[0.12643678]
[0.44827586]
[0.03448276]
[0.36781609]
[0.86206897]
[0.90804598]
[0.10344828]
[0.65517241]
[0.48275862]
[0.93103448]
[0.02298851]
[0.17241379]
[0.24137931]
[0.90804598]
[0

[0.36781609]
[0.16091954]
[0.57471264]
[0.65517241]
[0.77011494]
[0.97701149]
[0.3908046]
[0.90804598]
[0.18390805]
[0.37931034]
[0.54022989]
[0.48275862]
[0.26436782]
[0.37931034]
[0.10344828]
[0.72413793]
[0.34482759]
[0.01149425]
[0.]
[0.50574713]


In [20]:
print(scaled_train_samples.shape)

(2100, 1)


At this point, we've generated some sample raw data, put it into the numpy format that our model will require, and rescaled it to a scale ranging from 0 to 1.

In an upcoming episode, we'll use this data to train a neural network and see what kind of results we can get. 