# A simple learning example
Let's assume that we have the following data regarding students:
- various *features*: age at graduation, LM final grade, ML grade, height
- whether they have a fun job or not, encoded by a binary label: fun = 1, not fun = -1

We will now assume that there is a relation between the features and the label, and see whether we can predict the label for future students.

In [1]:
#importing some useful packages
#NumPy(http://www.numpy.org): useful for scientific computing, provides array objects, etc.
import numpy as np
#package for random number generators and other useful "random" stuff
import random as rnd
#package with some default math functions
import math
#specific model from a ML package
from sklearn.linear_model import LogisticRegression

Let's fix a random seed so that we can replicate the analysis (hopefully...)

In [2]:
FV_ID_number = 10
rnd.seed(FV_ID_number)

We assume that *there is* a function that given the values of the features provides the label. Here, we can *choose* the function

In [3]:
def get_true_label(student_features):
    age = student_features[0]
    LM_grade = student_features[1]
    ML_grade = student_features[2]
    height = student_features[3]
    label = np.sign(-1. * (age - 23.)/10. + 0.6 * (LM_grade-90.)/111. + 0.4 * (ML_grade-18.)/31. + 0. * height )
    return label

Let's see some examples:

In [4]:
student_1 = [25, 110, 31, 181]
label_1 = get_true_label(student_1)
print("Label of student 1: ",label_1)
student_2 = [30, 92, 18, 260]
label_2 = get_true_label(student_2)
print("Label of student 2: ",label_2)

Label of student 1:  1.0
Label of student 2:  -1.0


Let's generate some data!

In [5]:
# number of students for which we have data
N = 40
# let's save the data in matrix X
X = np.zeros((N,4))
for i in range(N):
    #let's generate age
    X[i,0] = rnd.randint(20,28)
    #let's generate LM_grade
    X[i,1] = rnd.randint(80,112)
    #let's generate ML_grade
    X[i,2] = rnd.randint(18,32)
    #let's generate height
    X[i,3] = rnd.randint(140,201)

Y = np.zeros(N)
for i in range(N):
    Y[i] = get_true_label(X[i,:])

Let's look at the data

In [6]:
print(X)

[[ 20. 107.  25. 176.]
 [ 20.  93.  25. 192.]
 [ 27.  97.  28. 191.]
 [ 22.  82.  26. 171.]
 [ 25.  84.  21. 200.]
 [ 25.  82.  24. 195.]
 [ 22. 102.  24. 166.]
 [ 24.  96.  25. 151.]
 [ 24. 103.  20. 169.]
 [ 23. 108.  27. 164.]
 [ 20.  80.  21. 148.]
 [ 23.  99.  26. 163.]
 [ 23. 100.  28. 175.]
 [ 27. 107.  25. 144.]
 [ 25. 112.  20. 193.]
 [ 23. 106.  21. 142.]
 [ 20. 111.  22. 192.]
 [ 21.  85.  20. 164.]
 [ 25.  89.  19. 201.]
 [ 21. 108.  20. 191.]
 [ 23. 102.  24. 166.]
 [ 27.  95.  28. 157.]
 [ 22.  91.  31. 147.]
 [ 24. 109.  22. 150.]
 [ 22.  91.  25. 189.]
 [ 25. 100.  24. 154.]
 [ 20.  82.  23. 197.]
 [ 25.  95.  19. 156.]
 [ 27. 105.  27. 150.]
 [ 26. 111.  28. 197.]
 [ 23.  97.  26. 170.]
 [ 27.  84.  20. 171.]
 [ 27. 105.  20. 166.]
 [ 28. 102.  26. 164.]
 [ 27.  90.  26. 168.]
 [ 21. 106.  31. 142.]
 [ 20. 108.  26. 144.]
 [ 20. 102.  19. 149.]
 [ 21. 109.  31. 191.]
 [ 27.  88.  25. 195.]]


Let's generate the labels

In [7]:
print(Y)

[ 1.  1. -1.  1. -1. -1.  1.  1. -1.  1.  1.  1.  1. -1. -1.  1.  1.  1.
 -1.  1.  1. -1.  1.  1.  1. -1.  1. -1. -1. -1.  1. -1. -1. -1. -1.  1.
  1.  1.  1. -1.]


# The real-world starts here

Let's split the data into 2 parts: a training set to learn a model, and a test set to test our model.

In [8]:
X_train = X[0:20,:]
Y_train = Y[0:20]
print(X_train)
print(Y_train)
X_test = X[20:,:]
Y_test = Y[20:]
print(X_test)
print(Y_test)

[[ 20. 107.  25. 176.]
 [ 20.  93.  25. 192.]
 [ 27.  97.  28. 191.]
 [ 22.  82.  26. 171.]
 [ 25.  84.  21. 200.]
 [ 25.  82.  24. 195.]
 [ 22. 102.  24. 166.]
 [ 24.  96.  25. 151.]
 [ 24. 103.  20. 169.]
 [ 23. 108.  27. 164.]
 [ 20.  80.  21. 148.]
 [ 23.  99.  26. 163.]
 [ 23. 100.  28. 175.]
 [ 27. 107.  25. 144.]
 [ 25. 112.  20. 193.]
 [ 23. 106.  21. 142.]
 [ 20. 111.  22. 192.]
 [ 21.  85.  20. 164.]
 [ 25.  89.  19. 201.]
 [ 21. 108.  20. 191.]]
[ 1.  1. -1.  1. -1. -1.  1.  1. -1.  1.  1.  1.  1. -1. -1.  1.  1.  1.
 -1.  1.]
[[ 23. 102.  24. 166.]
 [ 27.  95.  28. 157.]
 [ 22.  91.  31. 147.]
 [ 24. 109.  22. 150.]
 [ 22.  91.  25. 189.]
 [ 25. 100.  24. 154.]
 [ 20.  82.  23. 197.]
 [ 25.  95.  19. 156.]
 [ 27. 105.  27. 150.]
 [ 26. 111.  28. 197.]
 [ 23.  97.  26. 170.]
 [ 27.  84.  20. 171.]
 [ 27. 105.  20. 166.]
 [ 28. 102.  26. 164.]
 [ 27.  90.  26. 168.]
 [ 21. 106.  31. 142.]
 [ 20. 108.  26. 144.]
 [ 20. 102.  19. 149.]
 [ 21. 109.  31. 191.]
 [ 27.  88.  25. 19

Now let's use a ML method to "learn" a function to predict the label from the data we have!

In [9]:
clf = LogisticRegression().fit(X_train, Y_train)

Let's see how we do on the data we already have

In [10]:
#this reports the fraction of students on which our model makes the right prediction
clf.score(X_train,Y_train)

1.0

But this was easy, after all this is the data we have already seen! What about data that we have not seen?

In [11]:
student_new_1 = [24, 106, 29, 161]
label_new_1 = get_true_label(student_new_1)
print("True label of student_new_1:",label_new_1)
predicted_label_new_1 = clf.predict([student_new_1])
print("Predicted label of student_new_1:",predicted_label_new_1)
student_new_2 = [32, 104, 24, 191]
label_new_2 = get_true_label(student_new_2)
print("True label of student_new_2:",label_new_2)
predicted_label_new_2 = clf.predict([student_new_2])
print("Predicted label of student_new_2:",predicted_label_new_2)

True label of student_new_1: 1.0
Predicted label of student_new_1: [1.]
True label of student_new_2: -1.0
Predicted label of student_new_2: [-1.]


Let's see how we do on the test set!

In [12]:
#this reports the fraction of students on which our model makes the right prediction
accuracy_test = clf.score(X_test,Y_test)

print("Accuracy on test data: ",accuracy_test)

Accuracy on test data:  0.95
