# The ML Pipeline

In this exercise sheet the focus will not be on e. g. a specific classification procedure that you have to understand and train but on everything else around it. You will get to know about important aspects of the ML methodology including generating synthetic data, extracting features, splitting up the data set for training and testing as well as evaluation methods. Whenever you will implement another ML method in the upcoming days you can rely on what you learn today.

In [None]:
# imports
%load_ext autoreload
%autoreload 2
%matplotlib inline  

import numpy as np
import numpy.random as rng
import matplotlib.pyplot as plt
#import solutions

from sklearn.linear_model import LogisticRegression

## Generate data

Ex 1: Please add the missing code in `get_data1` to draw N uniformly distributed samples from $-\pi$ to $\pi$.  
Ex 2: Please add the missing code in `get_data1` and `get_data2` to add normal distributed noise with the given noise factor

In [None]:
def get_data1(N=1000, noise=.1):

    def circle(x,radius):
        return np.sin(x) * radius, np.cos(x) * radius
    
    y = rng.randint(0,2,N)
    #print(y.shape)
    # YOUR CODE HERE
    X = rng.uniform(-np.pi,np.pi,N)
    # END
    
    X = np.array([circle(x,radius) for x,radius in zip(X,rng.uniform(4,8,2)[y])])
    #print(X.shape)
    # YOUR CODE HERE
    #X = X + np.random.normal(noise, N)
    X += noise * rng.randn(*X.shape)
    # END
    #print(X.shape)
    
    # translate
    X[:,0] += noise * rng.uniform(0,10)
    X[:,1] += noise * rng.uniform(0,10)
    #print(y.shape)
    return X,y


def get_data2(N=1000, noise=.5):
    
    y = rng.randint(0,2,N)
    X = np.linspace(0, 6, N)
    
    def desc(x):
        return x, -x + 6
    
    def asc(x):
        return x,x
    
    X = np.array([asc(x) if yc == 1 else desc(x) for x,yc in zip(X, y)])
    
    # YOUR CODE HERE
    X += noise * rng.randn(*X.shape)
    # END
    
    # translate
    X[:,0] += noise * rng.uniform(0,10)
    X[:,1] += noise * rng.uniform(0,10)

    return X,y

In [None]:
X1,y1 = get_data1(noise=.2)

In [None]:
X2,y2 = get_data2(noise=.2)

Ex 3: create a scatterplot of the X values and color the points according to their y value. Please make sure that both axes have the same scaling.

In [None]:
def plot_data(X,y):
    # YOUR CODE HERE
    plt.scatter(X[:,0],X[:,1],c=y)
    plt.axis('equal')
    # END 

In [None]:
plot_data(X1,y1)

In [None]:
plot_data(X2,y2)

## extract features

EX4: Center the given data around zero.  
EX5: With the knowledge you have about the structure of the data, create a custom one dimensional feature representation in that the classes are lineary separable  
EX6: With the knowledge you have about the structure of the data, create a custom two dimensional feature representation in that the classes are lineary separable  

In [None]:
def extract_features_basic(X):
    # YOUR CODE HERE
    X[:,0] = X[:,0] - np.mean(X[:,0])
    X[:,1] = X[:,1] - np.mean(X[:,1])
    return X
    # END

In [None]:
def extract_features1(X):
    # YOUR CODE HERE
    X = extract_features_basic(X)
    X = np.sqrt(X[:,0]**2+X[:,1]**2)
    return X
    # END

In [None]:
def extract_features2(X):
    # YOUR CODE HERE
    X = extract_features_basic(X)
    #print(X)
    X[:,0] = X[:,0] * X[:,1]
    #print(X)
    zeroes = np.zeros(X.shape)
    zeroes[:,0] = X[:,0]
    print(zeroes)
    return zeroes
    # END

In [None]:
X1_feat = extract_features1(X1)

In [None]:
_ = plt.plot(X1_feat, y1, "o")

In [None]:
X2_feat = extract_features2(X2)

In [None]:
plot_data(X2_feat, y2)

EX7: Implement the `train_test_split` function that splits `X` and `y` in two parts of `test_portion` ratio. Whether the samples should be shuffled depends on `perform_shuffle`.

In [None]:
def train_test_split(X, y, test_portion=.25, perform_shuffle=True):
    # YOUR CODE HERE
    if(perform_shuffle):
        c = list(zip(X, y))
        rng.shuffle(c)
        X, y = zip(*c)
    X_train, X_test, X_val = np.split(X, [int((1-test_portion) * len(X)), int(1 * len(X))])
    y_train, y_test, y_val = np.split(y, [int((1-test_portion) * len(y)), int(1 * len(y))])

    return X_train, y_train, X_test, y_test
    # END

In [None]:
X1_train, y1_train, X1_test, y1_test = train_test_split(X1_feat, y1)
X2_train, y2_train, X2_test, y2_test = train_test_split(X2_feat, y2)
print(X1_train.shape)
print(y1_train.shape)
print(X1_test.shape)
print(y1_test.shape)

In the following two cells a classifier is given. In this exercise you should not worry about how the classifier works, just note that the predictions of the different X splits are stored in `train_prediction1`, `test_prediction1` (first data set), `train_prediction2`, `test_prediction2` (second data set)

In [None]:
clf1 = LogisticRegression()
clf1.fit(X1_train.reshape(len(X1_train), 1), y1_train)
train_prediction1 = clf1.predict(X1_train.reshape(len(X1_train), 1))
test_prediction1 = clf1.predict(X1_test.reshape(len(X1_test), 1))

In [None]:
clf2 = LogisticRegression()
clf2.fit(X2_train.reshape(len(X2_train), 2), y2_train)
train_prediction2 = clf2.predict(X2_train.reshape(len(X2_train), 2))
test_prediction2 = clf2.predict(X2_test.reshape(len(X2_test), 2))

## Evaluation    

EX8: please compute the four evaluation methods 
    1. precision
    2. recall
    3. accuracy
    4. f1 score


In [None]:
def evaluate(y_pred, y_true):
    true_pos = 0
    false_pos = 0
    false_neg = 0
    true_neg = 0
    #print(len(y_pred))
    for i,j in zip(y_pred,y_true):
        if i == 1 and j == 1:
            true_pos += 1
        elif i==1 and j == 0:
            false_pos += 1
        elif i==0 and i == 1:
            false_neg += 1
        else:
            true_neg += 1
        
    #print(true_pos)
    #print(y_pred.sum())
    #print(y_true.sum())
    #print(y_pred, y_true)
    # YOUR CODE HERE
    precision = true_pos/(true_pos+false_pos)
    recall = true_pos/(true_pos + false_neg)
    accuracy = (true_pos + true_neg)/len(y_pred)
    print("precision=", precision)
    print("recall=", recall)
    print("accuracy=",accuracy)
    print("f1=", 2 * ((precision * recall)/(precision + recall)))
    # END
    


### First data set

In [None]:
evaluate(test_prediction1, y1_test)

EX9: Plot the test set `X1_test` as histogram and visualize the decision boundary of the classifier `clf1`. Hint: You can use `clf1.predict(...)` to generate new output.

In [None]:
# YOUR CODE HERE
_ = plt.hist(X1_test)

# END

### Second data set

In [None]:
evaluate(test_prediction2, y2_test)

EX10: Plot the X2_test as a scatter plot and visualize the decision boundary of `clf2` as a contour. Hint: `plt.contourf(...)` can be used to plot a contour.

In [None]:
# YOUR CODE HERE
plot_data(X2_test, y2_test)
# END

#### Generalization

EX11: Interpret the output of the following cell with regard to the concept of generalization. Point out the crucial parts that have an effect on generalization.

In [None]:
X2_feat = extract_features_basic(X2)

X2_train, y2_train, X2_test, y2_test = train_test_split(X2_feat, y2, test_portion=.5, perform_shuffle=False)

clf3 = LogisticRegression()
clf3.fit(X2_train.reshape(len(X2_train), 2), y2_train)
train_prediction2 = clf3.predict(X2_train.reshape(len(X2_train), 2))
test_prediction2 = clf3.predict(X2_test.reshape(len(X2_test), 2))

print("training performance")
evaluate(train_prediction2, y2_train)
print("test performance")
evaluate(test_prediction2, y2_test)

It poorly generalize on the badly separated dataset, where 50% was used for training and 50% for testing without even shuffeling