# Logistic Regression with SUSY
Today, you'll be a physicist, working with collider data to make new particle discoveries! You've been given a small subset of the SUSY Data Set, which contains the results of a simulated experiment to detect supersymmetric particles.

First, let's start off with the necessary imports.

In [None]:
import csv
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

Next, we will read in the data and split into training and validation sets.

In [None]:
with open("SUSY_small.csv", "r") as f:
    read = csv.reader(f, delimiter=",")
    x = list(read)
data = np.array([[float(x[i][j]) for j in range(len(x[0]))] for i in range(len(x))])

In [None]:
shuffle = np.arange(data.shape[0])
np.random.shuffle(shuffle)
data_train = data[shuffle[:-1000],:]
data_val = data[shuffle[-1000:],:]
X_train, Y_train = data_train[:,1:], data_train[:,0]
X_val, Y_val = data_val[:,1:], data_val[:,0]

print(X_train.shape, X_val.shape, Y_train.shape, Y_val.shape)

Let's now implement the functions needed to perform gradient descent.

In [None]:
def sigmoid(X, w):
    """
    Compute the elementwise sigmoid of the product Xw
    Data in X should be rows, weights are a column. 
    """
    pass

def gradient(X, y, w, onept=False, norm=None, lamb=0):
    """
    Compute gradient of regularized loss function. 
    Accomodate for if X is just one data point. 
    """
    pass

def loss(X, y, w, norm=None, lamb=0):
    """
    Compute total loss for the data in X, labels in y, params w
    """
    pass

def accuracy(X, y, w):
    """
    Compute accuracy for data in X, labels in y, params w
    """
    pass

Now that we have the needed functions, we can perform gradient descent to train the model.

In [None]:
theta = np.random.normal(0, 0.1, X_train.shape[1])
losses = []
train_accuracies = []
validation_accuracies = []
epsilon = 0.05
num_iterations = 500

for i in range(num_iterations):
    pass

In [None]:
plt.figure(figsize=[10,10])
plt.plot(np.arange(num_iterations), losses, 'ro')
plt.plot(np.arange(num_iterations), train_accuracies, 'bo')
plt.plot(np.arange(num_iterations), validation_accuracies, 'go')
plt.title('Loss and Accuracy During Training Using Batch Gradient Descent')
plt.ylabel('Value')
plt.xlabel('Iteration Number')
plt.show()
print(accuracy(X_train, Y_train, theta))

In [None]:
print(accuracy(X_val, Y_val, theta))

# Logistic Regression with Regularization
What happens when our model overfits? Let's see what we can do to improve our validation accuracy.

In [None]:
X_train_small, Y_train_small = data_train[:50,1:], data_train[:50,0]

In [None]:
theta = np.random.normal(0, 0.1, X_train_small.shape[1])
losses = []
train_accuracies = []
validation_accuracies = []
epsilon = 0.05
num_iterations = 500

for i in range(num_iterations):
    pass

In [None]:
plt.figure(figsize=[10,10])
plt.plot(np.arange(num_iterations), losses, 'ro')
plt.plot(np.arange(num_iterations), train_accuracies, 'bo')
plt.plot(np.arange(num_iterations), validation_accuracies, 'go')
plt.title('Loss and Accuracy During Training Using Batch Gradient Descent')
plt.ylabel('Value')
plt.xlabel('Iteration Number')
plt.show()
print(accuracy(X_train_small, Y_train_small, theta))

In [None]:
print(accuracy(X_val, Y_val, theta))

Doesn't look so good, huh? Now's lets see what we can do to improve that with regularization.
Note: We are using a relatively simple model, and so for this specific instance may not be very useful.

In [None]:
lambdas = [0, 0.001, 0.01, 0.1, 1, 10]
norms = ['l1, 'l2'']

for norm in norms:
    print(norm)
    for lamb in lambdas:
        for i in range(num_iterations):
            pass
        print(lamb)
        print('Training: ' + train_acc + ', Validation: ' + val_acc)
        print()

To see what kind of effect regularization is having on the weights, train the model with varying norms and lambdas and print out the parameters.

In [None]:
print(theta)