In [17]:
import numpy as np
import helpers
import implementation

# Defining functions

In [18]:
def accuracy(y, tx, w):
    """Return the accuracy of the model."""
    pred    = np.where(tx.dot(w) > 0, 1, 0)
    correct = np.sum(pred == y)
    return correct / len(y)

The `build_test_train` function split the dataset into training/testing set in oder to perform cross validation tests.

In [19]:
def build_test_train(y, tx, ratio=0.9, seed=1):
    """Split the dataset (y, tx) into training/testing set according to the split ratio"""
    # performing permutation before splitting the dataset
    np.random.seed(seed)
    indices = np.random.permutation(len(y))

    # defining indices for y, tx
    delimiter_indice = int(ratio * len(y))
    te_indices = indices[delimiter_indice:]
    tr_indices = indices[:delimiter_indice]

    # creating the train/test sets
    y_te = y[te_indices]
    y_tr = y[tr_indices]
    tx_te = tx[te_indices]
    tx_tr = tx[tr_indices]
    return y_te, y_tr, tx_te, tx_tr

# Loading data

In [20]:
yb,      input_data,      ids      = helpers.load_csv_data("./data/train.csv")
yb_test, input_data_test, ids_test = helpers.load_csv_data("./data/test.csv")
# creating classification vector y that fits for logistic regression
y  = np.where(yb > 0, 1, 0)

# Finding the best model for the classification

### logistic regression and least squares
At first, we have to normalize and format the data.

In [21]:
# normalization of the data
x = implementation.z_normalize(input_data)
tx = np.append(np.ones(len(x)).reshape(-1,1), x, axis=1)

Then, we train our two models.

In [22]:
w_log_reg, loss = implementation.logistic_regression(y, tx, initial_w=np.zeros(tx.shape[1]), max_iters=2000, gamma=0.000003)
w_ls, loss      = implementation.least_squares(yb, tx)

We can see that logistic regression appears to achieve better accuracy.

In [23]:
print('Accuracy for logistic regression : ',accuracy(y, tx, w_log_reg))
print('Accuracy for least squares       : ',accuracy(y, tx, w_ls))

Accuracy for logistic regression :  0.75024
Accuracy for least squares       :  0.744972


#### Testing overfitting
We separate our dataset in the training set and the test set.

First, we train the data on the training set and then we test our model on the test set

In [24]:
y_te,  y_tr,  tx_te, tx_tr = build_test_train(y, tx)

In [25]:
w_log_reg, loss = implementation.logistic_regression(y_tr, tx_tr, initial_w=np.zeros(tx_tr.shape[1]), max_iters=2000, gamma=0.000003)

In [26]:
print('Accuracy for training set : ',accuracy(y_tr, tx_tr, w_log_reg))
print('Accuracy for testing set  : ',accuracy(y_te, tx_te, w_log_reg))

Accuracy for training set :  0.7504977777777778
Accuracy for testing set  :  0.74872


We can see that the accuracy for the training and testing set is really close.
It means that our model does not over fit too much and adding a regulator term would not yield significant improvements on the test set.

When taking a look at the leaderboard we notice that these results seem not good enough. Maybe our model is too simple as many teams can achieve over 0.8 accuracy.

### Using interaction of predictors

One good way of augmenting complexity of the model is to add interaction predictors.
Let's $p_{ij}$ be the interaction predictor mixing feature $f_i$ and feature $f_j$.
Then, $p_{ij} = f_i \cdot f_j$

The function `build_interactio_tx` will return an array `tx` with the initial features and the interaction predictors.

In [27]:
def build_interaction_tx(input_data, normalisation_function):
    """return the input vector tx with interaction terms"""
    # first normalizing the input data
    input_data = normalisation_function(input_data)

    n_features = input_data.shape[1]
    n_interacted_features = int((n_features-1) * n_features / 2)

    # creating the future output array
    x = np.empty((n_features + n_interacted_features, len(input_data)))
    x[:n_features] = input_data.T

    # adding interaction predictors to the output array
    index = n_features
    for i in range(n_features):
        for j in range(i):
            x[index] = x[i] * x[j]
            index = index + 1

    # normalizing the data and adding the bias term
    x = normalisation_function(x.T)
    tx = np.append(np.ones(len(x)).reshape(-1,1), x, axis=1)

    return tx

At first, we create interaction our new input with interaction terms.
Then, we split in train/test set.

In [28]:
tx = build_interaction_tx(input_data, implementation.z_normalize)
y_te, y_tr, tx_te, tx_tr = build_test_train(y, tx)
yb_te, yb_tr, tx_te, tx_tr = build_test_train(yb, tx)

We now train our two models.

In [29]:
w_log_reg, loss = implementation.logistic_regression(y_tr, tx_tr, initial_w=np.zeros(tx.shape[1]), max_iters=8000, gamma=0.0000005)
w_ls,      loss = implementation.least_squares(yb_tr, tx_tr)

In [30]:
print('Least squares')
print('Accuracy for training set : ',accuracy(y_tr, tx_tr, w_ls))
print('Accuracy for testing set  : ',accuracy(y_te, tx_te, w_ls))
print('Logistic regression')
print('Accuracy for training set : ',accuracy(y_tr, tx_tr, w_log_reg))
print('Accuracy for testing set  : ',accuracy(y_te, tx_te, w_log_reg))

Least squares
Accuracy for training set :  0.7910755555555555
Accuracy for testing set  :  0.7894
Logistic regression
Accuracy for training set :  0.8161022222222222
Accuracy for testing set  :  0.81424


We can see that Logistic regression achieve slightly better results than the least squares.
Moreover, this model is not victim to over fitting as the accuracy for the training set and testing set are really close.

# Taining the best model

The best model we found is the logistic regression on interaction terms. So we will train this model on the entire dataset.

In [31]:
tx = build_interaction_tx(input_data, implementation.z_normalize)

In [32]:
w_best, loss = implementation.logistic_regression(y, tx, initial_w=np.zeros(tx.shape[1]), max_iters=8000, gamma=0.0000005)

Now predicting output for the testing set.

In [33]:
tx_test = build_interaction_tx(input_data_test, implementation.z_normalize)

In [34]:
y_pred = np.where(tx_test.dot(w_best) > 0, 1, -1)

In [35]:
name = 'submission2'
helpers.create_csv_submission(ids_test, y_pred, name)