In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
from nose.tools import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV, train_test_split

# Write your imports in the cell below

In [3]:
# YOUR CODE HERE
# raise NotImplementedError()

In [4]:
np.random.seed(1234)

# Model Training and Improvement Lab
## Comparing and selecting models

### 1. Read the data (1 point)
Like in the previous lab, you need to read the Portuguese bank dataset [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00222/). It has been provided for you in the `data` folder.

Read the dataset using `pandas` (you can use the library with the alias `pd`). Save it in the `bank_data` variable.

In [5]:
bank_data = pd.read_csv("data/bank.csv", sep=";")
# YOUR CODE HERE
# raise NotImplementedError()

In [6]:
# From now on, all test cells might contain hidden tests. If you follow the instructions correctly,
# your solution will be graded with maximum points
assert_is_not_none(bank_data)

### 2. Preprocess the data (1 point)
Separate explanatory features from labels. Save all features (16 columns total) in the variable `bank_features`. Save the labels (corresponding to the `y` column) in the `bank_labels` variable. Rewrite the labels to be `0` and `1` instead of `no` and `yes`: `bank_labels` should be a numeric column.

In [7]:
bank_features, bank_labels = (
    bank_data[bank_data.columns[:-1]],
    bank_data[bank_data.columns[-1]],
)
# YOUR CODE HERE
# raise NotImplementedError()
bank_labels = bank_labels.replace("no", 0)
bank_labels = bank_labels.replace("yes", 1)

In [8]:
bank_features.columns.__len__()

16

In [9]:
bank_labels.dtype

dtype('int64')

In [10]:
assert_is_not_none(bank_features)
assert_is_not_none(bank_labels)

### 3. Get indicator variables (1 point)
Get indicator (dummy) variables for all categorical columns in `bank_features`. Overwrite the `bank_features` variable to store the new data.

In [11]:
# YOUR CODE HERE
bank_features = pd.get_dummies(bank_features)

In [12]:
assert_equal(bank_features.shape, (4521, 51))

### 4. Split the data (1 point)
Split the data into training and testing set, with 70% of the data for training. Because the output labels are not equaly distributed, use stratification based on the `bank_labels`.

In [13]:
(
    bank_features_train,
    bank_features_test,
    bank_labels_train,
    bank_labels_test,
) = train_test_split(bank_features, bank_labels, train_size=0.7)
# YOUR CODE HERE
# raise NotImplementedError()

In [14]:
bank_features_train.shape,bank_features_test.shape,bank_labels_train.shape,bank_labels_test.shape

((3164, 51), (1357, 51), (3164,), (1357,))

In [15]:
assert_is_not_none(bank_features_train)
assert_is_not_none(bank_labels_train)
assert_is_not_none(bank_features_test)
assert_is_not_none(bank_labels_test)

### 5. Train a baseline algorithm (1 point)
Train a logistic regression using the training data. Use 1 000 000 (`1e6`) as the value of C. Score it using the testing data. Save the score in the `baseline_score` variable. You should see a fairly high score.

In [16]:
model = LogisticRegression(C=1e6).fit(bank_features_train, bank_labels_train)
baseline_score = model.score(bank_features_test, bank_labels_test)

# YOUR CODE HERE
# raise NotImplementedError()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [17]:
baseline_score

0.8865143699336773

In [18]:
assert_is_not_none(model)
assert_greater(baseline_score, 0.7)

### 6. Select a better score (2 points)
As you alrady saw, the positive examples are very few. If you aren't convinced, just check the counts.

We know that the default scoring (accuracy) isn't correct in this case. Better measures would be precision and recall. However, we only want one number. Evaluate the algorithm once again, using a standard scoring method which combines precision and recall. Overwrite the `baseline_score` variable.

Don't forget to score the model on the testing data only.

In [19]:
# YOUR CODE HERE
# raise NotImplementedError()
baseline_score = f1_score(bank_labels_test, model.predict(bank_features_test))
print(baseline_score)

0.2735849056603774


In [20]:
assert_less(baseline_score, 0.7)

### 7. Tune your model (2 points)
Fine-tune the `C` and `max_iter` parameters.

Use full grid search with the following values:
* `C`: 0.0001, 0.01, 0.1, 1, 10, 100, 10000
* `max_iter`: 50, 100, 300, 1000
* `fit_itercept`: True, False

Save the grid search result in the `grid_search` variable. Don't forget to use the better scoring model that you obtained in the previous task.

In [21]:
tuned_params = [
    {
        "C": [0.0001, 0.01, 0.1, 1, 10, 100, 10000],
        "max_iter": [50, 100, 300, 1000],
        "fit_intercept": [True, False],
    }
]
grid_search = GridSearchCV(LogisticRegression(), tuned_params, scoring="f1")

grid_search.fit(bank_features_train, bank_labels_train)
# YOUR CODE HERE
# raise NotImplementedError()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [22]:
assert_is_not_none(grid_search)
assert_is_not_none(grid_search.best_estimator_)

### 8. Compare scores (1 point)
Use the best estimator from your grid search. Score it using the function from problem 6. Save your answer in `tuned_score`.

In [23]:
tuned_score = f1_score(
    bank_labels_test, grid_search.best_estimator_.predict(bank_features_test)
)

# YOUR CODE HERE
# raise NotImplementedError()

In [24]:
print(tuned_score)

0.3949579831932773


In [25]:
print(baseline_score - tuned_score)

-0.12137307753289989


Hmmmm, it seems we have not obtained a better algorithm, even the opposite (the difference is marginal and depends on the random initialization of the cross-validation datasets).

We can, of course, do a lot more things to improve our model's performance, such as normalizing the data, feature selection and feature engineering, trying out different aspects, e.g. polynomial terms, RANSAC; even boosting (we'll talk about this later). However, we'll stop at this point.

What can we conclude? It seems that this is close to the best performance we can get out of this algorithm, given these data points.

We can try improving (cleaning) our dataset, selecting features, etc. but we most likely need a better algorithm. In the next labs, we're going to explore that.