# Homework 3: Classification, Regression, and Other Prediction Models

Before starting, review past homeworks to find out mistakes I've made in previous practices.

### HW0 Data Preprocessing - Review

+ Issues in data preprocessing / data cleaning
  + discard outliers (temperature dataset has unreasonable values like -99.5)
  + ignore, impute or interpolate missing values (taipower dataset lacks data from Jan. to Apr.)
  + align records with n o'clock sharp (make sure there is one and only one record for each hour in both datasets)
+ Flatten JSON to DataFrame: `pandas.io.json.json_normalize()`
+ Write DataFrame to database: `pandas.DataFrame.to_sql()`
+ Calculate Euclidean distance: `numpy.linalg.norm(X - Y)`
+ Other useful functions: `scipy.signal.filtfilt()` for smoothing (noise reduction)
+ Useful programming skills: `f"Hello, my name is {name}."`

### HW1 Association Rules - Review

+ Three discretization methods
  + equal width (`pandas.cut()`)
  + equal frequency (`pandas.qcut()`)
  + clustering (e.g. k-means)
+ Investigate DataFrame: `head()`, `describe()`, `corr()`
+ Plot DataFrame
  + `df.A.plot(secondary_y=True, label='name', legend=True)` to add another y-axis
  + `df.A.plot.hist(edgecolor='k')` for histograms
+ We can also discretize multidimensional data.

### HW2 Clustering - Review

+ Use elbow method or silhouette value to determine k.
+ For multidimensional data, normalizing data in advance may produce better results.
+ For high-dimensional data, use principal component analysis (PCA) to reduce dimensions.
+ The clustering result of taipower data in `temporal-clustering.ipynb` also shows a periodic pattern.
  + The electricity consumption on weekends tends to belong to a different cluster.

## Task 1

Round DATETIME towards its nearest hour. (e.g. 2017/10/16 14:37:52 -> 2017/10/16 15:00:00)

```sql
UPDATE 逐時觀測
SET 逐時觀測.時間 = DATE_FORMAT(
  DATE_ADD(逐時觀測.時間, INTERVAL 30 MINUTE),
  '%Y-%m-%d %H:00:00'
)
```

```sql
UPDATE Power
SET Power.updateTime = DATE_FORMAT(
  DATE_ADD(Power.updateTime, INTERVAL 30 MINUTE),
  '%Y-%m-%d %H:00:00'
)
```

A naive way to generate dataset.

+ historical feature: temperature in Banqiao, Taiwan
+ future target: electricity consumption in northern Taiwan

```sql
DROP PROCEDURE IF EXISTS dm.generate_dataset;
DELIMITER //
CREATE PROCEDURE dm.generate_dataset(from_time DATETIME, to_time DATETIME)
BEGIN
  DECLARE it INT;
  DECLARE temp DOUBLE;
  DECLARE output TEXT;
  DECLARE curr_time DATETIME;
  WHILE from_time <= to_time DO
    SET it = 0;
    SET output = '';
    SET curr_time = from_time;
    generate_record: WHILE it < 102 DO
      SET temp = NULL;
      IF it < 96 THEN
        SELECT 逐時觀測.溫度 INTO temp FROM 逐時觀測
          WHERE (逐時觀測.時間 = curr_time) AND (逐時觀測.測站 = 'BANQIAO,板橋')
          LIMIT 1;
      ELSE
        SELECT Power.northUsage INTO temp FROM Power
          WHERE (Power.updateTime = curr_time)
          LIMIT 1;
      END IF;
      IF temp IS NULL THEN
        SET output = '';
        LEAVE generate_record;
      END IF;
      SET it = it + 1;
      SET output = CONCAT(output, ',', CAST(temp AS CHAR));
      SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
    END WHILE;
    IF output <> '' THEN
      INSERT INTO Records VALUES (output);
    END IF;
    SET from_time = DATE_ADD(from_time, INTERVAL 1 HOUR);
  END WHILE;
END//
DELIMITER ;
SET @from_time = CAST('2016-07-03 01:00:00' AS DATETIME);
SET @to_time = CAST('2017-07-04 00:00:00' AS DATETIME);
CALL generate_dataset(@from_time, @to_time);
```

This is very inefficient because there are too many queries. Try a better approach.

First, find duplicate record times.

```sql
SELECT 逐時觀測.時間
FROM 逐時觀測
WHERE 逐時觀測.測站 = 'BANQIAO,板橋'
GROUP BY 逐時觀測.時間
HAVING COUNT(*) > 1
```

0 rows retrieved.

```sql
SELECT Power.updateTime, COUNT(*) AS cnt
FROM Power
GROUP BY Power.updateTime
HAVING cnt > 1
```

updateTime | cnt
--- | ---
2016-10-17 00:00:00 | 2
2017-05-20 08:00:00 | 3
2017-05-20 13:00:00 | 2
2017-05-30 22:00:00 | 2
2017-07-19 09:00:00 | 2
2017-07-24 01:00:00 | 2
2017-07-24 13:00:00 | 2
2017-08-21 20:00:00 | 2
2017-08-21 21:00:00 | 2

Remove them using the following query.

```sql
-- run 10 times
DELETE FROM Power
WHERE Power.updateTime IN (
  SELECT A.updateTime
  FROM (SELECT * FROM Power) AS A
  GROUP BY A.updateTime
  HAVING COUNT(*) > 1
) LIMIT 1
```

Store each record (96 history data and 6 prediction data) in one CSV file.  
Ignore the whole record if it contains missing data.

```sql
DROP PROCEDURE IF EXISTS dm.generate_dataset;
DELIMITER //
CREATE PROCEDURE dm.generate_dataset(from_time DATETIME, to_time DATETIME)
BEGIN
  DECLARE curr_time DATETIME DEFAULT from_time;
  SET @path = '/Users/ywpu/Repositories/data-mining/hw3/';
  generate_record: WHILE curr_time <= to_time DO
    SET @cnt1 = (
      SELECT COUNT(逐時觀測.溫度) FROM 逐時觀測
        WHERE (逐時觀測.時間 >= curr_time)
        AND (逐時觀測.時間 < DATE_ADD(curr_time, INTERVAL 4 DAY))
        AND (逐時觀測.測站 = 'BANQIAO,板橋')
    );
    SET @cnt2 = (
      SELECT COUNT(Power.northUsage) FROM Power
        WHERE (Power.updateTime >= DATE_ADD(curr_time, INTERVAL 4 DAY))
        AND (Power.updateTime < DATE_ADD(curr_time, INTERVAL '4 6' DAY_HOUR))
    );
    IF @cnt1 <> 96 OR @cnt2 <> 6 THEN
      SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
      ITERATE generate_record;
    END IF;
    SET @filename = CONCAT(@path, CAST(curr_time AS CHAR), '.csv');
    SET @query_string = CONCAT(
      "SELECT A.Y INTO OUTFILE '", @filename, "'
        FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
      FROM (
        (
          SELECT 逐時觀測.時間 AS X, 逐時觀測.溫度 AS Y FROM 逐時觀測
          WHERE (逐時觀測.時間 >= ?)
          AND (逐時觀測.時間 < DATE_ADD(?, INTERVAL 4 DAY))
          AND (逐時觀測.測站 = 'BANQIAO,板橋')
        ) UNION ALL (
          SELECT Power.updateTime AS X, Power.northUsage AS Y FROM Power
          WHERE (Power.updateTime >= DATE_ADD(?, INTERVAL 4 DAY))
          AND (Power.updateTime < DATE_ADD(?, INTERVAL '4 6' DAY_HOUR))
        ) ORDER BY X ASC
      ) A"
    );
    PREPARE stmt FROM @query_string;
    EXECUTE stmt USING curr_time, curr_time, curr_time, curr_time;
    DEALLOCATE PREPARE stmt;
    SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
  END WHILE;
END//
DELIMITER ;
SET @from_time = CAST('2016-07-03 01:00:00' AS DATETIME);
SET @to_time = CAST('2017-07-04 00:00:00' AS DATETIME);
CALL generate_dataset(@from_time, @to_time);
```

Combine all records into `dataset.csv`.

```python
import os
with open('dataset.csv', 'w') as target:
    entries = sorted(os.listdir())
    for entry in entries:
        if entry.endswith('.csv') and entry.startswith('201'):
            with open(entry) as file:
                cols = file.readlines()
            cols = [col.strip() for col in cols]
            row = ','.join(cols) + '\n'
            target.write(row)
```

Load `dataset.csv` into DataFrame.

In [1]:
import numpy as np
import pandas as pd
dataset = pd.read_csv('dataset.csv', names=list(range(0, 102)), dtype=np.float64)

In [2]:
dataset.shape

(4568, 102)

In [3]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
0,27.6,28.4,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,...,27.3,27.4,27.4,27.7,841.3,826.4,789.5,773.6,778.0,783.1
1,28.4,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,...,27.4,27.4,27.7,26.6,826.4,789.5,773.6,778.0,783.1,814.3
2,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,...,27.4,27.7,26.6,26.0,789.5,773.6,778.0,783.1,814.3,812.4
3,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,25.3,...,27.7,26.6,26.0,25.9,773.6,778.0,783.1,814.3,812.4,809.0
4,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,25.3,25.3,...,26.6,26.0,25.9,25.8,778.0,783.1,814.3,812.4,809.0,796.6


## Task 2

Split dataset randomly using scikit-learn.

In [4]:
from sklearn.model_selection import train_test_split
seed = 666 # Fix the seed so that the results are reproducible.
train, test = train_test_split(dataset, test_size=0.3, random_state=seed)

In [5]:
train.shape

(3197, 102)

In [6]:
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
2136,18.3,18.8,19.7,21.4,21.0,21.2,21.7,21.7,21.2,20.3,...,13.9,14.0,14.0,14.3,823.2,924.6,945.6,976.0,961.5,917.7
3539,23.1,23.3,23.3,23.3,23.6,23.8,23.6,23.6,24.9,26.6,...,24.7,24.5,24.2,24.2,818.7,762.1,721.1,690.8,670.2,664.1
2285,20.1,19.6,19.7,19.3,19.0,18.6,18.8,18.6,18.5,18.6,...,20.6,20.3,20.6,21.0,965.0,1011.3,1021.1,1014.3,1022.5,1042.4
384,23.6,23.3,23.2,23.3,23.4,23.4,23.5,23.7,24.1,24.8,...,24.7,24.8,25.1,25.3,842.7,788.0,757.7,745.5,726.2,711.3
339,23.8,23.6,23.5,23.3,23.2,22.9,23.3,23.4,23.5,23.3,...,23.9,23.7,23.9,24.1,735.4,717.3,722.1,743.2,786.4,899.8


In [7]:
test.shape

(1371, 102)

In [8]:
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
256,29.0,28.3,27.9,27.6,27.6,27.3,27.3,26.2,26.3,26.2,...,23.3,23.7,22.8,22.8,828.0,851.5,896.2,895.8,889.2,880.7
3176,22.1,22.0,21.9,22.0,22.3,23.6,24.6,26.5,27.3,28.7,...,25.8,25.6,25.4,24.3,764.3,728.6,737.5,737.3,759.3,794.1
1210,27.1,26.3,25.5,25.2,24.5,24.0,23.7,23.7,23.6,23.7,...,20.5,22.9,25.8,28.3,922.7,924.3,943.2,912.7,929.0,929.5
333,25.4,24.7,24.4,24.3,23.6,23.6,23.8,23.6,23.5,23.3,...,24.3,24.0,23.8,23.2,1005.4,938.5,884.5,823.7,772.6,738.9
560,30.0,30.3,28.9,28.5,28.2,28.1,28.3,28.3,28.2,28.3,...,30.7,31.8,31.6,31.4,1103.5,1075.9,1080.4,1078.0,1108.0,1093.8


## Task 3

Before classification, discretize targets using pandas. (apply equal-frequency discretization)

In [57]:
# call pd.qcut(column, 4, labels=False) for each column in column 96 ~ 101
discrete = dataset.iloc[:, 96:102].apply(pd.qcut, args=(4,), labels=False)

In [58]:
train_labels = discrete.loc[train.index, :]
test_labels = discrete.loc[test.index, :]

In [59]:
train_features = train.iloc[:, 0:96]
test_features = test.iloc[:, 0:96]

Try different classification algorithms using scikit-learn.

### k-nearest neighbors

In [12]:
%%time
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(train_features, train_labels)
test_preds = clf.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 369 ms, sys: 6.14 ms, total: 375 ms
Wall time: 377 ms


In [13]:
clf

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [14]:
accuracy

0.76112326768781913

### Multinomial naive Bayes

In [15]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 22.9 ms, sys: 2.73 ms, total: 25.6 ms
Wall time: 23.3 ms


In [16]:
clf

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
accuracy

0.4987843423292001

### Random forest

In [18]:
%%time
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=seed)
clf.fit(train_features, train_labels)
test_preds = clf.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 297 ms, sys: 10.9 ms, total: 308 ms
Wall time: 308 ms


In [19]:
clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=666, verbose=0, warm_start=False)

In [20]:
accuracy

0.76428397763189881

### Support vector machine

In [21]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
clf = SVC()
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 17.7 s, sys: 50.6 ms, total: 17.8 s
Wall time: 17.8 s


In [22]:
clf

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [23]:
accuracy

0.81108679795769512

### Gaussian process

In [24]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
clf = GaussianProcessClassifier(random_state=seed)
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 36min 30s, sys: 13.7 s, total: 36min 44s
Wall time: 11min 12s


In [25]:
clf

GaussianProcessClassifier(copy_X_train=True, kernel=None,
             max_iter_predict=100, multi_class='one_vs_rest', n_jobs=1,
             n_restarts_optimizer=0, optimizer='fmin_l_bfgs_b',
             random_state=666, warm_start=False)

In [26]:
accuracy

0.67456844152686601

### Decision tree

In [27]:
%%time
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=seed)
clf.fit(train_features, train_labels)
test_preds = clf.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 372 ms, sys: 3.03 ms, total: 375 ms
Wall time: 374 ms


In [28]:
clf

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=666,
            splitter='best')

In [29]:
accuracy

0.67213712618526622

### Multilayer perceptron

In [63]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=seed)
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 2.68 s, sys: 203 ms, total: 2.88 s
Wall time: 2.59 s


In [64]:
clf

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=666,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [65]:
accuracy

0.46498905908096277

### Logistic regression

Note: The `LogisticRegression()` in scikit-learn is a **classifier**.

In [66]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=seed)
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 7.87 s, sys: 28.9 ms, total: 7.9 s
Wall time: 7.91 s


In [67]:
clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=666, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [68]:
accuracy

0.59713104789691218

### Gradient boosting

In [69]:
%%time
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=seed)
clfs = MultiOutputClassifier(clf)
clfs.fit(train_features, train_labels)
test_preds = clfs.predict(test_features)
diff = np.equal(test_labels.values, test_preds)
accuracy = np.sum(diff) / test_labels.size

CPU times: user 28.3 s, sys: 77.9 ms, total: 28.4 s
Wall time: 28.4 s


In [70]:
clf

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=666, subsample=1.0, verbose=0,
              warm_start=False)

In [71]:
accuracy

0.74507658643326036

Use continuous values instead of discretized ones for regression.

In [72]:
train_labels = train.iloc[:, 96:102]
test_labels = test.iloc[:, 96:102]

In [73]:
train_features = train.iloc[:, 0:96]
test_features = test.iloc[:, 0:96]

Try different regression algorithms using scikit-learn.  
For regression models, use coefficient of determination, r², to evaluate.

### Bayesian regression

In [38]:
%%time
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import BayesianRidge
clf = BayesianRidge()
clfs = MultiOutputRegressor(clf)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 221 ms, sys: 21.3 ms, total: 243 ms
Wall time: 177 ms


In [39]:
clf

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)

In [40]:
accuracy

0.66340461121231298

### Decision tree

In [41]:
%%time
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(random_state=seed)
clf.fit(train_features, train_labels)
accuracy = clf.score(test_features, test_labels.values)

CPU times: user 261 ms, sys: 3.16 ms, total: 265 ms
Wall time: 263 ms


In [42]:
clf

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=666, splitter='best')

In [43]:
accuracy

0.67454335936962351

### Support vector machine

In [44]:
%%time
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import SVR
clf = SVR()
clfs = MultiOutputRegressor(clf)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 12.9 s, sys: 83.8 ms, total: 13 s
Wall time: 13.1 s


In [45]:
clf

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [46]:
accuracy

0.015476040418789527

### Gradient boosting

In [47]:
%%time
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor(random_state=seed)
clfs = MultiOutputRegressor(clf)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 6.59 s, sys: 18.3 ms, total: 6.61 s
Wall time: 6.62 s


In [48]:
clf

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=666,
             subsample=1.0, verbose=0, warm_start=False)

In [49]:
accuracy

0.77381980918261439

### Gaussian process

In [51]:
%%time
from sklearn.multioutput import MultiOutputRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
clf = GaussianProcessRegressor(random_state=seed)
clfs = MultiOutputRegressor(clf)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 1min 43s, sys: 1.02 s, total: 1min 44s
Wall time: 38.5 s


In [52]:
clf

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True, kernel=None,
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=666)

In [53]:
accuracy

-26.175814875355499

### k-nearest neighbors

In [74]:
%%time
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(train_features, train_labels)
accuracy = clf.score(test_features, test_labels.values)

CPU times: user 370 ms, sys: 3.02 ms, total: 373 ms
Wall time: 372 ms


In [75]:
clf

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [76]:
accuracy

0.88205903631459748

### Multilayer perceptron

In [54]:
%%time
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(random_state=seed)
clf.fit(train_features, train_labels)
accuracy = clf.score(test_features, test_labels.values)

CPU times: user 2.2 s, sys: 188 ms, total: 2.39 s
Wall time: 2.12 s


In [55]:
clf

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=666,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [56]:
accuracy

0.52728592759425608

## Task 4

Tune model's parameters with the help of scikit-learn's `GridSearchCV()`.  
`GridSearchCV()` will try all parameters and use k-fold cross-validation to determine the best one.

Take `KNeighborsRegressor()` regression model as an example:

In [118]:
params = {
    'n_neighbors': list(range(2, 11)),
    'weights': ['uniform', 'distance'],
}

In [119]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clfs = GridSearchCV(clf, params)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 34.8 s, sys: 91.7 ms, total: 34.8 s
Wall time: 34.9 s


In [120]:
clfs.best_params_

{'n_neighbors': 2, 'weights': 'distance'}

In [121]:
accuracy

0.91661469619254876

The accuracy of the tuned model is 0.92, whereas the accuracy of the original model is 0.88.

In [122]:
%%time
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=2, weights='distance')
clf.fit(train_features, train_labels)
accuracy = clf.score(test_features, test_labels.values)

CPU times: user 258 ms, sys: 3.72 ms, total: 262 ms
Wall time: 260 ms


No significant change in computation time.

Take `MLPRegressor()` regression model as another example:

In [113]:
params = {
    'hidden_layer_sizes': [(100,), (20, 5), (5, 5, 4)],
    'activation': ['relu', 'tanh', 'logistic'],
    'solver': ['adam', 'lbfgs'],
    'random_state': [seed],
    'max_iter': [1000],
}

In [114]:
# Don't care if the model fails to converge.
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [115]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor()
clfs = GridSearchCV(clf, params)
clfs.fit(train_features, train_labels)
accuracy = clfs.score(test_features, test_labels.values)

CPU times: user 4min 33s, sys: 13.2 s, total: 4min 46s
Wall time: 4min 12s


In [116]:
clfs.best_params_

{'activation': 'relu',
 'hidden_layer_sizes': (100,),
 'max_iter': 1000,
 'random_state': 666,
 'solver': 'lbfgs'}

In [117]:
accuracy

0.65205433302654625

The accuracy of the tuned model is 0.65, whereas the accuracy of the original model is 0.53.

In [125]:
%%time
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(activation='relu', hidden_layer_sizes=(100,),
        max_iter=1000, random_state=seed, solver='lbfgs')
clf.fit(train_features, train_labels)
accuracy = clf.score(test_features, test_labels.values)

CPU times: user 12.4 s, sys: 240 ms, total: 12.7 s
Wall time: 8.08 s


Computation time is much longer, possibly due to increased number of iterations.