# Homework 3: Classification, Regression, and Other Prediction Models

Before starting, review past homeworks to find out mistakes I've made in previous practices.

### HW0 Data Preprocessing - Review

+ Issues in data preprocessing / data cleaning
  + discard outliers (temperature dataset has unreasonable values like -99.5)
  + ignore, impute or interpolate missing values (taipower dataset lacks data from Jan. to Apr.)
  + align records with n o'clock sharp (make sure there is one and only one record for each hour in both datasets)
+ Flatten JSON to DataFrame: `pandas.io.json.json_normalize()`
+ Write DataFrame to database: `pandas.DataFrame.to_sql()`
+ Calculate Euclidean distance: `numpy.linalg.norm(X - Y)`
+ Other useful functions: `scipy.signal.filtfilt()` for smoothing (noise reduction)
+ Useful programming skills: `f"Hello, my name is {name}."`

### HW1 Association Rules - Review

+ Three discretization methods
  + equal width (`pandas.cut()`)
  + equal frequency (`pandas.qcut()`)
  + clustering (e.g. k-means)
+ Investigate DataFrame: `head()`, `describe()`, `corr()`
+ Plot DataFrame
  + `df.A.plot(secondary_y=True, label='name', legend=True)` to add another y-axis
  + `df.A.plot.hist(edgecolor='k')` for histograms
+ We can also discretize multidimensional data.

### HW2 Clustering - Review

+ Use elbow method or silhouette value to determine k.
+ For multidimensional data, normalizing data in advance may produce better results.
+ For high-dimensional data, use principal component analysis (PCA) to reduce dimensions.
+ The clustering result of taipower data in `temporal-clustering.ipynb` also shows a periodic pattern.
  + The electricity consumption on weekends tends to belong to a different cluster.

## Task 1

Round DATETIME towards its nearest hour. (e.g. 2017/10/16 14:37:52 -> 2017/10/16 15:00:00)

```sql
UPDATE 逐時觀測
SET 逐時觀測.時間 = DATE_FORMAT(
  DATE_ADD(逐時觀測.時間, INTERVAL 30 MINUTE),
  '%Y-%m-%d %H:00:00'
)
```

```sql
UPDATE Power
SET Power.updateTime = DATE_FORMAT(
  DATE_ADD(Power.updateTime, INTERVAL 30 MINUTE),
  '%Y-%m-%d %H:00:00'
)
```

A naive way to generate dataset.

```sql
DROP PROCEDURE IF EXISTS dm.generate_dataset;
DELIMITER //
CREATE PROCEDURE dm.generate_dataset(from_time DATETIME, to_time DATETIME)
BEGIN
  DECLARE it INT;
  DECLARE temp DOUBLE;
  DECLARE output TEXT;
  DECLARE curr_time DATETIME;
  WHILE from_time <= to_time DO
    SET it = 0;
    SET output = '';
    SET curr_time = from_time;
    generate_record: WHILE it < 102 DO
      SET temp = NULL;
      IF it < 96 THEN
        SELECT 逐時觀測.溫度 INTO temp FROM 逐時觀測
          WHERE (逐時觀測.時間 = curr_time) AND (逐時觀測.測站 = 'BANQIAO,板橋')
          LIMIT 1;
      ELSE
        SELECT Power.northUsage INTO temp FROM Power
          WHERE (Power.updateTime = curr_time)
          LIMIT 1;
      END IF;
      IF temp IS NULL THEN
        SET output = '';
        LEAVE generate_record;
      END IF;
      SET it = it + 1;
      SET output = CONCAT(output, ',', CAST(temp AS CHAR));
      SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
    END WHILE;
    IF output <> '' THEN
      INSERT INTO Records VALUES (output);
    END IF;
    SET from_time = DATE_ADD(from_time, INTERVAL 1 HOUR);
  END WHILE;
END//
DELIMITER ;
SET @from_time = CAST('2016-07-03 01:00:00' AS DATETIME);
SET @to_time = CAST('2017-07-04 00:00:00' AS DATETIME);
CALL generate_dataset(@from_time, @to_time);
```

This is very inefficient because there are too many queries. Try a better approach.

First, find duplicate record times.

```sql
SELECT 逐時觀測.時間
FROM 逐時觀測
WHERE 逐時觀測.測站 = 'BANQIAO,板橋'
GROUP BY 逐時觀測.時間
HAVING COUNT(*) > 1
```

0 rows retrieved.

```sql
SELECT Power.updateTime, COUNT(*) AS cnt
FROM Power
GROUP BY Power.updateTime
HAVING cnt > 1
```

updateTime | cnt
--- | ---
2016-10-17 00:00:00 | 2
2017-05-20 08:00:00 | 3
2017-05-20 13:00:00 | 2
2017-05-30 22:00:00 | 2
2017-07-19 09:00:00 | 2
2017-07-24 01:00:00 | 2
2017-07-24 13:00:00 | 2
2017-08-21 20:00:00 | 2
2017-08-21 21:00:00 | 2

Remove them using the following query.

```sql
-- run 10 times
DELETE FROM Power
WHERE Power.updateTime IN (
  SELECT A.updateTime
  FROM (SELECT * FROM Power) AS A
  GROUP BY A.updateTime
  HAVING COUNT(*) > 1
) LIMIT 1
```

Store each record (96 history data and 6 prediction data) in one CSV file.  
Ignore the whole record if it contains missing data.

```sql
DROP PROCEDURE IF EXISTS dm.generate_dataset;
DELIMITER //
CREATE PROCEDURE dm.generate_dataset(from_time DATETIME, to_time DATETIME)
BEGIN
  DECLARE curr_time DATETIME DEFAULT from_time;
  SET @path = '/Users/ywpu/Repositories/data-mining/hw3/';
  generate_record: WHILE curr_time <= to_time DO
    SET @cnt1 = (
      SELECT COUNT(逐時觀測.溫度) FROM 逐時觀測
        WHERE (逐時觀測.時間 >= curr_time)
        AND (逐時觀測.時間 < DATE_ADD(curr_time, INTERVAL 4 DAY))
        AND (逐時觀測.測站 = 'BANQIAO,板橋')
    );
    SET @cnt2 = (
      SELECT COUNT(Power.northUsage) FROM Power
        WHERE (Power.updateTime >= DATE_ADD(curr_time, INTERVAL 4 DAY))
        AND (Power.updateTime < DATE_ADD(curr_time, INTERVAL '4 6' DAY_HOUR))
    );
    IF @cnt1 <> 96 OR @cnt2 <> 6 THEN
      SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
      ITERATE generate_record;
    END IF;
    SET @filename = CONCAT(@path, CAST(curr_time AS CHAR), '.csv');
    SET @query_string = CONCAT(
      "SELECT A.Y INTO OUTFILE '", @filename, "'
        FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
      FROM (
        (
          SELECT 逐時觀測.時間 AS X, 逐時觀測.溫度 AS Y FROM 逐時觀測
          WHERE (逐時觀測.時間 >= ?)
          AND (逐時觀測.時間 < DATE_ADD(?, INTERVAL 4 DAY))
          AND (逐時觀測.測站 = 'BANQIAO,板橋')
        ) UNION ALL (
          SELECT Power.updateTime AS X, Power.northUsage AS Y FROM Power
          WHERE (Power.updateTime >= DATE_ADD(?, INTERVAL 4 DAY))
          AND (Power.updateTime < DATE_ADD(?, INTERVAL '4 6' DAY_HOUR))
        ) ORDER BY X ASC
      ) A"
    );
    PREPARE stmt FROM @query_string;
    EXECUTE stmt USING curr_time, curr_time, curr_time, curr_time;
    DEALLOCATE PREPARE stmt;
    SET curr_time = DATE_ADD(curr_time, INTERVAL 1 HOUR);
  END WHILE;
END//
DELIMITER ;
SET @from_time = CAST('2016-07-03 01:00:00' AS DATETIME);
SET @to_time = CAST('2017-07-04 00:00:00' AS DATETIME);
CALL generate_dataset(@from_time, @to_time);
```

Combine all records into `dataset.csv`.

In [3]:
import os
with open('dataset.csv', 'w') as target:
    entries = sorted(os.listdir())
    for entry in entries:
        if entry.endswith('.csv') and entry.startswith('201'):
            with open(entry) as file:
                cols = file.readlines()
            cols = [col.strip() for col in cols]
            row = ','.join(cols) + '\n'
            target.write(row)

Load `dataset.csv` into DataFrame.

In [4]:
import numpy as np
import pandas as pd
dataset = pd.read_csv('dataset.csv', names=list(range(0, 102)), dtype=np.float64)

In [6]:
dataset.shape

(4568, 102)

In [7]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
0,27.6,28.4,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,...,27.3,27.4,27.4,27.7,841.3,826.4,789.5,773.6,778.0,783.1
1,28.4,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,...,27.4,27.4,27.7,26.6,826.4,789.5,773.6,778.0,783.1,814.3
2,29.4,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,...,27.4,27.7,26.6,26.0,789.5,773.6,778.0,783.1,814.3,812.4
3,29.7,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,25.3,...,27.7,26.6,26.0,25.9,773.6,778.0,783.1,814.3,812.4,809.0
4,28.3,27.1,26.3,25.9,25.6,25.6,25.4,25.3,25.3,25.3,...,26.6,26.0,25.9,25.8,778.0,783.1,814.3,812.4,809.0,796.6


## Task 2

Split dataset randomly using scikit-learn.

In [37]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(dataset, test_size=0.3)

In [38]:
train.shape

(3197, 102)

In [39]:
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
459,24.2,24.8,24.7,24.7,24.6,24.4,24.8,25.1,26.1,25.7,...,27.1,27.6,27.4,27.8,791.3,774.6,776.2,786.8,833.2,969.7
4135,26.0,25.9,25.8,25.6,25.5,25.5,26.3,28.3,29.9,31.6,...,22.7,22.6,22.5,22.6,780.5,761.1,736.5,725.7,727.7,750.1
842,22.5,22.4,22.7,22.6,22.5,22.8,22.9,22.7,22.8,22.6,...,21.9,21.6,21.5,21.6,1015.7,1016.3,993.8,963.4,936.3,878.5
3241,27.6,27.2,26.2,25.8,25.3,24.8,24.5,24.3,24.1,24.3,...,29.3,29.0,28.2,27.0,1078.3,1029.6,978.0,929.9,843.7,803.8
3380,28.2,27.2,26.4,25.9,25.7,25.2,25.2,24.8,24.6,24.7,...,27.4,27.1,26.5,25.9,939.3,910.2,844.0,826.8,751.6,712.2


In [40]:
test.shape

(1371, 102)

In [41]:
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
2288,19.3,19.0,18.6,18.8,18.6,18.5,18.6,18.6,18.7,18.7,...,21.0,21.5,21.7,21.0,1014.3,1022.5,1042.4,1018.0,996.4,952.9
220,24.7,25.6,25.6,26.3,27.0,27.8,28.5,29.2,31.0,31.2,...,24.8,24.7,24.5,24.4,784.5,774.1,766.2,802.5,837.0,917.9
3414,25.7,26.8,27.8,28.5,29.8,30.1,31.1,29.9,29.9,29.2,...,21.3,21.0,20.8,20.7,794.5,889.6,1028.4,1081.6,1099.9,1042.9
3830,26.2,25.7,25.7,25.5,25.3,26.0,27.4,28.6,30.4,29.9,...,24.1,24.0,23.9,24.1,791.7,762.3,753.2,749.5,773.0,820.3
1002,22.5,25.5,27.9,29.0,30.1,29.0,28.6,27.8,27.0,26.3,...,19.5,19.0,18.9,19.0,978.2,913.2,1022.2,1003.2,955.1,1009.9


## Task 3

Before classification, discretize targets using pandas.

In [61]:
# call pd.qcut(col, 4, labels=['A', 'B', 'C', 'D']) for each col in col 96 ~ 101
discrete = dataset.iloc[:, 96:102].apply(pd.qcut, args=(4,), labels=['A', 'B', 'C', 'D'])

In [62]:
train_labels = discrete.loc[train.index, :]
test_labels = discrete.loc[test.index, :]

In [63]:
train_features = train.iloc[:, 0:96]
test_features = test.iloc[:, 0:96]

In [64]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh.fit(train_features, train_labels)
# neigh.score(test_features, test_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [65]:
neigh.predict([test_features.loc[2288, :]])

array([['C', 'C', 'C', 'C', 'C', 'C']], dtype=object)