# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [749]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [750]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [751]:
training_size = 94
testing_size = 10
order_temp = 1
order_case = 4

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [752]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [753]:
def SplitData(datalist, start, end, columns_name):
    dataset = pd.DataFrame(datalist[start:end], columns = columns_name)
    return dataset

columns_name = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC']
training_dataset = SplitData(input_datalist, 1, training_size+1, columns_name)
testing_dataset = SplitData(input_datalist, training_size+1-order_case, training_size+1+testing_size, columns_name)

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [754]:
def Normalize(datas):
    return datas.map(lambda data: (data-datas.min())/(datas.max()-datas.min()))
def PreprocessData(dataset):
    # Missing data
    missing = (dataset['tempA'] != '') & (dataset['tempB'] != '') & (dataset['tempC'] != '')
    dataset.where(missing, other=np.nan, inplace=True)
    dataset.dropna(axis='index', inplace=True)

    # String to number
    for column in dataset:
        if column != 'epiweek':
            dataset[column] = dataset[column].map(float)

    # Outlier for temp
    min_temp_A = dataset['tempA'].mean() - 3*dataset['tempA'].std()
    max_temp_A = dataset['tempA'].mean() + 3*dataset['tempA'].std()
    min_temp_B = dataset['tempB'].mean() - 3*dataset['tempB'].std()
    max_temp_B = dataset['tempB'].mean() + 3*dataset['tempB'].std()
    min_temp_C = dataset['tempC'].mean() - 3*dataset['tempC'].std()
    max_temp_C = dataset['tempC'].mean() + 3*dataset['tempC'].std()
    outlierA_temp = (dataset['tempA'] >= min_temp_A) & (dataset['tempA'] <= max_temp_A)
    outlierB_temp = (dataset['tempB'] >= min_temp_B) & (dataset['tempB'] <= max_temp_B)
    outlierC_temp = (dataset['tempC'] >= min_temp_C) & (dataset['tempC'] <= max_temp_C)
    outlier_temp = outlierA_temp & outlierB_temp & outlierC_temp
    dataset.where(outlier_temp, other=np.nan, inplace=True)
    
    # drop outlier
    dataset.dropna(axis='index', inplace=True)
    dataset.reset_index(drop=True, inplace=True)
    
    # Normalize
    dataset['tempA'] = Normalize(dataset['tempA'])
    dataset['tempB'] = Normalize(dataset['tempB'])
    dataset['tempC'] = Normalize(dataset['tempC'])
    
PreprocessData(training_dataset)
PreprocessData(testing_dataset)

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [755]:
def Regression(x, y, learning_rate, learninig_iter, dim, N):
    # weights
    w = np.zeros((3, dim))
    
    # iterate 3 cities
    for city in range(3):       
        for i in range(learning_iter):
            x_trans = x[city].transpose()
            
            prediction = np.dot(x[city], w[city])
            loss = prediction - y[city]
            
            curr_gradient = np.dot(x_trans, loss) / (N-order_case)
            w[city] = w[city] - learning_rate[city] * curr_gradient
    
    return w
    
# hyperparameter
w_dim = 1+order_temp+order_case
learning_rate = [0.00001, 0.00001, 0.000001]
learning_iter = 150000

# size of training_dataset
training_N = len(training_dataset)

# x for training_data
x_temp_train = np.array([
    [[(training_dataset['tempA'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, training_N)],        
    [[(training_dataset['tempB'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, training_N)],        
    [[(training_dataset['tempC'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, training_N)]      
])
x_case_train = np.array([
    [[(training_dataset['caseA'][i-j-1]) for j in range(order_case)] for i in range(order_case, training_N)],
    [[(training_dataset['caseB'][i-j-1]) for j in range(order_case)] for i in range(order_case, training_N)],
    [[(training_dataset['caseC'][i-j-1]) for j in range(order_case)] for i in range(order_case, training_N)]
])
x_train = np.array([
    np.array([np.concatenate((x_temp_train[0][i], x_case_train[0][i])) for i in range(training_N-order_case)]),
    np.array([np.concatenate((x_temp_train[1][i], x_case_train[1][i])) for i in range(training_N-order_case)]),
    np.array([np.concatenate((x_temp_train[2][i], x_case_train[2][i])) for i in range(training_N-order_case)])
])

# y for training_data
y_train = np.array([
    [training_dataset['caseA'][i] for i in range(order_case, training_N)],
    [training_dataset['caseB'][i] for i in range(order_case, training_N)],
    [training_dataset['caseC'][i] for i in range(order_case, training_N)]
])

w = Regression(x_train, y_train, learning_rate, learning_iter, w_dim, training_N)

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [756]:
def MakePrediction(dataset, x, w, dim):
    # MAPE
    def MAPE(actual, predict):
        return np.mean(np.abs((actual - predict) / actual)) * 100
        
    # Predict
    predict_A = [] 
    predict_B = [] 
    predict_C = [] 
    output_datalist = []
    
    N = len(dataset)

    # x_case for testing_data
    x_caseA = [[dataset['caseA'][order_case-j-1] for j in range(order_case)]]
    x_caseB = [[dataset['caseB'][order_case-j-1] for j in range(order_case)]]
    x_caseC = [[dataset['caseC'][order_case-j-1] for j in range(order_case)]]
    
    # predict for testing_data
    for index in range(N-order_case):
        x_A = np.concatenate((x[0][index], x_caseA[index]))
        x_B = np.concatenate((x[1][index], x_caseB[index]))
        x_C = np.concatenate((x[2][index], x_caseC[index]))
        answer_A = sum([w[0][i] * x_A[i] for i in range(dim)])
        answer_B = sum([w[1][i] * x_B[i] for i in range(dim)])
        answer_C = sum([w[2][i] * x_C[i] for i in range(dim)])
        
        predict_A.append(round(answer_A))
        predict_B.append(round(answer_B))
        predict_C.append(round(answer_C))
        x_caseA.append([answer_A] + x_caseA[-1][0:-1])
        x_caseB.append([answer_B] + x_caseB[-1][0:-1])
        x_caseC.append([answer_C] + x_caseC[-1][0:-1])
        
        output_datalist.append([dataset['epiweek'][index+order_case], round(answer_A), round(answer_B), round(answer_C)])
        
    # MAPE
    # actual_A = dataset['caseA'][order_case:].to_numpy()
    # actual_B = dataset['caseB'][order_case:].to_numpy()
    # actual_C = dataset['caseC'][order_case:].to_numpy()
    
    # print("MAPE_A =", MAPE(actual_A, predict_A))
    # print("MAPE_B =", MAPE(actual_B, predict_B))
    # print("MAPE_C =", MAPE(actual_C, predict_C))

    return output_datalist

# size of testing_dataset
testing_N = len(testing_dataset)
# x_temp for testing_data
x_temp_test = np.array([
    [[(testing_dataset['tempA'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, testing_N)],        
    [[(testing_dataset['tempB'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, testing_N)],        
    [[(testing_dataset['tempC'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, testing_N)],         
])
output_datalist = MakePrediction(testing_dataset, x_temp_test, w, w_dim)

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





In [757]:
print(f"{w[0][0]} {w[0][1]} {w[0][2]} {w[0][3]} {w[0][4]} {w[0][5]}")
print(f"{w[1][0]} {w[1][1]} {w[1][2]} {w[1][3]} {w[1][4]} {w[1][5]}")
print(f"{w[2][0]} {w[2][1]} {w[2][2]} {w[2][3]} {w[2][4]} {w[2][5]}")

1.2931560984319803 2.136323239034926 0.656941346153915 0.2452831218335672 -0.08818116970644299 0.06007123016424466
0.5056667624358357 0.9245519373579979 0.37514942457578326 0.2368765800627512 0.07916403866494195 0.21082956307388598
0.11431316477322064 0.15777993102118826 0.947419358361804 0.028222972484821255 0.07545927296721004 -0.08118063278530917


## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [758]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


In [759]:
advanced_input1_dataroot = 'hw1_advanced_input1.csv'
advanced_output_dataroot = 'hw1_advanced.csv'

advanced_input1_datalist = []
advanced_output_datalist = []

# Read input csv to datalist
with open(advanced_input1_dataroot, newline='') as csvfile:
    advanced_input1_datalist = np.array(list(csv.reader(csvfile)))
    
# Merge basic_input and advanced_input1
advanced_datalist = np.concatenate((input_datalist, advanced_input1_datalist), axis=1)
advanced_columns_name = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC', 'epiweek2', 'precipA', 'precipB', 'precipC']
advanced_training_dataset = SplitData(advanced_datalist, 1, training_size+1, advanced_columns_name)
advanced_testing_dataset = SplitData(advanced_datalist, training_size+1-order_case, training_size+1+testing_size, advanced_columns_name)
advanced_training_dataset.drop(columns='epiweek2', inplace=True)
advanced_testing_dataset.drop(columns='epiweek2', inplace=True)

# Preproces dataset
PreprocessData(advanced_training_dataset)
PreprocessData(advanced_testing_dataset)

# =================================================================================================

advanced_w_dim = 1+order_temp+1+order_case
# hyperparameter
learning_rate = [0.000007, 0.00001, 0.0000001]
learning_iter = 150000

# size of training_dataset
advanced_training_N = len(advanced_training_dataset)

# x for training_data
advanced_x_temp_train = np.array([
    [[(advanced_training_dataset['tempA'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_training_N)],        
    [[(advanced_training_dataset['tempB'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_training_N)],        
    [[(advanced_training_dataset['tempC'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_training_N)]      
])
advanced_x_precip_train = np.array([
    [advanced_training_dataset['precipA'][i] for i in range(order_case, advanced_training_N)],        
    [advanced_training_dataset['precipB'][i] for i in range(order_case, advanced_training_N)],        
    [advanced_training_dataset['precipC'][i] for i in range(order_case, advanced_training_N)]
])
advanced_x_case_train = np.array([
    [[(advanced_training_dataset['caseA'][i-j-1]) for j in range(order_case)] for i in range(order_case, advanced_training_N)],
    [[(advanced_training_dataset['caseB'][i-j-1]) for j in range(order_case)] for i in range(order_case, advanced_training_N)],
    [[(advanced_training_dataset['caseC'][i-j-1]) for j in range(order_case)] for i in range(order_case, advanced_training_N)]
])
advanced_x_train = np.array([
    np.array([np.concatenate((advanced_x_temp_train[0][i], [advanced_x_precip_train[0][i]], advanced_x_case_train[0][i])) for i in range(advanced_training_N-order_case)]),
    np.array([np.concatenate((advanced_x_temp_train[1][i], [advanced_x_precip_train[1][i]], advanced_x_case_train[1][i])) for i in range(advanced_training_N-order_case)]),
    np.array([np.concatenate((advanced_x_temp_train[2][i], [advanced_x_precip_train[2][i]], advanced_x_case_train[2][i])) for i in range(advanced_training_N-order_case)])
])

# y for training_data
advanced_y_train = np.array([
    [advanced_training_dataset['caseA'][i] for i in range(order_case, advanced_training_N)],
    [advanced_training_dataset['caseB'][i] for i in range(order_case, advanced_training_N)],
    [advanced_training_dataset['caseC'][i] for i in range(order_case, advanced_training_N)]
])

# w
advanced_w = Regression(advanced_x_train, advanced_y_train, learning_rate, learning_iter, advanced_w_dim, advanced_training_N)

# =================================================================================================

# size of testing_dataset
advanced_testing_N = len(advanced_testing_dataset)

# x for testing_data
advanced_x_temp_test = np.array([
    [[(advanced_testing_dataset['tempA'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_testing_N)],        
    [[(advanced_testing_dataset['tempB'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_testing_N)],        
    [[(advanced_testing_dataset['tempC'][i]**(order_temp-j)) for j in range(order_temp+1)] for i in range(order_case, advanced_testing_N)]      
])
advanced_x_precip_test = np.array([
    [advanced_testing_dataset['precipA'][i] for i in range(order_case, advanced_testing_N)],        
    [advanced_testing_dataset['precipB'][i] for i in range(order_case, advanced_testing_N)],        
    [advanced_testing_dataset['precipC'][i] for i in range(order_case, advanced_testing_N)]
])
advanced_x_test = np.array([
    np.array([np.concatenate((advanced_x_temp_test[0][i], [advanced_x_precip_test[0][i]])) for i in range(advanced_testing_N-order_case)]),
    np.array([np.concatenate((advanced_x_temp_test[1][i], [advanced_x_precip_test[1][i]])) for i in range(advanced_testing_N-order_case)]),
    np.array([np.concatenate((advanced_x_temp_test[2][i], [advanced_x_precip_test[2][i]])) for i in range(advanced_testing_N-order_case)])
])

# Prediction
advanced_output_datalist = MakePrediction(advanced_testing_dataset, advanced_x_test, advanced_w, advanced_w_dim)

print(f"{advanced_w[0][0]} {advanced_w[0][1]} {advanced_w[0][2]} {advanced_w[0][3]} {advanced_w[0][4]} {advanced_w[0][5]} {advanced_w[0][6]}")
print(f"{advanced_w[1][0]} {advanced_w[1][1]} {advanced_w[1][2]} {advanced_w[1][3]} {advanced_w[1][4]} {advanced_w[1][5]} {advanced_w[1][6]}")
print(f"{advanced_w[2][0]} {advanced_w[2][1]} {advanced_w[2][2]} {advanced_w[2][3]} {advanced_w[2][4]} {advanced_w[2][5]} {advanced_w[2][6]}")

# =================================================================================================

# Write the output
with open(advanced_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in advanced_output_datalist:
    writer.writerow(row)

0.5979616029824036 1.078235379718589 0.6890531959775202 0.59755312893875 0.27090808620770446 -0.07533558108391886 0.050435713855746085
0.4178186412195893 0.8032221356409875 0.12320473457181418 0.3638864061211623 0.22939690075731764 0.07918171402135729 0.21552439665528714
0.010843452466361784 0.017045476193808723 0.20305695040751298 0.6763207581662259 0.2271679897798216 0.07982441667201459 -0.06518298232441437


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)