# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [None]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [None]:
training_size = 94
testing_size = 10
order_temp = 1
order_case = 4

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [None]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [None]:
def SplitData():
    training_dataset = pd.DataFrame(input_datalist[1:training_size+1], 
                                    columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    testing_dataset = pd.DataFrame(input_datalist[training_size+1-order_case:training_size+1+testing_size], 
                                   columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    return training_dataset, testing_dataset
    # return training_dataset, training_dataset.copy()
        
training_dataset, testing_dataset = SplitData()
print(training_dataset)
print(testing_dataset)

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [None]:
def PreprocessData():
    def PreprocessDataset(dataset):
        # Missing data
        missing = (dataset['tempA'] != '') & (dataset['tempB'] != '') & (dataset['tempC'] != '')
        dataset.where(missing, other=np.nan, inplace=True)
        dataset.dropna(axis='index', inplace=True)
    
        # String to number
        dataset['tempA'] = dataset['tempA'].map(float)
        dataset['tempB'] = dataset['tempB'].map(float)
        dataset['tempC'] = dataset['tempC'].map(float)
        dataset['caseA'] = dataset['caseA'].map(float)
        dataset['caseB'] = dataset['caseB'].map(float)
        dataset['caseC'] = dataset['caseC'].map(float)
    
        # Outlier for temp
        min_temp_A = dataset['tempA'].mean() - 3*dataset['tempA'].std()
        max_temp_A = dataset['tempA'].mean() + 3*dataset['tempA'].std()
        min_temp_B = dataset['tempB'].mean() - 3*dataset['tempB'].std()
        max_temp_B = dataset['tempB'].mean() + 3*dataset['tempB'].std()
        min_temp_C = dataset['tempC'].mean() - 3*dataset['tempC'].std()
        max_temp_C = dataset['tempC'].mean() + 3*dataset['tempC'].std()
        outlierA_temp = (dataset['tempA'] >= min_temp_A) & (dataset['tempA'] <= max_temp_A)
        outlierB_temp = (dataset['tempB'] >= min_temp_B) & (dataset['tempB'] <= max_temp_B)
        outlierC_temp = (dataset['tempC'] >= min_temp_C) & (dataset['tempC'] <= max_temp_C)
        outlier_temp = outlierA_temp & outlierB_temp & outlierC_temp
        dataset.where(outlier_temp, other=np.nan, inplace=True)
        
        # Outlier for case
        # min_case_A = dataset['caseA'].mean() - 3*dataset['caseA'].std()
        # max_case_A = dataset['caseA'].mean() + 3*dataset['caseA'].std()
        # min_case_B = dataset['caseB'].mean() - 3*dataset['caseB'].std()
        # max_case_B = dataset['caseB'].mean() + 3*dataset['caseB'].std()
        # min_case_C = dataset['caseC'].mean() - 3*dataset['caseC'].std()
        # max_case_C = dataset['caseC'].mean() + 3*dataset['caseC'].std()
        # outlierA_case = (dataset['caseA'] >= min_case_A) & (dataset['caseA'] <= max_case_A)
        # outlierB_case = (dataset['caseB'] >= min_case_B) & (dataset['caseB'] <= max_case_B)
        # outlierC_case = (dataset['caseC'] >= min_case_C) & (dataset['caseC'] <= max_case_C)
        # outlier_case = outlierA_case & outlierB_case & outlierC_case
        # dataset.where(outlier_case, other=np.nan, inplace=True)
        
        # drop outlier
        dataset.dropna(axis='index', inplace=True)
        dataset.reset_index(drop=True, inplace=True)
        
        # Normalize
        dataset['tempA'] = dataset['tempA'].map(lambda data: (data-dataset['tempA'].min())/(dataset['tempA'].max()-dataset['tempA'].min()))
        dataset['tempB'] = dataset['tempB'].map(lambda data: (data-dataset['tempB'].min())/(dataset['tempB'].max()-dataset['tempB'].min()))
        dataset['tempC'] = dataset['tempC'].map(lambda data: (data-dataset['tempC'].min())/(dataset['tempC'].max()-dataset['tempC'].min()))
        # dataset['caseA'] = dataset['caseA'].map(lambda data: (data-dataset['caseA'].min())/(dataset['caseA'].max()-dataset['caseA'].min()))
        # dataset['caseB'] = dataset['caseB'].map(lambda data: (data-dataset['caseB'].min())/(dataset['caseB'].max()-dataset['caseB'].min()))
        # dataset['caseC'] = dataset['caseC'].map(lambda data: (data-dataset['caseC'].min())/(dataset['caseC'].max()-dataset['caseC'].min()))
        
        return dataset
    
    PreprocessDataset(training_dataset)
    PreprocessDataset(testing_dataset)
    
PreprocessData()
print(training_dataset)
print(testing_dataset)

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [None]:
def Regression():
    learning_rate = [0.0001, 0.0001, 0.0001]
    learning_iter = 10000
    
    # size of training_dataset
    N = len(training_dataset)
    
    # x for training_data
    x_temp = np.array([
        [[(training_dataset['tempA'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)],        
        [[(training_dataset['tempB'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)],        
        [[(training_dataset['tempC'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)]      
    ])
    x_case = np.array([
        [[(training_dataset['caseA'][i-j-1]) for j in range(order_case)] for i in range(order_case, N)],
        [[(training_dataset['caseB'][i-j-1]) for j in range(order_case)] for i in range(order_case, N)],
        [[(training_dataset['caseC'][i-j-1]) for j in range(order_case)] for i in range(order_case, N)]
    ])
    x = np.array([
        np.array([np.concatenate((x_temp[0][i], x_case[0][i])) for i in range(N-order_case)]),
        np.array([np.concatenate((x_temp[1][i], x_case[1][i])) for i in range(N-order_case)]),
        np.array([np.concatenate((x_temp[2][i], x_case[2][i])) for i in range(N-order_case)])
    ])
    
    # y for training_data
    y = np.array([
        [training_dataset['caseA'][i] for i in range(order_case, N)],
        [training_dataset['caseB'][i] for i in range(order_case, N)],
        [training_dataset['caseC'][i] for i in range(order_case, N)]
    ])
    
    # weights
    w = np.zeros((3, 1+order_temp+order_case))
    
    # iterate 3 cities
    for city in range(3):    
        pre_cost = 0        
        for i in range(learning_iter):
            x_trans = x[city].transpose()
            
            prediction = np.dot(x[city], w[city])
            loss = prediction - y[city]
            
            cost = np.sum(loss ** 2) / (2 * (N-order_case))
            # print(f"Iteration {i}: City{city} Cost= {cost}")
            if cost == pre_cost:
                learning_rate[city] /= 2
            pre_cost = cost
            
            curr_gradient = np.dot(x_trans, loss) / (N-order_case)
            w[city] = w[city] - learning_rate[city] * curr_gradient
    return w
    
w = Regression()
print(w)

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [None]:
def MakePrediction():
    # Predict
    predict_A = [] 
    predict_B = [] 
    predict_C = [] 
    
    N = len(testing_dataset)
    
    # x_temp for testing_data
    x_temp = np.array([
        [[(testing_dataset['tempA'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)],        
        [[(testing_dataset['tempB'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)],        
        [[(testing_dataset['tempC'][i]**j) for j in range(order_temp+1)] for i in range(order_case, N)],         
    ])

    # x_case for testing_data
    x_caseA = [[testing_dataset['caseA'][order_case-j-1] for j in range(order_case)]]
    x_caseB = [[testing_dataset['caseB'][order_case-j-1] for j in range(order_case)]]
    x_caseC = [[testing_dataset['caseC'][order_case-j-1] for j in range(order_case)]]
    
    # predict for testing_data
    for index in range(N-order_case):
        x_A = np.concatenate((x_temp[0][index], x_caseA[index]))
        x_B = np.concatenate((x_temp[1][index], x_caseB[index]))
        x_C = np.concatenate((x_temp[2][index], x_caseC[index]))
        answer_A = sum([w[0][i] * x_A[i] for i in range(1+order_temp+order_case)])
        answer_B = sum([w[1][i] * x_B[i] for i in range(1+order_temp+order_case)])
        answer_C = sum([w[2][i] * x_C[i] for i in range(1+order_temp+order_case)])
        
        predict_A.append(round(answer_A))
        predict_B.append(round(answer_B))
        predict_C.append(round(answer_C))
        x_caseA.append([answer_A] + x_caseA[-1][0:-1])
        x_caseB.append([answer_B] + x_caseB[-1][0:-1])
        x_caseC.append([answer_C] + x_caseC[-1][0:-1])
        
        output_datalist.append([testing_dataset['epiweek'][index+order_case], round(answer_A), round(answer_B), round(answer_C)])
        
    # MAPE
    actual_A = testing_dataset['caseA'][order_case:].to_numpy()
    actual_B = testing_dataset['caseB'][order_case:].to_numpy()
    actual_C = testing_dataset['caseC'][order_case:].to_numpy()
    
    MAPE_A = np.mean(np.abs((actual_A - predict_A) / actual_A)) * 100
    MAPE_B = np.mean(np.abs((actual_B - predict_B) / actual_B)) * 100
    MAPE_C = np.mean(np.abs((actual_C - predict_C) / actual_C)) * 100
    
    print("MAPE_A =", MAPE_A)
    print("MAPE_B =", MAPE_B)
    print("MAPE_C =", MAPE_C)

    return output_datalist

output_datalist = MakePrediction()

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





In [None]:
print(f"{w[0][1]} {w[0][0]} {w[0][2]} {w[0][3]} {w[0][4]} {w[0][5]}")
print(f"{w[1][1]} {w[1][0]} {w[1][2]} {w[1][3]} {w[1][4]} {w[1][5]}")
print(f"{w[2][1]} {w[2][0]} {w[2][2]} {w[2][3]} {w[2][4]} {w[2][5]}")

## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)