# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [None]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [None]:
training_size = 84
testing_size = 10
order_temp = 1
order_case = 1
training_dataset = []
testing_dataset = []
w = []

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [None]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [None]:
def SplitData():
    training_dataset = pd.DataFrame(input_datalist[1:training_size+1], 
                                    columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    testing_dataset = pd.DataFrame(input_datalist[training_size+1:training_size+1+testing_size], 
                                   columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    return training_dataset, testing_dataset
        
training_dataset, testing_dataset = SplitData()
print(training_dataset)
print(testing_dataset)

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [None]:
def PreprocessData():
    def PreprocessDataset(dataset):
        min_temp = 15
        max_temp = 40 
        
        # Missing data
        missing = (dataset['tempA'] != '') & (dataset['tempB'] != '') & (dataset['tempC'] != '')
        dataset.where(missing, other=np.nan, inplace=True)
        dataset.dropna(axis='index', inplace=True)
    
        # String to number
        dataset['tempA'] = dataset['tempA'].map(float)
        dataset['tempB'] = dataset['tempB'].map(float)
        dataset['tempC'] = dataset['tempC'].map(float)
        dataset['caseA'] = dataset['caseA'].map(float)
        dataset['caseB'] = dataset['caseB'].map(float)
        dataset['caseC'] = dataset['caseC'].map(float)

        # Outlier
        outlierA = (dataset['tempA'] >= min_temp) & (dataset['tempA'] <= max_temp)
        outlierB = (dataset['tempB'] >= min_temp) & (dataset['tempB'] <= max_temp)
        outlierC = (dataset['tempC'] >= min_temp) & (dataset['tempC'] <= max_temp)
        outlier = outlierA & outlierB & outlierC
        dataset.where(outlier, other=np.nan, inplace=True)
        dataset.dropna(axis='index', inplace=True)
        dataset.reset_index(drop=True, inplace=True)
    
    PreprocessDataset(training_dataset)
    PreprocessDataset(testing_dataset)
    
PreprocessData()
print(training_dataset)
print(testing_dataset)

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [None]:
def Regression():
    learning_rate = 0.09
    learning_iter = 10000
    
    # size of training_dataset
    N = len(training_dataset)
    
    # x
    x_temp = np.array([
        [[1, training_data['tempA']] for index, training_data in training_dataset.iterrows()],
        [[1, training_data['tempB']] for index, training_data in training_dataset.iterrows()],
        [[1, training_data['tempC']] for index, training_data in training_dataset.iterrows()]
    ])
    
    x_case = np.array([
        [[training_dataset['caseA'][max(0, i-1)], training_dataset['caseA'][max(0, i-2)], training_dataset['caseA'][max(0, i-3)]] for i in range(N)],
        [[training_dataset['caseB'][max(0, i-1)], training_dataset['caseB'][max(0, i-2)], training_dataset['caseB'][max(0, i-3)]] for i in range(N)],
        [[training_dataset['caseC'][max(0, i-1)], training_dataset['caseC'][max(0, i-2)], training_dataset['caseC'][max(0, i-3)]] for i in range(N)]
    ])
        
    # y
    y = np.array([
        training_dataset['caseA'],
        training_dataset['caseB'],
        training_dataset['caseC']
    ])
    
    # weights
    w = np.zeros((3, 1+order_temp))
    
    for i in range(learning_iter):
    # for i in range(learning_iter):
        # iter 3 cities
        for city in range(3):    
            # iter every data
            for index in range(N):
                x_trans = x_temp[city][index].transpose()
                prediction = np.dot(x_temp[city][index], w[city])
                loss = prediction - y[city][index]
                curr_gradient = np.dot(x_trans, loss) / N
                w[city] = w[city] - learning_rate * curr_gradient
    return w
    
w = Regression()
print(w)

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [None]:
def MakePrediction():
    # Predict
    predict_A = [] 
    predict_B = [] 
    predict_C = [] 
    for index, data in testing_dataset.iterrows():
        predict_A.append(int(w[0][0]+w[0][1]*data['tempA']))
        predict_B.append(int(w[1][0]+w[1][1]*data['tempB']))
        predict_C.append(int(w[2][0]+w[2][1]*data['tempC']))
        
        output_datalist.append([data['epiweek'], predict_A[-1], predict_B[-1], predict_C[-1]])
        
    # MAPE
    actual_A = testing_dataset['caseA'].to_numpy()
    actual_B = testing_dataset['caseB'].to_numpy()
    actual_C = testing_dataset['caseC'].to_numpy()
    
    MAPE_A = np.mean(np.abs((actual_A - predict_A) / actual_A)) * 100
    MAPE_B = np.mean(np.abs((actual_B - predict_B) / actual_B)) * 100
    MAPE_C = np.mean(np.abs((actual_C - predict_C) / actual_C)) * 100
    
    print("MAPE_A =", MAPE_A)
    print("MAPE_B =", MAPE_B)
    print("MAPE_C =", MAPE_C)

    return output_datalist

output_datalist = MakePrediction()

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)