# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [112]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [113]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [114]:
processed_dataset_list = [[],[],[]]
train_dataset_list = [[],[],[]]
valid_dataset_list = [[],[],[]]
unknown_dataset_list = [[],[],[]]
predict_dataset_list = [[],[],[]]
phi_matrix = [[],[],[]]
city_w = [[],[],[]]


## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [115]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [116]:
def PreprocessData():
 
    global input_datalist

    input_datalist = np.delete(input_datalist,0,axis=0)

    for i in [4, 5, 6]:
        B = np.insert(np.delete(input_datalist[:,i],-1,axis=0), 0, np.array(['']), axis=0)
        input_datalist = np.c_[input_datalist, B.transpose()]

    for i in [1,2,3]:

        global processed_dataset_list
        
        #detect empty input in column with index i in the dataset(false means empty in mask)
        mask = np.isin(input_datalist[:,i], [''], invert = True)

        #remove rows with empty input in dataset
        dataset_remove_empty = input_datalist[mask]

        #change the data type in dataset, and calculate the absolute value of each element's Z score
        temp = dataset_remove_empty[:,i].astype(np.double)
        temp_Z_score_abs = np.absolute((temp - np.average(temp))/np.std(temp))

        #change the value to integer which help us to identify outliers
        mask_t = np.isin(temp_Z_score_abs.astype(np.int64),[0], invert = False)
            
        dataset_remove_empty = dataset_remove_empty[mask_t]

        #detect empty input in column with index i+6(city's cases in yesterday) in the dataset(false means empty in mask)
        mask = np.isin(dataset_remove_empty[:,i+6], [''], invert = True)

        dataset_remove_empty = dataset_remove_empty[mask]
        dataset_remove_empty = dataset_remove_empty.astype(np.double)
            
        #store only the temperature, data and number of cases in the list
        processed_dataset_list[i-1] = dataset_remove_empty.transpose()[np.array([0,i,i+3,i+6])].transpose()

### Step 2: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [117]:
def SplitData():
    global train_dataset_list
    global valid_dataset_list

    for i in range(3):
        #numbers of rows without datas that we want to predict(10)
        data_row = np.size(processed_dataset_list[i],0) - 10

        #number of dataset for training
        num_train = int(data_row * 0.7)

        #number of dataset for validation
        num_valid = data_row - num_train
           
        train_dataset_list[i] = processed_dataset_list[i][0:num_train]
        valid_dataset_list[i] = processed_dataset_list[i][num_train:num_train+num_valid]
        unknown_dataset_list[i] = processed_dataset_list[i][num_train+num_valid:]
        

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [118]:
def phi_function(x, m):
    if m == 0:
        return x
    elif m==1:
        return x
    elif m==2:
        return x
    else:
        return x

In [119]:

def Regression(city):
    #y = W0 + w1 * phi(temp) + w2 * phi(yester_cases) + error
    #processed_dataset_list[city_name][data_list][temp]
    
    M = city

    row = np.size(train_dataset_list[M],0)
    for i in range(row):
        a = train_dataset_list[M][i][1]
        b = train_dataset_list[M][i][3]
        t = np.array([1, phi_function(a, M), phi_function(b, M)])
        if i==0:
            phi_matrix[M] = t
        else:
            phi_matrix[M] = np.r_[phi_matrix[M], t]

    phi_matrix[M] = phi_matrix[M].reshape(row, 3)
    
    w = np.linalg.inv(phi_matrix[M].T.dot(phi_matrix[M])).dot(phi_matrix[M].T).dot(train_dataset_list[M][:,2].reshape(row, 1))

    return w


### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [120]:
def MakePrediction(city):

    M = city
    
    global city_w

    row = np.size(valid_dataset_list[M],0)

    predict_dataset_list[M] = np.arange(row)

    MAPE = 0

    for i in range(row):

        a = valid_dataset_list[M][i][1]
        
        if i==0:
        
            b = valid_dataset_list[M][i][3]
        
        else:
        
            b = predict_dataset_list[M][i-1]

        predict_dataset_list[M][i] = int(city_w[M].T.dot([1, a, b]))

        y1 = valid_dataset_list[M][i][2] 
        
        y2 = predict_dataset_list[M][i]

        MAPE += np.absolute((y1-y2)/y1)

    print(str(int(MAPE*1000/row)/10) + "%" )





    
    

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





In [121]:
PreprocessData()

SplitData()

for i in range(3):
    city_w[i] = Regression(i)
    MakePrediction(i)


33.6%
32.3%
72.6%


## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [122]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)