# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [23]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [24]:
training_size = 84
testing_size = 10
order_temp = 1
order_case = 1
training_dataset = []
testing_dataset = []
w = []

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [25]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [26]:
def SplitData():
    training_dataset = pd.DataFrame(input_datalist[1:training_size+1], 
                                    columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    testing_dataset = pd.DataFrame(input_datalist[training_size+1:training_size+1+testing_size], 
                                   columns = ['epiweek', 'tempA', 'tempB', 'tempC', 'caseA', 'caseB', 'caseC'])
    return training_dataset, testing_dataset
        
training_dataset, testing_dataset = SplitData()
print(training_dataset)
print(testing_dataset)

   epiweek  tempA  tempB  tempC caseA caseB caseC
0   202001  21.48  22.24   9.16   147    89     9
1   202002                        146    99     7
2   202003  24.66  22.32  24.84   198    78    13
3   202004  23.89   24.9  29.66   180    69    14
4   202005  22.85  23.74  29.78   162    57     8
..     ...    ...    ...    ...   ...   ...   ...
79  202128  22.78  23.19  26.25    33    35    59
80  202129  22.75  19.05  26.42    29    22    45
81  202130  23.61  23.23  40.74    39    19    55
82  202131                         42    20    55
83  202132  71.47  22.02  26.26    35    22    66

[84 rows x 7 columns]
  epiweek  tempA  tempB  tempC caseA caseB caseC
0  202133  27.45  18.59  28.02    29    18    56
1  202134  24.26  21.45  26.76    28    25    44
2  202135  28.88  24.26  25.79    31    27    40
3  202136  28.08  24.76  28.64    33    25    39
4  202137  27.79  25.89  26.75    23    12    50
5  202138   26.8  24.27   29.8    27    22    35
6  202139  24.06  22.58  29.39    

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [27]:
def PreprocessData():
    def PreprocessDataset(dataset):
        min_temp = 15
        max_temp = 40 
        
        # Missing data
        missing = (dataset['tempA'] != '') & (dataset['tempB'] != '') & (dataset['tempC'] != '')
        dataset.where(missing, other=np.nan, inplace=True)
        dataset.dropna(axis='index', inplace=True)
    
        # String to number
        dataset['tempA'] = dataset['tempA'].map(float)
        dataset['tempB'] = dataset['tempB'].map(float)
        dataset['tempC'] = dataset['tempC'].map(float)
        dataset['caseA'] = dataset['caseA'].map(float)
        dataset['caseB'] = dataset['caseB'].map(float)
        dataset['caseC'] = dataset['caseC'].map(float)

        # Outlier
        outlierA = (dataset['tempA'] >= min_temp) & (dataset['tempA'] <= max_temp)
        outlierB = (dataset['tempB'] >= min_temp) & (dataset['tempB'] <= max_temp)
        outlierC = (dataset['tempC'] >= min_temp) & (dataset['tempC'] <= max_temp)
        outlier = outlierA & outlierB & outlierC
        dataset.where(outlier, other=np.nan, inplace=True)
        dataset.dropna(axis='index', inplace=True)
        dataset.reset_index(drop=True, inplace=True)
    
    PreprocessDataset(training_dataset)
    PreprocessDataset(testing_dataset)
    
PreprocessData()
print(training_dataset)
print(testing_dataset)

   epiweek  tempA  tempB  tempC  caseA  caseB  caseC
0   202003  24.66  22.32  24.84  198.0   78.0   13.0
1   202004  23.89  24.90  29.66  180.0   69.0   14.0
2   202005  22.85  23.74  29.78  162.0   57.0    8.0
3   202006  27.49  25.41  30.38  127.0   52.0   14.0
4   202008  26.20  21.51  27.98   99.0   51.0   15.0
..     ...    ...    ...    ...    ...    ...    ...
60  202125  26.78  23.83  20.53   57.0   23.0   68.0
61  202126  26.43  21.20  23.54   40.0   23.0   54.0
62  202127  26.88  22.90  26.03   51.0   27.0   57.0
63  202128  22.78  23.19  26.25   33.0   35.0   59.0
64  202129  22.75  19.05  26.42   29.0   22.0   45.0

[65 rows x 7 columns]
  epiweek  tempA  tempB  tempC  caseA  caseB  caseC
0  202133  27.45  18.59  28.02   29.0   18.0   56.0
1  202134  24.26  21.45  26.76   28.0   25.0   44.0
2  202135  28.88  24.26  25.79   31.0   27.0   40.0
3  202136  28.08  24.76  28.64   33.0   25.0   39.0
4  202137  27.79  25.89  26.75   23.0   12.0   50.0
5  202138  26.80  24.27  29.8

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [28]:
def Regression():
    learning_rate = 0.01
    learning_iter = 100
    
    # size of training_dataset
    N = len(training_dataset)
    
    # x
    x = np.array([
        [[1, training_data['tempA']] for index, training_data in training_dataset.iterrows()],
        [[1, training_data['tempB']] for index, training_data in training_dataset.iterrows()],
        [[1, training_data['tempC']] for index, training_data in training_dataset.iterrows()]
    ])
    
    # y
    y = np.array([
        training_dataset['caseA'],
        training_dataset['caseB'],
        training_dataset['caseC']
    ])
    
    # weights
    w = np.zeros((3, 1+order_temp))
    
    for i in range(learning_iter):
    # for i in range(learning_iter):
        # iter 3 cities
        for city in range(3):    
            # iter every data
            for index in range(N):
                x_trans = x[city][index].transpose()
                prediction = np.dot(x[city][index], w[city])
                loss = prediction - y[city][index]
                curr_gradient = np.dot(x_trans, loss) / N
                w[city] = w[city] - learning_rate * curr_gradient
    return w
    
w = Regression()
print(w)

[[1.10718508 1.57039905]
 [0.23824931 1.04271472]
 [0.89669436 1.87797157]]


### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [29]:
def MakePrediction():
    for index, data in testing_dataset.iterrows():
        output_datalist.append([data['epiweek'], int(w[0][0]+w[0][1]*data['tempA']), int(w[1][0]+w[1][1]*data['tempB']), int(w[2][0]+w[2][1]*data['tempC'])])
    return output_datalist

output_datalist = MakePrediction()
print(output_datalist)

[['202133', 44, 19, 53], ['202134', 39, 22, 51], ['202135', 46, 25, 49], ['202136', 45, 26, 54], ['202137', 44, 27, 51], ['202138', 43, 25, 56], ['202139', 38, 23, 56], ['202140', 41, 23, 55], ['202141', 44, 23, 54], ['202142', 43, 23, 53]]


### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [16]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)