# **Solar Power Generation Forecast Report**

Taylor Boyd

CS458

## **Part 1 - Splitting Data**

**Split "solar.csv" into training dataset "solar_training.csv" and test dataset "solar_test.csv".**

In [6]:
f = open("solar.csv", 'r')
f2 = open("solar_training.csv", 'w')
f3 = open("solar_test.csv", 'w')

ctr = 0 # variable that tracks what line of the file we're on

# go through each line of the file
for line in f:
    if ctr != 0:
        a_str = line
        a_str = a_str.strip().split(',')
        time = a_str[1]
        time = time.split(' ')
        hr = time[1].split(":")
        hr = int(hr[0])
        # if timestamp is 20130701 00:00 or earlier, write data to training dataset file
        if int(time[0]) < 20130701:
            f2.write(line)
        elif (int(time[0]) == 20130701 and hr == 0):
            f2.write(line)
        # else, write data to testing dataset file
        else:
            f3.write(line)
    ctr +=1
f.close()
f2.close()
f3.close()

In order to split the data, I casted the timestamp into different int parts. That way, it could be checked if the timestamp of any given line of data is smaller (earlier) or greater (later) than 20130701 00:00. If the timestamp was earlier than or equal to 20130701 00:00, the line of data is written to the training dataset file; otherwise, the line of data is written to the testing dataset file.

## **Part 2 - Model Building**

**Build a 24 hr ahead solar power generation forecast model.**

In [18]:
import numpy as np
import math
import pandas as pd
from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

f2 = open("solar_training.csv", 'r')
f3 = open("solar_test.csv", 'r')

p1_trainingX = [] # plant 1 training data
p1_trainingY = []
p2_trainingX = [] # plant 2 training data
p2_trainingY = []
p3_trainingX = [] # plant 3 training data
p3_trainingY = []

p1_testX = [] # plant 1 test data
p1_testY = []
p2_testX = [] # plant 2 test data
p2_testY = []
p3_testX = [] # plant 3 test data
p3_testY = []

# helper function for separating data and labels
def store_data(dataX, dataY, a_str):
    a_str.pop(0)
    a_str.pop(0)
    temp = []
    for item in range(len(a_str)-1):
        temp.append(float(a_str[item]))
    dataX.append(temp)
    dataY.append(float(a_str[len(a_str)-1]))
    return 0

# go through each line of training dataset file
for line in f2:
    line = line.strip().split(',')
    a_str = line
    plant = int(a_str[0])
    if plant == 1:
        store_data(p1_trainingX, p1_trainingY, a_str)
    elif plant == 2:
        store_data(p2_trainingX, p2_trainingY, a_str)
    else:
        store_data(p3_trainingX, p3_trainingY, a_str)
        
# go through each line of testing dataset file
for line in f3:
    line = line.strip().split(',')
    a_str = line
    plant = int(a_str[0])
    if plant == 1:
        store_data(p1_testX, p1_testY, a_str)
    elif plant == 2:
        store_data(p2_testX, p2_testY, a_str)
    else:
        store_data(p3_testX, p3_testY, a_str)
        
f2.close()
f3.close()

# convert lists to np arrays
p1_trainingX = np.array(p1_trainingX)
p1_trainingY = np.array(p1_trainingY)
p2_trainingX = np.array(p2_trainingX)
p2_trainingY = np.array(p2_trainingY)
p3_trainingX = np.array(p3_trainingX)
p3_trainingY = np.array(p3_trainingY)
p1_testX = np.array(p1_testX)
p1_testY = np.array(p1_testY)
p2_testX = np.array(p2_testX)
p2_testY = np.array(p2_testY)
p3_testX = np.array(p3_testX)
p3_testY = np.array(p3_testY)

# function that splits labels into 5 different classes
def split_classes(arr):
    new_arr = np.zeros(arr.shape[0])
    for i in range(arr.shape[0]):
        if arr[i] < 0.2:
            new_arr[i] = 0
        elif arr[i] < 0.4:
            new_arr[i] = 1
        elif arr[i] < 0.6:
            new_arr[i] = 2
        elif arr[i] < 0.8:
            new_arr[i] = 3
        else:
            new_arr[i] = 4
    return new_arr

# variables for newly defined labels
p1train = np.array(split_classes(p1_trainingY))
p2train = np.array(split_classes(p2_trainingY))
p3train = np.array(split_classes(p3_trainingY))
 
# build svr classifier for each plant
regr1 = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
regr1.fit(p1_trainingX, p1train)
regr2 = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
regr2.fit(p2_trainingX, p2train)
regr3 = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
regr3.fit(p3_trainingX, p3train)

# function for predicting power generation 24hrs after givin timestamp
# takes in n which is the index for the sample with targeted timestamp
def predict_tmrw(clf, testX, n):
    window = []
    # if there is a week of history, use that to make prediction
    for i in range(7):
        if (n-i) >= 0:
            window.append(testX[n-i])
    window = np.array(window)
    pred = 0
    preds = 0
    for i in range(window.shape[0]):
        preds += clf.predict(window)
    for i in range(preds.size):
        pred += preds[i]
    pred = pred / preds.size
    pred = (pred / window.shape[0]) / 5 + 0.1
    return pred

In order to build classifiers for each solar plant and make 24hr window predictions, I first decided to split the possible power generation into 5 different ranges (0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, 0.8-1.0). Using the newly defined training labels, I built a classifier for each solar plant. Next, I made a function that makes the predictions for 24hrs ahead. It takes in a classifier, test samples, and some index and basically looks at the last week of data (if it's available) in order to make the prediction. The power generation predictions from the last 7 days are averaged in order to come up with the 24hr prediction as output. 

## **Part 3 - Model Evaluation**

**Use MAE and RMSE measures to evaluate the model.**

In [19]:
# takes in trained model and test data, outputs mae
def mae_function(model, testX, testY):
    mae = 0
    for i in range(testX.shape[0]):
        pred = predict_tmrw(model, testX, i)
        mae += abs(testY[i] - pred)
    mae = mae / testX.shape[0]
    return mae

# takes in trained model and test data, outputs rmse
def rmse_function(model, testX, testY):
    rmse = 0
    for i in range(testX.shape[0]):
        pred = predict_tmrw(model, testX, i)
        rmse += (abs(testY[i] - pred)) * (abs(testY[i] - pred))
    rmse = rmse / testX.shape[0]
    rmse = math.sqrt(rmse)
    return rmse

# calculate and print mae of each plant's model
p1_mae = mae_function(regr1, p1_testX, p1_testY)
p2_mae = mae_function(regr2, p2_testX, p2_testY)
p3_mae = mae_function(regr3, p3_testX, p3_testY)
print("MAE")
print(p1_mae)
print(p2_mae)
print(p3_mae)

# calculate and print rmse of each plant's model
p1_rmse = rmse_function(regr1, p1_testX, p1_testY)
p2_rmse = rmse_function(regr2, p2_testX, p2_testY)
p3_rmse = rmse_function(regr3, p3_testX, p3_testY)
print("RMSE")
print(p1_rmse)
print(p2_rmse)
print(p3_rmse)

MAE
0.1781144352180035
0.1871333440553051
0.18999626716860712
RMSE
0.22121105569287766
0.23172321897512396
0.23444825402235855


RMSE results are a little worse than MAE results but both are pretty good overall. I think it's interesting that the error is higher for plant 3 predictions in both calculations.