## Predicting Power Generation using Linear Regression and LSTM

## Data  
**power_actual**
- This file contains the solar generation of a certain plant from October 1st, 2017 to September 30th, 2019.
- You'll find the following columns: 'power', 'gti' and 'ghi'. Power is the actual power generated while GHI (Global Horizontal Irradiance) and GTI (Global Tilt Irradiance) are the parameters relevant to the that define the radiation received from the sun.

**weather_actuals**
- This file contains the weather data of the same plant from October 1st, 2017 to September 30th, 2019.
- The columns' names are self-explanatory.

**weather_forecast**
- This file contains the weather data from October 1st, 2019 to October 27th, 2019. 

You need to predict the generation of power of the given plan in this duration: October 1st, 2019 to October 27th, 2019. 

## Metrics¶
The metric used for predictions was Power

## Data cleaning¶
The raw data had the following issues:

- weather data are in datetime each hour a day with a 15 min of interval and solar generation data are in each hour interval a day.
- Some negative values (such as -99999) for features that should only have positive values
- Missing data
- Unneeded columns
- some outliers

To get the data into usable form, I did the following steps:

- grouping weather data on year of month of day of hour by applying sum over each 15 min to get data on hourly basis.
- Set negative values to median of a column 
- Dropped extra datetime data found on power data as per weather data.
- Dropped unneeded columns having all null values
- Removed and imputed outliers by updating with its previous datetime duration power

## Exploratory Data Analysis

### Predictors over time
![](img/1.png)

### Correlation coefficients of all features
The correlation matrix shows some multicollinearity between variables, as well as many weak correlations.     

![](img/2.png)

## Feature Engineering¶

on the basis of correlation, only selective numeric features were selected and for categorical features select `icon` and `humidity` fields

## Modeling

### Linear Regression
Predictions were made using an Linear Regression

### LSTM (long short-term memory) RNN (recurrent neural network) in Keras
Predictions were made using an LSTM (long short-term memory) model. Data was lagged by 1 day, 2 day, 10 day, 4 week periods.

### Train test split
Data was split on 75:25 ratio means 75% is for training and 25% is for test

### Hyperparameters¶
Hyperparameters used for the LSTM were:

LSTM cells = number of hours predicting

epochs = 10

batch_size = 12

dropout = .3

### Additional steps
After fitting each model and model results were saved for later use.

### Model evaluation
Models were scored on RMSE, r2 score, MSE, MAE, adjusted r2 score.

### Predictions and results¶
Here are example results for modeling using the hyperparameters above. 
- The plot area is of whole data available in weather data from `2017-09-01` to `2019-09-30`
![](img/3.png)

$$Mean Square Error      = 130.74405940675533$$                  
$$Root Mean Square Error = 11.434336859072998$$                   
$$Mean Absolute Error   = 6.6891303464407095$$               
$$Median Absolute Error  = 2.4006800651550293$$            
$$R^2**                    = 0.6533891923975141$$             
$$Adjusted R^2**           = 0.6506179481555043$$              

- The plot area is for forecast data from `2019-10-01` to `2019-10-27`
![](img/4.png)