# Initial EDA
#### Britain Power Generation (bw/ May 2011 - November 2019)

## 1) Load Required Libraries

In [1]:
library("tidyverse")
library("lubridate")
library("janitor")

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.1.1       ✔ forcats 0.3.0  
“package ‘stringr’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

“package ‘janitor’ was built under R version 3.5.2”
Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test



## 2) Load Data

In [2]:
bp_generation_df <- read.csv(file="data/gridwatch2.csv", header=TRUE, sep=",")
bp_generation_df$timestamp <- ymd_hms(as.character(bp_generation_df$timestamp))
glimpse(bp_generation_df)

Observations: 891,922
Variables: 22
$ id               <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ timestamp        <dttm> 2011-05-27 15:50:04, 2011-05-27 15:55:02, 2011-05-2…
$ demand           <int> 38874, 38845, 38745, 38826, 38865, 38881, 38876, 389…
$ frequency        <dbl> 50.132, 50.091, 50.034, 49.990, 50.017, 50.092, 50.0…
$ coal             <int> 9316, 9294, 9270, 9262, 9256, 9284, 9243, 9250, 9187…
$ nuclear          <int> 8221, 8225, 8224, 8220, 8210, 8198, 8203, 8202, 8199…
$ ccgt             <int> 18239, 18158, 18110, 18114, 18107, 18074, 18060, 180…
$ wind             <int> 1253, 1304, 1322, 1364, 1370, 1397, 1451, 1490, 1529…
$ pumped           <int> 309, 332, 285, 287, 297, 293, 285, 287, 289, 290, 28…
$ hydro            <int> 636, 633, 634, 635, 637, 637, 635, 635, 636, 634, 63…
$ biomass          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ oil              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ solar         

## 3) Feature Engineering

### 3.1) Create Power Supply Features

**Add Up All Renewable Supply**

In [11]:
bp_generation_df$renewables <- bp_generation_df$wind + bp_generation_df$pumped + bp_generation_df$pumped + bp_generation_df$hydro + bp_generation_df$biomass + bp_generation_df$solar + bp_generation_df$nemo + bp_generation_df$dutch_ict

**Add Up All Fossil Fuel Supply**

In [12]:
bp_generation_df$fossils <- bp_generation_df$coal + bp_generation_df$ccgt + bp_generation_df$oil + bp_generation_df$irish_ict #+ bp_generation_df$ew_ict

**Add Up All Nuclear Supply** 

In [13]:
bp_generation_df$nuclears <- bp_generation_df$nuclear + bp_generation_df$french_ict

**Add Up All Supply** 

In [14]:
bp_generation_df$supply <- bp_generation_df$renewables + bp_generation_df$fossils + bp_generation_df$nuclears

**Makes Sure Supply Meets Demand**

In [19]:
(sum(bp_generation_df$supply)/sum(bp_generation_df$demand))*100

### 3.2) Create Date Features

**Floor Date to Half Hour**

In [27]:
bp_generation_df$DateTime <- bp_generation_df$timestamp
minute(bp_generation_df$DateTime) <- floor(minute(bp_generation_df$DateTime)/30)*30
second(bp_generation_df$DateTime) <- 0

**Hour Of Day**

In [28]:
bp_generation_df$DateTime_hour <- hour(bp_generation_df$DateTime) + (minute(bp_generation_df$DateTime)/60)

**Month**

In [29]:
bp_generation_df$DateTime_month <- month(bp_generation_df$DateTime) 

**Weekday**

In [30]:
bp_generation_df$DateTime_wday <- wday(bp_generation_df$DateTime, week_start = getOption("lubridate.week.start", 1)) #Start on Monday

**Yearday**

In [31]:
bp_generation_df$DateTime_yday <- yday(bp_generation_df$DateTime)

**Year**

In [32]:
bp_generation_df$DateTime_year <- year(bp_generation_df$DateTime) 

**Check it out**

In [33]:
glimpse(bp_generation_df)

Observations: 891,922
Variables: 32
$ id               <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ timestamp        <dttm> 2011-05-27 15:50:04, 2011-05-27 15:55:02, 2011-05-2…
$ demand           <int> 38874, 38845, 38745, 38826, 38865, 38881, 38876, 389…
$ frequency        <dbl> 50.132, 50.091, 50.034, 49.990, 50.017, 50.092, 50.0…
$ coal             <int> 9316, 9294, 9270, 9262, 9256, 9284, 9243, 9250, 9187…
$ nuclear          <int> 8221, 8225, 8224, 8220, 8210, 8198, 8203, 8202, 8199…
$ ccgt             <int> 18239, 18158, 18110, 18114, 18107, 18074, 18060, 180…
$ wind             <int> 1253, 1304, 1322, 1364, 1370, 1397, 1451, 1490, 1529…
$ pumped           <int> 309, 332, 285, 287, 297, 293, 285, 287, 289, 290, 28…
$ hydro            <int> 636, 633, 634, 635, 637, 637, 635, 635, 636, 634, 63…
$ biomass          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ oil              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ solar         

## 4) Generate Dataset Aggregated By Half Hour

In [49]:
bp_generation_hh_df <- bp_generation_df %>% 
                    group_by(DateTime, DateTime_year, DateTime_yday, DateTime_month, DateTime_wday, DateTime_hour) %>%
                    summarise(demand=sum(demand), renewables=sum(renewables), fossils=sum(fossils), nuclears=sum(nuclears), supply=sum(supply))    

In [50]:
write.csv(bp_generation_hh_df,  "./data/bp_generation_hh_df.csv")

In [51]:
summary(bp_generation_hh_df$demand)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5722  166664  196411  200367  231260  753999 

In [57]:
bp_nonrenewables_daily_df <- bp_generation_hh_df %>% filter(DateTime_year == 2013) %>% group_by(DateTime_yday) %>% 
    summarise(nonrenewable_forecast=(sum(fossils)/12000000)+(sum(nuclears)/12000000)) %>% rename(day=DateTime_yday)

In [58]:
write.csv(bp_nonrenewables_daily_df,  "./data/bp_nonrenewables_daily_df.csv")