$\underline{\textbf{PV Power Readings Analysis}}$

The objective of this analysis is to build a data-driven model of solar PV power generation under each zipcode in a given time. The minimal capability is to output a minimum and maximum power generated by PV components under each zipcode where the input is time with 15 minutes granularity. The better version is to give an expected value and confidence interval and to incorporate more inputs such as early power readings and weather information.

# All power readings data preprocessing

In order to buid such a model, we need to understand the data of power readings from each PV components in a zip code first. Here we preprocess bulk data by spliting data by zipcodes.

## Separate data files based on zipcodes

There are two kind data files here. The first kind $\texttt{metadata}$ where each row represents one PV component containing ID, latitude, longitude, zip code, timezone, and possibly installation size. The second data file is $\texttt{timeseriesdata}$ where each row represents one power reading instance containing time, PV component ID, and power. 

We specify data source and year we would like to process the data here. Thank you our data provider for helping us finding the future of clean energy. (Our data source is confidential though). 

In [1]:
datasource = 'L'

Here we import package for data analysis called 'pandas' and specify directory.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
mainDir = 'C:\Users\Tee\Documents\Active\EnergyProject\Thesis' #this is main directory in ICME with raw data

We have two year data 2013 and 2014. Let's pull up metadata.csv first.

In [3]:
metadata_2013 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2013' +'/'+'raw/metadata.csv',
                       dtype={'componentId':'object','zip':'object'})
metadata_2014 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2014' +'/'+'raw/metadata.csv',
                       dtype={'componentId':'object','zip':'object'})

Here is how $\texttt{metadata}$ looks like:

In [4]:
metadata_2013[:1]

Unnamed: 0,nodeId,componentId,latitude,longitude,zip,tz
0,00:04:A3:A1:CA:B3.1,2023207,40.01211,-74.63209,8640,US/Eastern


In [5]:
metadata_2014[:1]

Unnamed: 0,componentId,sizeBucket,latitude,longitude,zip,timezone
0,602,0-1 kW,41.46,-72.91,6410,America/New_York


Let's do a quick check if all PV components in 2013 appears in 2014.

In [6]:
set(metadata_2013.componentId).issubset(set(metadata_2014.componentId))

True

Since all PV components in 2013 contains in 2014. We just use $\texttt{metadata}$ from 2014.

In [7]:
metadata = metadata_2014

Here we look how many zipcodes and some quick statistics on each zipcode.

In [8]:
metadata_2014[['zip','componentId']].groupby(['zip']).agg(['count'])

Unnamed: 0_level_0,componentId
Unnamed: 0_level_1,count
zip,Unnamed: 1_level_2
6010,283
6082,231
6084,303
6410,224
8640,181
8641,738
92562,329
92563,249


For 2014 data, we have information on installed kW range of components (column name: sizeBucket). Note that exact value is confidential and some sites do not have this information. We are particularly interested in two zipcodes: 08641 and 92562. Here are distributions of kW range:

In [9]:
z = '08641'
d = metadata_2014[metadata_2014.zip == z]
d[['componentId','sizeBucket']].groupby(['sizeBucket']).agg(['count'])
#sum of this table = 112+2+94+3+124+4 = 339 < 738 (399 components have no kW range info)

Unnamed: 0_level_0,componentId
Unnamed: 0_level_1,count
sizeBucket,Unnamed: 1_level_2
10-20 kW,112
2-3 kW,2
20-50 kW,94
3-5 kW,3
5-10 kW,124
50-100 kW,4


In [10]:
z = '92562'
d = metadata_2014[metadata_2014.zip == z]
d[['componentId','sizeBucket']].groupby(['sizeBucket']).agg(['count'])
#sum of this table = 1+34+1+2+54+218 = 310 < 329 (19 components have no kW range infor)

Unnamed: 0_level_0,componentId
Unnamed: 0_level_1,count
sizeBucket,Unnamed: 1_level_2
0-1 kW,1
10-20 kW,34
2-3 kW,1
20-50 kW,2
3-5 kW,54
5-10 kW,218


Now we read all timeseries data files.

In [11]:
timeseriesdata2013_1 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2013' +'/'+'raw/timeseriesdata.csv',
                             dtype={'componentId':'object'})
timeseriesdata2014_1 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2014' +'/'+'raw/timeseriesdata.csv',
                             dtype={'componentId':'object'})
timeseriesdata2014_2 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2014' +'/'+'raw/timeseriesdata_2.csv',
                             dtype={'componentId':'object'})
timeseriesdata2014_3 = pd.read_csv(mainDir+'/data/solar/' + datasource +'/'+ '2014' +'/'+'raw/timeseriesdata_3.csv',
                             dtype={'componentId':'object'})

#remove comma
timeseriesdata2013_1['componentId'] = timeseriesdata2013_1['componentId'].map(lambda x: x.replace(',',''))
timeseriesdata2014_1['componentId'] = timeseriesdata2014_1['componentId'].map(lambda x: x.replace(',',''))
timeseriesdata2014_2['componentId'] = timeseriesdata2014_2['componentId'].map(lambda x: x.replace(',',''))
timeseriesdata2014_3['componentId'] = timeseriesdata2014_3['componentId'].map(lambda x: x.replace(',',''))

Here is how $\texttt{timeseriesdata}$ looks like:

In [12]:
timeseriesdata2013_1[:1]

Unnamed: 0,tsUTC,componentId,power
0,2013-01-01 00:00:00,467500,0.275


In [13]:
timeseriesdata2014_1[:1]

Unnamed: 0,componentId,tsUTC,powerKw
0,1455,2014-01-01 00:00:00.0,0


In [14]:
timeseriesdata2014_2[:1]

Unnamed: 0,componentId,tsUTC,powerKw
0,2030559,2014-09-18 08:15:00.0,0


In [15]:
timeseriesdata2014_3[:1]

Unnamed: 0,componentId,tsUTC,powerKw
0,2055667,2014-04-27 13:45:00.0,0.05367


Here we would like to establish same column names through all data frames. Hence we rename column on power reading and reorder them.

In [16]:
timeseriesdata2014_1=timeseriesdata2014_1.rename(columns = {'powerKw':'power'})
timeseriesdata2014_2=timeseriesdata2014_2.rename(columns = {'powerKw':'power'})
timeseriesdata2014_3=timeseriesdata2014_3.rename(columns = {'powerKw':'power'})

In [17]:
timeseriesdata2013_1=timeseriesdata2013_1[['componentId','tsUTC','power']]

We would like to process power readings data under each zipcode. So we split and write data files according to zipcodes.

$\textbf{Warning:}$ this process may take a long time.

In [18]:
#ziplist = ['08640','92563'] #list(set(metadata['zip'])) if want to process all zipcode
ziplist = ['08641','92562']

for z in ziplist:
    timeseriesdataByZip = pd.DataFrame()
    metadataByZip = metadata[metadata.zip == z]
    l = list(metadataByZip['componentId'])

    timeseriesdataByZip = timeseriesdataByZip.append(timeseriesdata2013_1.iloc[[i for i, 
                                                   elem in enumerate(timeseriesdata2013_1.componentId.map(lambda x: x in l)) if elem]],
                                                     ignore_index = True)
    timeseriesdataByZip = timeseriesdataByZip.append(timeseriesdata2014_1.iloc[[i for i, 
                                                   elem in enumerate(timeseriesdata2014_1.componentId.map(lambda x: x in l)) if elem]],
                                                     ignore_index = True)
    timeseriesdataByZip = timeseriesdataByZip.append(timeseriesdata2014_2.iloc[[i for i, 
                                                   elem in enumerate(timeseriesdata2014_2.componentId.map(lambda x: x in l)) if elem]],
                                                     ignore_index = True)
    timeseriesdataByZip = timeseriesdataByZip.append(timeseriesdata2014_3.iloc[[i for i, 
                                                   elem in enumerate(timeseriesdata2014_3.componentId.map(lambda x: x in l)) if elem]],
                                                     ignore_index = True)
    directory = mainDir + '/data/solar/' + datasource +'/' + z
    if not os.path.exists(directory):
        os.makedirs(directory)
    metadataByZip.to_csv(directory+"/metadata.csv",index=False)
    timeseriesdataByZip.to_csv(directory+"/timeseriesdata.csv",index=False)