### Making a datafile using PV data, meta data and NWP data

#### Extracting data from PV data
Pv data has one dimesnions `datetime`, and variables as `ss_id`. Each variable is a unique `ss_id`. PV data records the generations of power at different times and is recorded in PV data file. It tells the time at which data is generated.

#### Extracting from metadata.csv
The metadata has `ss_id` column, which can be used to map it to PV data. Metadata is meta data for PV sites, so the PV sites tells the time a `ss_id` generated energy, it has all of that. Now metadata can be used to get information of longitude, latitude and pther features of PV site. 

#### NWP data
NWP data is multidimesnional. It has different dimesnions giving us different information. 
- **init_time** : time at which forecast is made. It is same as in PV site data, we can map from Pv site to find the time here, and get the weather data for that time given in PV data. 
- **step**: future time till when forecast is running from starting init time, e.g. for 48 hours. there are 49 steps, it tells time in nanoseconds, which are ahead of time given in init_time. 
- **variable**: (weather variables which we are considering) different weather datapoints, t2m (temperature 2 meters above ground), etc. 
- **longitude and latitude**: longitude and latitude tells the location of weather prediction. 

> data variable name is `emmwf_uk`


- variable has 14 points
- init_time has 4059 points
- latitude has 241 points
- longitude has 341 points


Hence, we need to get datetime from PV data, for those ss_id, we will get longitude and latitude from metadata, and then reduce dimesnions in nwp data. 
We need to get data for 49 steps, so we will have 49 columns, for 49 steps.
Let's start with only `t2m` for one step=0, and then we will add more data columns.

In [1]:
import xarray as xr
import ocf_blosc2
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import plotly.express as px


In [2]:
nwp_data = xr.open_dataset("../../../mnt/disks/gcp_data/nwp/ecmwf/UK_v2.zarr")
nwp_data
pv_data = xr.open_dataset("data_files/pv.netcdf", engine='h5netcdf')
pv_data
meta_data = pd.read_csv("data_files/metadata.csv")
meta_data



Unnamed: 0,ss_id,latitude_rounded,longitude_rounded,llsoacd,orientation,tilt,kwp,operational_at
0,2405,53.53,-1.63,E01007430,180.0,35.0,3.36,2010-11-18
1,2406,54.88,-1.38,E01008780,315.0,30.0,1.89,2010-12-03
2,2407,54.88,-1.38,E01008780,225.0,30.0,1.89,2010-12-03
3,2408,54.88,-1.38,E01008780,225.0,30.0,1.89,2010-12-03
4,2409,54.88,-1.38,E01008780,225.0,30.0,1.89,2010-12-03
...,...,...,...,...,...,...,...,...
24657,27063,51.41,-2.49,E01014398,185.0,35.0,4.00,2018-04-18
24658,27064,51.47,-0.59,E01016588,180.0,45.0,2.00,2018-04-18
24659,27065,51.36,-2.93,E01014817,125.0,37.0,4.00,2018-04-18
24660,27066,51.44,-2.85,E01014759,165.0,35.0,4.00,2018-04-18


In [84]:
pv_data
mask = (pv_data['datetime'].dt.month == 10) & (pv_data['datetime'].dt.year == 2021)
october_2021_data = pv_data.sel(datetime=mask)
october_2021_data

In [85]:
dates = october_2021_data['datetime'].values
dates_array = dates.astype(str).tolist()
# dates_array

In [86]:
## selected one data for datetime from pv, get it's ss_id
ss_id = '10005'
time = '2018-01-01T06:05:00.000000000'
datetime = pv_data.sel(datetime=time)[ss_id]
print(datetime)

# use ss_id to get lat and long from metadata
result_row = meta_data[meta_data['ss_id'] == int(ss_id)]
latitude = int(result_row.latitude_rounded)
longitude = int(result_row.longitude_rounded)

<xarray.DataArray '10005' ()> Size: 4B
[1 values with dtype=float32]
Coordinates:
    datetime  datetime64[ns] 8B 2018-01-01T06:05:00


In [87]:
nwp_data

In [88]:
# reduce nwp dimensions by specifiying 4 dimesnions out of 3
# data = nwp_data.sel(init_time=time, method="nearest").sel(latitude=latitude, method="nearest").sel(longitude=longitude, method="nearest")
# data
data = nwp_data.sel(latitude=latitude, method="nearest").sel(longitude=longitude, method="nearest")
data

In [89]:
results= {'ss_id':[], 'datetime':[], 'latitude':[], 'longitude':[], 'step':[], 't2m':[], 'mcc':[], 'lcc':[], 'hcc':[], 'u10':[]}

In [90]:
def add_value(results_dict, key, func):
    try:
        value = func()
        results_dict[key].append(value)
    except Exception:
        results_dict[key].append(None)

In [91]:
steps = data.step.values.tolist()
data = nwp_data.sel(latitude=latitude, method="nearest").sel(longitude=longitude, method="nearest")
data

In [92]:
date = '2021-10-10T00:00:00.000000000'
nwp = data.sel(init_time=date)
nwp

In [93]:
# select date, of day and then can try data.load to load data locally
# function in xarray, toDataframe(), can get dataframes, from xarray directly 
# and then concat for all days in a month. 

# randomly take some location and some times
# size of data: 1000 init times * 49 steps 
# 20, to 50 different sites. 

# can get 10 to 15 columns of ss_id and weather variables
# just keep ss_id, in resulting dataframe
# can get lat, long, capacity from metadata when need to feed to the model
# so our resulting dataframe will have ss_id and weather variables 10 to 15 columns

for time in dates_array:

    str_time = str(time)
    time_data = data.sel(init_time=str_time, method='nearest')
    steps = time_data.step.values.tolist()
    
    for step in steps:
        results['datetime'].append(time)
        results['step'].append((step/3600000000000).astype(int))
        
        results['ss_id'].append(ss_id)
        results['latitude'].append(latitude)
        results['longitude'].append(longitude)
        nwp = time_data.sel(step=str(step), method="nearest")

        add_value(results, 't2m', lambda: nwp['ECMWF_UK'].sel(variable='t2m').values.item())
        add_value(results, 'mcc', lambda: nwp['ECMWF_UK'].sel(variable='mcc').values.item())
        add_value(results, 'lcc', lambda: nwp['ECMWF_UK'].sel(variable='lcc').values.item())
        add_value(results, 'hcc', lambda: nwp['ECMWF_UK'].sel(variable='hcc').values.item())
        add_value(results, 'u10', lambda: nwp['ECMWF_UK'].sel(variable='u10').values.item())



    # t2m = nwp['ECMWF_UK'].sel(variable='t2m').values.item()
    # mcc = nwp['ECMWF_UK'].sel(variable='mcc').values.item()
    # lcc = nwp['ECMWF_UK'].sel(variable='lcc').values.item()
    # hcc = nwp['ECMWF_UK'].sel(variable='hcc').values.item()
    # u10 = nwp['ECMWF_UK'].sel(variable='u10').values.item()
    # results['t2m'].append(t2m)
    # results['mcc'].append(mcc)
    # results['lcc'].append(lcc)
    # results['hcc'].append(hcc)
    # results['u10'].append(u10)

In [94]:
for key, value in results.items():
    print(f"Length of list for {key}: {len(value)}")

Length of list for ss_id: 380632
Length of list for datetime: 380632
Length of list for latitude: 380632
Length of list for longitude: 380632
Length of list for step: 380632
Length of list for t2m: 380632
Length of list for mcc: 380632
Length of list for lcc: 380632
Length of list for hcc: 380632
Length of list for u10: 380632


In [95]:
results_df = pd.DataFrame.from_dict(results)
results_df

Unnamed: 0,ss_id,datetime,latitude,longitude,step,t2m,mcc,lcc,hcc,u10
0,10005,2021-10-01T00:00:00.000000000,55,-4,0,285.485107,0.311035,0.817078,0.851990,3.937487
1,10005,2021-10-01T00:00:00.000000000,55,-4,3600000000000,285.497803,0.898163,0.705078,0.999420,4.630762
2,10005,2021-10-01T00:00:00.000000000,55,-4,7200000000000,284.968018,0.626984,0.701294,0.930756,4.567980
3,10005,2021-10-01T00:00:00.000000000,55,-4,10800000000000,284.847412,0.839722,0.993683,0.978912,4.091146
4,10005,2021-10-01T00:00:00.000000000,55,-4,14400000000000,285.010742,0.375702,0.997498,0.000000,3.614630
...,...,...,...,...,...,...,...,...,...,...
380627,10005,2021-10-27T23:55:00.000000000,55,-4,158400000000000,280.454834,0.074127,0.067841,0.342194,-0.210285
380628,10005,2021-10-27T23:55:00.000000000,55,-4,162000000000000,280.034424,0.121277,0.148407,1.000000,-1.214453
380629,10005,2021-10-27T23:55:00.000000000,55,-4,165600000000000,280.943359,0.637177,0.878418,0.961578,-1.259577
380630,10005,2021-10-27T23:55:00.000000000,55,-4,169200000000000,281.560547,0.691772,0.981445,0.999847,-2.288478


In [97]:
# results_df.to_csv("steps_oct_2021.csv")

In [24]:
results_oct = pd.read_csv("result_data/steps_oct_2021.csv")
results_oct

Unnamed: 0.1,Unnamed: 0,ss_id,datetime,latitude,longitude,step,t2m,mcc,lcc,hcc,u10
0,0,10005,2021-10-01T00:00:00.000000000,55,-4,0,285.485107,0.311035,0.817078,0.851990,3.937487
1,1,10005,2021-10-01T00:00:00.000000000,55,-4,3600000000000,285.497803,0.898163,0.705078,0.999420,4.630762
2,2,10005,2021-10-01T00:00:00.000000000,55,-4,7200000000000,284.968018,0.626984,0.701294,0.930756,4.567980
3,3,10005,2021-10-01T00:00:00.000000000,55,-4,10800000000000,284.847412,0.839722,0.993683,0.978912,4.091146
4,4,10005,2021-10-01T00:00:00.000000000,55,-4,14400000000000,285.010742,0.375702,0.997498,0.000000,3.614630
...,...,...,...,...,...,...,...,...,...,...,...
380627,380627,10005,2021-10-27T23:55:00.000000000,55,-4,158400000000000,280.454834,0.074127,0.067841,0.342194,-0.210285
380628,380628,10005,2021-10-27T23:55:00.000000000,55,-4,162000000000000,280.034424,0.121277,0.148407,1.000000,-1.214453
380629,380629,10005,2021-10-27T23:55:00.000000000,55,-4,165600000000000,280.943359,0.637177,0.878418,0.961578,-1.259577
380630,380630,10005,2021-10-27T23:55:00.000000000,55,-4,169200000000000,281.560547,0.691772,0.981445,0.999847,-2.288478


In [25]:
results_oct['step'] = (results_oct['step']/3600000000000).astype(int)
results_oct

Unnamed: 0.1,Unnamed: 0,ss_id,datetime,latitude,longitude,step,t2m,mcc,lcc,hcc,u10
0,0,10005,2021-10-01T00:00:00.000000000,55,-4,0,285.485107,0.311035,0.817078,0.851990,3.937487
1,1,10005,2021-10-01T00:00:00.000000000,55,-4,1,285.497803,0.898163,0.705078,0.999420,4.630762
2,2,10005,2021-10-01T00:00:00.000000000,55,-4,2,284.968018,0.626984,0.701294,0.930756,4.567980
3,3,10005,2021-10-01T00:00:00.000000000,55,-4,3,284.847412,0.839722,0.993683,0.978912,4.091146
4,4,10005,2021-10-01T00:00:00.000000000,55,-4,4,285.010742,0.375702,0.997498,0.000000,3.614630
...,...,...,...,...,...,...,...,...,...,...,...
380627,380627,10005,2021-10-27T23:55:00.000000000,55,-4,44,280.454834,0.074127,0.067841,0.342194,-0.210285
380628,380628,10005,2021-10-27T23:55:00.000000000,55,-4,45,280.034424,0.121277,0.148407,1.000000,-1.214453
380629,380629,10005,2021-10-27T23:55:00.000000000,55,-4,46,280.943359,0.637177,0.878418,0.961578,-1.259577
380630,380630,10005,2021-10-27T23:55:00.000000000,55,-4,47,281.560547,0.691772,0.981445,0.999847,-2.288478


In [26]:
results_oct.to_csv("result_data/steps_oct_2021.csv")

In [27]:
final = pd.read_csv("result_data/steps_oct_2021.csv")
final

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ss_id,datetime,latitude,longitude,step,t2m,mcc,lcc,hcc,u10
0,0,0,10005,2021-10-01T00:00:00.000000000,55,-4,0,285.485107,0.311035,0.817078,0.851990,3.937487
1,1,1,10005,2021-10-01T00:00:00.000000000,55,-4,1,285.497803,0.898163,0.705078,0.999420,4.630762
2,2,2,10005,2021-10-01T00:00:00.000000000,55,-4,2,284.968018,0.626984,0.701294,0.930756,4.567980
3,3,3,10005,2021-10-01T00:00:00.000000000,55,-4,3,284.847412,0.839722,0.993683,0.978912,4.091146
4,4,4,10005,2021-10-01T00:00:00.000000000,55,-4,4,285.010742,0.375702,0.997498,0.000000,3.614630
...,...,...,...,...,...,...,...,...,...,...,...,...
380627,380627,380627,10005,2021-10-27T23:55:00.000000000,55,-4,44,280.454834,0.074127,0.067841,0.342194,-0.210285
380628,380628,380628,10005,2021-10-27T23:55:00.000000000,55,-4,45,280.034424,0.121277,0.148407,1.000000,-1.214453
380629,380629,380629,10005,2021-10-27T23:55:00.000000000,55,-4,46,280.943359,0.637177,0.878418,0.961578,-1.259577
380630,380630,380630,10005,2021-10-27T23:55:00.000000000,55,-4,47,281.560547,0.691772,0.981445,0.999847,-2.288478


In [1]:
# now choose one data variable from 14, and reduce one more dimensions by specifying the dimension. 

## if i do data.sel(variable='t2m'), it does not work
# t2m = data['ECMWF_UK'].sel(variable='t2m')
# t2m
# step = data.sel(step='3600000000000')

# t2m = step['ECMWF_UK'].sel(variable='t2m').values
# t2m
# int(t2m)
# mcc = step['ECMWF_UK'].sel(variable='lcc').values.item()
# mcc
# t2m = step['ECMWF_UK'].sel(variable='t2m').values.item()
# mcc = step['ECMWF_UK'].sel(variable='mcc').values.item()
# lcc = step['ECMWF_UK'].sel(variable='lcc').values.item()
# hcc = step['ECMWF_UK'].sel(variable='hcc').values.item()
# u10 = step['ECMWF_UK'].sel(variable='u10').values.item()
# print(t2m,mcc,lcc, hcc, u10)
# int(mcc)
# ds['ECMWF_UK'].sel(variable='t2m')

In [2]:
# # try to get the values from this array
# print(t2m.values)
# print(len(t2m.values))

## gives the values of t2m for 49 steps

Brain Storming:

Now we have to save the values in pandas dataframe, one way is to create 49 columns, but for each weather vairbale if we store 49 columns, it will be too many columns. We can also create one step column, which will have step number or value, and save it in one column as value. 

- Con: It will have lot of repetitive ss_id  and date_time in many rows.
- Pro: Easier to read data and understand data. As creating 49 columns for each weather variable will be too many columns.

Which one is benefical, for training the data. How much step value matters, it basically sees all the weather variables and then predict, so can be done with one column of step. 


In [3]:
# need to make a pandas dataframe which can have t2m for 49 columns for steps
# it will have init_time, ss_id column, and t2m 49 steps column
# start with it, and then get for a month

## create a dictionary

## if we want to do for one month datetime, steps will be nested loop inside. 



In [4]:
# ecmwf = ds.ECMWF_UK
# # ecmwf
# t2m = ds['ECMWF_UK'].sel(variable='t2m')
# t2m_5 = t2m.isel(init_time=slice(5))
# t2m_5

# data_vars = variable.values
# t2m = ds.sel(variable='t2m').values
# t2m


# init_times = ds['init_time'].isel(init_time=slice(10)).values
# data = []

# for var in ds.ECMWF_UK.variable:
#     print(var)
#     var_data = ds.sel(variable = var).isel(init_time=slice(10)).values
#     print(var_data)
#     # data.append(var_data.flatten())


In [5]:
# create a plot of latituade and longitude as x and y cooridnates. plot for one weather variable
# can take this dataset, change into dataframes, or into csv
# pick a piece of data and use it to convert to dataframes

In [6]:
# plt.figure(figsize=(10, 6))
# sc = plt.scatter(longitude, latitude, c=value, cmap='viridis')  # Use viridis colormap, or choose another
# plt.colorbar(sc, label='Intensity')  # Add a color bar to show intensity scale
# plt.xlabel('Longitude')
# plt.ylabel('Latitude')
# plt.title('Scatter Plot of Intensity')
# plt.show()

- using the pv data, extract the nwp data
- init time and datetime is same, and use datetime from pv to get init time from nwp
- use metadata to get data of pv site using ssid 
- try for 2 steps, 1 month, get a base version
- start with 1 value, 