# NEM Open Data Extraction Tool

**The Objective**: to create a tool that can easily extract open data from nemweb.com.au and assemble it into an analysis-ready format.

* For the intitial testing phase, extracted data will be stored as CSV files
* Once the data extraction pipeline is set up, we can experiment with SQL databases

In [1]:
%load_ext autoreload
%autoreload 2

# Standard Python
import os
import pandas as pd

# My modules
from nem_tracker import NEM_tracker
from nem_extract import NEM_extractor
import config
config.use('config.json')

Value for DATA_PATH has been set!


### NEM Tracker

* `.bulk_update()` method will update all existing resources
* to add a new resource use `.update_resource(resource, new=True)`
* tracking CSVs in `.tracker_dir` keep track of resource URLs 

In [2]:
data_dir = os.getenv('DATA_PATH')
nem_trk = NEM_tracker(data_dir)
for k, v in nem_trk.resources.items():
    print(k, v['last_update'])

/Reports/Current/Operational_Demand/ACTUAL_DAILY/ 2020-04-10-20:35:11
/Reports/Current/Dispatch_SCADA/ 2020-04-10-20:35:11


In [3]:
resources_new = []
nem_trk.add_resources(resources_new)
#nem_trk.bulk_update()

### NEM Extractor

In [9]:
nem_get = NEM_extractor(data_dir)
nem_get.select_resource()

[0] /Reports/Current/Operational_Demand/ACTUAL_DAILY/
[1] /Reports/Current/Dispatch_SCADA/

Selected: /Reports/Current/Operational_Demand/ACTUAL_DAILY/


In [10]:
nem_get.load_tracker_df()
nem_get.current_tracker_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
TIMESTAMP        60 non-null datetime64[ns]
VERSION          60 non-null object
DOWNLOADED       60 non-null bool
DOWNLOAD_DATE    60 non-null datetime64[ns]
URL              60 non-null object
dtypes: bool(1), datetime64[ns](2), object(2)
memory usage: 2.1+ KB


In [11]:
nem_get.set_download_df(2)
nem_get.download_df

Unnamed: 0,TIMESTAMP,VERSION,DOWNLOADED,DOWNLOAD_DATE,URL
58,2020-04-08,V20200409044000,False,1900-01-01,http://nemweb.com.au/Reports/Current/Operation...
59,2020-04-09,V20200410044000,False,1900-01-01,http://nemweb.com.au/Reports/Current/Operation...


In [12]:
nem_get.download_files()
nem_get.current_tracker_df.tail()

Unnamed: 0,TIMESTAMP,VERSION,DOWNLOADED,DOWNLOAD_DATE,URL
55,2020-04-05,V20200406044001,False,1900-01-01 00:00:00,http://nemweb.com.au/Reports/Current/Operation...
56,2020-04-06,V20200407044000,False,1900-01-01 00:00:00,http://nemweb.com.au/Reports/Current/Operation...
57,2020-04-07,V20200408044000,False,1900-01-01 00:00:00,http://nemweb.com.au/Reports/Current/Operation...
58,2020-04-08,V20200409044000,True,2020-04-10 21:00:17,http://nemweb.com.au/Reports/Current/Operation...
59,2020-04-09,V20200410044000,True,2020-04-10 21:00:17,http://nemweb.com.au/Reports/Current/Operation...


### Setting up the NEM Loader

* The tracker files should be useful as metadata for what's in the resource folders
  * Basically, if `DOWNLOADED==True`, the files should be found in the relevant resource folders
    * However, we may want to have file names tracked, or just use the last part of the `URL` field, which ideally would correspond to the file names

* Parsing the downloaded data files:
  * The `C` code appears to be used to signal the start/end of data files
  * `I` seems to be column names
  * `D` are the data rows