# Data Acquisition and Pre-Processing
## Gregory Myers
<a id='top'></a>

# Hydrocarbon Production and Greenhouse Gas Emissions Data Acquisition and Pre-Processing.
Data will be obtained from Energy Information Administration (EIA) eia.gov  
and the Scripps Institution of Oceanography (SIO) at the University of California, San Diego. scrippsco2.ucsd.edu  
The emissions data is focused on the Greenhouse Gases (GHG) Carbon Dioxide (CO2), Methane (CH4) and Nitrous Oxide (N2O).  

**The following datasets are queried directly by API from the EIA website in JSON format (product, date range)**
* Annual U.S. Crude Oil Production (EIA)  1859-2019
* Annual U.S. Natural Gas Production (EIA)  1936-2019
* Annual U.S. Natural Gas, Vented and Flared (EIA)  1936-2018
* Annual U.S. Natural Gas Sector CO2 Emissions (EIA)  1980-2017
* Annual U.S. Crude Oil Sector CO2 Emissions (EIA)  1980-2017
* Annual U.S. Industrial CO2 Emissions (EIA)  1980-2017


**The Scripps datasets are manually downloaded from ucsd.edu in text format (product, date range)**
* Annual Atmospheric CO2 Concentration (SIO)  1958-2019
* Annual Atmospheric CH4 Concentration (SIO)  1983-2019
* Annual Atmospheric N2O Concentration (SIO)  2001-2019

**EIA Data**  
The EIA JSON data contains considerable redundency in the data mostly in the form of metadata. The primary  
component of interest is a list field that contains the date and product data which is processed in pandas  
to extract the required data and save to CSV in two formats; comma (,) and pipe (|) seperated.  

The EIA production data records crude oil volumes in units of thousand barrels of oil and emissions data  
in metric tons of CO2 equivilent.

**SIO Data**  
The SIO text files containing the GHG concentration data are read into a pandas dataframe, skipping comment lines  
and importing only the space seperated data. A header is added and the the data is ready for export, again to  
CSV in two formats as above.  

The SIO concentratin data is recorded in Parts Per Million (ppm) for CO2 gas and Parts Per Billion (ppb) for CH4 and N2O.  

**Final Product** 
In total nine datasets are produced by this notebook in two formats for a total of 18 CSV files.  

[Return to Introduction](1_Introduction.ipynb)

In [1]:
import pandas as pd
import requests
import json

In [2]:
# helper command for specific jupyter lab extension
%config IPCompleter.greedy=True

### Data folder variables for coding simplification.  
The data_591 folder is the location for the analysis ready CSV data. Note that these CSV files will use the PIPE symbol ( | ) as the seperator.  
The remaining folders will hold the raw JSON and intermediate CSV files.

In [3]:
data_591 = './'
data_eia = 'data_eia/'
data_scripps = 'data_Scripps/'

### Here we will update some pandas options to improve the on screen readability of the dataframe listings.

In [4]:
pd.set_option('expand_frame_repr', True)
pd.set_option('colheader_justify', 'right')
pd.set_option('precision', 12)

# Link for reformatting pandas from exponential formatting to standard decimal
# https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
# pd.set_option('display.float_format', lambda x: f'{x:.1f}')

pd.set_option('max_rows', 250)
pd.set_option('display.width', 200)
pd.set_option('max_columns', 100)
pd.set_option('max_colwidth', 50)
pd.set_option('column_space', 40)

### EIA API Query Strings

Natural Gas and Crude Oil Annual Production  

[Top of Page](#top)

In [23]:
us_gas_annual = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=NG.N9010US2.A"
us_oil_annual = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=PET.MCRFPUS1.A"

Natural Gas and Crude Oil Sectors, CO2 Emissions

In [24]:
us_gas_emiss = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-TT-NG-US.A"
us_oil_emiss = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-TT-PE-US.A"

Natural Gas Venting and Flaring

In [25]:
us_flared_annual = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=NG.N9040US2.A"

Industrial CO2 data Query

In [26]:
us_CO2_industrial = "http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-IC-TO-US.A"

Enable one query dict() for processing by the following active cell.

In [27]:
# eia_queries = {'US_OIL_ANNUAL': us_oil_annual}
# eia_queries = {'US_GAS_ANNUAL': us_gas_annual, 'US_FLARED_ANNUAL': us_flared_annual}
eia_queries = {'US_GAS_EMISS': us_gas_emiss, 'US_OIL_EMISS': us_oil_emiss, 'US_CO2_INDUSTRIAL': us_CO2_industrial}

### API Query Execution  
The following cell will iterate the query dict(), retrieving the respective JSON file for the query. The JSON files saved to the ___data_eia___ folder as source data.  
Each retrieved and saved query will be printed to STDOUT to verify the query process.  

[Top of Page](#top)

In [28]:
for key, query in eia_queries.items():
    data = requests.get(query)
    if data.status_code == 200:
        with open('../data_eia/' + key + '.json', 'w') as j_out:
            json.dump(data.json(), j_out)
            print(key+'.json', query)

US_GAS_EMISS.json http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-TT-NG-US.A
US_OIL_EMISS.json http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-TT-PE-US.A
US_CO2_INDUSTRIAL.json http://api.eia.gov/series/?api_key=95c8b4b9cc5b48a195376f74b841d16c&series_id=EMISS.CO2-TOTV-IC-TO-US.A


#### An external directory listing to confirm the contents of our data folder.

In [1]:
!ls data_eia/* -laF

-rw-r--r-- 1 gmyers gmyers  637 May 20 23:23 data_eia/us_co2_industrial.csv
-rw-r--r-- 1 gmyers gmyers 1449 May 20 23:22 data_eia/US_CO2_INDUSTRIAL.json
-rw-r--r-- 1 gmyers gmyers 1840 May 20 23:18 data_eia/us_flared_annual.csv
-rw-r--r-- 1 gmyers gmyers 1960 May 20 23:17 data_eia/US_FLARED_ANNUAL.json
-rw-r--r-- 1 gmyers gmyers 2174 May 20 23:18 data_eia/us_gas_annual.csv
-rw-r--r-- 1 gmyers gmyers 2139 May 20 23:17 data_eia/US_GAS_ANNUAL.json
-rw-r--r-- 1 gmyers gmyers  643 May 20 23:23 data_eia/us_gas_emiss.csv
-rw-r--r-- 1 gmyers gmyers 1469 May 20 23:22 data_eia/US_GAS_EMISS.json
-rw-r--r-- 1 gmyers gmyers 2286 May 20 23:21 data_eia/us_oil_annual.csv
-rw-r--r-- 1 gmyers gmyers 3394 May 20 23:21 data_eia/US_OIL_ANNUAL.json
-rw-r--r-- 1 gmyers gmyers  651 May 20 23:23 data_eia/us_oil_emiss.csv
-rw-r--r-- 1 gmyers gmyers 1475 May 20 23:22 data_eia/US_OIL_EMISS.json


In [2]:
!ls data_Scripps/* -laF

-rw-r--r-- 1 gmyers gmyers 19765 May 15 12:06 data_Scripps/atmos_conc_ch4.csv
-rw-r--r-- 1 gmyers gmyers 34803 May 15 12:05 data_Scripps/atmos_conc_co2_csv
-rw-r--r-- 1 gmyers gmyers  9875 May 15 12:06 data_Scripps/atmos_conc_n2o.csv
-rw-r--r-- 1 gmyers gmyers 40512 Apr 11 15:50 data_Scripps/ch4_mm_gl.txt
-rw-r--r-- 1 gmyers gmyers 51579 Apr 11 15:55 data_Scripps/co2_mm_mlo.txt
-rw-r--r-- 1 gmyers gmyers 22576 Apr 11 16:01 data_Scripps/n2o_mm_gl.txt


### Begin processing the downloaded JSON data  
Read each dataset in the query dict() into a pandas dataframe. Each dataframe will be referenced from a key/value pair in a dict dedicated for the process.

In [29]:
df_data = {}

for key in eia_queries.keys():
    with open('../data_eia/' + key + '.json', 'r') as j_in:
        data = json.load(j_in)
        df_data[key] = pd.DataFrame(data['series']).explode('data')

#### Quick data inspection

In [30]:
for key in eia_queries.keys():
    print(key, df_data[key])

US_GAS_EMISS                    series_id                                               name                    units  f unitsshort                                        description copyright  \
0  EMISS.CO2-TOTV-TT-NG-US.A  Total carbon dioxide emissions from all sector...  million metric tons CO2  A    mmt CO2  See http://www.eia.gov/environment/emissions/s...      None   
0  EMISS.CO2-TOTV-TT-NG-US.A  Total carbon dioxide emissions from all sector...  million metric tons CO2  A    mmt CO2  See http://www.eia.gov/environment/emissions/s...      None   
0  EMISS.CO2-TOTV-TT-NG-US.A  Total carbon dioxide emissions from all sector...  million metric tons CO2  A    mmt CO2  See http://www.eia.gov/environment/emissions/s...      None   
0  EMISS.CO2-TOTV-TT-NG-US.A  Total carbon dioxide emissions from all sector...  million metric tons CO2  A    mmt CO2  See http://www.eia.gov/environment/emissions/s...      None   
0  EMISS.CO2-TOTV-TT-NG-US.A  Total carbon dioxide emissions from all se

### Product Processing Blocks
Each product to be processed is broken into seperate blocks below due to the slight differences  
either in product units of measure or in the case of the Scripps data, small differences in the  
file content. When possible product processing is combined to reduce code duplication.  

[Top of Page](#top)

## Crude Oil Block

In [None]:
df_oil = df_data['US_OIL_ANNUAL'][['unitsshort','data']]

df_oil = pd.DataFrame(df_oil.data.values.tolist(), index=df_oil.index, columns=['Year', 'MBBLS'])

df_oil = df_oil.astype({'Year': 'int64', 'MBBLS': 'float64'})

df_oil['Year'] = pd.to_datetime(df_oil['Year'], format='%Y')

df_oil.reset_index(drop=True, inplace=True,)

df_oil['Year'] = pd.DatetimeIndex(df_oil['Year']).year

df_oil.head(20)

In [None]:
df_oil.to_csv(data_eia+'us_oil_annual.csv', header=True, index=False, )
df_oil.to_csv(data_591+'us_oil_annual.csv', header=True, index=False, sep='|')

## Natural Gas and Vented/Flared Block

In [13]:
for key in eia_queries.keys():

    name = key.lower()

    df_gas = df_data[key][['unitsshort','data']]

    df_gas = df_gas[['unitsshort', 'data']]

    df_gas = pd.DataFrame(df_gas.data.values.tolist(), index=df_gas.index, columns=['Year', 'MMCF'])

    df_gas = df_gas.astype({'Year': 'int64', 'MMCF': 'float64'})

    df_gas['Year'] = pd.to_datetime(df_gas['Year'], format='%Y')

    df_gas.reset_index(drop=True, inplace=True)

    df_gas['Year']=pd.DatetimeIndex(df_gas['Year']).year

    df_gas['MBOE'] = (df_gas['MMCF'].values / 6000 * 1000)

    df_gas['MBOE'] = df_gas['MBOE'].round(1)

    df_gas.head(20)

    df_gas.to_csv(data_eia + name+'.csv', header=True, index=False)
    df_gas.to_csv(data_591 + name+'.csv', header=True, index=False, sep='|')

## Natural Gas, Oil, and Industrial Emissions Block

In [None]:
for key in eia_queries.keys():

    name = key.lower()

    df_em = df_data[key][['unitsshort','data']]

    df_em = pd.DataFrame(df_em.data.values.tolist(), index=df_em.index, columns=['Year', 'mmt CO2'])

    df_em = df_em.astype({'Year': 'int64', 'mmt CO2': 'float64'})

    df_em['Year'] = pd.to_datetime(df_em['Year'], format='%Y')

    df_em.rename(columns={'mmt CO2': 'mmt_CO2'}, inplace=True)

    df_em.reset_index(drop=True, inplace=True,)

    df_em['Year'] = pd.DatetimeIndex(df_em['Year']).year

    print(name, df_em.head(20))

    df_em.to_csv(data_eia+name+'.csv', header=True, index=False, )
    df_em.to_csv(data_591+name+'.csv', header=True, index=False, sep='|')

## Scripps Atomospheric Concentrations Block

### All Scripps (SIO) data was obtained from noaa.gov
https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html  

Because the production data is based on yearly totals, the concentration value<br>used is taken from the December reading as the summation of the current year.  

[Top of Page](#top)

**CO2 URL ->** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt

In [33]:
header = ['Year', 'month', 'decimal_date', 'average_ppm', 'interpolated_ppm', 'trend_ppm', 'num_days']

df_co2 = pd.read_csv(data_scripps+'co2_mm_mlo.txt', delimiter=r'\s+', comment='#', names=header)

df_co2.sort_values(by=['Year', 'month'], ascending=False, inplace=True)

df_co2.head()

In [35]:
df_co2.to_csv(data_scripps + 'atmos_conc_co2_csv', header=True, index=False, sep=',')
df_co2.to_csv(data_591 + 'atmos_conc_co2.csv', header=True, index=False, sep='|')

## Scripps CH4 Concentration Block

**CH4 URL ->** ftp://aftp.cmdl.noaa.gov/products/trends/ch4/ch4_mm_gl.txt  

In [36]:
header = ['Year', 'month', 'decimal_date', 'average_ppb', 'average_unc', 'trend_ppb', 'trend_unc']

df_ch4 = pd.read_csv(data_scripps+'ch4_mm_gl.txt', delimiter=r'\s+', comment='#', names=header)

df_ch4.sort_values(by=['Year', 'month'], ascending=False, inplace=True)

df_ch4.head()

In [38]:
df_ch4.to_csv(data_scripps + 'atmos_conc_ch4.csv', header=True, index=False, sep=',')
df_ch4.to_csv(data_591 + 'atmos_conc_ch4.csv', header=True, index=False, sep='|')

## Scripps N2O Concentration Block

**N2O URL ->** ftp://aftp.cmdl.noaa.gov/products/trends/n2o/n2o_mm_gl.txt

In [39]:
header = ['Year', 'month', 'decimal_date', 'average_ppb', 'average_unc', 'trend_ppb', 'trend_unc']

df_n2o = pd.read_csv(data_scripps+'n2o_mm_gl.txt', delimiter=r'\s+', comment='#', names=header)

df_n2o.sort_values(by=['Year', 'month'], ascending=False, inplace=True)

df_n2o.head()

In [41]:
df_n2o.to_csv(data_scripps + 'atmos_conc_n2o.csv', header=True, index=False, sep=',')
df_n2o.to_csv(data_591 + 'atmos_conc_n2o.csv', header=True, index=False, sep='|')

### End of processing

[Top of Page](#top)