# SIADS 591-592 Milestone 1 Project

## Greenhouse Gas (GHG) Emissions from Upstream and Midstream US Oil and Gas Operations

By Rafee Shaik and Greg Myers  
April-May 2020
<a id='overview'></a>
## Project Overview

The Oil and Natural Gas industry consists of three sectors, the **Upstream** sector that focuses on Exploring and Producing (E&P) Hydrocarbons, the **Midstream** sector focuses on Transportation and storage facilities, and the **Downstream** sector will process and refine raw materials such as crude oil into consumer products like gasoline.

Enhanced hydrocarbon extraction methods, that include horizontal drilling and fracking boosted Crude Oil and Natural Gas Production in the US since 2007. The purpose of this project is to examine Greenhouse Gas (GHG) Emissions from upstream and midstream sectors of the industry and determine if there is a correlation between accelerated hydrocarbon (Crude oil and Natural gas) production in the most recent decade and GHG emissions rates. If a correlation is found, additional analysis may be able to reveal the causal source. The project will focus on three GHGs, Carbon-Dioxide (CO2), Methane (CH4), and Nitrous Oxide (N2O).

<a id='motivation'></a>
## Project Motivation
There has been an increase in US hydrocarbons production since the shale boom that started in 2007/2008 ![Fig-0](./HistoricalProduction.png) Hydrocarbon production in the US increased steadily since then, this increase in production led to an increase in GHG emissions from Upstream and Midstream operations of Oil and Gas companies.

It is in every stakeholder's interest to control these emissions while optimizing the production. These stakeholders include the operating company, environmental protection agencies, and local and federal governments. The goal of the study is to find any correlation between increased hydrocarbon production and industry sector GHG emissions? Does the correlation apply to the industry as a whole or one or more individual components? This study can also lead to identifying the opportunities to improve the pipeline infrastructure and invest in pneumatic devices that can detect and prevent hydrocarbon emissions.

Both the project team members work in the Oil and Gas industry and are intrigued to find opportunities to reduce GHG emissions while optimizing the production.

<a id='DataSources'></a>
## Data Sources
1. **Greenhouse Gas Emissions Data**
**Source:** US Oil and Gas Upstream (Exploration & Production) and Midstream (Pipelines and Storage) facilities can be accessed from the U.S. Environmental Protection Agency (EPA) FLIGHT database.  
**Location:** https://ghgdata.epa.gov/ghgp/main.do  
**Access Method:** Facility Level Information on GreenHouse gases Tool (FLIGHT) database can be accessed through the website: https://ghgdata.epa.gov/ghgp/main.do.  
Download the data in excel format after selecting appropriate filters.  
**Format:** Excel spreadsheets  
**Dataset Size:** Six excel spreadsheets with a total of 41K records.  
**Period:** This data covers emissions from upstream and midstream Oil & Gas operations between 2011 and 2018.  


2. **Crude Oil and Natural Gas Production volumes**
 U.S. Field Production of Crude Oil, and U.S. Natural Gas Gross Withdrawals; Yearly  
**Source:** Energy Information Administration (EIA) datastore.  
**Location:** https://www.eia.gov/opendata/qb.php?category=371  
**Access Method:** API query  
**Format:** JSON  
**Dataset Size:** Crude records: 161; Gas records: 84; ~3 kilobytes each for crude and oil  
**Period:** Crude production records from 1920 to 2019, Natural Gas production records from 1980 to 2019.  


3. **Emission Data from other Industries**
**Source:** Emissions data from other Industrial sectors can be downloaded from the EAP data store. We will use this data to compare GHG emissions from Oil & Gas systems and other industrial sectors.  
**Location:** https://cfpub.epa.gov/ghgdata/inventoryexplorer/#industry/allgas/source/all  
**Access Method:** Web Scraping if there are no popups preventing web scraping, otherwise use the manual download option.  
**Format:** CSV or Web Scraping, CSV name will be IndustryWiseGHGEmissions.csv, if we’re not able to scrape it from the web.  
**Dataset Size:** 243 records  
**Period:** This data covers emissions from other industries between 2011 and 2018  

## Data Manipulation Methods

<a id='ProcessEmissionData'></a>
### Processing Emissions Data
#### Data Acquisition:
1. Data that was downloaded from EPA is in Excel spreadsheet format.
2. EPA Emission reports are separated by the industry sector (upstream and midstream) and GHG gas type (CO2, CH4, and N2O). All together we have 6 Excel spreadsheets.
3. Within each Excel file we have multiple worksheets, a separate worksheet for each reporting year, starting from 2011 to 2018.  
[Go to the code](EmissionsDataPreparation.ipynb/#ProcessEmissionsData)

#### Parsing Excel spreadsheets:
* Pandas’ **read_excel** module was used in parsing excel spreadsheets. Read_excel module can read multiple worksheets present in excel file. The option **"sheet_name=’null’"** was used to read all sheets from the excel files. Read_excel() returns a named dictionary containing worksheet name as key and corresponding data in a dataframe as dictionary value.  
* Iterate through each data frame in the dictionary, add three columns, 1. reporting_year with the value from dictionary key, 2. GHG Gas type with the value from part of the source file name, and 3. Industry sector with the value from part of the source file name.  
* Do the same for all six excel spreadsheets.  
* Combine all dataframes into a single dataframe using **pandas.concat()** method.

#### Processing the data:
1. Each row represents a facility operated under single company ownership or joint venture between multiple companies. So we need to separate the joint ventures and create a row for each company, and then calculate their portion of emission based on joint venture percentages.  
For example:  
```python
    "SHELL OIL CO (51.8%); EXXONMOBIL CORP (48.2%)"
```
2. Joint ventures are seperated by semicoluns (;), ".str.split(';')" method will separate the joint venture companies and will give us a list containing companies and their portion of GHG emission contribution.  
3. Melt this single row into multiple rows that represent a separate row for each partner company. Apply the **‘explode’** method on Joint venture company list to separate the individual companies.  
4. We use regular expressions to parse out the company name and partnership percentages into separate columns.  
5. Here is the regular expression used to parse the company names and partnership percentages:  
```python
	regex=r'(?P<PARENT_COMPANY>[-\w\s\d,&./()#]+)([\(])(?P<CONTRIBUTION>[\(\d.]+)([%\)]*)'
```
6. Convert partnership percentage and emission quantity from text type to numeric type.  
7. Replace non-numeric values with zero ‘0’

#### Aggregation:
We applied different levels of aggregations in our analysis. Data has attributes representing Company Name, Reporting Year, Gas type, and Industry sector.
1. GHG Gas type analysis: Data were separated by GHG gas type (CO2, CH4, and N2O), all these emission quantities are represented in CO2 equivalents. We will sum them up by the Company and Gas type within the reporting year to compare the emissions of different gas types.
2. Company level emissions: Each company can operate multiple facilities across the US, we will sum the emissions from the facilities operated by the same company within a reporting year. This data will be used to compare the emissions from different companies over the period of time.
3. Sector level emissions: Data will be aggregated by the industry sector within the reporting year.
4. Yearly GHG emissions: We will sum up the emissions from all companies, sectors, and gas types within the reporting year

#### Joining Emissions data with other datasets:
Reporting_year is be used as joining key when joining Emissions dataset with other datasets in the analysis such as Yearly Crude Oil and Natural Gas Production and Emissions from other Industries.  
[Go to the code](EmissionsDataPreparation.ipynb/#Join_Prod_n_Emission)

#### Challenges:
Over the period of time companies report their emissions with different names, like ‘Conoco Phillips’, ‘ConocoPhillips’, ‘ConocoPhillips Company’ all these company names represent one company ‘ConocoPhillips’. We used a third-party library ‘cleanco’ to clean up the company names, this library helped to remove the company type suffixes, like ‘LLC’, ‘Co’, etc. We performed company name lookup to standardize the company names. We put these company name lookup table in a [CSV file](./CompanyName_Lookup.csv).

#### Saving aggregated data for Analysis and visualizations:
Aggregated emissions data will be saved in ['Emissions_aggregatedData.csv'](Emissions_aggregatedData.csv) CSV file

<a id='ProcessProductionData'></a>
### Processing Crude and Natural Gas Production volumes datasets
1. Save API query results to a JSON file as an immutable source data reference.
2. Import JSON data (crude & gas) into Pandas data frames.
3. Perform Explode operations to separate date and production data.
4. Transform Date column into a DateTime data type.
5. Create and populate a product type column and drop unused columns.
6. Append crude and gas data frames (long-format).
7. Save the data frame to a CSV file with the name ['Processed_AnnualProductionData.csv'](Processed_AnnualProductionData.csv) for analysis and visualizations.  

[Go to the code](EmissionsDataPreparation.ipynb/#DataPrep_ProdDataPrep)

<a id='ProcessOtherIndustryData'></a>
### Processing GHG Emissions from Other industries
1. Emissions data from Other industries were downloaded from EPA.gov website. URL: https://cfpub.epa.gov/ghgdata/inventoryexplorer/#industry/allgas/source/all
2. Download the data in CSV format and save it to 'IndustryWiseGHGEmissions.csv' file.
3. Data is present in wide format, one row for each industry and emission values in seperate columns from 2011 to 2018.
4. use the Pandas **melt** function to convert the wide format to long format.
5. Filter out the yearly total values from the dataset
6. Fill non-numeric values with zero '0'
7. Save the processed dataframe to a CSV file for visualization and analysis. Name the output file to 'Emissions_OtherIndustries.csv'  
[Go to the code](EmissionsDataPreparation.ipynb/#DataPrep_OtherIndustries)

### Data Integration
Combine the Emissions dataset and Production volume dataset by the Reporting year, with this combined dataset we can compare greenhouse gas emission volumes with Crude Oil and Natural Gas production volumes.  
Joined dataset is saved in a CSV file with name ['ProductionVsEmissionSplit.csv'](ProductionVsEmissionSplit.csv)

<a id='analysis'></a>
## Analysis and Visualization

[Go to Analysis and Visualization Notebook](./EmissionsProject-Visualizations-2.ipynb)

### Correlation between Hydrocarbons production and GHG emissions
The focus of the analysis is studying the impact of increased Hydrocarbons production on Greenhouse Gas (GHG) emissions from the Oil & Gas industry.  

To perform this analysis we joined the annual Oil and Natural Gas Production data from EIA (Energy Information Administration, EIA.gov) with annual GHG emissions data from EPA (Environment Protection Agency EPA.gov). These datasets were joined by the Reporting Year as key. The processed and merged dataset was saved in ['ProductionVsEmissionSplit.csv'](ProductionVsEmissionSplit.csv) file. We used this merged dataset for our analysis and visualizations.  

We plotted annual production and emissions volumes on a timeseries line chart to observe the trend in GHG Emissions and Hydrocarbon productions between the years 2011 and 2018.  
![Combined Hydrocarbon Production vs Combined GHG Emission volumes](ProductionVsEmission.JPG)  
We observed both GHG Emissions and Hydrocarbon Production volumes are on increasing trend between 2011 and 2018.  

For a better understanding of GHG Emissions from different Oil & Gas sectors we did a deeper analysis by comparing GHG emissions from individual Oil & Gas sectors (Upstream and Midstream) and different hydrocarbon products (Crude Oil and Natural Gas).
![Product wise and Sector wise Production vs Emission volumes](ProdVsEmissionSecWise.JPG)  

To understand how strong the correlation between sector-wise GHG emissions and individual product production volumes we calculated Pearson-r between GHG Emissions from different sectors and different hydrocarbon product production volumes.  

We observed a strong-to-very strong correlation between all hyderocarbon productions and all GHG  sector emissions. Particularly we observed the highest correlation between ‘Natural Gas Production’ and ‘Combined GHG Emission.  

Product: Combined Production , Sector: Combined Emission , pearson-r: 0.8190218781217727  
Product: Combined Production , Sector: Midstream Emission , pearson-r: 0.7307744998650313  
Product: Combined Production , Sector: Upstream Emission , pearson-r: 0.5836843921994607  
Product: Crude Production , Sector: Combined Emission , pearson-r: 0.7620534805062806  
Product: Crude Production , Sector: Midstream Emission , pearson-r: 0.6723432310890928  
Product: Crude Production , Sector: Upstream Emission , pearson-r: 0.5850816693652066  
**Product: Natural Gas Production , Sector: Combined Emission , pearson-r: 0.8700047336668489**  
Product: Natural Gas Production , Sector: Midstream Emission , pearson-r: 0.7874070889774125  
Product: Natural Gas Production , Sector: Upstream Emission , pearson-r: 0.5584520095475436  

### Linear regression -  Predicting future emissions:
**Feature selection:**  
From the above Pearson-r calculations we observed there is a strong correlation between annual Natural Gas Production volume and annual GHG Emission volume. We will use ‘Natural Gas Production’ as our independent variable (single feature) to predict the GHG Emission volume for year 2019.  

Since we have only 8 data points, we will use all the datapoints to train our linear regression model.  

Refer to the Liner regression code module [here](./EmissionsProject-Visualizations-2.ipynb/#linearRegress).  

Our linear regression model predicted 337.62 million metric tons of CO2 equivalent GHG emissions for the year 2019.  

Predicted 2019 GHG emissions are plotted on timeseries chart below  
![2019 Predicted Emission](./PredictedEmission.JPG)

### A deep dive into sector-wise GHG emissions and Revising regression model:
For a deeper GHG emissions analysis we examine the emissions from individual sectors and GHG gas types. We observed a big jump in Midstream GHG emissions in 2016. 
![Subplots showing annual Emissions by sector, by Gas Type and Number of Operators from each sector](subplots.JPG)  
[Go to chart code](EmissionsProject-Visualizations-2.ipynb/#Fig-6)  
  
This jump in GHG emissions was not expected while the energy industry was in a downturn between 2015 and 2016. After talking to a few Business Analysts and reviewing all the changes in GHG emissions reporting rules from EPA between 2015 and 2016 we noticed EPA introduced a new rule in October 2015 (Docket ID No. EPA–HQ–OAR–2014–0831, https://www.govinfo.gov/content/pkg/FR-2015-10-22/pdf/2015-25840.pdf) that require Oil and Gas companies to report GHG emissions from their ‘Gathering and Boosting’ operations. GHG emissions from Gathering and Boosting were not reported in the past. The new rule requires the operators to report these emissions under the midstream sector.  

We also observed GHG emissions from ‘Gathering and Boosting’ operations are high and contributing a significant proportion to overall midstream emissions, they were 80.6, 74.12 and 81.28 million metric tons of CO2e in 2016, 2017 and 2018 respectively. That is about 40 to 50% of total annual midstream emissions.

#### Revision of the regression model:

Since ‘Gathering and Boosting’ added a significant amount of GHG emission to midstream emissions we wanted to revisit our earlier analysis and make adjustments to midstream emissions prier to 2016 EPA rule changes. We added a constant 74.12 million tons of CO2e (minimum amount of GHG emissions from Gathering and Boosting since 2016) to midstream emissions between 2011 and 2015.
We recalculated the correlation coefficient (Pearson-r) and observed a positive correlation between Natural Gas Production and combined GHG emission, however, Pearson-r value was down to 0.66 from 0.87.
We retrained our linear regression model using adjusted emissions data and recalculated estimated 2019 combined GHG emission quantity. The estimated GHG emission value for 2019 is 290 million metric tons of CO2 equivalent, which is 47.62 million metric tons less than our earlier estimation before adjusting the data for new EPA rule.  
[Follow the recalculation and visualization here](EmissionsProject-Visualizations-2.ipynb/#adjustEmission)
 
### Conclusion
**This correlation coefficient value (Pearson-r) may conclude that there is a strong correlation between Natural Gas Production and overall GHG emissions. Since we have only 8 years of data we will be cautious to conclude there is a causal relationship between Natural Gas Production and GHG emissions.**  


<a id='visSummary'></a>
## A Summery of Analysis and Visualizations:

Along the analysis we genearted several visualizations, most of them are time-series line+scatter plots. We also produced a histogram and Summary table.  
Here is a list of visualizations and observations:  
* [Fig-1](EmissionsProject-Visualizations-2.ipynb/#Fig-1): A visualization of historical Crude Oil and Natural Gas Production in the USA.
> Observations:  
There has been an increase in US hydrocarbons production since the shale boom that started in 2007/2008

* [Fig-2](EmissionsProject-Visualizations-2.ipynb/#Fig-2): Trend line of total GHG emissions from US Oil and Gas Companies between 2011 and 2018.
>Observations:  
This time-series chart shows the upwards trend in GHG Emissions from Oil & Gas upstream and midstream systems

* [Fig-3](EmissionsProject-Visualizations-2.ipynb/#Fig-3): A summery table showing annual GHG Emissions from individual US Oil and Gas Producers between 2011 and 2018, table sorted in GHG emission descending order to show major emittors at the top.

* [Fig-4](EmissionsProject-Visualizations-2.ipynb/#Fig-4): An Interactive chart - Emission trends between 2011 and 2018 from the top US emitters
>**About Fig-4:**  
This interactive chart can be used to visualize how GHG emissions from the top US emitters changed in the last 8 years.  
Companies are sorted in GHG emission volume descending order.  
Use the slider bar to select the number of companies to compare.  

>**Observations:**  
'ConocoPhillips' emissions are trending down between 2011 and 2018 while 'Energy Transfer Partners' emissions trending up.
there may be a reason why ConocoPhillips emissions were going down, such as assets divestiture or efficiency in reducing the GHG emissions while increasing the production.  
Similarly there may be an explanation of why 'Energy Transfer Partners' climbed to the top position in the last eight years. They might have increased their production or increased natural gas flaring.  
This can be a subject of another analysis.

* [Fig-5](EmissionsProject-Visualizations-2.ipynb/#Fig-5): An Interactive chart to compare emissions between different companies and sectors
>Observations:  
We used this interactive chart to study the Upstream and Midstream CO2 emissions from the top 5 emitters.  
The chart show Midstream emissions are higher when compared to Upstream operations of the same companies.  
Upstream emissions were down between 2015 and 2017 due to the Crude oil market downturn but started picking up since 2016.  
Oil market downturn doesn’t seem to impact emissions from midstream operations.  

* [Fig-6](EmissionsProject-Visualizations-2.ipynb/#Fig-6): A 4X4 Subplots to show, Emissions by sector, GHG Gas Type, Number of Operators(companies) in each sector and Total emissions
>Observations:  
>1. GHG emissions are on the rise from Oil & Gas sectors from 2011 to 2018.  
>2. There seems to be a big jump in GHG emissions from midstream operations in 2016.  
>3. N2O emissions are negligible from Oil & Gas industry.  
>4. There seems to be an impact on number of companies operating in midstream sector during 2015 energy industry downturn, but there was no impact on the number of upstream operators. This observation requires a study of its own.
  

* [Fig-7](EmissionsProject-Visualizations-2.ipynb/#Fig-7): A histogram to identify the common emission quantity range from US Oil & Gas companies
>Observations:  
Most of the Oil and Gas companies emit less then 500K metric tons of GHG gases.  
There are very few companies producing over 5 million tons of GHG gases.  

* [Fig-8](EmissionsProject-Visualizations-2.ipynb/#Fig-8): An interactive chart to visualize the correlation between diffrent Hydrocarbon product production volumes and  GHG Emissions from Upstream and Midstream Sectors
>Observations:  
A Pearson-r calculation revealed a strong correlation between annual 'Natural Gas Production' volume and combined annual GHG emission quantity.

* [Fig-9](EmissionsProject-Visualizations-2.ipynb/#Fig-9): A line chart with predicted 2019 GHG Emission
>We predicted a 337.65 mellion metric tons of CO2 equivalent GHG emissions from Oil and Gas Upsteam and Midsteam sectors in 2019. EPA releases 2019 GHG emission report in October 2020

* [Fig-10 and Fig-11](EmissionsProject-Visualizations-2.ipynb/#Fig-10): Adjusting Midsteam emissions for additional emissions from 'Gathering and Boosting' operations between 2011 and 2015; and recalculating 2019 predicted GHG emissions value.

<a id='codepipeline'></a>
## Project code pipeline
The project code and visualization are split into two jupyter notebooks.
1.	Notebook **‘EmissionsDataPreparation.ipynb’** will concentrate on data acquisition and Processing. The output of this notebook will be saved in several CSV(Comma Separated Values) files. The output files include: 
>a. **Emissions_aggregatedData.csv** - This file will contain Reporting year, Company, Gas, Sector, and GHG Emission volume and 2018 Emission wise rank  
>b. **Emissions_OtherIndustries.csv** - Emissions from Other industries - columns - Reporting Year, Industrial Sector and Emission volume  
>c.	**ProductionVsEmissionSplit.csv** - Production and Emission volumes are segigated by Emission-sector and Production-Product types, Data present in Key-value types from 2011 to 2018, please note that we have production volueme data available for year 2019, but Emission data is not available for year 2019.  
2.	Notebook **‘EmissionsProject-Visualizations.ipynb’** will take the CSV files prepared by EmissionsDataPreparation.ipynb notebook and generate required visualizations and analysis.


#### Visualization Technique:
We used graph_objs library from plotly for all visualizations in this analysis.
* Most of our plots are scatter plots with line marks over time-series data.
* We used ipywidgets to add interactivity with the charts.

**Other libraries used:**  
Pandas, numpy, scipy, matplotlib, sklearn


<a id='sow'></a>
## Statement of Work
The Greenhouse Gas Emissions investigation team is composed of **Rafee Shaik** and **Gregory Myers**, **UMSI MADS** students. Both members have professional backgrounds in domestic Crude Oil and Natural Gas production.  

This project doesn’t contain any proprietary datasets and doesn’t require an NDA (Non-Disclosure Agreement) between the University of Michigan and any organization.  
Rafee collaborated with his organization business unite to review the emissions project idea using the publicly available datasets.  

Some of the findings and tools created as part of the may benefit the organization.  


### Rafee Shaik

Rafee Shaik has contributed to acquiring the emissions data from EPA, parsing the Emissions data excel spreadsheets, and processing data for further analysis. Rafee also collected and processed emissions data from other industries. He combined emissions and production data and prepared various interactive visualizations that compare emissions from upstream and midstream sectors; and between companies and different GHG gas types.  
Rafee calculated the correlation coefficient between different Hydrocarbon production volumes and sector-wise GHG emission quantities. He prepared a linear regression model to predict future GHG emission quantity.  
He collaborated with Gregory in preparing the final project report.


### Gregory Myers
Gregory Myers has contributed the collection and processing of natural gas and crude oil production data, atmospheric GHG data, energy and industrial sectors emissions data. In addition to the data collection process Gregory has also contributed the following to the project: historical production and GHG concentration visualizations, production versus sector emissions, and industry top GHG emitters visualizations. Finally, preparation of the skeletal layout of the JupyterLab notebook presentation which will also receive significant contribution from Rafee Shaik.  

## Lessons learned (What didn't work, and why? )
1. After using both plotly.graph_objs and plotly.express we feel like plotly.express has improved functionality to produce visualizations much faster.
2. Better code management: even though we used github for team coding we were working on the master branch which cause code overwrite towards the end of the project. We should have created individual branches and avoided overwriting the code.
3. Bokeh seems to a popular visualization library amount data scientists. We will try to use Bokeh in future projects.
