# SIADS 591-592 Milestone 1 Project

## Greenhouse Gas (GHG) Emissions from Upstream and Midstream US Oil and Gas Operations

### Data Acquisition and Processing

* Emission data is present in six excel files. Each file represents a specific sector (upstream or downstream) and specific type of greenhouse gas (CO2, CH4, N2O).
* Within each file, there are eight sheets a separate sheet for each year. Starting from 2011 to 2018.
* Combine these sheets from excel files into a single data frame. Add a column to each of these data frames (file level) to indicate what type of emission it is, i.e, CO2 or CH4 or N2O.
* Also add a column to indicate what sector that excel data represents, whether it is Upstream or midstream.


<b>The output from the Jupyter notebook:</b>
* -Add the original dataset to each output file
1. <b>Emissions_aggregatedData.csv</b> - This file will contain Reporting year, Company, Gas, Sector, and GHG Emission volume and 2018 Emission wise rank
2. <b>Emissions_AggWithSecRank.csv</b> - This file contains Reporting year, Company and Emission volume, and rank by total emissions from 2009 to 2018.
3. <b>ND_AggregatedEmissions.csv</b> - Emission volumes from the North Dakota region only. Columns include -Reporting year, company and GHG emission volume
4. <b>Emissions_OtherIndustries.csv</b> - Emissions from Other industries - columns - Reporting Year, Industrial Sector and Emission volume
5. <b>Processed_AnnualProductionData.csv</b> - This file contains Oil and Gas production volumes, columns: Product, Reporting Date and Production Volume
6. <b>Emission_and_Production.csv</b> - This file combines production volumes and Emission volumes into a single dataframe without a split of the Product or Emission Sector. Data present in Key-value pairs - Data present from 2011 onwards
7. <b>ProductionVsEmissionSplit.csv</b> - Production and Emission numbers are split in Emission-sector and Production-Product types, Data present in Key-value types from 2011 onwards

In [10]:
import pandas as pd
#cleanco library will be used to standardize the company names
from cleanco import cleanco
import re

In [2]:
#Define reusable functions
def standardCompName(inName):
    return cleanco(inName).clean_name().upper()

In [3]:
#Read emission excel spreadsheets
FileList=['Midstream-CH4.xls','Midstream-CO2.xls','Midstream-N2O.xls','Upstream-CH4.xls','Upstream-CO2.xls','Upstream-N2O.xls']
folder_path = "./"

##### Read the six excel spreadsheets, parse them and save the processed data into a CSV file

In [32]:
total_size = 0
df_list=[]
for file in FileList:
    file_path=folder_path+file
    
    fileName = file.split('.')[0].split('-')
    sector = fileName[0]
    gas = fileName[1]
    
    df = pd.read_excel(file_path, sheet_name=None,skiprows=[0,1,2,3,4,5])
    for year in df:
        df_key = sector+"_"+gas+"_"+year
        df_name=df_key+"_df"
        df_name = df[year]
        df_name['GAS']=gas
        df_name['SECTOR']=sector
        df_list.append(df_name)

# combine (union) all these dataframes into a single dataframe       
df_fullset = pd.concat(df_list)
# Some of the facilities are operated under partnership, we need to allocate emission quantities to individual parent companies
df_fullset['PARENT COMPANIES']=df_fullset['PARENT COMPANIES'].str.split(';')
df_fullset=df_fullset.explode('PARENT COMPANIES')

#Seperate the company name and contribution
regex=r'(?P<PARENT_COMPANY>[-\w\s\d,&./()#]+)([\(])(?P<CONTRIBUTION>[\(\d.]+)([%\)]*)'
#Compile the regular expression for better performance
compName_RE = re.compile(regex)

df_fullset[['PARENT_COMPANY','CONTRIBUTION_PERCENT']]=df_fullset['PARENT COMPANIES'].str\
.extract(compName_RE)[['PARENT_COMPANY','CONTRIBUTION']]

#Turn those numbers into float values
df_fullset[['GHG QUANTITY (METRIC TONS CO2e)','CONTRIBUTION_PERCENT']]=df_fullset[['GHG QUANTITY (METRIC TONS CO2e)','CONTRIBUTION_PERCENT']].apply(pd.to_numeric)

df_fullset['GHG_CONTRIBUTION']=df_fullset['GHG QUANTITY (METRIC TONS CO2e)']*df_fullset['CONTRIBUTION_PERCENT']*0.01
df_fullset['PARENT_COMPANY']=df_fullset['PARENT_COMPANY'].str.strip()

#Standardize company names
df_NameLookup = pd.read_csv(folder_path+'CompanyName_Lookup.csv',sep='|')
df_NameLookup['PARENT_COMPANY']=df_NameLookup['PARENT_COMPANY'].str.strip()

df_Agg_withRanks=df_Agg_withRanks.merge(df_NameLookup,left_on='PARENT_COMPANY', right_on='PARENT_COMPANY', how='left')
## Remove the rows that doesnt have company name, these are excel column wise sums
df_Agg_withRanks=df_Agg_withRanks[~df_Agg_withRanks['STANDARD_COMPANY_NAME'].isna()]



#Aggregate the data by Parent company, sector, reporting year and Emission type
df_Agg=df_fullset.groupby(['REPORTING YEAR','PARENT_COMPANY','GAS','SECTOR'])['GHG_CONTRIBUTION']\
.agg('sum').reset_index().sort_values(['PARENT_COMPANY','REPORTING YEAR'])

df_Agg['PARENT_COMPANY']=df_Agg['PARENT_COMPANY'].str.strip()

#2018 upstream rank
df_Agg_upstream=df_Agg[(df_Agg['REPORTING YEAR']==2018) & (df_Agg['SECTOR']=='Upstream')].groupby(['PARENT_COMPANY']).sum().sort_values('GHG_CONTRIBUTION',ascending=False).reset_index().reset_index().rename(columns={'index':'2018_UPSTREAM_RANK'})
df_Agg_upstream['2018_UPSTREAM_RANK']=df_Agg_upstream['2018_UPSTREAM_RANK']+1
df_Agg_upstream=df_Agg_upstream.drop(['REPORTING YEAR','GHG_CONTRIBUTION'],axis=1)
#df_Agg_upstream.head(10)

#2018 midstream rank
df_Agg_midstream=df_Agg[(df_Agg['REPORTING YEAR']==2018) & (df_Agg['SECTOR']=='Midstream')].groupby(['PARENT_COMPANY']).sum().sort_values('GHG_CONTRIBUTION',ascending=False).reset_index().reset_index().rename(columns={'index':'2018_MIDSTREAM_RANK'})
df_Agg_midstream['2018_MIDSTREAM_RANK']=df_Agg_midstream['2018_MIDSTREAM_RANK']+1
df_Agg_midstream=df_Agg_midstream.drop(['REPORTING YEAR','GHG_CONTRIBUTION'],axis=1)
df_Agg_midstream.head(10)


#2018 Overall rank
df_Agg_overall=df_Agg[(df_Agg['REPORTING YEAR']==2018)].groupby(['PARENT_COMPANY']).sum().sort_values('GHG_CONTRIBUTION',ascending=False).reset_index().reset_index().rename(columns={'index':'2018_OVERALL_RANK'})
df_Agg_overall['2018_OVERALL_RANK']=df_Agg_overall['2018_OVERALL_RANK']+1
df_Agg_overall=df_Agg_overall.drop(['REPORTING YEAR','GHG_CONTRIBUTION'],axis=1)
df_Agg_overall.head(10)


# Add ranks to Aggregated dataset
df_Agg_withRanks=pd.merge(df_Agg,df_Agg_upstream,on='PARENT_COMPANY',how='left')\
.merge(df_Agg_midstream,on='PARENT_COMPANY',how='left')\
.merge(df_Agg_overall,on='PARENT_COMPANY',how='left')\
.sort_values('2018_UPSTREAM_RANK')

#Fill uncalculated ranks to max rank
df_Agg_withRanks['2018_UPSTREAM_RANK']=df_Agg_withRanks['2018_UPSTREAM_RANK'].fillna(max(df_Agg_withRanks['2018_UPSTREAM_RANK'])+1)
df_Agg_withRanks['2018_MIDSTREAM_RANK']=df_Agg_withRanks['2018_MIDSTREAM_RANK'].fillna(max(df_Agg_withRanks['2018_MIDSTREAM_RANK'])+1)




#Keep a copy of processed data into a file
df_Agg_withRanks.to_csv('Emissions_aggregatedData.csv',index=False,sep='|')

df_Agg_withRanks.head(5)

Unnamed: 0,REPORTING YEAR,PARENT_COMPANY,GAS,SECTOR,GHG_CONTRIBUTION,2018_UPSTREAM_RANK,2018_MIDSTREAM_RANK,2018_OVERALL_RANK,STANDARD_COMPANY_NAME
0,2018,HILCORP ENERGY CO,N2O,Upstream,1495.0,1.0,29.0,8.0,HILCORP ENERGY
1,2018,HILCORP ENERGY CO,N2O,Midstream,598.70788,1.0,29.0,8.0,HILCORP ENERGY
2,2015,HILCORP ENERGY CO,CH4,Midstream,13146.24412,1.0,29.0,8.0,HILCORP ENERGY
3,2015,HILCORP ENERGY CO,CH4,Upstream,640199.0,1.0,29.0,8.0,HILCORP ENERGY
4,2015,HILCORP ENERGY CO,CO2,Midstream,922041.90366,1.0,29.0,8.0,HILCORP ENERGY


##### Prepare a dataframe with Rank by total value in last 9 years - This file not in use

In [5]:
#Prepare a dataframe with Rank by total value in last 9 years
df_Agg_SecRank=df_Agg_withRanks[['STANDARD_COMPANY_NAME','SECTOR','GHG_CONTRIBUTION']]\
.groupby(['STANDARD_COMPANY_NAME','SECTOR'])\
.sum().reset_index().sort_values('GHG_CONTRIBUTION',ascending=False).reset_index().drop('index',axis=1).reset_index()\
.rename(columns={'index':'RANK'})

df_Agg_SecRank['RANK']=df_Agg_SecRank['RANK']+1

df_Agg_SecRank.to_csv('Emissions_AggWithSecRank.csv',index=False,sep='|')

df_Agg_SecRank.head(5)

Unnamed: 0,RANK,STANDARD_COMPANY_NAME,SECTOR,GHG_CONTRIBUTION
0,1,EXXONMOBIL,Midstream,66370220.0
1,2,WILLIAMS,Midstream,65691030.0
2,3,KINDER MORGAN,Midstream,63478460.0
3,4,ENERGY TRANSFER PARTNERS,Midstream,62582680.0
4,5,CONOCOPHILLIPS,Upstream,54862980.0


###### Pickup North Dakota specific emissions - This file Not in use

In [6]:
df_ND_Emissions=df_fullset[df_fullset['STATE']=='ND'].copy()
df_ND_Emissions['PARENT_COMPANY']=df_ND_Emissions['PARENT_COMPANY'].str.strip()

df_ND_Agg=df_ND_Emissions.groupby(['REPORTING YEAR','PARENT_COMPANY','GAS','SECTOR'])['GHG_CONTRIBUTION']\
.agg('sum').reset_index().sort_values(['PARENT_COMPANY','REPORTING YEAR'])

df_ND_Agg['PARENT_COMPANY']=df_ND_Agg['PARENT_COMPANY'].str.strip()

df_ND_Agg=pd.merge(df_ND_Agg,df_NameLookup,left_on='PARENT_COMPANY', right_on='PARENT_COMPANY',how='left')

df_ND_Agg['PARENT_COMPANY']=df_ND_Agg['STANDARD_COMPANY_NAME']
df_ND_Agg=df_ND_Agg.rename(columns={'PARENT_COMPANY':'COMPANY'}).drop('STANDARD_COMPANY_NAME', axis=1)

df_ND_Agg.to_csv('ND_AggregatedEmissions.csv',sep='|')
df_ND_Agg.head(5)

Unnamed: 0,REPORTING YEAR,COMPANY,GAS,SECTOR,GHG_CONTRIBUTION
0,2017,1804,CH4,Midstream,3818.0
1,2017,1804,CO2,Midstream,19480.0
2,2018,1804,CH4,Midstream,8913.0
3,2018,1804,CO2,Midstream,59969.0
4,2018,1804,N2O,Midstream,24.0


## Processing the Emissons from other Industrial sectors

In [7]:
df_industry = pd.read_csv(folder_path+'IndustryWiseGHGEmissions.csv')
# Unpivot the Yearly data, put it in columner format
df_industry=pd.melt(df_industry, id_vars=['Industry Sector'], var_name='Year', value_name='Emission')
#Remove yearly totals
df_industry=df_industry[df_industry['Industry Sector']!='Total']
#Write it to a csv file
df_industry.to_csv('Emissions_OtherIndustries.csv',index=False,sep='|')

In [8]:
df_industry.sample(5)

Unnamed: 0,Industry Sector,Year,Emission
23,Chemical production and use,1992,72.78805
172,Other industrial categories,2007,112.642797
125,Mineral products,2002,62.789458
260,Fossil fuel combustion: carbon dioxide,2016,767.851626
60,Fossil fuel combustion: carbon dioxide,1996,887.37819


## Oil and Gas Production data processing

In [23]:
# Read Natural Gas production file
dict_gasProduction = pd.read_excel(folder_path+'NaturalGas Production.xls',sheet_name=[1,2], usecols=[0,1],skiprows=[0,1], names=['Date','Production kBOE'])
df_gasProduction = pd.concat(dict_gasProduction.values())
df_gasProduction=df_gasProduction.groupby('Date').sum()
df_gasProduction['Product']='Natural Gas'
df_gasProduction['Production kBOE']=df_gasProduction['Production kBOE']*1000/6000

df_gas_annual=df_gasProduction.groupby(['Product',pd.Grouper(freq="Y")]).sum().reset_index()
df_gas_annual=df_gas_annual[df_gas_annual['Date']<'2020-01-01']

#Pickup the year part
df_gas_annual.Date = df_gas_annual.Date.dt.year


# Crude oil production
df_CrudeProduction=pd.read_excel(folder_path+'CrudeOil Production.xls',sheet_name=1, usecols=[0,1],skiprows=[0,1], names=['Date','Production kBOE'])
df_CrudeProduction['Product']='Crude'
df_CrudeProduction['Production kBOE']=df_CrudeProduction['Production kBOE']
df_CrudeProduction=df_CrudeProduction.set_index('Date')

df_crude_annual=df_CrudeProduction.groupby(['Product',pd.Grouper(freq="Y")]).sum().reset_index()
df_crude_annual=df_crude_annual[df_crude_annual['Date']<'2020-01-01']
#Pickup the year
df_crude_annual.Date = df_crude_annual.Date.dt.year


#Put both Datasets together in columner format
df_annual_production = pd.concat([df_crude_annual,df_gas_annual])
#Generate file for SP analysis
df_annual_production.to_csv('Processed_AnnualProductionData.csv', sep='|')
df_annual_production.head(5)

Unnamed: 0,Product,Date,Production kBOE
0,Crude,1920,442929.0
1,Crude,1921,472183.0
2,Crude,1922,557531.0
3,Crude,1923,732407.0
4,Crude,1924,713940.0


### Join the Emission Data and Production Data using Reporting Year

##### Put Emissions and Production data into a single columnar dataframe and write it to a file for visualizations

In [16]:
df_individula = df_annual_production[df_annual_production.Date>2010]

df_combined=df_annual_production[df_annual_production.Date>2010].groupby('Date').sum().reset_index()
df_combined['Product']='Combined Production'

df_ProdSplit=pd.concat([df_combined,df_individula]).rename(columns={'Date':'REPORTING_YEAR','Product':'Key','Production kBOE':'Value'})
df_ProdSplit=df_ProdSplit.replace('Crude','Crude Production').replace('Natural Gas','Natural Gas Production')


df_emission_BySec=df_Agg_withRanks[['REPORTING YEAR','SECTOR','GHG_CONTRIBUTION']].groupby(['REPORTING YEAR','SECTOR']).sum().reset_index()

df_emission_comb=df_emission_BySec.groupby('REPORTING YEAR').sum().reset_index()
df_emission_comb['SECTOR']='Combined Emission'
df_emiss_split=pd.concat([df_emission_BySec, df_emission_comb])
df_emiss_split=df_emiss_split.rename(columns={'REPORTING YEAR':'REPORTING_YEAR','SECTOR':'Key','GHG_CONTRIBUTION':'Value'})
df_emiss_split=df_emiss_split.replace('Midstream','Midstream Emission').replace('Upstream','Upstream Emission')

df_prodvsEmission=pd.concat([df_ProdSplit,df_emiss_split]).reset_index().drop('index',axis=1)
df_prodvsEmission.to_csv('ProductionVsEmissionSplit.csv',sep='|',index=False)
df_prodvsEmission

Unnamed: 0,Product,Date,Production kBOE
91,Crude,2011,2068316.0
92,Crude,2012,2385704.0
93,Crude,2013,2734901.0
94,Crude,2014,3207206.0
95,Crude,2015,3445138.0
96,Crude,2016,3235183.0
97,Crude,2017,3413418.0
98,Crude,2018,4011519.0
99,Crude,2019,4464530.0
38,Natural Gas,2011,4940064.0


In [None]:
#df_Agg_withRanks['STANDARD_COMPANY_NAME']=df_Agg_withRanks['PARENT_COMPANY'].apply(standardCompName)

In [None]:
#df_Agg_withRanks.to_csv('Emissions_aggregatedData.csv',index=False,sep='|')

In [33]:
df_Agg_withRanks[df_Agg_withRanks['STANDARD_COMPANY_NAME'].str.contains('CONOCO')]

Unnamed: 0,REPORTING YEAR,PARENT_COMPANY,GAS,SECTOR,GHG_CONTRIBUTION,2018_UPSTREAM_RANK,2018_MIDSTREAM_RANK,2018_OVERALL_RANK,STANDARD_COMPANY_NAME
117,2018,CONOCOPHILLIPS,CO2,Midstream,2.992739e+06,5.0,16.0,10.0,CONOCOPHILLIPS
118,2018,CONOCOPHILLIPS,N2O,Midstream,1.496054e+03,5.0,16.0,10.0,CONOCOPHILLIPS
119,2015,CONOCOPHILLIPS,CH4,Midstream,5.728028e+04,5.0,16.0,10.0,CONOCOPHILLIPS
120,2015,CONOCOPHILLIPS,CH4,Upstream,3.360061e+06,5.0,16.0,10.0,CONOCOPHILLIPS
121,2015,CONOCOPHILLIPS,CO2,Midstream,2.239105e+06,5.0,16.0,10.0,CONOCOPHILLIPS
...,...,...,...,...,...,...,...,...,...
5790,2014,ConocoPhillips Company,CH4,Upstream,1.476440e+05,312.0,279.0,,CONOCOPHILLIPS
5791,2014,ConocoPhillips Company,CO2,Midstream,1.073000e+05,312.0,279.0,,CONOCOPHILLIPS
5792,2014,ConocoPhillips Company,CO2,Upstream,9.118760e+05,312.0,279.0,,CONOCOPHILLIPS
5793,2014,ConocoPhillips Company,N2O,Midstream,6.050000e+01,312.0,279.0,,CONOCOPHILLIPS
