## Texas Oil production data

As mentioned earlier, the oil and gas production data has been obtained for three different states - Texas, North Dakota and Wyoming. These states are among the top oil and gas producers in the country. Texas produces even more oil than all the offshore US fields put together.

In this notebook, the Texas data is processed.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from os import listdir

### Texas - Railroad Commission of Texas (RRC) data
This data was provided as a big data dump by Enigma's help desk. This is all stored in the RRCRawData folder. The data are all stored in '.dsv' files, which can be read as a regular csv file, but with '}' separators.

A number of files from the RRC are available, with data aggregated in different ways (only the relevant files will be put on github).

In [2]:
listdir('./')

['.DS_Store',
 '.ipynb_checkpoints',
 'DataWrangling_TexasOilProd.ipynb',
 'OG_COUNTY_CYCLE_DATA_TABLE.dsv',
 'pdq-dump-user-manual_final_ada_1-3-2018.pdf',
 'TexasOilProdCounty.csv']

In [3]:
# The county_cycle data table contains historical production aggregated by county
county_cycle = pd.read_csv('OG_COUNTY_CYCLE_DATA_TABLE.dsv',sep='}')

The county data is modified to add the state FIPS code to the county code, making the code unique for each county

In [4]:
area_codes = pd.read_csv('../../Unemployment/BLS_AreaCodes.txt',sep='\t',index_col=False)
county_codes = area_codes[area_codes['area_type_code'] == 'F']
county_codes = county_codes.reset_index(drop=True)

In [5]:
county_codes['FIPS code'] = list(map(lambda x: x[2:7],county_codes.area_code))
county_codes.head()

Unnamed: 0,area_type_code,area_code,area_text,display_level,selectable,sort_sequence,FIPS code
0,F,CN0100100000000,"Autauga County, AL",0,T,31,1001
1,F,CN0100300000000,"Baldwin County, AL",0,T,32,1003
2,F,CN0100500000000,"Barbour County, AL",0,T,33,1005
3,F,CN0100700000000,"Bibb County, AL",0,T,34,1007
4,F,CN0100900000000,"Blount County, AL",0,T,35,1009


In [6]:
county_FIPS_names = dict(zip(county_codes['FIPS code'],county_codes['area_text']))

In [7]:
county_cycle.head()

Unnamed: 0,COUNTY_NO,DISTRICT_NO,CYCLE_YEAR,CYCLE_MONTH,CYCLE_YEAR_MONTH,CNTY_OIL_PROD_VOL,CNTY_OIL_ALLOW,CNTY_OIL_ENDING_BAL,CNTY_GAS_PROD_VOL,CNTY_GAS_ALLOW,...,CNTY_CSGD_PROD_VOL,CNTY_CSGD_LIMIT,CNTY_CSGD_GAS_LIFT,CNTY_OIL_TOT_DISP,CNTY_GAS_TOT_DISP,CNTY_COND_TOT_DISP,CNTY_CSGD_TOT_DISP,COUNTY_NAME,DISTRICT_NAME,OIL_GAS_CODE
0,1,5,1993,1,199301,7355,,,0,,...,6347,,,,,,,ANDERSON,5,O
1,1,5,1993,2,199302,6312,,,0,,...,4919,,,,,,,ANDERSON,5,O
2,1,5,1993,3,199303,6222,,,0,,...,4973,,,,,,,ANDERSON,5,O
3,1,5,1993,4,199304,6139,,,0,,...,4410,,,,,,,ANDERSON,5,O
4,1,5,1993,5,199305,5785,,,0,,...,5961,,,,,,,ANDERSON,5,O


County numbers greater than 509 do not show up in the Bureau of Labor Statistics' FIPS codes, and are dropped here. They seem to correspond to the RRC's own made-up numbering.

In [8]:
county_cycle.drop(county_cycle[county_cycle['COUNTY_NO'] > 509].index,inplace=True)

In [9]:
county_cycle['COUNTY_FIPS_CODE'] = ["48%03d" % x for x in county_cycle.COUNTY_NO]
county_cycle.head()

Unnamed: 0,COUNTY_NO,DISTRICT_NO,CYCLE_YEAR,CYCLE_MONTH,CYCLE_YEAR_MONTH,CNTY_OIL_PROD_VOL,CNTY_OIL_ALLOW,CNTY_OIL_ENDING_BAL,CNTY_GAS_PROD_VOL,CNTY_GAS_ALLOW,...,CNTY_CSGD_LIMIT,CNTY_CSGD_GAS_LIFT,CNTY_OIL_TOT_DISP,CNTY_GAS_TOT_DISP,CNTY_COND_TOT_DISP,CNTY_CSGD_TOT_DISP,COUNTY_NAME,DISTRICT_NAME,OIL_GAS_CODE,COUNTY_FIPS_CODE
0,1,5,1993,1,199301,7355,,,0,,...,,,,,,,ANDERSON,5,O,48001
1,1,5,1993,2,199302,6312,,,0,,...,,,,,,,ANDERSON,5,O,48001
2,1,5,1993,3,199303,6222,,,0,,...,,,,,,,ANDERSON,5,O,48001
3,1,5,1993,4,199304,6139,,,0,,...,,,,,,,ANDERSON,5,O,48001
4,1,5,1993,5,199305,5785,,,0,,...,,,,,,,ANDERSON,5,O,48001


In [10]:
county_cycle.shape

(172426, 25)

In [11]:
# map the county FIPS codes to the county names using the dictionary county_FIPS_names
county_cycle['Mapped_Name'] = county_cycle['COUNTY_FIPS_CODE'].map(county_FIPS_names)
county_cycle['TIME'] = pd.to_datetime(county_cycle.CYCLE_YEAR_MONTH,format='%Y%m')

county_cycle.head()

Unnamed: 0,COUNTY_NO,DISTRICT_NO,CYCLE_YEAR,CYCLE_MONTH,CYCLE_YEAR_MONTH,CNTY_OIL_PROD_VOL,CNTY_OIL_ALLOW,CNTY_OIL_ENDING_BAL,CNTY_GAS_PROD_VOL,CNTY_GAS_ALLOW,...,CNTY_OIL_TOT_DISP,CNTY_GAS_TOT_DISP,CNTY_COND_TOT_DISP,CNTY_CSGD_TOT_DISP,COUNTY_NAME,DISTRICT_NAME,OIL_GAS_CODE,COUNTY_FIPS_CODE,Mapped_Name,TIME
0,1,5,1993,1,199301,7355,,,0,,...,,,,,ANDERSON,5,O,48001,"Anderson County, TX",1993-01-01
1,1,5,1993,2,199302,6312,,,0,,...,,,,,ANDERSON,5,O,48001,"Anderson County, TX",1993-02-01
2,1,5,1993,3,199303,6222,,,0,,...,,,,,ANDERSON,5,O,48001,"Anderson County, TX",1993-03-01
3,1,5,1993,4,199304,6139,,,0,,...,,,,,ANDERSON,5,O,48001,"Anderson County, TX",1993-04-01
4,1,5,1993,5,199305,5785,,,0,,...,,,,,ANDERSON,5,O,48001,"Anderson County, TX",1993-05-01


### Missing counties

In [12]:
county_cycle.COUNTY_NO.nunique() # checking to see the total number of counties is correct: Texas has 254 counties

240

This shows that out of 254 counties, 14 have been left out. Here, I find out which counties have missing data.

In [13]:
kept_names = county_cycle.Mapped_Name.unique()

In [14]:
tx_county_names = [county_FIPS_names[x] for x in county_FIPS_names if 'TX' in county_FIPS_names[x]] 

In [15]:
len(tx_county_names)

254

So the original dictionary had the correct number of counties. Now we can find out which counties were left out in the RRC data.

In [16]:
set(tx_county_names)-set(kept_names)

{'Bailey County, TX',
 'Burnet County, TX',
 'Collin County, TX',
 'Comal County, TX',
 'Deaf Smith County, TX',
 'Delta County, TX',
 'El Paso County, TX',
 'Gillespie County, TX',
 'Hall County, TX',
 'Lamar County, TX',
 'Llano County, TX',
 'Mason County, TX',
 'Parmer County, TX',
 'Randall County, TX'}

These counties have been left out of the RRC data. Note that some of these counties are significant producers of oil and gas. So this dataset is definitely missing some of this key data.

Here I'm also keeping track of the county FIPS codes which have been left out.

In [17]:
kept_nums = county_cycle.COUNTY_FIPS_CODE.unique()

In [18]:
tx_county_nums = [x for x in county_FIPS_names if 'TX' in county_FIPS_names[x]] 

In [19]:
set(tx_county_nums) - set(kept_nums)

{'48017',
 '48053',
 '48085',
 '48091',
 '48117',
 '48119',
 '48141',
 '48171',
 '48191',
 '48277',
 '48299',
 '48319',
 '48369',
 '48381'}

## Final dataframe

In this project, I have only focused on the oil production. So I will drop the other data.

In [20]:
# Isolating just the oil production data
county_cycle_oil = county_cycle[['TIME','Mapped_Name','CNTY_OIL_PROD_VOL']]
county_cycle_oil.tail()

Unnamed: 0,TIME,Mapped_Name,CNTY_OIL_PROD_VOL
179013,2017-09-01,"Zavala County, TX",0
179014,2017-10-01,"Zavala County, TX",0
179015,2017-10-01,"Zavala County, TX",0
179016,2017-11-01,"Zavala County, TX",0
179017,2017-11-01,"Zavala County, TX",0


As seen in the above output, some counties have multiple oil production data for the same month. We take the maximum reported oil production in that case.

In [21]:
county_cycle_oil = county_cycle_oil.rename(columns={'TIME':'Date','Mapped_Name':'County_Name','CNTY_OIL_PROD_VOL':'Oil_Production'})

In [22]:
county_cycle_oil.Date = pd.to_datetime(county_cycle_oil.Date)

In [23]:
county_cycle_oil.Date = county_cycle_oil.Date.dt.strftime('%m/%Y')

In [24]:
county_cycle_oil = county_cycle_oil.groupby(['County_Name','Date']).agg({'Oil_Production':'max'})
county_cycle_oil.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Oil_Production
County_Name,Date,Unnamed: 2_level_1
"Zavala County, TX",12/2012,390030
"Zavala County, TX",12/2013,544949
"Zavala County, TX",12/2014,817685
"Zavala County, TX",12/2015,871432
"Zavala County, TX",12/2016,612218


In [25]:
county_cycle_oil.reset_index(inplace=True)

In [26]:
county_cycle_oil = county_cycle_oil.pivot(index='Date',columns='County_Name',values='Oil_Production')

In [27]:
county_cycle_oil.head()

County_Name,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bandera County, TX","Bastrop County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01/1993,121640.0,2866720.0,132.0,30998.0,229457.0,,74280.0,48319.0,,19685.0,...,72844.0,825.0,133441.0,414290.0,82683.0,848809.0,2728824.0,212978.0,7072.0,256497.0
01/1994,110684.0,2846724.0,148.0,33803.0,214768.0,,66752.0,45153.0,,18881.0,...,68147.0,1423.0,90654.0,391482.0,86178.0,804362.0,2643341.0,188740.0,5587.0,146641.0
01/1995,111589.0,2688326.0,133.0,28209.0,187377.0,,60447.0,31324.0,0.0,18120.0,...,57598.0,917.0,76739.0,398144.0,78929.0,707875.0,2576306.0,186605.0,5426.0,92445.0
01/1996,104492.0,2672855.0,193.0,18207.0,177123.0,,60092.0,41908.0,295.0,15495.0,...,54366.0,1239.0,62634.0,394219.0,64957.0,631908.0,2546692.0,169412.0,6703.0,68672.0
01/1997,97139.0,2544895.0,23488.0,17315.0,163035.0,,58672.0,28653.0,191.0,13201.0,...,54684.0,1215.0,43190.0,446820.0,55618.0,561135.0,2556121.0,162501.0,5147.0,48230.0


Now we can add the missing counties. Their production is simply set to 0. I would like the columns to be there as I will ultimately compare all the county oil production data with unemployment and labor force data for all counties. This way, the unemployment and labor force data will not be omitted from the final combined dataframe.

The NaN values are set to 0, as they are the result of pivoting.

In [28]:
county_cycle_oil = county_cycle_oil.fillna(0)

In [29]:
missing_counties = list(set(tx_county_names)-set(kept_names))

In [30]:
for x in missing_counties:
    county_cycle_oil[x] = np.zeros(len(county_cycle_oil['Anderson County, TX']))

In [31]:
county_cycle_oil.sort_index(axis=1,inplace=True)

In [32]:
county_cycle_oil.head()

County_Name,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bailey County, TX","Bandera County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01/1993,121640.0,2866720.0,132.0,30998.0,229457.0,0.0,74280.0,48319.0,0.0,0.0,...,72844.0,825.0,133441.0,414290.0,82683.0,848809.0,2728824.0,212978.0,7072.0,256497.0
01/1994,110684.0,2846724.0,148.0,33803.0,214768.0,0.0,66752.0,45153.0,0.0,0.0,...,68147.0,1423.0,90654.0,391482.0,86178.0,804362.0,2643341.0,188740.0,5587.0,146641.0
01/1995,111589.0,2688326.0,133.0,28209.0,187377.0,0.0,60447.0,31324.0,0.0,0.0,...,57598.0,917.0,76739.0,398144.0,78929.0,707875.0,2576306.0,186605.0,5426.0,92445.0
01/1996,104492.0,2672855.0,193.0,18207.0,177123.0,0.0,60092.0,41908.0,0.0,295.0,...,54366.0,1239.0,62634.0,394219.0,64957.0,631908.0,2546692.0,169412.0,6703.0,68672.0
01/1997,97139.0,2544895.0,23488.0,17315.0,163035.0,0.0,58672.0,28653.0,0.0,191.0,...,54684.0,1215.0,43190.0,446820.0,55618.0,561135.0,2556121.0,162501.0,5147.0,48230.0


Checking to see the last date for which production has been reported. This is good to keep track of while combining with other dataframes.

In [33]:
county_cycle_oil.index = pd.to_datetime(county_cycle_oil.index)
county_cycle_oil.sort_index(axis=0,inplace=True)
county_cycle_oil.tail()

County_Name,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bailey County, TX","Bandera County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-07-01,42772.0,3094031.0,0.0,3383.0,77206.0,0.0,1586274.0,31191.0,0.0,122.0,...,14024.0,713.0,117169.0,615249.0,14494.0,275487.0,2008449.0,78763.0,4542.0,514086.0
2017-08-01,42453.0,2914908.0,0.0,2412.0,63253.0,0.0,1423428.0,28304.0,0.0,56.0,...,15476.0,631.0,90852.0,509957.0,11303.0,258891.0,1939632.0,77334.0,4724.0,468779.0
2017-09-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-10-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-11-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Eliminating 2017, as those values seem to be weird in places:

In [35]:
county_cycle_oil = county_cycle_oil.loc[:'2016-12-01']

In [36]:
county_cycle_oil.tail()

County_Name,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bailey County, TX","Bandera County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-08-01,47221.0,2989914.0,0.0,3484.0,87147.0,0.0,1608203.0,35242.0,0.0,93.0,...,17654.0,686.0,135399.0,484328.0,16324.0,294673.0,1956900.0,79573.0,6461.0,647237.0
2016-09-01,44144.0,2978563.0,0.0,3339.0,81275.0,0.0,1546479.0,32845.0,0.0,90.0,...,16603.0,817.0,124777.0,521024.0,15842.0,280943.0,1884683.0,77034.0,5196.0,623681.0
2016-10-01,46425.0,3102870.0,0.0,3585.0,84378.0,0.0,1706671.0,35824.0,0.0,96.0,...,18138.0,524.0,130192.0,619251.0,17180.0,277868.0,1978785.0,79918.0,5930.0,625078.0
2016-11-01,45678.0,2939942.0,0.0,3077.0,78413.0,0.0,1600728.0,31736.0,0.0,151.0,...,14754.0,644.0,122064.0,587838.0,15573.0,281104.0,1871260.0,76406.0,5486.0,596527.0
2016-12-01,46477.0,2882062.0,0.0,3545.0,81196.0,0.0,1594355.0,32321.0,0.0,35.0,...,17185.0,569.0,120549.0,571039.0,16431.0,308243.0,1922891.0,78893.0,5880.0,612218.0


In [37]:
county_cycle_oil.to_csv('TexasOilProdCounty.csv')