## Reading data

You can find the data here: <a href="https://catalog.data.gov/dataset/eia-annual-energy-outlook-for-2011-all-tables-de75e">data.gov</a><br>
This lecture is based on <a href="https://docs.python.org/2/tutorial/inputoutput.html">the python tutorial I/O</a>

Let's start by looking at the file.  Notice that there are several lines that separate different tables, that there are quotes surrounding strings, and it is comma delimited.

In [23]:
!head data/1-AEO2011.csv

1-AEO2011,,"Case","Region","Row","Quantity","Units (unless otherwise specified)",2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026,2027,2028,2029,2030,2031,2032,2033,2034,2035
"Table 1","Total Energy Supply, Disposition, and Price Summary","Reference case","United States",1,"Total Energy Supply, Disposition, and Price Summary, Reference case",
,,"Reference case","United States",2,"(quadrillion Btu, unless otherwise noted)",
,,"Reference case","United States",3,"Production",
,,"Reference case","United States",4,"   Crude Oil and Lease Condensate","quadrillion Btu",10.51,11.34,11.87,11.76,11.58,12.02,12.40,12.51,12.82,13.07,13.12,13.17,13.07,13.05,13.09,12.97,12.78,12.64,12.50,12.44,12.53,12.48,12.49,12.63,12.91,13.01,13.04,12.80
,,"Reference case","United States",5,"   Natural Gas Plant Liquids","quadrillion Btu",2.41,2.57,2.64,2.63,2.76,2.81,2.83,2.86,2.86,2.89,2.94,2.99,3.06,3.15,3.28,3.43,3.52,3.55,3.57,3.61,3.65,3.67,3.71,3.75,3

Let's start by using read and readline.  If we want to separate the line into different elements, then we need to do this manually with split and strip.

In [17]:
with open('data/1-AEO2011.csv','r') as govdat:
    print(govdat.read()[0:100])

1-AEO2011,,"Case","Region","Row","Quantity","Units (unless otherwise specified)",2008,2009,2010,2011


In [18]:
with open('data/1-AEO2011.csv','r') as govdat:
    govhead = govdat.readline()

In [19]:
header = [name.strip('"\n') for name in govhead.split(',')]

In [21]:
header[0:10]

['1-AEO2011',
 '',
 'Case',
 'Region',
 'Row',
 'Quantity',
 'Units (unless otherwise specified)',
 '2008',
 '2009',
 '2010']

<h2>CSV package</h2>

<a href="https://docs.python.org/2/library/csv.html">CSV package documentation</a>

The csv reader pretty much just automates the removal of quotes and the split.

In [88]:
import csv

In [89]:
govf = open('data/1-AEO2011.csv','r')
fcsv = csv.reader(govf,delimiter=',',quotechar='"')

In [90]:
header = next(fcsv)

In [91]:
## Detect when the new table begins, and append the data to a list of tables
all_tables = []
econtable = []
table_names = []
for l in fcsv:
    if "Table" in l[0]:
        if econtable:
            all_tables.append(econtable)
        table_names.append(l[-2])
        econtable = []
    if len(l) > 7:
        econtable.append([x.strip() for x in l[2:7]] + [float(x.strip()) for x in l[7:]])
all_tables.append(econtable)    

In [92]:
govf.close()

In [93]:
table_names

['Total Energy Supply, Disposition, and Price Summary, Reference case',
 'Total Energy Supply, Disposition, and Price Summary, High economic growth',
 'Total Energy Supply, Disposition, and Price Summary, Low economic growth',
 'Total Energy Supply, Disposition, and Price Summary, High oil price',
 'Total Energy Supply, Disposition, and Price Summary, Low oil price']

In [94]:
len(all_tables)

5

<h2>Pandas</h2>

<a href="http://pandas.pydata.org/pandas-docs/stable/io.html">Pandas I/O documentation</a>

Pandas can expedite this process since it will read the data directly into a Pandas dataframe.

In [2]:
import pandas as pd

econtable = pd.read_csv('data/1-AEO2011.csv',header=0)
print(type(econtable))
econtable = econtable.iloc[3:,2:].dropna() #drop the rows with NAs
econtable.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Case,Region,Row,Quantity,Units (unless otherwise specified),2008,2009,2010,2011,2012,...,2026,2027,2028,2029,2030,2031,2032,2033,2034,2035
3,Reference case,United States,4,Crude Oil and Lease Condensate,quadrillion Btu,10.51,11.34,11.87,11.76,11.58,...,12.5,12.44,12.53,12.48,12.49,12.63,12.91,13.01,13.04,12.8
4,Reference case,United States,5,Natural Gas Plant Liquids,quadrillion Btu,2.41,2.57,2.64,2.63,2.76,...,3.57,3.61,3.65,3.67,3.71,3.75,3.78,3.82,3.86,3.92
5,Reference case,United States,6,Dry Natural Gas,quadrillion Btu,20.83,21.5,21.83,21.61,21.83,...,24.7,24.92,25.21,25.47,25.75,26.02,26.22,26.42,26.67,27.0
6,Reference case,United States,7,Coal,quadrillion Btu,23.85,21.58,22.59,21.75,21.39,...,23.95,24.05,24.4,24.53,24.77,24.96,25.21,25.45,25.72,26.01
7,Reference case,United States,8,Nuclear Power,quadrillion Btu,8.43,8.35,8.39,8.4,8.5,...,9.17,9.17,9.17,9.17,9.17,9.17,9.16,9.16,9.15,9.14


In [86]:
econtable.shape
set(econtable['Case']) #the unique Cases in the table

{'High economic growth',
 'High oil price',
 'Low economic growth',
 'Low oil price',
 'Reference case'}

In [102]:
## We can read the table in chunks, the following will produce an iterator called the TextFileReader
econtable = pd.read_csv('data/1-AEO2011.csv',header=0,chunksize=10)
print(type(econtable))

<class 'pandas.io.parsers.TextFileReader'>


In [103]:
## As an iterator, we can call next
table = next(econtable)

In [106]:
## It produces a DataFrame, of shape chunksize x p (p is the number of variables in the table)
print(type(table))
table.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,1-AEO2011,Unnamed: 1,Case,Region,Row,Quantity,Units (unless otherwise specified),2008,2009,2010,...,2026,2027,2028,2029,2030,2031,2032,2033,2034,2035
0,Table 1,"Total Energy Supply, Disposition, and Price Su...",Reference case,United States,1,"Total Energy Supply, Disposition, and Price Su...",,,,,...,,,,,,,,,,
1,,,Reference case,United States,2,"(quadrillion Btu, unless otherwise noted)",,,,,...,,,,,,,,,,
2,,,Reference case,United States,3,Production,,,,,...,,,,,,,,,,
3,,,Reference case,United States,4,Crude Oil and Lease Condensate,quadrillion Btu,10.51,11.34,11.87,...,12.5,12.44,12.53,12.48,12.49,12.63,12.91,13.01,13.04,12.8
4,,,Reference case,United States,5,Natural Gas Plant Liquids,quadrillion Btu,2.41,2.57,2.64,...,3.57,3.61,3.65,3.67,3.71,3.75,3.78,3.82,3.86,3.92


## Reading in chunks with pandas

The previous code read in all of the data, and then we could do something with it.  Pandas gives an easy way to read the data in chunks that we can process in sequence.  I'll be taking the mean of the whole dataset.

In [15]:
import numpy as np # I'll use numpy for np.array

In [17]:
## Alternatively we can read the data in chunks
econreader = pd.read_csv('data/1-AEO2011.csv',header=0,chunksize=10)
Esum, n = np.sum(np.array([et.sum(), et.shape[0]]) for et in econreader) ## Sum arrays with this genexp

In [20]:
econmean = Esum / n ## Calculate the mean

In [22]:
print(econmean)

1-AEO2011                                 NaN
2008                                  17.9125
2009                                  14.4231
2010                                  16.0162
2011                                  16.4398
2012                                   16.742
2013                                   17.127
2014                                    17.41
2015                                  17.7537
2016                                  18.0941
2017                                  18.4604
2018                                  18.8288
2019                                  19.1982
2020                                  19.5986
2021                                  19.9517
2022                                   20.298
2023                                  20.6516
2024                                  20.9878
2025                                  21.3221
2026                                  21.6338
2027                                  21.9346
2028                              