Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
##Project Summary Since 2005, the U.S. airline industry has experienced the most dramatic merger activity in its history, which has reduced the number of major carriers in the U.S. from eight to four. My project seeks to provide novel estimates of changes in consumer and producer welfare in the U.S. due to these mergers. To do so, I seek to estimate a dynamic model of route competition using the entire DB1B dataset, which is a 10% sample of all airline tickets in the U.S. from 1993 on, provided by the U.S. Department of Transportation. This dataset is large, consisting of roughly 5 million observations per quarter. Further, in order to estimate parameters of the dynamic game, I use a simulation and estimation approach, which requires increasing the size of the DB1B dataset to accommodate routes offered by carriers that may not exist in the dataset but that may have existed if these mergers were prevented. This data augmentation step increases the number of observations to 11 million per quarter. With this dataset, running my simulation using the R programming language is computationally infeasible on my laptop. The eScience Fall 2014 incubator project consists of creating software that will allow my simulation to run in parallel on an Amazon EC2 instance, drastically speeding up the computations and allowing me to complete multiple iterations of my simulation. The tasks consist of a 1) data augmentation step (DA), 2) value function simulation and estimation step (VFE), and 3) counterfactual simulation step (CS).
For a recent update, please see our blog.
For a final project report, please see below.
##Project Goals ####Minimum
- Complete the DA step on a 1% DB1B sample
- Run the VFE step completely on a 1% DB1B sample using an Amazon EC2 instance
- Complete "Minimum" goals
- Run CS step on 1% DB1B sample using an Amazon EC2 instance
- Complete "Expected" goals
- Run VFE and CS steps on larger than 1% samples
- Visualize and interpret initial results
- Project lead: Carlos A. Manzanares (email@example.com), PhD Candidate, Department of Economics, Vanderbilt University (visiting PhD student, Department of Economics, University of Washington).
##Final Report My primary goal for this project is to simulate the U.S. airline route offerings and prices that would have resulted had all large scale airline mergers since 2008 been blocked by the U.S. Department of Justice. These counterfactual route offerings and prices could be compared with actual route offerings and prices to determine the "treatment effect" of allowing these mergers to take place, hopefully informing antitrust policy in the future.
The data used to produce this simulation is the DB1B database, which is a 10% sample of all airline tickets from 1993 to present collected and stored by the U.S. Department of Transportation. The collection of this dataset is a remnant of the days when airlines were heavily regulated (prior to the Airline Deregulation Act of 1978) and provides a level of detail and comprehensiveness rare in the economics profession. As a consequence, there is a long tradition in economics of articles written using samples of DB1B data (and its sister dataset, the T100 database); however, no articles (to my knowledge) in our profession have used the entire DB1B dataset, primarily due to its size. Researchers often resort to studying airline behavior using samples of the largest U.S. airports, leaving behavior at other classes of airports understudied. For example, since it's likely that mergers have large effects on airline offerings at regional airports, including these in the analysis would be useful. For my project, since I'm attempting to approximate welfare changes due to airline mergers across the entire U.S., it's imperative that I use as large of a sample of the DB1B data as possible. In fact, my medium-term goal is to use the entire DB1B dataset to estimate a dynamic model of airline behavior, which, to my knowledge, has never been attempted. Therefore, my collaboration with the eScience Institute to enhance the scale of my computation is expected to enhance our understanding of recent changes in airline market structure in an important way.
The project involves three general steps. The first requires the preparation of my data in a data augmentation step (DA step). For this step, I start with the raw DB1B dataset that has been filtered to include only the top 60 US airports (I will increase this to 100% of US airports now that the infrastructure is complete). The filtered raw DB1B data contains roughly 5 million rows per quarter, where I keep the following information per row: fare, origin, destination, stop, carrier, and quarter. This results in a csv file that has 6 columns and 5M rows. I then augment the data by 1) "collapsing" it into one of 11 fare bins (creating rows that are tuples of fare bin, origin, destination, stop, carrier, quarter, defined as "products" in my analysis) and 2) adding additional features per row. The process of collapsing actually increases the number of rows, since I create some rows for products that did not necessarily exist in the data, but that could exist in my counterfactual simulation (e.g. a particular fare bin, origin, dest, stop, carrier, quarter tuple that could have existed if firms had remained unmerged during my time period of interest). An example of an additional feature includes adding the distance between the origin and destination. I create 161 such features, resulting in an augmented dataset with 11 million rows per quarter and 167 columns. My analysis focuses on 2003q1 to 2013q4 (44 quarters), with 20 quarters (2003q1 to 2007q4) in the "training" dataset, and 24 quarters (2008q1 to 2013q4) in the "testing" dataset. To begin the eScience Fall incubator process, this data augmentation step was coded exclusively in STATA and carried out very slowly on my laptop. As I'll document below, the eScience Institute convinced me this was inefficient and has induced me to convert this code to R so that it can be carried out on an Amazon EC2 instance.
The second step involves simulating and estimating the "value function," (VFE step) which is the foundational function of interest in the context of estimating parameters associated with dynamic Markov-decision processes. Although I'll leave the deeper theoretical details absent and make these available to interested parties who inquire further, in a nutshell, I'm primarily interested in estimating the implied structural parameters of unobserved entry costs and salvage values and do so through an equilibrium assumption which implies that observed actions are optimal. The simulation and estimation procedure is coded in R and processes one quarter at a time of augmented data (from the DA step) for 20 quarters sequentially. The output of this step is 1000 csv files, each with 11M rows and one column. The final step is to take a simple average over all entries of the same line in each of the 1000 csv files and return one csv file with these averages of 11M rows.
The third step is the counterfactual simulation step (CS step), where, using the parameters estimated from the VFE step, I restrict merger activity and simulate the route offerings and prices that would have resulted from 2008q1 to 2013q4 had all major U.S. airline mergers been prevented. This simulation is also coded in R and is very similar to the VFE step in that it processes one quarter of augmented data from the DA step at a time for 24 quarters, sequentially. The output of this step is 1000 csv files, each with 11M rows. One difference between the CS step and VFE step is that I don't desire an average over these csv's. Instead, I wish to recover and store the entire distribution of entries for each line.
I began the eScience Fall incubator having completed the DA step and the VFE step R code for a 1% sample of the DB1B dataset. Andrew Whitaker, with whom I collaborated most frequently during the quarter, encouraged me to run the VFE simulation on 1% of the data using an Amazon EC2 instance to observe the RAM consumption. We used an Amazon EC2 instance that provides 61GB of RAM (r3.2xlarge), and even on a 1% DB1B sample, the instance would terminate the R program, indicating that my R code was extremely inefficient, memory-wise.
Rather than continuing to scale up the RAM capacity on the EC2 instance, we decided to begin a quick an rudimentary process of optimizing my code. This involved cutting up each quarter of data into 8 pieces and updating my R VFE code to process one piece at a time, sequentially. This modification worked well, since we ran the VFE R code on a r3.2xlarge Amazon EC2 instance and discovered, using the htop function, that my R script used, at its peak, 6GB of RAM.
This initial success encouraged us to scale the simulation up to 100% of my augmented DB1B dataset (which represents 64% of the entire DB1B sample; I hope to scale this up to 100% of the DB1B sample soon). The process of adapting my STATA code and generating my full augmented sample took two weeks on my laptop, prompting Andrew to suggest that I re-write the STATA code in R so that we could implement the data-augmentation step (and store the augmented and raw data), on and Amazon EC2 instance. This recoding task was started during the fall but is ongoing today. I additionally modified my VFE R code to accommodate the full augmented sample, which took an additional week.
Finally, we tested the VFE R code on the full augmented sample using the r3.2xlarge Amazon EC2 instance (61GB capacity). My R script utilized more than 61GB of RAM and terminated. This prompted what is, in my view, the most fruitful product of our collaboration during the fall incubator. Frustrated by the inefficiency of R's RAM consumption when storing large matrices, Andrew discovered through inspection that the matrices I invoked during my R simulation were "sparse," e.g. they contained very few non-zero entries. Based on this observation, he helped me seek out and discover novel R packages designed to carry out functions used in my simulation on sparse matrices. As it turns out, there are a broad set of ad-hoc sparse matrix packages in R, and all of the functions I needed for my simulation were available using these packages. I therefore recoded my R simulation to incorporate these sparse packages.
We implemented the revised "sparse package" R script on the r3.2xlarge Amazon EC2 instance on December 2, 2014, two days before the eScience final presentations, and experienced the most encouraging result of the eScience Fall incubator: the "sparse package" R script worked on 100% of my sample, which utilizes 64% of the entire DB1B raw sample. Implementing the VFE step using this scale of data was not envisioned prior to beginning the eScience task, and allows me to complete the VFE step in roughly 5 hours per run. For comparison, on my laptop, a similar task would take several weeks. Given the efficiency by which the R script now utilizes RAM, I am in the process of scaling up the calculation to 100% of the DB1B raw data, which is a scale that, to my knowledge, has never been utilized in economics (and certainly not to estimate a nation-wide dynamic model of route offerings). This VFE simulation task was further enabled by parallelization code in Python (developed by Andrew) to run the simulation in an embarrassingly parallel manner. Additionally, Andrew developed Python code to calculate averages over the 1000 csv's obtained from the VFE step. These two Python codes will also be used in the CS step. Currently, I've implemented the VFE step on 2 of 20 training quarters. Scaling this up to the full 20 is not expected to be difficult, since the quarters consumed sequentially (meaning we've likely hit the RAM ceiling for the R script).
My eScience Fall Incubator experience can be best summarized by the following chart, which details the RAM consumption as a function of time and the percentage of raw DB1B raw sample consumed.
In summary, I've listed the original goals with their status:
#####Project Goals ######Minimum
Complete the DA step on a 1% DB1B sample
STATUS: more than complete (64% DB1B sample on 17 of 20 quarters).
Run the VFE step completely on a 1% DB1B sample using an Amazon EC2 instance
STATUS: more than complete (64% DB1B sample on 2 of 20 quarters).
Complete "Minimum" goals
STATUS: more than complete (see above).
Run CS step on 1% DB1B sample using an Amazon EC2 instance
STATUS: Incomplete, but given the similarity of the VFE and CS steps, this is attainable soon.
Complete "Expected" goals
STATUS: Nearly complete, see above.
Run VFE and CS steps on larger than 1% samples
STATUS: More than complete for VFE (64% raw DB1B sample). Nearly complete (and exceeded) for CS step.
Visualize and interpret initial results
STATUS: Incomplete, but attainable with completion of CS step.
During the eScience Fall 2014 Incubator, I have learned an extraordinary amount. I wanted to highlight some of the skills I acquired through the program.
- Using the terminal to access and run R scripts on Amazon AWS
- Selecting appropriate EC2 instances on Amazon
- Using htop function to observe RAM consumption
- Coding and implementing sparse R packages to carry out large-scale statistical computations
- Parallelizing computations using Python
- Performing line-by-line averages across multiple csv's using Python
One additional note. I found the casual interactions with eScience Institute incubator participants and staff to be very helpful. Learning about the daily developments of projects in other fields helped place my work into context and expand my understanding of the diversity of "big data" tasks. Additionally, I found it helpful to passively absorb the interactions of others while physically working alongside other participants.
Overall, I found the eScience Institute Fall 2014 Incubator program at UW to be extremely rewarding. I am grateful to have participated in the program and look forward to continued interactions with eScience Institute collaborators in the future. I am grateful for the extensive help of Andrew Whitaker not only during the Fall 2014 Incubator program but also during the months leading up to the Fall. I am also grateful for the ongoing interactions with Bill Howe, Dan Halperin, Jake Vanderplas, Brittany Fiore-Silfvast, Anissa Tanweer, and the other participants of the Fall 2014 incubator program. I would like to especially thank Bill Howe for taking an interest in my research and facilitating my collaboration with the eScience Institute beginning in December of 2013.