Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
call log project
Efficient Computation on Large Spatiotemporal Network Data
- Ian Kelley, Ph.D., Research Consultant, Information School
- Andrew Whitaker, Ph.D., Research Scientist, eScience Institute
The pervasive and rich data available in today’s networked computing environment provides many major opportunities for innovative data-intensive applications. Particularly challenging are data analysis projects that rely upon input from millions of sparse, highly dimensional, and dirty data files at can be difficult and time consuming to analyze.
The goal of this project was to develop methods and infrastructure for analyzing large-scale call detail record (CDR) data. The first goal of this investigation was to identify the computational and logistical challenges that existed when collecting, storing, and analyzing this type of data. The next stage focused on evaluating different tools, environments, and middle-ware that could support the data workflows needed to analyze this data. Due to the size, heterogeneity, and scale of the datasets, project scope and emphasis focused on current state-of-the-art "big data" systems such as MapReduce, Hive, Shark, and Spark.
Call Detail Records (CDRs) are one such set of information artifacts, consisting of metadata about mobile phone network calls that are passively collected in log files. These records can provide rich information that is useful for explorations ranging from mobility analysis and location inference to calculating probabilities of new product adoption.
Many aspects to this incubator project contributed to the design of the computational environment at the Information Schools' DataLab. The project not only helped to provide expertise in deployment of the resulting system, but was also instrumental in the data pipeline evaluations.
The resultant DataLab processing environment is depicted as follows:
This was the result of an in-depth evaluation of the following technologies, some of which have benchmarks listed below:
Research Output & Future Directions
The work done in the incubator also contributed to the development of a research paper, and helped to define future development and directions.
- I. Kelley and J. Blumenstock “Computational Challenges in the Analysis of Large, Sparse, Spatiotemporal Data.” in Proceedings of ACM HPDC Sixth International Workshop on Data Intensive Distributed Computing (DIDC’14). Vancouver, Canada, June, 2014. http://dl.acm.org/citation.cfm?id=2608025