call log project

Andrew Whitaker edited this page Aug 26, 2014 · 22 revisions

Efficient Computation on Large Spatiotemporal Network Data

Key Personnel

  • Ian Kelley, Ph.D., Research Consultant, Information School
  • Andrew Whitaker, Ph.D., Research Scientist, eScience Institute

Background

The pervasive and rich data available in today’s networked computing environment provides many major opportunities for innovative data-intensive applications. Particularly challenging are data analysis projects that rely upon input from millions of sparse, highly dimensional, and dirty data files at can be difficult and time consuming to analyze.

Goals

The goal of this project was to develop methods and infrastructure for analyzing large-scale call detail record (CDR) data. The first goal of this investigation was to identify the computational and logistical challenges that existed when collecting, storing, and analyzing this type of data. The next stage focused on evaluating different tools, environments, and middle-ware that could support the data workflows needed to analyze this data. Due to the size, heterogeneity, and scale of the datasets, project scope and emphasis focused on current state-of-the-art "big data" systems such as MapReduce, Hive, Shark, and Spark.

Call Detail Records (CDRs) are one such set of information artifacts, consisting of metadata about mobile phone network calls that are passively collected in log files. These records can provide rich information that is useful for explorations ranging from mobility analysis and location inference to calculating probabilities of new product adoption.

Actionable Output

Many aspects to this incubator project contributed to the design of the computational environment at the Information Schools' DataLab. The project not only helped to provide expertise in deployment of the resulting system, but was also instrumental in the data pipeline evaluations.

The resultant DataLab processing environment is depicted as follows:

This was the result of an in-depth evaluation of the following technologies, some of which have benchmarks listed below:

Research Output & Future Directions

The work done in the incubator also contributed to the development of a research paper, and helped to define future development and directions.

  • I. Kelley and J. Blumenstock “Computational Challenges in the Analysis of Large, Sparse, Spatiotemporal Data.” in Proceedings of ACM HPDC Sixth International Workshop on Data Intensive Distributed Computing (DIDC’14). Vancouver, Canada, June, 2014. http://dl.acm.org/citation.cfm?id=2608025

Project Links

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.