Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
245 lines (163 sloc) 18 KB

paTest Demo

*** This demo is not yet available for download ***

After installing the SigSRF SDK eval, below are notes and example command lines for the paTest 1 demo. The demo has two (2) purposes:

  • show a predictive analytics application that applies algorithms and deep learning to continuous log data in order to predict failure anomalies

  • provide an example application, including Java and C/C++ source code, that interfaces to Spark and SigSRF software, and shows examples of API usage for both

1 paTest = predictive analytics test

Other Demos

mediaTest Demo (Streaming Media, Buffering, Transcoding, and Packet RFCs)

iaTest Demo (Image Analytics)

Table of Contents

Predictive Analytics from Log Data
    Data Flow Diagram
    Theory
    Log Data Requirements and Format
    Converting Log Data to Time Series
    Frequency Domain Contour Images
Java Source and Spark API Excerpts
Demo Notes
coCPU Notes

Predictive Analytics from Log Data

The demo uses a real case as its basis, parsing unstructured logs generated by a high capacity telephony system (they've been scrubbed of proprietary information). This system handles over 8000 simultaneous sessions, with continuous session setup and tear down, and with varying session duration, using Texas Instruments C6678 CPUs on an ATCA board. A PowerPC CPU on the ATCA board handles session control and initiates sessions; the TI CPUs handle session data over 10 GbE network connections.

During capacity testing a highly infrequent and intermittent anomaly was found to occur, which required painstaking and time-consuming debug effort over the course of several weeks. Eventually it turned out (i) an error occurred in DDR3 memory that required several conditions to coincide, and (ii) the error is related to the "RowHammer" phenomenon that can occur with state-of-the-art DDR3 memories (RowHammer involves near-simultaneous access of adjacent rows of memory cells in the chip's physical silicon layout). To correct the error required slight -- but unforeseen -- changes to DDR3 memory chip setup (timing parameters). This case was further remarkable in that all other, independent memory tests created for both the DDR3 chips and the board failed to induce the problem -- such was the uniqueness of the required combination of conditions, it only occurred during production stress testing.

In post-case analysis, one thing that stood out is that most of the debug effort was spent forcing the error to occur more often, thus making it easier to see and subject to a substantial increase in the rate of testing and software debug insertions. Otherwise, the system could run for several days until the error happened to occur in a way that affected system performance (typically manifesting as a random "core crash", or software critical error in the log data). In general, it's not unusual to spend a disproportionally high amount of debug effort during final production testing -- the last few problems tend to be the most difficult to isolate and identify (a colloquial expression for this situation is the "90-90 rule"). In this case specifically, it was unusual there was a problem so intermittent that the error rate was less than 1017 memory access (there were 160 CPU cores on the ATCA board, and 40 GB of DDR3 mem).

Fortunately debug efforts were eventually successful, culminating in a "row marker" memory test that could be inserted into the software at various points in order to catch the error in action, adding a specific, precisely timed event to the log. This allowed possible stress conditions to be correlated using log timestamps, the nature of the problem to be characterized, and finally, DDR3 chip timing parameter changes to be implemented and then verified as the root cause resolution to the problem.

During post-case debrief discussions, one common question among the engineers involved was whether deep learning methods might have been used to identify operational trends occcuring temporally near the core crashes, which were visible early in the debug process, thus predicting which stress conditions to emphasize to make the error occur more frequently. If so, then potentially weeks of engineering time could be saved for future production systems with tough, intermittent issues.

Data Flow Diagram

Below is a data flow diagram showing I/O, algorithms, and convolutional neural networks used to predict anomalies in log data.

 

Log data predictive analytics data flow diagram

 

As shown in the above diagram, the approach centers around the concept of converting continuous log measurement data into a series of images, which are then used to train a convolutional neural network. A primary objective of this approach is take advantage of algorithms, training methods, and inference performance available due to computer vision current state-of-the-art.

Theory

Performing "recognition" based on frequency domain data is not a new approach, having been well-established as the primary basis for human vision and speech recognition. For speech, the waveform displays below give an example.

 

2-D spectrograph (frequency domain) display of speech time series data

 

In the above waveform displays, the upper display shows time series data (in yellow), and the lower display shows the equivalent frequency domain data as a "2-D spectrograph", with frequency on the y-axis, time on the x-axis, and amplitude as color coded.

The spectrograph display is actually a series of STFTT output frames, each representing around 20 msec of time series data. For speech, 20 msec is the natural "framesize" of the underlying time series data produced by a human vocal tract. For speech recognition, phoneme segmentation is used to form images for CNN input, where each image consists of 8 to 10 STFFT output frames. For predictive analytics, the natural framesize will vary depending on the specific system under test (SUT) and the nature of the data processed by the system. In the case study being used for this demo, the natural framesize is about 100 usec and each image represents around 40000 STFFT output frames (about 4 sec).

Once the framesize is known, then a short-time FFT (Fourier analysis) can be performed to generate output frames and images. As noted in the above data flow diagram, overlap and windowing are used to calculate the STFFT. The overlap + windowing method is sometimes referred to as a "sliding FFT". The Fourier transform is a linear operation, so it's important to eliminate "edges" (discontinuities) from input data; overlap + windowing accomplishes this. It's also important to ensure that all incoming time series data points are evenly spaced; in signal processing, this is known as the "sampling rate".

To create the time series input to the STFFT, multiple measurement time series must be combined, and interpolated where necessary to ensure evenly spaced data points (also known as "gap filling"). The data flow diagram above shows a regression neural net model with one hidden layer calculating a weighted combination of multiple log data measurements. This neural net is trained with a labeled data set; i.e. comparing measurement data during normal operation and near the anomaly. At this point in the data flow, the objective is to eliminate measurements that have little or no statistical correlation with the anomaly, not to predict it, so this neural net acts as a qualifier.

For both the time series combination neural net and the vision CNN, training occurs in two phases (i) to learn normal operation, and (ii) to learn conditions temporally near the anomaly (about a minute before and a few seconds afterwards).

Log Data Requirements and Format

The demo assumes that input log data meets the following requirements:

  • all entries have timestamps. Any entries without timestamps are ignored
  • one or more entries include measurment data. Some entries may be events, or combinations of events and measurements. In the case study example, measurement data include CPU and memory usage, number of current active calls, call setup and tear-down time, etc
  • for training purposes, some logs include the target anomaly, or other error conditions closely associated with the anomaly

Below is an excerpt from the logs used in the demo, with measurement data highlighted:

 

Log data excerpt with measurement data highlighted

 

Converting Log Data to Time Series

Note in the log data excerpt above that some entries include measurement data and some do not, which is typical of general, unstructured log formats. Also note that entries do not have linear timestamps, so any extracted measurement data must be interpolated into one or more time series with linear sampling periods, in order to apply standard signal processing algorithms.

Below is a waveform plot showing log data "number of sessions" measurements as an interpolated time series.

Time series display of the number-of-sessions feature, after log data extraction and interpolation

The demo uses Apache Commons Math APIs called from Java, in this case a linear interpolator to create a polynomial spline function to approximate missing samples in time series data.

In some cases, if long or irregular intervals beween measurements make the data sparse, it may be necessary to curve fit rather than interpolate. The case study in this demo does not require that.

Frequency Domain Contour Images

As shown in the above data flow diagram, extracted log data measurements are converted to time series and given as inputs to a regression model neural network. Each input is considered a feature, and the network's output is a combined time series, which is then processed by short-time FFT analysis. For explanation purposes, below is a frequency domain 2-D contour plot showing only the number of sessions feature (i.e. not combined with other features).

Frequency domain 2D contour display of the number-of-sessions feature

Notes about the above display:

  • Time is on the horizontal axis (as with a time series plot), frequency on the vertical axis, and amplitude is in log-magnitude units indicated by a "heatmap" color scheme. Together the 3 dimensions form a contour display
  • The combination of colors is similar to "inferno" or "magma" colormaps in Matlab and R, which are known as perceptually uniform colormaps, and for which researchers have found that the human brain perceives equal data steps as equal color space steps. This is important for convolutional neural networks based on machine vision, in order to generate input images that approximate common real world images
  • Highlighted areas show "wideband energy" which indicates areas of rapid, sharp changes in the time series data. The term comes from signal processing, and typically refers to an edge, or discontinuity in the time-series data. In this case, since the feature is the number of concurrent sessions, such areas indicate the telephony system was rapidly opening and closing sessions

Java Source and Spark API Excerpts

Below is a simplified Java source excerpt that parses unstructured log files.

   while ((line = br.readLine()) != null) {
         
      /* check for string delineating log sections to switch file to write to */
      if (line.contains("Start of one-time log area")) {
         contLog = false;
      } else {

         /* only use lines that contain "core" */
         if (line.contains("core")) {
               
            /* only use lines that start with "core" and don't contain "core" elsewhere */
            if (line.indexOf("core", 1) == -1) {
               if (contLog) {
                  if (firstContLog) firstContLog = false;
                  else bwCont.newLine();
                  bwCont.write(line);
               } else {
                  if (firstOneLog) firstOneLog = false;
                  else bwOne.newLine();
                  bwOne.write(line);
               }
            }
         }
      }
   }

The following Java source excerpt calls Spark APIs to extract specific fields from the unstructured log data, in order to build structured data.

   data = data.filter(col("value").contains(data_field));

   if (data.count() == 0) return data;

   Column split_col = functions.split(col("value"), " ");

   data = data.withColumn("time", split_col.getItem(1));
   data = data.sort(col("time"));

   /* base time value used in log files, all timestamp values will be msec offsets from this starting time */
   long baseTimeValue;
   if (ts_offs_zero) baseTimeValue = getMilliSeconds("01/01/1900-00:00:00.000");
   else baseTimeValue = getMilliSeconds(data.select(col("time")).first().toString().replace("[","").replace("]",""));

   spark.udf().register("tsUDF", (String datetime) -> getMilliSeconds(datetime) - baseTimeValue, DataTypes.LongType);
   data = data.withColumn("ts", functions.callUDF("tsUDF", col("time")));

   data = data.withColumn(data_field, functions.regexp_extract(col("value"), data_field + " = (\\d+)", 1));

For example, after the above processing, if the "number of concurrent sessions" field were to be written to a .csv file and then displayed with an appropriate viewer, it might look something like this:

+--------+--------+
|ts      |num sesn|
+--------+--------+
|10309858|39      |
|10310449|40      |
|10310519|39      |
|10311273|38      |
|10312181|39      |
|10313911|40      |
|10314064|39      |
|10315589|40      |
|10315732|39      |
|10317302|40      |
|10318486|39      |
|10319008|40      |
|10319167|39      |
|10320760|40      |
|10320862|39      |
|10321624|38      |
|10322476|39      |
|10324206|40      |
|10324369|39      |
|10325033|38      |
+--------+--------+

where "ts" is the timestamp (in msec) and "num sesn" is the current number of sessions.

Install Notes

TBD

Demo Notes

An example command line is shown below

 /home/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --master local[4] target/simple-project-1.0.jar -r10000 -ilog_file.txt -omultichan_time_series.wav "num sesn" "crt time" sesn dur"

where the input log is log_file.txt, the sampling rate is 10 kHz, and the output (multichan_time_series.wav) contains three (3) channels, corresponding to extracted data for unstructured data fields "num sesn" (number of concurrent sessions), "crt time" (session create time), and "sesn dur" (session duration).

coCPU™ Notes

As explained on the main SigSRF SDK page, the demos support optional coCPU™ technology when per-box performance increases are required. Examples include servers with SWaP 2 constraints, very small form-factors, and increasing overall system bandwidth by "fronting" data with additional CPU cores.

coCPU cards add NICs and up to 100s of coCPU cores to scale per-box streaming and performance density. For example, coCPUs can turn conventional 1U, 2U, and mini-ITX servers into high capacity media, HPC, and AI servers, or they can allow an embedded AI server to operate independently of the cloud. coCPU cards have NICs, allowing coCPUs to front streaming data and perform wirespeed packet filtering, routing decisions and processing. The coCPU cards supported by the demos include:

  • High performance, including extensive SIMD capability, 8 or more cores per CPU, L1 and L2 cache, and advanced DMA capability
  • Contain onchip network I/O and packet processing and onchip PCIe
  • Access to 2 (two) GB or more external DDR3 mem
  • Able to efficiently decode camera input, e.g. H.264 streams arriving as input via onchip network I/O
  • CGT 4 supports gcc compatible C/C++ build and link, mature and reliable debug tools, RTOS, and numerous libraries

The current vision + AI server demo uses TI C6678 CPUs, which meet these requirements. Over time, other suitable CPUs may become available.

Combining x86 and c66x CPUs and running software components necessary for AI applications such as H.264 decode, OpenCV and TensorFlow, is another form of an "AI Accelerator". The architecture described here favors fast, reliable development: mature semiconductors and tools, open source software, standard server format, and a wide range of easy-to-use peripherals and storage.

2 SWaP = Size, Weight, and Power Consumption
4 CGT = Code Generation Tools

You can’t perform that action at this time.