# Time Series: Phone Broadcast Data

<br />
<br />

### Table of Contents

* Introduction
* Fixing the File
* Strings to Timestamps
* Time Deltas
* Time Series
* Frequency Histograms
* Looking at the Full Dataset

<br />
<br />

## Introduction

Dealing with time can be a pain - especially with datasets that have different time formats in each file. This notebook shows how to convert time stamps into useful formats, and use those to manipulate and bin time series to count events and investigate frequencies. 

In [None]:
# Also see
# http://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [None]:
import os
os.listdir('../input/')

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint

## Fixing The File

We'll start with the AllBroadcasts.csv data file. This file is fine for the first 50 lines or so, but it chokes on the whole file because of stray commas.

In [None]:
all_broadcasts = pd.read_csv('../input/AllBroadcasts.csv',nrows=50,header=0)

In [None]:
# Loading all of the data chokes on a column with commas
#all_broadcasts = pd.read_csv('data/AllBroadcasts.csv',nrows=1000,header=0)

In [None]:
with open('../input/AllBroadcasts.csv','r') as f:
    lines = f.readlines()
lc = set()
for line in lines:
    lc.add( len( line.split(",") ) )

In [None]:
print(lc)

We are expecting five fields (four commas) based on the file header, but we have some lines with no commas (probably an empty line) and some lines with 1, 2, or more extra commas:

In [None]:
print(lines[0])

In [None]:
Here's an example of one of those lines - it consists of the first two fields, UserId and UUID, which are parsed fine, but an Extras column containing commas. Then the Action and timestamp columns are also parsed okay.

In [None]:
sp = lines[73].split(",")
print(sp)
print(len(sp))

Our parsing strategy, when there are more than 4 commas, is to split into tokens, then recombine all the middle tokens. We can get a list of token indexes that are the "middle" (excluding the first two and last two items) using `range(2, len(sp)-2 )` - except that excludes the last number, so we should actually use `range(2, (len(sp)-2)+1)`.

In [None]:
#print len(sp)
new_sp = []

new_sp.append(sp[0].strip())
new_sp.append(sp[1].strip())

# This one-liner uses a list comprehension to collect each of the middle pieces
# then concatenates everything together with "".join()
middle_token = "".join([sp[j].strip() for j in range(2, len(sp)-2 + 1)])
new_sp.append(middle_token)

new_sp.append(sp[-2].strip())
new_sp.append(sp[-1].strip())

print(new_sp)

In [None]:
def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

# Note that clean_line returns a list of string tokens, not a string.
def clean_line(line):
    
    line2 = strip_non_ascii(line)
    sp = line2.split(",")
    
    if(len(sp)==5):
        return sp
    
    elif(len(sp) is not 5 and len(sp)>2):
                
        new_sp = []
        
        new_sp.append(sp[0])
        new_sp.append(sp[1])

        # This one-liner uses a list comprehension to collect each of the middle pieces
        # then concatenates everything together with "".join()
        middle_token = "".join([sp[j] for j in range(2, len(sp)-2 + 1)])
        new_sp.append(middle_token)

        new_sp.append(sp[-2])
        new_sp.append(sp[-1])
        
        return new_sp


In [None]:
clean_headers = strip_non_ascii(lines[0]).strip().split(",")

# Start at second line (skip header) and end at second-to-last line 
# (arrrrrg, blank lines cause blank lists which cause Pandas problems)
# (could add an if to the list comprehension too - if line not [])
clean_tokens = [clean_line(line) for line in lines[1:-1]]
print(clean_headers)

In [None]:
all_broadcasts_full = pd.DataFrame(clean_tokens, columns = clean_headers)
print(all_broadcasts_full.shape)

In [None]:
print(all_broadcasts_full.columns)

Success! We now have all 170,000+ lines of this file loaded into a DataFrame.

In [None]:
## Strings to Timestamps

The AllBroadcasts.csv data file contains timestamps as strings. We can convert these to more useful objects.

In [None]:
#print all_broadcasts
#print all_broadcasts['timestamp']
print(all_broadcasts['timestamp'].loc[0])
print(type(all_broadcasts['timestamp'].loc[0]))

This is the optimal situation - we're given a nicely-formatted timestamp as a string that we can definitely turn into a date.

In [None]:
list_of_strings = all_broadcasts['timestamp'].tolist()
print(all_broadcasts['timestamp'].describe())

To convert a single column of date/time strings into datetime objects, use `pd.to_datetime()`:

In [None]:
print(pd.to_datetime(all_broadcasts['timestamp']).head(10))

In [None]:
all_broadcasts.loc[:,'timestamp'] = pd.to_datetime(all_broadcasts['timestamp'])

In [None]:
print(all_broadcasts['timestamp'].loc[0])
print(type(all_broadcasts['timestamp'].loc[0]))

Now all of the date/time strings have been converted to a Pandas Timestamp type. What can we do with this object? Start by getting the range of dates covered by this data set. We can perform min/max operations on Pandas Timestamp objects and the comparison works as we would expect, so we can get the date range covered by this data set:

In [None]:
tmin = all_broadcasts['timestamp'].min()
tmax = all_broadcasts['timestamp'].max()
print(tmin)
print(tmax)

If we wanted to make our own range of Timestamps, at a specified interval, we could use the `pd.date_range(start,end,freq)` function, which creates a series of timestamp objects that starts at start and ends at end, at a frequency of freq (seconds, minutes, days, etc.).

In [None]:
dates = pd.date_range(tmin,tmax,freq='S')
print(dates)

This is useful if we have data without timestamps (create a series with the data, and specify the date/time index object just created as the index). It is also useful if we want to create a "master list" of timestamps covering a certain date/time range with a specified frequency.

Somewhat related, if we need to convert our long list of timestamps (or any other list of timestamps) into a DatetimeIndex object, we can use the DatetimeIndex object constructor:

In [None]:
print(type(list_of_strings))
print(type(list_of_strings[0]))
print(pd.DatetimeIndex(list_of_strings))

## Time Deltas

Suppose we want to know how long this data set spans - how many seconds, minutes, hours, or days? If we subtract two datetime objects, we get the result as a Timedelta object:

In [None]:
diff = tmax-tmin
print(diff)
print(type(diff))

In [None]:
#print dir(diff)
print("%d minutes %d seconds"%( diff.seconds/60,diff.seconds%60 ))

If we want to convert the column of absolute timestamps into a column of relative time differences represented with Timedelta objects, we can just subtract a date from the entire row:

In [None]:
time_diff = all_broadcasts['timestamp'] - all_broadcasts['timestamp'].loc[0]
print(time_diff.head(10))

We can also combine this with Timedelta's built-in methods and fields by defining a function that operates element-wise, then applying that function to the whole column:

In [None]:
def print_me(diff):
    return "%d minutes %d seconds"%( diff.seconds/60,diff.seconds%60 )

print(time_diff.apply( lambda x : print_me(x) ).head(10))

## Time Series

Let's talk about time series proper - that is, Pandas Series objects whose index is actually a DatetimeIndex object. This type of index has some more powerful built-in methods that we'll explore. First, how do we turn a column of data in our DataFrame, which has timestamps in another separate column, into a Series with a time index?

In [None]:
print(all_broadcasts.columns)

Start by getting the timestamps and the data that we're interested in combining. Let's examine the "Action" column. Grab the "Action" and "timestamp" columns, and here we use the `.values` attribute to reduce these to Numpy arrays to keep things simple, uncluttered, and uncomplicated.

In [None]:
data_values = all_broadcasts['Action'].values
data_index  = all_broadcasts['timestamp'].values

In [None]:
ts = pd.Series(data_values, index=data_index)

In [None]:
print(ts.head(10))

We already saw that we are looking at about 15 minutes of data:

In [None]:
print(ts.index.max() - ts.index.min())

Let's look at how to select a range of data. We'll explore two examples:
* Extract all data between 2016-04-28 00:37:19 and 00:39:19 (that is, 2 minutes of data specified by timestamp)
* Extract all data between minute 2 and minute 4 of this long (that is, 2 minutes of data specified relatively)

To perform the first type of filtering, we want to extract all data falling between two timestamps. This turns out to be really easy, assuming we use nicely-formatted timestamp strings. The syntax looks something like array slicing: `ts[start:end]`.

In [None]:
# To find all time series data between two timestamp ranges, just specify them as ts[start:end]
print(ts['2016-04-28 00:37:19':'2016-04-28 00:39:19'])

To perform the second type of filtering, we want to create a new column of time deltas (a "seconds elapsed" column), and extract all data falling between two time delta values. This is only slightly more complicated - and made easier by the fact that we can still use the same slicing notation. If we make all timestamps relative to the start of the data set, then we can slice it starting at '00:02:00' (2 minutes) and ending at '00:04:00' (4 minutes).

In [None]:
# To find all time series data based on elapsed time, start by making a new time delta index
delta_index = ts.index - ts.index.min()

In [None]:
# Make a new series object with the new time delta index and the same data
tsd = pd.Series(data_values, index=delta_index)

In [None]:
# Now slice this the same way we sliced the other...
print(tsd['00:02:00':'00:04:00'])

The documentation for what strings, exactly, these slicing methods will take is not entirely clear. If you opt instead for using Timedelta objects directly, you'll get yourself into some trouble, although it is not obvious why, exactly:

In [None]:
twomin = pd.Timedelta(2,units='m')
fourmin = pd.Timedelta(4,units='m')
print(tsd[twomin:fourmin])

## Frequency Histogram

Let's suppose we want to use a long index of timestamps to examine the sampling frequency and see if it is consistent across the data set or whether it occurs at random intervals. In this case, we'll use `ts.iteritems()` to iterate through each item, one timestamp and piece of data at a time. At each step (except the first one), we'll compute the Timedelta between the current timestamp and the previous timestamp. Adding this to a list will give us a collection of data on which to compute statistics and plot histograms.

In [None]:
diffs = []
for (i,(t,d)) in enumerate(ts.iteritems()):
    if i>0:
        diff = t - prev_value
        diffs.append(diff)
    prev_value = t

diffs = pd.Series(diffs)

One more thing to do, before we can plot a histogram, is to convert the Timedelta object (which plotting libraries will not understand) into a number. We can use the seconds attribute to get the equivalent number of seconds of each Timedelta:

In [None]:
diffs = diffs.apply(lambda x : x.seconds)

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns

In [None]:
sns.distplot(diffs, kde=False,bins=10)
plt.title('Histogram: Sampling Frequencies')
plt.xlabel('Sampling Interval (s)')
f = plt.gcf()
f.set_size_inches(6,3)
plt.show()

Three sampling intervals are dominant. This, together with the data set we are visualizing, indicates that there are probably 2 or 3 background processes constantly running at fixed intervals, with some other less continual processes mixed in. (Note this is also only 15 minutes of data - we'll get to the full dataset in a moment.)

We could explore this further by grouping the data by "Activity" label, and repeating the above procedure on each group to plot a sampling frequency histogram for each different "Activity". This would tell us which activities have constant sampling frequencies, and which happen sporadically.

In [None]:
all_broadcasts.columns

In [None]:
grp = all_broadcasts[['timestamp','Action']].groupby(['Action'])

print(grp.groups.keys())

In [None]:
all_keys = grp.groups.keys()

for key in all_keys:

    print("Timestamps matching action '%s':"%(key))
    for t in grp.groups[key]:
        print(all_broadcasts['timestamp'].ix[t])
        
    print("")


This gives us a list of timestamps associated with a particular activity. If we wanted to turn those into Timedeltas from the start of the dataset, we still have the first time stored in `tmin`, so we can subtract that from the matching timestamps:

In [None]:
all_keys = grp.groups.keys()

for key in all_keys:

    print("Timestamps matching action '%s':"%(key))
    for t in grp.groups[key]:
        diff = all_broadcasts['timestamp'].ix[t] - tmin
        print("%s (%s)"%( print_me(diff) , diff ))
        
    print("")


To turn these into histograms, it's probably useful to actually store this information. Let's store it as a list of Series, and use the name attribute of the Series to store the action name:

In [None]:
all_keys = grp.groups.keys()

list_of_series = []

for key in all_keys:
    
    data = []
    list_of_timestamps = grp.groups[key]
    
    for c,t in enumerate(list_of_timestamps):
        if(c>0):
            diff = (all_broadcasts['timestamp'].ix[t] - prior_value).seconds
            data.append(diff)
        prior_value = all_broadcasts['timestamp'].ix[t]
    
    label = key
    
    s = pd.Series(data,name=label)
    list_of_series.append(s)

print(list_of_series)

In [None]:
#sns.distplot(list_of_series[0], norm_hist=True, bins=5, kde=False)
interval = 15
fifteen_second_bins = range(0,3*60+interval,interval)
[sns.distplot(s, norm_hist=False, kde=False, bins=fifteen_second_bins, label=s.name) for s in list_of_series]
#plt.xlim([0,0.1])
f = plt.gcf()
f.set_size_inches(6,4)

plt.xlabel('Sensor Interval (Seconds)')
plt.ylabel('Number')
plt.legend()
plt.show()

Let's unpack that Seaborn command. Because plotting histograms can be squirrely when you have even slight differences in the distribution of data (especially for small data sets), we specify the bin sizes using the `range()` command. We assume most intervals are 3 minutes or less, and split that into four parts (15 second intervals):

In [None]:
interval = 15
range(0,3*60+interval,interval)

This way, we can compare histograms of intervals across categories, without dealing with various other complications.

## Looking at the Full Dataset

Let's take a look at the full dataset. For some reason, adding calls to `.strip()` in the methods we defined above still hasn't gotten rid of \r and \n characters in the timestamps. We'll have to fix that to properly parse the timestamps:

In [None]:
pprint(all_broadcasts_full['timestamp'].head(10).tolist())

In [None]:
all_broadcasts_full.loc[:,'timestamp'] = all_broadcasts_full['timestamp'].apply(lambda x : x.strip())

In [None]:
pprint(all_broadcasts_full['timestamp'].head(10).tolist())

In [None]:
# This does not work:
#all_broadcasts_full.loc[:,'timestamp'] = pd.to_datetime(all_broadcasts_full['timestamp'])

In [None]:
nerr = 0
for (i,row) in all_broadcasts_full['timestamp'].iteritems():
    try:
        pd.to_datetime(row)
    except:
        nerr += 1
        pass
print("%d errors"%(nerr))

In [None]:
all_broadcasts_full.loc[:,'timestamp'] = pd.to_datetime(all_broadcasts_full['timestamp'],errors='coerce')

In [None]:
# ------
# Step 1: Group

grp = all_broadcasts_full[['timestamp','Action']].groupby(['Action'])

#print(grp.groups.keys())

In [None]:
# -----
# Step 2: Determine Polling Intervals
# 
# (this takes a while)

all_keys = grp.groups.keys()
list_of_series = []
for key in all_keys:
    
    data = []
    list_of_timestamps = grp.groups[key]
    
    for c,t in enumerate(list_of_timestamps):
        
        skip = False
        if(c>0):
            try:
                diff = (all_broadcasts_full['timestamp'].ix[t] - prior_value).seconds
                data.append(diff)
            except:
                skip = True
                
        if(not skip):
            prior_value = all_broadcasts_full['timestamp'].ix[t]

    label = key
    
    s = pd.Series(data,name=label)
    list_of_series.append(s)

Now we have gone through the list of groups (unique actions and the corresponding list of timestamp indexes) and turned each into a number of seconds since the last timestamp from that service. We then sent that list of intervals to a list. Now we can visualize the list. We'll start by going through the list of keys for the groups - these are the different actions in the `AllBroadcast.csv` file. 

Using list comprehensions, we can filter on different services:

In [None]:
pprint([j for j in all_keys if 'bluetooth' in j])

In [None]:
pprint([j for j in all_keys if 'wifi' in j])

In [None]:
pprint([j for j in all_keys if 'hardware' in j])

We can also use list comprehensions to group our buckets of intervals for each action into groups. Remember, we used the Series name field, which allows us to retrieve Series by name:

In [None]:
bluetooth_series = [s for s in list_of_series if 'bluetooth' in s.name]
wifi_series = [s for s in list_of_series if 'wifi' in s.name]

In [None]:
# -------
# Step 3A: Bluetooth Interval Counts

#sns.distplot(list_of_series[0], norm_hist=True, bins=5, kde=False)
minutes = 5
interval = 15
fifteen_second_bins = range(0,minutes*60+interval,interval)

[sns.distplot(s, norm_hist=False, kde=False, bins=fifteen_second_bins, label=s.name) for s in bluetooth_series]

f = plt.gcf()
f.set_size_inches(8,6)

plt.xlim([0,minutes*60])
plt.xlabel('Sensor Interval (Seconds)')
plt.ylabel('Number')
plt.legend()
plt.show()


In [None]:
# -------
# Step 3B: Wifi Interval Counts

#sns.distplot(list_of_series[0], norm_hist=True, bins=5, kde=False)
minutes = 5
interval = 15
fifteen_second_bins = range(0,minutes*60+interval,interval)

[sns.distplot(s, norm_hist=False, kde=False, bins=fifteen_second_bins, label=s.name) for s in wifi_series]

f = plt.gcf()
f.set_size_inches(8,6)

plt.xlim([0,minutes*60])
plt.xlabel('Sensor Interval (Seconds)')
plt.ylabel('Number')
plt.legend()
plt.show()


In [None]:
# -------
# Step 3B: Wifi Interval Counts (improved)

#sns.distplot(list_of_series[0], norm_hist=True, bins=5, kde=False)
minutes = 2
interval = 5
fifteen_second_bins = range(0,minutes*60+interval,interval)

[sns.distplot(s, norm_hist=False, kde=False, bins=fifteen_second_bins, label=s.name) 
         for s in wifi_series
            if 'RSSI' not in s.name]

f = plt.gcf()
f.set_size_inches(12,6)

plt.xlim([0,minutes*60])
plt.xlabel('Sensor Interval (Seconds)')
plt.ylabel('Number')
plt.legend()
plt.show()


Among the high frequency signals are state changes in the wifi and wifi supplicant, while lower frequency signals are p2p state changes and p2p device changes. Sensible - p2p networks tend to be limited to a smaller area and a smaller number of people and a smaller area.

## Conclusions and Next Steps

This isn't a very in-depth exploration, but got us familiar with time stamps in this file (a format shared by several other files). This sets us up for later data analysis. More concretely, we have an idea of the polling frequencies of various wifi and bluetooth sensors onboard the phone, which can be used as a proxy for changes in environment.

In later notebooks we'll keep exploring the data in these other data sets, focusing on counts and on broad-level statistics to understand what's in the data. Then we can start to understand how to build machine learning models from the data.