## Measuring and predicting locality of smartphone memory access through data mining of trace files
### IITGN CS612 Fall 2017 - Project Checkpoint Presentation (13 November 2017)
### Sohhom Bandyopadhyay (15510011) and Sujata Sinha (15350008)

**Based on trace files from: ** <br>
http://iotta.snia.org/tracetypes/3 (Traces collected from **Nexus 5**) <br>
Zhou, D., Pan, W., Wang, W., & Xie, T. (2015, October). I/O characteristics of smartphone applications and their implications for eMMC design. In Workload Characterization (IISWC), 2015 IEEE International Symposium on (pp. 12-21). IEEE.



** Outline (this notebook):**
This notebook presents the code and outputs of some preliminary EDA (Exploratory Data Analysis). Sections:
 - Description of trace file format
 - Conversion and data preprocessing
 - Calculate turnaround time and hardware processing time for each access, based on timestamps
 - plots of:
    - Average request size (bytes) by type of activity
    - Average turnaround time (microseconds) by type of activity
    - Average hardware time (microseconds) by type of activity
    

** ToDo (next two weeks): ** <br>
 - measuring spatial and temporal locality
 - Machine learning classifier to predict the above

## Trace file format:
**column 0** : start address (in sectors) <br>
**column 1** : access size (in sectors) <br>
**column 2** : access size (in byte) <br>
**column 3** : access type & waiting status (3 bit number):
 - LSB: indicates read (0) or write (1)
 - MSB: indicates waiting status (0 = yes, 1 = no)
 - Middle bit: unused <br>
 (It's not represented as binary, but as integers : 0, 1, 4 and 5 )
 <br>
 
**column 4** : request generate time (generated and inserted into request queue). <br>
**column 5** : request process start time (fetched and and began processing)  <br>
**column 6** : request submit time (submitted to hardware) <br>
**column 7** : request finish time (completed, callback function invoked)  <br>

Thus, any request goes through 4 stages: push to queue -> start processing -> submit to hardware -> finish (callback)


In [None]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from datetime import datetime
import seaborn
%matplotlib inline

In [None]:
# define where the trace files, w.r.t working directory
data_dir = 'Trace_files'
# collect the names of all txt files in specified directory
fnames = [x for x in filter(lambda x: x.endswith('.txt'), os.listdir(os.path.join('.', data_dir)))]
# extract the names of workloads from file names, this will be useful later to create dictionary keys
workload_names = [ x for x in map(lambda x: x.split('_')[1].split('.')[0], fnames) ]

In [None]:
# read all the Trace Files into a single variable called dataset
# it's dict of dataframes, indexed by name
# example: dataset['twitter'] gives all rows of the file "log186_twitter.txt"
dataset = {fname.split('_')[1].split('.')[0]:pd.read_csv(os.path.join(data_dir, fname), delimiter='\s+', header = None, dtype = float) for fname in fnames}

## convert unix timestamps to datetime objects
also, extract the request size

In [None]:
request_sizes = { workname:dataset[workname].iloc[:,2] for workname in workload_names }
# columns 4 through 7 contains the timestamps (floating point numbers), need to convert them into manipulable objects
for cindx in 4, 5, 6, 7:
    for df in dataset.values():
        # apply transformation (non destructive) to each column, then store the updated column back in the dataframe
        df.loc[:,cindx] = df.loc[:,cindx].apply(datetime.fromtimestamp)

### Calculate durations
 ... of various processing stages

In [None]:
turnaround_time = {}
hw_time = {}
for name in workload_names:
    turnaround_time[name] = dataset[name].iloc[:,7] - dataset[name].iloc[:,4]
    turnaround_time[name] = turnaround_time[name].apply(lambda x: x.components.milliseconds*1000+x.components.microseconds)
    hw_time[name] = dataset[name].iloc[:,7] - dataset[name].iloc[:,6]
    hw_time[name] = hw_time[name].apply(lambda x: x.components.milliseconds*1000+x.components.microseconds)

### Plot averages across activities

In [None]:
avg_req_size       = {x:np.mean(request_sizes[x]) for x in workload_names}

In [None]:
somevar = seaborn.barplot(list(avg_req_size.keys()), list(avg_req_size.values()), palette='dark')
for item in somevar.get_xticklabels():
    item.set_rotation(90)
plt.ylabel('Request size (bytes)')
plt.xlabel('Workload')
plt.title("Average request size across workloads")
plt.show()

In [None]:
# TO be plotted
# calculate the average turnaround time (request finish - request submission) across all workloads
avg_tat = {x:np.mean(turnaround_time[x]) for x in workload_names}

In [None]:
# plotting..
axes = seaborn.barplot(list(avg_tat.keys()), list(avg_tat.values()), palette='dark')
for item in axes.get_xticklabels():
    item.set_rotation(85)
plt.title("Average turnaround time across workloads")
plt.xlabel('Workload')
plt.ylabel('Turnaround time (microsecond)')
plt.show()

In [None]:
# TO be plotted
# calculate the average time taken by hardware to process request, across all workloads
avg_hw = {x:np.mean(hw_time[x]) for x in workload_names}

In [None]:
# plotting..
axes = seaborn.barplot(list(avg_hw.keys()), list(avg_hw.values()), palette='dark')
for item in axes.get_xticklabels():
    item.set_rotation(85)
plt.title("Average hardware time across workloads")
plt.xlabel('Workload')
plt.ylabel('Hardware time (microsecond)')
plt.show()

## Todo

### Questions to answer:

  - How many unique sector requests?
  - How many repeated requests within a certain threshold of time (e.g. 5ms)?
  - Assuming a certain cache size and configuration, how many bytes could have been cached?
 

In [None]:
for key in dataset:
    df = dataset[key]
    print (key, ":", df.iloc[:,0].shape[0] - df.iloc[:,0].unique().shape[0], "repeated requests out of total", df.iloc[:,0].shape[0])

### towards locality..
For a given workload, for each sector that was requested **more than once**, what are the **time-deltas between first request and subsequent requests?**

In [None]:
df = dataset['radio']

In [None]:
uniq_sectors = df.iloc[:,0].unique()

In [None]:
uniq_sectors.shape

In [None]:
5820-1698