# Conversant Test Data analysis

Upon initial look at the data, I've firstly Googled rtb.requests. Turned out that could be "Real Time Bidding" when advertisers beat for the ad space on the web. Sounds just up Conversant's alley. With that assumption I started to dig further. I found 'Time' section and was confused by the representation but after some Google search, I was able to identify that this must be the time represented in Unix Epoch time that starts Jan 1, 1970 so the huge number is the amount of seconds since that day. With that assumption I've normalized the epoch time to be readable, and also broke the data down by the day of the week nad by the hour.

Considering that we have three different data centers that could be potentially located in different time zones, I've chosen to use UTC timestamp to make sure that the timing is accuarate. 

Upon test sort of the data, I've found that there are some negative values and made an assumption that this data is incorrect as we can't have a negative number amount of bids so I have removed them. Then I found the average or 'mean' value which allowed for a trend to form on the chart.

From looking at the chart, we can see that Data Center (DC) I gets the highest traffic between all three DCs. It's laggin between 5 and 10 hours and is picking up and at it's highest between 15 and 22 hours on both - Monday and Tuesday. DC S has very similar tendecies but much less traffic. And DC A data is suspiciously even across the entire time frime.

One of the main questions I had was about Value. What is it? Is it the amount of bids at a certain time, winning bid, or something else?

In [1]:
import pandas as pd

In [2]:
# Use pandas to export data from the text file to Jupyter data frame
df = pd.read_table('data/SRE_Test_Data.txt', delimiter='\t', names=None, header=0)

In [3]:
# Count the amount of rows in the data frame not including headers
df.count()

Type           1678
Time           1678
Value          1678
Data center    1678
dtype: int64

In [4]:
# Run wc -l data/SRE_Test_Data.txt in your terminal to match the amount of rows in the data 
# frame to the original data from the text file

In [5]:
# Sort the data frame by time in ascending order
df = df.sort_values('Time')

In [6]:
df['Time'].describe()

count    1.678000e+03
mean     1.443475e+09
std      3.778246e+04
min      1.443406e+09
25%      1.443444e+09
50%      1.443474e+09
75%      1.443506e+09
max      1.443541e+09
Name: Time, dtype: float64

In [7]:
# Removes negative values
df = df[df['Value'] >= 0]

In [8]:
# Converts epoch time to human readable time in UTC format
from datetime import datetime
import calendar

days_of_week = list(calendar.day_abbr) # Assigns the abbreviation of the weekday to an integer representation 

def normalize_time(epoch):
    event_time = datetime.utcfromtimestamp(epoch)
    return {
        "hour": event_time.hour,
        "weekday": days_of_week[event_time.weekday()],
        "iso8601": event_time.isoformat()
    }

df['Human Time'] = df.apply (lambda row: normalize_time(row['Time'])['iso8601'], axis=1)
df['Weekday'] = df.apply (lambda row: normalize_time(row['Time'])['weekday'], axis=1)
df['Hour'] = df.apply (lambda row: normalize_time(row['Time'])['hour'], axis=1)

In [9]:
# Normalizing epoch time by removing the earliest timestamp
minimum_time = df['Time'].min()
df['Normalized Time'] = df.apply (lambda row: row['Time'] - minimum_time, axis=1)

In [10]:
# Sorts data frame by the data center location
dc_a = df[df['Data center'] == 'dc=A']
dc_i = df[df['Data center'] == 'dc=I']
dc_s = df[df['Data center'] == 'dc=S']

In [11]:
# Sorts data center locations data by the day of the week
dc_a_mon = dc_a[dc_a['Weekday'] == 'Mon']
dc_i_mon = dc_i[dc_i['Weekday'] == 'Mon']
dc_s_mon = dc_s[dc_s['Weekday'] == 'Mon']

dc_a_tue = dc_a[dc_a['Weekday'] == 'Tue']
dc_i_tue = dc_i[dc_i['Weekday'] == 'Tue']
dc_s_tue = dc_s[dc_s['Weekday'] == 'Tue']

In [12]:
# Sort data from each data center per day of the week by the hour
dc_a_mon_hr = dc_a_mon.groupby(['Hour'])['Value'].describe()
dc_i_mon_hr = dc_i_mon.groupby(['Hour'])['Value'].describe()
dc_s_mon_hr = dc_s_mon.groupby(['Hour'])['Value'].describe()

dc_a_tue_hr = dc_a_tue.groupby(['Hour'])['Value'].describe()
dc_i_tue_hr = dc_i_tue.groupby(['Hour'])['Value'].describe()
dc_s_tue_hr = dc_s_tue.groupby(['Hour'])['Value'].describe()

In [13]:
# Plots the graph showing activity per hour of all three data centers over the period of 
# 24 hours (how many bids was placed in each data center ever hour)

import plotly
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected = True)

scatter_dc_a_mon_hr = go.Scatter(
    x = dc_a_mon_hr.index,
    y = dc_a_mon_hr['mean'],
    name = "Data center A"
)
scatter_dc_i_mon_hr = go.Scatter(
    x = dc_i_mon_hr.index,
    y = dc_i_mon_hr['mean'],
    name = "Data center I"
)
scatter_dc_s_mon_hr = go.Scatter(
    x = dc_s_mon_hr.index,
    y = dc_s_mon_hr['mean'],
    name = "Data center S"
)

scatter_dc_a_tue_hr = go.Scatter(
    x = dc_a_tue_hr.index,
    y = dc_a_tue_hr['mean'],
    name = "Data center A"
)
scatter_dc_i_tue_hr = go.Scatter(
    x = dc_i_tue_hr.index,
    y = dc_i_tue_hr['mean'],
    name = "Data center I"
)
scatter_dc_s_tue_hr = go.Scatter(
    x = dc_s_tue_hr.index,
    y = dc_s_tue_hr['mean'],
    name = "Data center S"
)

mon_data = [scatter_dc_a_mon_hr, scatter_dc_i_mon_hr, scatter_dc_s_mon_hr]
tue_data = [scatter_dc_a_tue_hr, scatter_dc_i_tue_hr, scatter_dc_s_tue_hr]

mon_layout = go.Layout(
    title = 'Monday breakdown by hour',
    xaxis = dict(
        title='24 hour period of time in UTC',
        domain = [0, 24]
    ),
    yaxis = dict(
        title = 'Amount of bids',
        domain = [0, 60000] 
    ),
)

tue_layout = go.Layout(
    title = 'Tuesday breakdown by hour',
    xaxis = dict(
        title='24 hour period of time in UTC',
        domain = [0, 24],
        ticklen = 5
    ),
    yaxis = dict(
        title = 'Amount of bids',
        domain = [0, 60000],
        ticklen = 5,
    ),
)

mon_fig = go.Figure(data=mon_data, layout=mon_layout)
tue_fig = go.Figure(data=tue_data, layout=tue_layout)

iplot(mon_fig)
iplot(tue_fig)