# Time distribution of transition events

The goal of this notebook is to plot the time between per-user events - aggregated over all users. This can be used inspect the given data e.g. w.r.t. noisy transitions between antennas or w.r.t. very large time deltas that we might want to exclude from the analysis.

In [None]:
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from datetime import datetime
import matplotlib.pyplot as plt

The setup code is very similar to [mobilitymatrix.ipynb](mobilitymatrix.ipynb)

In [None]:
spark = (
    SparkSession.builder.master("local[*]")
    .config("spark.driver.memory", "32g")
    .appName("mobility_analysis")
    .getOrCreate()
)

In [None]:
from pyspark.sql.types import StructType, IntegerType, StringType, DoubleType

schema = (
    StructType()
    .add("time", StringType(), True)
    .add("user", StringType(), True)
    .add("zip1", IntegerType(), True)
    .add("zip2", IntegerType(), True)
    .add("lat", DoubleType(), True)
    .add("lon", DoubleType(), True)
)

In [None]:
data = spark.read.csv("../data/calldata", sep="|", schema=schema)

In [None]:
preprocessed = data.rdd.map(lambda row: (row["user"], row["time"]))

In [None]:
grouped = preprocessed.groupByKey().map(lambda row: row[1].data)

In this step we actually calculate the difference between timestamps for each user:

In [None]:
def extract_time_between_calls(events):
    # Is this the correct sorting criterion?
    sorted_events = sorted(events)
    ret = []
    for a, b in zip(sorted_events[:-1], sorted_events[1:]):
        ret.append(
            (
                datetime.strptime(b, "%Y%m%d%H%M%S")
                - datetime.strptime(a, "%Y%m%d%H%M%S")
            ).seconds
        )
    return ret

Next, we aggregate over the entire data and send the data to the fronted:

In [None]:
transitions = grouped.flatMap(extract_time_between_calls).collect()

Finally, we plot it:

In [None]:
fig = plt.hist(transitions)