# Basic Exploratory Data Analysis

In this notebook I will demonstrate how to download the data and how to visualise one trip. The import statements can be seen as a default for the jupyter notebook in the cookiecutter environment. If you only want to download all the data call:

``python src\data\make_data.py``

In [45]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys
from dotenv import load_dotenv, find_dotenv

import pandas as pd
import numpy as np

#Visualisation Libraries
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
from datetime import datetime
#####
#
# Default way of appending the src directory in the cookiecutter file structure
#
#####

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# By loading the dotenv we can access Environment variables setted int the dm_mobility_task/.env file
# e.g. I setted there my token like this: "KEY_LUKAS"=1234, similary there is one for KEY_RAPHAEL and KEY_MORITZ
load_dotenv(find_dotenv())

# import my method from the source code
%aimport data.data_utils
from data.data_utils import list_recorded_data
from data.data_utils import download_data_sets
from data.data_utils import get_data_per_trip, get_data_per_token
from data.data_utils import download_all
from data.data_utils import VALID_NAMES
from data.data_utils import get_trip_summaries
%aimport visualization.visualize
from visualization.visualize import plot_track
%aimport data.preprocessing
from data.preprocessing import downsample_time_series, convert_timestamps
from data.preprocessing import downsample_time_series_per_category

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [46]:
# the environment variable specified in .env
# lists the recorded data by user
token = os.environ.get("KEY_RAPHAEL")
recorded_trips = list_recorded_data(token)
recorded_trips

Unnamed: 0,full_name,last_modified,size
0,358568053229914_20171121-144912,2017-11-21 15:00,82K
1,358568053229914_20171121-145403,2017-11-21 15:00,122K
2,358568053229914_20171127-181845,2017-11-27 18:40,605K
3,358568053229914_20171128-130428,2017-11-28 18:00,173K
4,358568053229914_20171128-163426,2017-11-28 18:00,638K
5,358568053229914_20171128-174241,2017-11-28 18:00,707K
6,358568053229914_20171130-110040,2017-11-30 11:20,647K


In order to avoid too many request to the server we can use the **full_name** column to download the data and save it to data/raw, but download_data_sets(token) can also download the data per token, or all data from our team can be downloaded with download_all().

In [47]:
# download_data_sets(token) works as well, 
# but than list_recorded_data is invoked again
# we can also download all data for our team by 
download_all()
tar_file_names = list(recorded_trips["full_name"] + ".csv.tar.gz")
#download_data_sets(token, file_names=tar_file_names)


Downloaded  358568053229914_20171121-144912.csv.tar.gz
Downloaded  358568053229914_20171121-145403.csv.tar.gz
Downloaded  358568053229914_20171127-181845.csv.tar.gz
Downloaded  358568053229914_20171128-130428.csv.tar.gz
Downloaded  358568053229914_20171128-163426.csv.tar.gz
Downloaded  358568053229914_20171128-174241.csv.tar.gz
Downloaded  358568053229914_20171130-110040.csv.tar.gz
Downloaded  355007075245007_20171108-110713.csv.tar.gz
Downloaded  355007075245007_20171108-132646.csv.tar.gz
Downloaded  355007075245007_20171121-140720.csv.tar.gz
Downloaded  355007075245007_20171121-141338.csv.tar.gz
Downloaded  868049020858898_20171109-131946.csv.tar.gz
Downloaded  868049020858898_20171116-074009.csv.tar.gz
Downloaded  868049020858898_20171123-072847.csv.tar.gz
Downloaded  868049020858898_20171123-074632.csv.tar.gz
Downloaded  868049020858898_20171128-164017.csv.tar.gz
Downloaded  868049020858898_20171128-165210.csv.tar.gz
Downloaded  868049020858898_20171130-074628.csv.tar.gz


The data has now been downloaded in dm_mobility_task/data/raw/token, we can check that by calling:

In [48]:
from data.data_utils import get_file_names, get_data_dir
# also possible for specific token
# get_file_names(os.path.join(get_data_dir(),"raw"), token=token)
recorded_file_names = get_file_names(os.path.join(get_data_dir(),"raw"))
print("We have recorded: {} trips".format(len(recorded_file_names)))
recorded_file_names

We have recorded: 18 trips


['358568053229914/358568053229914_20171127-181845.csv.tar.gz',
 '358568053229914/358568053229914_20171128-163426.csv.tar.gz',
 '358568053229914/358568053229914_20171128-130428.csv.tar.gz',
 '358568053229914/358568053229914_20171121-144912.csv.tar.gz',
 '358568053229914/358568053229914_20171128-174241.csv.tar.gz',
 '358568053229914/358568053229914_20171130-110040.csv.tar.gz',
 '358568053229914/358568053229914_20171121-145403.csv.tar.gz',
 '355007075245007/355007075245007_20171108-132646.csv.tar.gz',
 '355007075245007/355007075245007_20171121-141338.csv.tar.gz',
 '355007075245007/355007075245007_20171108-110713.csv.tar.gz',
 '355007075245007/355007075245007_20171121-140720.csv.tar.gz',
 '868049020858898/868049020858898_20171123-074632.csv.tar.gz',
 '868049020858898/868049020858898_20171116-074009.csv.tar.gz',
 '868049020858898/868049020858898_20171109-131946.csv.tar.gz',
 '868049020858898/868049020858898_20171128-164017.csv.tar.gz',
 '868049020858898/868049020858898_20171128-165210.csv.t

Now that the data has been downloaded, we can read it from file and start to explore it. I will here only explore data from my key, but with get_data_per_trip(dir_name="raw") it is possible to load all **raw** data per trip in memory.

In [49]:
# read data per trip by for all users by invoking: get_data_per_trip(dir_name="raw")
#dfs=get_data_per_trip(dir_name="raw")
dfs=get_data_per_token(token)

The data can now be accessed in the following way. Enter one of the following valid names as key in the dictionary:


In [50]:
VALID_NAMES

['annotation', 'cell', 'event', 'location', 'mac', 'marker', 'sensor']

E.g. for the sensor data:

In [51]:
trip_nr = 0
dfs[trip_nr]["sensor"].head(10)

Unnamed: 0,sensor,time,x,y,z,total
0,magnetic,1511803126089,166.75,-72.0,-86.75,201.283691
1,acceleration,1511803126089,0.852,6.828,7.24,9.988247
2,magnetic,1511803126099,166.75,-72.0,-86.75,201.283691
3,acceleration,1511803126099,0.718,6.828,7.221,9.963933
4,magnetic,1511803126109,166.75,-72.0,-86.75,201.283691
5,acceleration,1511803126109,0.651,6.885,7.221,9.998494
6,magnetic,1511803126119,166.75,-72.0,-86.75,201.283691
7,acceleration,1511803126119,0.584,6.828,7.489,10.151244
8,magnetic,1511803126129,163.25,-58.25,-88.0,194.390396
9,acceleration,1511803126129,0.507,6.809,7.412,10.077563


Get summaries for each recorded trip:

In [52]:
get_trip_summaries(dfs, convert_time=True)

ValueError: Length of values does not match length of index

Next we are visualizing the acceleration data:

In [None]:
acceleration_df = dfs[trip_nr]["sensor"]
acceleration_df = acceleration_df[acceleration_df["sensor"]=="acceleration"]
acceleration_df.head(3)

Quick visualisation of the acceleration of one of my trips

In [None]:
small = acceleration_df.drop(["sensor","total"],axis=1).set_index("time")
figsize=(12, 4)
small["x"].plot(figsize=figsize);
plt.ylabel("x")
plt.show();

small["y"].plot(figsize=figsize);
plt.ylabel("y")
plt.show();

small["z"].plot(figsize=figsize);
plt.ylabel("z")
plt.show();



**Plot the gps data on a google map and save it as html to disk:**

In [None]:
location_df = dfs[trip_nr]["location"]
file_name = "gps_test.html"
plot_track(location_df[["longitude", "latitude"]], file_name)

The track can now be viewed at:


In [None]:
os.path.join("reports","maps",file_name)

----
**Apply resampling in new time interval for coarser granularity**

The following csv files include time columns: cell, event, location, marker, sensor.

Lets see an example for the acceleration data for one trip. First we have to convert the integer timestamps int the time column to datetime objects. This can be done via the convert_timestamps function.

In [None]:
acceleration_df = dfs[trip_nr]["sensor"]
acceleration_df = acceleration_df[acceleration_df["sensor"]=="acceleration"]
acceleration_df = convert_timestamps(acceleration_df)
acceleration_df.head()

Next we can downsample the acceleration data from milliseconds to a 1 second interval, where the new aggregated values are aggregated via the mean. 

**Note** that this drops the sensor column and the sensor column has to be reappended. This is not an issue here because we have only one sensor type. If you want to keep the categorical variable, see next point.

In [None]:
acceleration_df_resampled = downsample_time_series(acceleration_df, time_interval="1S")
acceleration_df_resampled.head()

Another possibility were we can keep all the categorical values is by using the downsample_time_series_per_category function, here shown for full sensor table.

**Note** that here we did not convert the time column for dfs[trip_nr] before, thats why this step is also done implicitly, otherwise the resampling does not work.

In [None]:
dfs[trip_nr]["sensor"].head()

In [None]:
all_sensors_resampled = downsample_time_series_per_category(dfs[trip_nr]["sensor"],
                                                            categorical_colnames=["sensor"])

all_sensors_resampled.head()

If we now once again plot the acceleration for the resampled version we get:

In [None]:
small = acceleration_df_resampled
figsize=(12, 4)
small["x"].plot(figsize=figsize);
plt.ylabel("x")
plt.show();

small["y"].plot(figsize=figsize);
plt.ylabel("y")
plt.show();

small["z"].plot(figsize=figsize);
plt.ylabel("z")
plt.show();