# Explorative Data Analysis

In this exercise, you will learn how to handle datasets, to load, manipulate and visualize data in python. Furthermore, we will have a look at calculating features on data.


## Pandas

A common package for data handling and analysis in python is `pandas`. The fundamental data structure introduced by this package is the data frame. A data frame is a table that can be handled like a table in a relational data base: We can select rows, columns or both. 

In [3]:
import numpy as np
import pandas as pd
from scipy.io import arff
import matplotlib.pyplot as plt

data = arff.loadarff('S08.arff')
df = pd.DataFrame(data[0])


#plt.plot(df["Sensor_T8_Acceleration_X"])
df

Unnamed: 0,time,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y,Sensor_T8_Acceleration_Z,Sensor_T8_AngularVelocity_X,Sensor_T8_AngularVelocity_Y,Sensor_T8_AngularVelocity_Z,Sensor_RightForeArm_Acceleration_X,Sensor_RightForeArm_Acceleration_Y,Sensor_RightForeArm_Acceleration_Z,...,Sensor_RightLowerLeg_AngularVelocity_X,Sensor_RightLowerLeg_AngularVelocity_Y,Sensor_RightLowerLeg_AngularVelocity_Z,Sensor_LeftLowerLeg_Acceleration_X,Sensor_LeftLowerLeg_Acceleration_Y,Sensor_LeftLowerLeg_Acceleration_Z,Sensor_LeftLowerLeg_AngularVelocity_X,Sensor_LeftLowerLeg_AngularVelocity_Y,Sensor_LeftLowerLeg_AngularVelocity_Z,class
0,0.0,0.970733,-0.055330,0.171758,0.018513,0.028240,-0.000941,0.908353,-0.303781,0.181371,...,0.036398,0.001883,-0.002196,0.991241,-0.035035,0.161290,0.009413,-0.012865,0.022278,"b""'open'"""
1,8.0,0.970733,-0.055330,0.171972,0.031691,0.033888,-0.006589,0.908567,-0.303568,0.181158,...,0.026985,0.000000,-0.014434,0.991241,-0.035035,0.161290,0.016630,-0.007217,0.023533,"b""'open'"""
2,16.0,0.970519,-0.055330,0.172185,0.028553,0.034515,0.000941,0.908567,-0.303568,0.181158,...,0.031064,0.002196,-0.009099,0.991241,-0.035249,0.161077,0.030436,-0.006589,0.016630,"b""'open'"""
3,24.0,0.970519,-0.055116,0.172613,0.033574,0.030122,-0.001569,0.908567,-0.303568,0.181158,...,0.032005,0.009413,-0.007844,0.991241,-0.035463,0.161077,0.030122,0.003765,0.028553,"b""'open'"""
4,32.0,0.970519,-0.055116,0.172613,0.039849,0.008786,0.000941,0.908567,-0.303354,0.180944,...,0.026357,-0.003138,-0.011610,0.991241,-0.035676,0.161290,0.033574,0.002510,0.027298,"b""'open'"""
5,40.0,0.970519,-0.055116,0.172826,0.038594,0.002196,0.011923,0.908780,-0.303354,0.180731,...,0.035457,0.006903,-0.004707,0.991241,-0.035676,0.161290,0.032319,0.005962,0.028867,"b""'open'"""
6,48.0,0.970519,-0.055116,0.172826,0.037653,0.007217,0.000628,0.908780,-0.303140,0.180517,...,0.046752,0.006275,-0.013179,0.991241,-0.035890,0.161290,0.032319,0.015689,0.020082,"b""'open'"""
7,56.0,0.970519,-0.055116,0.172826,0.040791,0.015061,-0.002510,0.908780,-0.303140,0.180517,...,0.033574,0.005962,0.000628,0.991241,-0.036103,0.161290,0.049263,0.003452,0.012237,"b""'open'"""
8,64.0,0.970519,-0.055116,0.173040,0.031691,0.022278,0.016316,0.908994,-0.302927,0.180303,...,0.042360,0.003765,0.000628,0.991241,-0.036103,0.161504,0.032319,0.011610,0.020082,"b""'open'"""
9,72.0,0.970519,-0.055116,0.173254,0.035143,0.033260,0.007217,0.908994,-0.302927,0.180090,...,0.033888,-0.000941,-0.006903,0.991241,-0.036103,0.161504,0.042046,0.014434,0.011923,"b""'open'"""


### Selecting rows and columns

Using pandas, it is simple to select only certain rows or columns of a data frame. One option is to select columns by name:


In [20]:
accx = df.loc[:,"Sensor_T8_Acceleration_X"]

Here, `:` stands for "select all rows". If just a single column is returned, the result is of type `Series`, otherwise it is a data frame again. We can also access columns by index:

In [25]:
acc = df.iloc[:,1:4]
acc

Unnamed: 0,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y,Sensor_T8_Acceleration_Z
0,0.970733,-0.055330,0.171758
1,0.970733,-0.055330,0.171972
2,0.970519,-0.055330,0.172185
3,0.970519,-0.055116,0.172613
4,0.970519,-0.055116,0.172613
5,0.970519,-0.055116,0.172826
6,0.970519,-0.055116,0.172826
7,0.970519,-0.055116,0.172826
8,0.970519,-0.055116,0.173040
9,0.970519,-0.055116,0.173254


Rows can be accessed in the same way. For example, following expression returns only the first 5 rows:

In [24]:
df.iloc[0:5,:]

Unnamed: 0,time,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y,Sensor_T8_Acceleration_Z,Sensor_T8_AngularVelocity_X,Sensor_T8_AngularVelocity_Y,Sensor_T8_AngularVelocity_Z,Sensor_RightForeArm_Acceleration_X,Sensor_RightForeArm_Acceleration_Y,Sensor_RightForeArm_Acceleration_Z,...,Sensor_RightLowerLeg_AngularVelocity_X,Sensor_RightLowerLeg_AngularVelocity_Y,Sensor_RightLowerLeg_AngularVelocity_Z,Sensor_LeftLowerLeg_Acceleration_X,Sensor_LeftLowerLeg_Acceleration_Y,Sensor_LeftLowerLeg_Acceleration_Z,Sensor_LeftLowerLeg_AngularVelocity_X,Sensor_LeftLowerLeg_AngularVelocity_Y,Sensor_LeftLowerLeg_AngularVelocity_Z,class
0,0.0,0.970733,-0.05533,0.171758,0.018513,0.02824,-0.000941,0.908353,-0.303781,0.181371,...,0.036398,0.001883,-0.002196,0.991241,-0.035035,0.16129,0.009413,-0.012865,0.022278,"b""'open'"""
1,8.0,0.970733,-0.05533,0.171972,0.031691,0.033888,-0.006589,0.908567,-0.303568,0.181158,...,0.026985,0.0,-0.014434,0.991241,-0.035035,0.16129,0.01663,-0.007217,0.023533,"b""'open'"""
2,16.0,0.970519,-0.05533,0.172185,0.028553,0.034515,0.000941,0.908567,-0.303568,0.181158,...,0.031064,0.002196,-0.009099,0.991241,-0.035249,0.161077,0.030436,-0.006589,0.01663,"b""'open'"""
3,24.0,0.970519,-0.055116,0.172613,0.033574,0.030122,-0.001569,0.908567,-0.303568,0.181158,...,0.032005,0.009413,-0.007844,0.991241,-0.035463,0.161077,0.030122,0.003765,0.028553,"b""'open'"""
4,32.0,0.970519,-0.055116,0.172613,0.039849,0.008786,0.000941,0.908567,-0.303354,0.180944,...,0.026357,-0.003138,-0.01161,0.991241,-0.035676,0.16129,0.033574,0.00251,0.027298,"b""'open'"""


We can combine both, lik this:

In [28]:
df.loc[0:5,["Sensor_T8_Acceleration_X", "Sensor_T8_Acceleration_Y"]]

Unnamed: 0,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y
0,0.970733,-0.05533
1,0.970733,-0.05533
2,0.970519,-0.05533
3,0.970519,-0.055116
4,0.970519,-0.055116
5,0.970519,-0.055116


Another useful option is to access rows or columns via boolean expression. For example, the following expression returns all rows, where the value of `Sensor_T8_Acceleration_X` is less than 0.7. 

In [33]:
df.loc[df.Sensor_T8_Acceleration_X < 0.7,:]

Unnamed: 0,time,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y,Sensor_T8_Acceleration_Z,Sensor_T8_AngularVelocity_X,Sensor_T8_AngularVelocity_Y,Sensor_T8_AngularVelocity_Z,Sensor_RightForeArm_Acceleration_X,Sensor_RightForeArm_Acceleration_Y,Sensor_RightForeArm_Acceleration_Z,...,Sensor_RightLowerLeg_AngularVelocity_X,Sensor_RightLowerLeg_AngularVelocity_Y,Sensor_RightLowerLeg_AngularVelocity_Z,Sensor_LeftLowerLeg_Acceleration_X,Sensor_LeftLowerLeg_Acceleration_Y,Sensor_LeftLowerLeg_Acceleration_Z,Sensor_LeftLowerLeg_AngularVelocity_X,Sensor_LeftLowerLeg_AngularVelocity_Y,Sensor_LeftLowerLeg_AngularVelocity_Z,class
4348,34784.0,0.697714,0.207221,0.661397,0.315657,0.969878,-0.560715,0.336894,-0.788507,0.397565,...,-0.190461,0.055852,0.443677,0.861141,0.515061,0.030335,-0.956072,-0.473172,-0.561657,"b""'close'"""
4349,34792.0,0.691733,0.211707,0.666097,0.299341,0.983997,-0.534045,0.352062,-0.775262,0.411237,...,-0.194540,0.063696,0.458739,0.859004,0.518693,0.030549,-0.714151,-0.468779,-0.552871,"b""'close'"""
4350,34800.0,0.685751,0.216193,0.670797,0.298086,0.962975,-0.510825,0.365307,-0.761803,0.425336,...,-0.215249,0.079699,0.468466,0.856655,0.522324,0.029908,-0.512708,-0.492626,-0.545340,"b""'close'"""
4351,34808.0,0.680197,0.220466,0.675283,0.302793,0.910260,-0.502981,0.377270,-0.748131,0.439863,...,-0.223094,0.079699,0.441481,0.854518,0.525956,0.028626,-0.316912,-0.465328,-0.514904,"b""'close'"""
4352,34816.0,0.674642,0.224525,0.679342,0.310009,0.848133,-0.504864,0.387310,-0.734672,0.454176,...,-0.241607,0.072796,0.430185,0.852596,0.529374,0.026490,-0.069344,-0.430813,-0.466269,"b""'close'"""
4353,34824.0,0.669515,0.229011,0.682974,0.315344,0.767179,-0.514591,0.396283,-0.722068,0.467635,...,-0.252589,0.092250,0.457170,0.850887,0.532365,0.023713,0.193285,-0.390336,-0.414496,"b""'close'"""
4354,34832.0,0.664815,0.233283,0.686178,0.329150,0.706934,-0.505177,0.403974,-0.709891,0.480239,...,-0.258237,0.087543,0.479134,0.849178,0.535142,0.020081,0.407280,-0.351114,-0.377157,"b""'close'"""
4355,34840.0,0.660115,0.237556,0.689169,0.323188,0.658299,-0.496705,0.411023,-0.698996,0.491134,...,-0.230624,0.057107,0.476624,0.847682,0.537492,0.016022,0.540948,-0.326639,-0.355507,"b""'close'"""
4356,34848.0,0.655843,0.241829,0.691946,0.298714,0.662378,-0.497960,0.417432,-0.688955,0.500107,...,-0.201130,0.023533,0.485096,0.846187,0.539842,0.011750,0.562912,-0.291810,-0.344838,"b""'close'"""
4357,34856.0,0.651143,0.245888,0.694723,0.279573,0.697521,-0.502040,0.423627,-0.679983,0.507584,...,-0.136178,0.028240,0.494195,0.844691,0.542192,0.007477,0.521180,-0.280201,-0.344211,"b""'close'"""


### Inserting new values

Inserting new values into a data frame is possible in the same way. The following expression sets all values of column `Sensor_T8_Acceleration_Y` that are smaller than 0 to 0.

In [37]:
df.loc[df.Sensor_T8_Acceleration_Y < 0,"Sensor_T8_Acceleration_Y"] = 0
df

Unnamed: 0,time,Sensor_T8_Acceleration_X,Sensor_T8_Acceleration_Y,Sensor_T8_Acceleration_Z,Sensor_T8_AngularVelocity_X,Sensor_T8_AngularVelocity_Y,Sensor_T8_AngularVelocity_Z,Sensor_RightForeArm_Acceleration_X,Sensor_RightForeArm_Acceleration_Y,Sensor_RightForeArm_Acceleration_Z,...,Sensor_RightLowerLeg_AngularVelocity_X,Sensor_RightLowerLeg_AngularVelocity_Y,Sensor_RightLowerLeg_AngularVelocity_Z,Sensor_LeftLowerLeg_Acceleration_X,Sensor_LeftLowerLeg_Acceleration_Y,Sensor_LeftLowerLeg_Acceleration_Z,Sensor_LeftLowerLeg_AngularVelocity_X,Sensor_LeftLowerLeg_AngularVelocity_Y,Sensor_LeftLowerLeg_AngularVelocity_Z,class
0,0.0,0.970733,0.0,0.171758,0.018513,0.028240,-0.000941,0.908353,-0.303781,0.181371,...,0.036398,0.001883,-0.002196,0.991241,-0.035035,0.161290,0.009413,-0.012865,0.022278,"b""'open'"""
1,8.0,0.970733,0.0,0.171972,0.031691,0.033888,-0.006589,0.908567,-0.303568,0.181158,...,0.026985,0.000000,-0.014434,0.991241,-0.035035,0.161290,0.016630,-0.007217,0.023533,"b""'open'"""
2,16.0,0.970519,0.0,0.172185,0.028553,0.034515,0.000941,0.908567,-0.303568,0.181158,...,0.031064,0.002196,-0.009099,0.991241,-0.035249,0.161077,0.030436,-0.006589,0.016630,"b""'open'"""
3,24.0,0.970519,0.0,0.172613,0.033574,0.030122,-0.001569,0.908567,-0.303568,0.181158,...,0.032005,0.009413,-0.007844,0.991241,-0.035463,0.161077,0.030122,0.003765,0.028553,"b""'open'"""
4,32.0,0.970519,0.0,0.172613,0.039849,0.008786,0.000941,0.908567,-0.303354,0.180944,...,0.026357,-0.003138,-0.011610,0.991241,-0.035676,0.161290,0.033574,0.002510,0.027298,"b""'open'"""
5,40.0,0.970519,0.0,0.172826,0.038594,0.002196,0.011923,0.908780,-0.303354,0.180731,...,0.035457,0.006903,-0.004707,0.991241,-0.035676,0.161290,0.032319,0.005962,0.028867,"b""'open'"""
6,48.0,0.970519,0.0,0.172826,0.037653,0.007217,0.000628,0.908780,-0.303140,0.180517,...,0.046752,0.006275,-0.013179,0.991241,-0.035890,0.161290,0.032319,0.015689,0.020082,"b""'open'"""
7,56.0,0.970519,0.0,0.172826,0.040791,0.015061,-0.002510,0.908780,-0.303140,0.180517,...,0.033574,0.005962,0.000628,0.991241,-0.036103,0.161290,0.049263,0.003452,0.012237,"b""'open'"""
8,64.0,0.970519,0.0,0.173040,0.031691,0.022278,0.016316,0.908994,-0.302927,0.180303,...,0.042360,0.003765,0.000628,0.991241,-0.036103,0.161504,0.032319,0.011610,0.020082,"b""'open'"""
9,72.0,0.970519,0.0,0.173254,0.035143,0.033260,0.007217,0.908994,-0.302927,0.180090,...,0.033888,-0.000941,-0.006903,0.991241,-0.036103,0.161504,0.042046,0.014434,0.011923,"b""'open'"""


### Apply
Often, we want to apply a function to a complete row or column of the data. The function `apply` allows that. The following expression calculates the mean per column:

In [46]:
df.iloc[:,1:31].apply(np.mean)

time                                      58560.000000
Sensor_T8_Acceleration_X                      0.967756
Sensor_T8_Acceleration_Y                      0.008310
Sensor_T8_Acceleration_Z                      0.137146
Sensor_T8_AngularVelocity_X                   0.021981
Sensor_T8_AngularVelocity_Y                   0.004654
Sensor_T8_AngularVelocity_Z                   0.007052
Sensor_RightForeArm_Acceleration_X            0.207117
Sensor_RightForeArm_Acceleration_Y           -0.716522
Sensor_RightForeArm_Acceleration_Z            0.269941
Sensor_RightForeArm_AngularVelocity_X         0.031404
Sensor_RightForeArm_AngularVelocity_Y         0.004029
Sensor_RightForeArm_AngularVelocity_Z        -0.002067
Sensor_LeftForeArm_Acceleration_X             0.207117
Sensor_LeftForeArm_Acceleration_Y            -0.716522
Sensor_LeftForeArm_Acceleration_Z             0.269941
Sensor_LeftForeArm_AngularVelocity_X          0.031404
Sensor_LeftForeArm_AngularVelocity_Y          0.004029
Sensor_Lef

## Exercise 1


Compute the distribution of classes in `df`. (`collections.Counter`)

Plot the distribution of classes as a bar plot.

Plot multiple accelerometer axes (e.g. (z.B. "Sensor_T8_Acceleration_X",
"Sensor_T8_Acceleration_Y", "Sensor_T8_Acceleration_Z") as a line plot. The different axes should be drawn in different colors.

## Exercise 2

Next, we want to calculate some features of the data. For sequential data, we will typically calculate features in a window-based fashion. That is, for rows 1 to n, we calculate some feature (mean, ...), then for rows n+1 to 2n and so on. The windows can also be overlapping. In the lecture, you learned about other features that are, for example, based on the frequency of the signal. 

Implement a function `feature` which calculates a feature of the data, given a window size, an overlap, a dataset, and a statistical feature function (mean, ...). Calculate the mean, median and variance of accelerometer data of the right leg with window sizes 128, 256 and 512. Use 50% overlap and plot the result. 
