# Cleaning up the raw data

This notebook will demonstrate how to do a few things

1. Expand the filename dummies into columns
2. Drop duplicate time indexes
3. Combine all raw files into one feather file

The Arrow parquet format will store our data more accurately, unlike csv. It is also much smaller.

(Support for feather improves with pandas 1.1.0, in particular compression: https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html)

### Imports

In [1]:
import pandas as pd
import os
import re

### Importing all raw files

The data logging script will output files with predictable filenames, so we can use regular expressions to safely import them all.

  * Wikipedia: https://en.wikipedia.org/wiki/Regular_expression
  * Basic python example: https://docs.python.org/3/library/re.html#finding-all-adverbs

The code below will find all files in the `./data` sub-folder. With the `re.match()` regular expression, we can only return files that start with `pistress` but also end with `.csv`.

#### Regex explanation

I used a raw string `r"   "` to make regex easier to write. Regular expressions use a lot of backslahes, and each time you have to use `\\` instead to escape it. The raw string lets you use backslashes without escaping them.

The `^` character will match the start of the filename, and the `$` character will match the end of the filename. These are great for making sure you're really getting what you're expecting. For example, we wouldn't want a file ending in `.csview`.

The `[01]{10}` regular expression will match exactly ten 0 or 1 characters. Being this strict means that we won't accidentally import anything else.

In [2]:
raw_files = [x for x in os.listdir("./data") if re.match(r"^pistress_[01]{10}.csv$", x)]

raw_files

['pistress_1100001011.csv',
 'pistress_1101000111.csv',
 'pistress_1101010011.csv',
 'pistress_1100000011.csv',
 'pistress_0000000011.csv',
 'pistress_1110100011.csv',
 'pistress_1111000111.csv',
 'pistress_1110001011.csv',
 'pistress_1100010011.csv',
 'pistress_1101100011.csv',
 'pistress_1111001011.csv',
 'pistress_1101000011.csv',
 'pistress_1111100011.csv',
 'pistress_1000000011.csv',
 'pistress_1111010011.csv',
 'pistress_1100000111.csv',
 'pistress_1100100011.csv']

#### Reading into pandas

We can send each filename to the `pd.read_csv()` function and then "chain" `.assign()` to add in the filename as its own column.

The list of these dataframes can then be given to `pd.concat()` for merging into one big dataframe.

In [3]:
raw = pd.concat(
    [pd.read_csv(f"./data/{x}").assign(filename=x) for x in raw_files])

raw

Unnamed: 0,datetime,usage,temp,stress,load,filename
0,2020-07-31 21:20:35.169258,5.1,40.780,2,,pistress_1100001011.csv
1,2020-07-31 21:20:36.185128,49.9,41.856,2,,pistress_1100001011.csv
2,2020-07-31 21:20:37.198883,50.1,42.394,2,,pistress_1100001011.csv
3,2020-07-31 21:20:38.205597,49.9,42.932,2,,pistress_1100001011.csv
4,2020-07-31 21:20:39.215587,50.0,43.470,2,,pistress_1100001011.csv
...,...,...,...,...,...,...
3591,2020-07-31 08:10:25.971033,0.0,70.908,0,,pistress_1100100011.csv
3592,2020-07-31 08:10:26.973625,0.0,70.370,0,,pistress_1100100011.csv
3593,2020-07-31 08:10:27.976173,0.0,69.832,0,,pistress_1100100011.csv
3594,2020-07-31 08:10:28.978738,0.0,70.908,0,,pistress_1100100011.csv


### Creating dummy variables

You may know these as indicator variables or as flags.

This information has been embedded into the filenames. Here is the significance of the filename codes.

```
1.  (a) case_under
2.  (b) case_frame
3.  (c) case_cable
4.  (d) case_gpio
5.  (m) top_solid
6.  (n) top_holed
7.  (o) top_intake (fan)
8.  (p) top_exhaust (fan)
9.  (x) heatsink_main
10. (y) heatsink_sub
```

For example, `1111000111` is a fully formed case with an exhaust fan and heatsinks on both ICs.

#### Dummification

The code below will find all unique filenames and encode them. These can then be joined back to the original raw dataframe.

In [4]:
def dummify_filename(filename):
    # The filenames have an underscore
    # Split on _ and keep the second half
    code = filename.split("_")[1]
    # The string will still have .csv at the end
    # Split on . and keep the first half
    code = filename.split(".")[0]
    # Get rid of non-digit characters
    code = re.sub(r"\D", "", code)
    # Break the code into a list of 0/1 integers
    flags = [int(x) for x in list(code)]
    # List the dummy labels
    keys = ["case_under",
            "case_frame",
            "case_cable",
            "case_gpio",
            "top_solid",
            "top_holed",
            "top_intake",
            "top_exhaust",
            "heatsink_main",
            "heatsink_sub"
           ]
    # Output into a dictionary, which pandas can transform into a dataframe
    values = {k:v for k,v in zip(keys, flags)}
    values["filename"] = filename
    return values

# Only get unique filenames
# Feed them into the dummify function
flags = pd.DataFrame([dummify_filename(x) for x in raw.filename.drop_duplicates()])

# Just so you can see what this looks like
flags.head()

Unnamed: 0,case_under,case_frame,case_cable,case_gpio,top_solid,top_holed,top_intake,top_exhaust,heatsink_main,heatsink_sub,filename
0,1,1,0,0,0,0,1,0,1,1,pistress_1100001011.csv
1,1,1,0,1,0,0,0,1,1,1,pistress_1101000111.csv
2,1,1,0,1,0,1,0,0,1,1,pistress_1101010011.csv
3,1,1,0,0,0,0,0,0,1,1,pistress_1100000011.csv
4,0,0,0,0,0,0,0,0,1,1,pistress_0000000011.csv


#### Merging

The two dataframes can be merged together, adding the column dummy variables. The `filename` column could be dropped since its information content has been extracted. It ends up being useful later though.

In [5]:
# The .join() method joins on indexes
df_flagged = raw.merge(flags, on = "filename")

df_flagged.head()

Unnamed: 0,datetime,usage,temp,stress,load,filename,case_under,case_frame,case_cable,case_gpio,top_solid,top_holed,top_intake,top_exhaust,heatsink_main,heatsink_sub
0,2020-07-31 21:20:35.169258,5.1,40.78,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
1,2020-07-31 21:20:36.185128,49.9,41.856,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
2,2020-07-31 21:20:37.198883,50.1,42.394,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
3,2020-07-31 21:20:38.205597,49.9,42.932,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
4,2020-07-31 21:20:39.215587,50.0,43.47,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1


### Creating the datetime index

We're better off with a datetime index. The `pandas` documentation always talks about it, so I get the impression it's a best practice.

There is a practical benefit to the `DateTimeIndex` too. We can use the `.round()` method to round the datetimes to the second, dropping the extra precision we don't really need. The temperature logger read sensors every second, so the sub-second time scale is not really informative.

In [6]:
df_dt = df_flagged

# Convert the datetime column to a datetime dtype
# This works fine since the datetime is already in a standard format
df_dt["datetime"] = pd.to_datetime(df_dt["datetime"])
df_dt["datetime"] = pd.DatetimeIndex(df_dt["datetime"]).round("s")

# Set datetime as the index
df_dt = df_dt.set_index("datetime")

df_dt.head()

Unnamed: 0_level_0,usage,temp,stress,load,filename,case_under,case_frame,case_cable,case_gpio,top_solid,top_holed,top_intake,top_exhaust,heatsink_main,heatsink_sub
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2020-07-31 21:20:35,5.1,40.78,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
2020-07-31 21:20:36,49.9,41.856,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
2020-07-31 21:20:37,50.1,42.394,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
2020-07-31 21:20:38,49.9,42.932,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1
2020-07-31 21:20:39,50.0,43.47,2,,pistress_1100001011.csv,1,1,0,0,0,0,1,0,1,1


### Drop any duplicate indexes

The data should not have any observations in the same second. The delay between sensor readings should be at least 1 second.

Nevertheless I will detect and remove duplicates. I almost always do this when working with data since unexpected duplicates have caused me grief in the past.

In [7]:
# Do we have any duplicated indexes?
# It's a good idea to include this printout for information purposes
print(f"We have duplicate indexes: {any(df_dt.index.duplicated())}")

We have duplicate indexes: False


In [8]:
# Drop duplicate indexes, just in case
# I learned something: the tilde sign in Python is bitwise not
# ie: "vectorized" not if you're from R
df = df_dt.loc[~df_dt.index.duplicated()]

### Writing to disk

Below I write the cleaned dataframe to a parquet file for quick and safe storage.

In [9]:
df.reset_index().to_parquet("./data/cleaned.parquet")

In [10]:
# In the near future, feather files will offer compression (pandas 1.1.0)
# df.reset_index().to_feather("./data/cleaned.feather")