The data is currently stored as `.xls` files. In this notebook, we will implement some code to manipulate the data as `pandas.Dataframes` and store as more efficient `.parquet` files on disk.

In [1]:
# import any required libraries here
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

First, we need to read the `.xls` files into `pandas.Dataframes`. You can use [pandas.read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for this.

In [162]:
# load the building data 
# consider the different number of header rows!
# loading data of different buildings
def data_loader(file_name1, file_name2):
    
    # reading files
    df1 = pd.read_excel('/case_study/case_study_data/' + file_name1, header = [2, 3, 4])
    df2 = pd.read_excel('/case_study/case_study_data/' + file_name2, header = [2, 3, 4])

    concat_df = pd.concat([df1, df2], axis = 0) # concating data frames
    
    # defining Unique, friendly coulmn name ('first digit of Kennzahl' + 'from fivth digit of Beschreng' + 'Bezeichnung')
    concat_df.columns = [x[1][0] + '_' + x[0][5:] + '_' + x[-1]  for x in concat_df.columns]
    
    concat_df.columns.values[0] = 'Time' # change first column nameto Time
    concat_df = concat_df.drop(0, axis = 0) # droping 1st row
    concat_df['Time'] = pd.to_datetime(concat_df['Time']) # convert to datetime format
    
    # converting all variables except Time to float 
    concat_df[concat_df.columns[~concat_df.columns.isin(['Time'])]] = concat_df[concat_df.columns[~concat_df.columns.isin(['Time'])]].astype('float')
    
    concat_df = concat_df.sort_values(['Time']) # sort values by time
    
    concat_df = concat_df.reset_index(drop = True) # reset the index and drop the previous one
    
    concat_df = concat_df.drop_duplicates(['Time'], keep = 'first') # dropping duplicate rows
    
    
    return concat_df

Next, we need to implement a function that takes a `pandas.Dataframe` and a path string as an input and writes the data to disk as a `parquet` file. You can use the [PyArrow library](https://arrow.apache.org/docs/python/parquet.html) for this: 

In [163]:
# calling data loader function
oh14_df = data_loader(file_name1 = 'OH14.xls', file_name2 = 'OH14_01_26-07_19.xls')
oh12_df = data_loader(file_name1 = 'OH12.xls', file_name2 = 'OH12_01_26-07_19.xls')
kita_hokida_df = data_loader(file_name1 = 'Kita Hokido.xls', file_name2 = 'Kita Hokido_05_22_20-07_19_22.xls')
chemie_df = data_loader(file_name1 = 'Chemie.xls', file_name2 = 'Chemie_01_26-07_19.xls')
gross_df = data_loader(file_name1 = 'Großtagespflege.xls', file_name2 = 'Grosstagespflege_04_05-07_19.xls')
hg_2_df  = data_loader(file_name1 = 'HG II.xls', file_name2 = 'HGII_01_26-07_19.xls')



In [145]:
def write_as_parquet(df, path):
    # implement this function and add a short doc string describing its use
    table = pa.Table.from_pandas(df) # Construct a table from pandas dataframe
    pq.write_table(table, path) # pass this table schema to write_table function

In [146]:
# writing all the files in parquet format
write_as_parquet(oh14_df, path = '/case_study/case_study_data/OH14.parquet')
write_as_parquet(oh12_df, path = '/case_study/case_study_data/OH12.parquet')
write_as_parquet(kita_hokida_df, path = '/case_study/case_study_data/Kita_hokida.parquet')
write_as_parquet(chemie_df, path = '/case_study/case_study_data/Chemie.parquet')
write_as_parquet(gross_df, path = '/case_study/case_study_data/Großtagespflege.parquet')
write_as_parquet(hg_2_df, path = '/case_study/case_study_data/HGII.parquet')

Now we need the opposite functionality: a function that reads data from a `.parquet` file on disk and returns it as a `pandas.Dataframe`. Implement this function such that it can take a list of names of column to load as an _optional_ parameter. 

In [147]:
def load_to_pandas(path, columns):
    # implement this function and add a short doc string describing its use
    df = pq.read_pandas(path, columns = columns).to_pandas() # reading .parquet file in pandas dataframe format
    return df

In [152]:
load_to_pandas(path = '/case_study/case_study_data/oh14.parquet', columns = ['Time', '6_11 01 01_Wärmeenergie Tarif 1', '6_11 01 01_Durchfluss'])

Unnamed: 0,Time,6_11 01 01_Wärmeenergie Tarif 1,6_11 01 01_Durchfluss
0,2021-07-06 11:45:00,2066251.0,0.155
1,2021-07-06 12:00:00,2066251.0,0.126
2,2021-07-06 12:15:00,2066251.0,0.134
3,2021-07-06 12:30:00,2066252.0,0.130
4,2021-07-06 12:45:00,2066252.0,0.128
...,...,...,...
36269,2022-07-19 01:30:00,2249905.0,0.236
36270,2022-07-19 01:45:00,2249905.0,0.163
36271,2022-07-19 02:00:00,2249906.0,0.158
36272,2022-07-19 02:15:00,2249907.0,0.183


Great! We can now store data more efficiently on disk and know how to load it again. Store all the data we have as one `.parquet` file per building.